Some trivia about how GPU registers are organized.

I have written about the GPU – the Graphics Processing Unit – at length. And just going by what I wrote so far, my readers might think that its registers are defined the same way, as those of any main CPU. But to the contrary, GPU registers are organized differently at the hardware level, in a way most-optimized for raster-based graphics output.

Within GPU / graphics-oriented / shader coding, there exists a type of language which is ‘closest to the machine’, and which is a kind of ‘Assembler Language for GPUs’, that being called “ARB“. Few shader-designers actually use it anymore, instead using a high-level language, such as ‘HLSL’ for the DirectX platform, or such as ‘GLSL’ for the OpenGL platform… Yet, especially since drivers have been designed that use the GPU for general-purpose (data-oriented) programming, it might be good to glance at what ARB defines.

And so one major difference that exists between main CPU registers, and GPU registers by default, is that each GPU register is organized into a 4-element vector of 32-bit, floating-point numbers. The GPU is designed at the hardware level, to be able to perform certain Math operations on the entire 4-element vector, in one step if need be. And within ARB, a notation exists by which the register name can be given a dot, and can then be followed by such pieces of text as:

  • .xyz – Referring to the set of the first 3 elements (for scene or model coordinates),
  • .uv – Referring to the set of the first 2 elements (for textures),
  • .rst – Referring to the set of the first 3 elements – again (for 3D textures, volume-texture coordinates).

Also, notations exist in which the order of these elements gets switched around. Therefore, if the ARB code specifies this:

  • r0.uv

It is specifying not only register (0), but the first 2, 32-bit, floating-point elements, within (r0), in their natural order.

This observation needs to be modified somewhat, before an accurate representation of modern GPU registers has been defined.

Firstly, I have written elsewhere on my blog, that as data passes from a Vertex Shader to a Fragment Shader, that data, which may contain texture coordinates by default, but which can really consist of virtually any combination of values, needs to be interpolated (:1), so that the interpolated value gets used by the FS, to render one pixel to the screen. This interpolation is carried out by specialized hardware in a GPU core group, and for that reason, some upward limit exists, on how many such registers can be interpolated.

(Updated 5/04/2019, 23h35 … )

(As of 13h25 : )

Additionally, GPU operations exist which use their parameters as integers – just as it is with the main CPU – and, modern GPUs also support 64-bit floating-point values. This last detail is usually not completely necessary for 3D graphics, but is certainly necessary whenever the GPU is being used to perform plain computations, that could form part of some Scientific work.

I’m not entirely sure how 64-bit floating-point values are arranged, on the GPU. I only know that modern GPUs support them.



What I read was, that a bit can be sent to a rendering pipeline, which changes the method for interpolation, from ‘perspective-corrected’, to ‘linear’, or which switches it off completely, so that only one provoking vertex defines the value to be received by an FS.

Because I have not seen any examples in which this bit is set, with results useful for graphics, I’m just ignoring this possibility for the moment.


I suppose that I should also add, that if a 4×4 matrix is represented by GPU-registers, this is usually done as a series of 4 registers…

Further, according to what I just wrote, each GPU-register has a width of 4×32 = 128 bits. According to what I read elsewhere, modern GPUs additionally offer so-called ‘double-width’ registers, that are 256 bits wide. But what I do not know specifically, is whether they are meant to be dereferenced as 8×32 bits, or only as 4×64 bits…


(Update 5/04/2019, 17h00 : )

One observation which I should add, describes what happens in the Vertex Array, which is also referred to as the Vertex List, in which each element is a structure describing one model-vertex.

What does not happen, is that positions are wasted in this Vertex List. And this is especially good because by now, high-polygon models exist, where, any waste of positions could represent a major waste in graphics memory.

Additionally, the ‘shader constants’, which in OpenGL coding are referred to as “Uniforms”, don’t need to take up multiples of 4 values.

A typical example of what the structure of one vertex may hold, is a 3-element position vector, in model-space, a 3-element normal vector, in model space, and a (UV) texture coordinate set, that consists of 2 elements. As common sense would have it, such a vertex structure will only need to take up 8, 32-bit members, contiguously.

Further, assuming that the platform version is high enough to support a Vertex Shader, the actual contents – the attributes – of one vertex structure may be arbitrary to some degree, and defined by the game engine as well as by the model editor. It’s then up to the Vertex Shader to read those attributes, and to apply the relevant calculations to them. If a Vertex Shader has been defined, a Pixel / Fragment Shader must also be defined, that processes whatever ‘Varying’ variables have been output by the Vertex Shader. In some cases, Attributes in the Vertex Structure may have a reserved function, either unknown to the shader-designer, or useless to the one shader.

But what a typical Vertex Shader will do, when programmed to read this vertex structure is:

  • Read 3 floating-point numbers from the vertex array, assign them to the first 3 elements of a certain register, and set the 4th element of that register to (0.0),
  • Read the next 3 floating-point numbers from the vertex array, assign them to the first 3 elements of a different register, and set the 4th element of that register to (0.0),
  • (…)
  • Read 2 more floating-point numbers from the vertex array, assign them to the first 2 elements of yet another register, and set the last 2 elements of that register to (0.0).

This will be 100% consistent in how the GPU core can read a number of elements as input, which is fewer than 4. In a similar vein, I’ve seen some shader-programmers put bad code into their shaders, that include the following two types of instructions:

  • mov r0.x,
  • mov r0.x, r1.x

Those two instructions, if the compiler accepts both, will not have the same effect. The first would read 3 values from (r1) and write them into elements belonging to (r0), starting at element (x). The second will only read 1 value from (r1), and write that to element (x) of (r0). The first of these two forms is written badly, even if the programmer knows what effect it will have. It should be rewritten:

  • mov,

One assumption which I should add to this posting would be, that the GPU core is much less flexible, in how to write the result from a computation – the output – to a destination register, than it is, in how to read input to a computation, from a source register. If only one element was being written, it might be fine to specify which element to write it to ad-hoc. But as soon as more than one element is to be written as output, most of the defined operations will have zero flexibility, about which element of the output register to start from.


(Update 5/04/2019, 19h25 : )

I should also mention how this arrangement differs today, given a modern GPU and Unified Shader Model, from how it once existed, when DirectX 9 / OpenGL 2 ruled the day, and when the GPU had a fixed set of Vertex Pipelines, in addition to a fixed set of Pixel Pipelines.

According to ‘the old way’ of handling hardware-accelerated graphics, the coordinates at which texture images were to be sampled needed to stem from these interpolation registers, which accepted input from the Vertex Pipelines. Actual textures needed to be bound to a specific interpolation unit – as they still do – at least until one shader-pass was complete, and communication between shader-passes was limited for the most part, to alpha-blending with the output alpha channel.

There needed to be an exception to this, which would be used with ‘environment bump-mapping’. What would happen if ‘in-scene’ surface reflections were computed was, the environment-image to be reflected was assigned to a cube-map, and a Fragment Shader / Pixel Shader stage would compute where on the cube-map, a camera-space reflection vector was to land, so that a later stage of the same shader-pass could sample the cube-map at the point struck, and incorporate the texel fetched from there, into final output from the shader-pass.

This was a big deal because it implied that Fragment Shaders were able to write changes to the samplers (once for the same output-fragment), the coordinates of which were no longer just passed down from the Vertex Pipelines, and interpolated. It meant that FS-stage-to-FS-stage communication could take place in an additional way.

By default, only Texture Coordinate Index Zero was designed to support this. The texture bound to it, had to correspond to the environment cube.

In modern times, no such limit exists anymore, and a Fragment Shader can define a set of coordinates as often as desired, for the same Fragment Shader invocation to sample a texture at.




Print Friendly, PDF & Email

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>