## Some trivia about how GPU registers are organized.

I have written about the GPU – the Graphics Processing Unit – at length. And just going by what I wrote so far, my readers might think that its registers are defined the same way, as those of any main CPU. But to the contrary, GPU registers are organized differently at the hardware level, in a way most-optimized for raster-based graphics output.

Within GPU / graphics-oriented / shader coding, there exists a type of language which is ‘closest to the machine’, and which is a kind of ‘Assembler Language for GPUs’, that being called “ARB“. Few shader-designers actually use it anymore, instead using a high-level language, such as ‘HLSL’ for the DirectX platform, or such as ‘GLSL’ for the OpenGL platform… Yet, especially since drivers have been designed that use the GPU for general-purpose (data-oriented) programming, it might be good to glance at what ARB defines.

And so one major difference that exists between main CPU registers, and GPU registers by default, is that each GPU register is organized into a 4-element vector of 32-bit, floating-point numbers. The GPU is designed at the hardware level, to be able to perform certain Math operations on the entire 4-element vector, in one step if need be. And within ARB, a notation exists by which the register name can be given a dot, and can then be followed by such pieces of text as:

• .xyz – Referring to the set of the first 3 elements (for scene or model coordinates),
• .uv – Referring to the set of the first 2 elements (for textures),
• .rst – Referring to the set of the first 3 elements – again (for 3D textures, volume-texture coordinates).

Also, notations exist in which the order of these elements gets switched around. Therefore, if the ARB code specifies this:

• r0.uv

It is specifying not only register (0), but the first 2, 32-bit, floating-point elements, within (r0), in their natural order.

This observation needs to be modified somewhat, before an accurate representation of modern GPU registers has been defined.

Firstly, I have written elsewhere on my blog, that as data passes from a Vertex Shader to a Fragment Shader, that data, which may contain texture coordinates by default, but which can really consist of virtually any combination of values, needs to be interpolated (:1), so that the interpolated value gets used by the FS, to render one pixel to the screen. This interpolation is carried out by specialized hardware in a GPU core group, and for that reason, some upward limit exists, on how many such registers can be interpolated.

(Updated 5/04/2019, 23h35 … )

## The PC Graphics Cards have specifically been made Memory-Addressable.

Please note that this posting does not describe

• Android GPUs, or
• Graphics Chips on PCs and Laptops, which use shared memory.

I am writing about the big graphics cards which power-users and gamers install into their PCs, which have a special bus-slot, and which cost as much money in themselves, as some computers cost.

The way those are organized physically, they possess one or more GPU, and DDR Graphics RAM, which loosely correspond to the CPU and RAM on the motherboard of your PC.

The GPU itself contains registers, which are essentially of two types:

• Per-core, and
• Shared

When coding shaders for 3D games, the GPU-registers do not fulfill the same function, as addresses in GRAM. The addresses in Graphics RAM typically store texture images, vertex arrays in their various formats, and index buffers, as well as frame-buffers for the output. In other words, the GRAM typically stores model-geometry and 2D or 3D images. The registers on the GPU are typically used as temporary storage-locations, for the work of shaders, which are again, separately loaded onto the GPUs, after they are compiled by the device-drivers.

A major feature which the designers of graphics cards have given them, is to extend the system memory of the PC onto the graphics card, in such a way that most of its memory actually has hardware-addresses as well.

This might not include the GPU-registers that are specific to one core, but I think does include shared GPU-registers.

## Observations about the Z-Buffer

Any game-engine currently on the market, uses the GPU of your computer – or your tablet – to do most of the work of rendering 3D scenes to a 2D screen, that also represents a virtual camera-position. There are two constants about this process which the game-engine defines, which are the closest distance at which fragments are allowed to be rendered, which I will name ‘clip-near’, and the maximum distance rendering is to be extended to, which I will name ‘clip-far’.

Therefore, what some users might expect, is that the Z-buffer, which determines the final outcome of the occlusion of the fragments, should contain a simple value from [ clip-near … clip-far ) . However, this is not truly how the Z-buffer works. And the reason why has to do with its origins. The Z-buffer belonging to the earliest rendering-hardware was only a 16-bit value, associated with each output pixel! And so a system needed to be developed that could use this extremely low resolution, according to which distances closer to (clip-near) would be spaced closer together, and according to which distance closer to (clip-far) could receive a smaller number of Z-values, since at that distance, the ability of the player even to distinguish differences in distances, was also diminished.

And so the way hardware-rendering began, was in this Z-buffer-value representing a fractional value between [ 0.0 … 1.0 ) . In other words, it was decided early-on, that these 16 bits followed a decimal point – even though they were ones and zeros – and that while (0) could be reached exactly, (1.0) could never be reached. And, because game-engine developers love to use 4×4 matrices, there could exist a matrix which defines conversion from the model-view matrix to the model-view-projection matrix, just so that a single matrix could minimally be sent to the graphics card for any one model to render, which would do all the necessary work, including to determine screen-positions and to determine Z-buffer-values.

The rasterizer is given a triangle to render, and rasterizes the 2D space between, to include all the pixels, and to interpolate all the parameters, according to an algorithm which does not need to be specialized, for one sort of parameter or another. The pixel-coordinates it generates are then sent to any Fragment Shader (in modern times), and three main reasons their number does not actually equal the number of screen-pixels are:

1. Occlusion obviates the need for many FS-calls.
2. Either Multi-Sampling or Super-Sampling tampers with the true number of fragments that need to be computed, and in the case of Multi-Sampling, in a non-constant way.
3. Alpha Entities“, whose textures have an Alpha channel in addition to R, G, B per texel, are translucent and do not write the Z-buffer, thereby requiring that Entities behind them additionally be rendered.

And so there exists a projection-matrix which I can suggest which will do this (vertex-related) work:


| 1.0 0.0 0.0 0.0 |
| 0.0 1.0 0.0 0.0 |
| 0.0 0.0 1.0 0.0 |
| 0.0 0.0  a   b  |

a = clip-far / (clip-far - clip-near)
b = - (clip-far * clip-near) / (clip-far - clip-near)




One main assumption I am making, is that a standard, 4-component position-vector is to be multiplied by this matrix, which has the components named X, Y, Z and W, and the (W) component of which equals (1.0), just as it should. But as you can see, now, the output-vector has a (W) component, which will no longer equal (1.0).

The other assumption which I am making here, is that the rasterizer will divide (W) by (Z), once for every output fragment. This last request is not unreasonable. In the real world, when objects move further away from us, they seem to get smaller in the distance. Well in the game-world, we can expect the same thing. Therefore by default, we would already be dividing (X) and (Y) by (Z), to arrive at screen-coordinates from ( -1.0 … +1.0 ), regardless of what the real-world distances from the camera were, that also led to (Z) values.

This gives the game-engine something which photographic cameras fail to achieve at wide angles: Flat Field. The position from the center of the screen, becomes the tangent-function, of a view-angle from the Z-coordinate.

Well, to divide (X) by (Z), and then to divide (Y) by (Z), would actually be two GPU-operations, where to scalar-multiply the entire output-vector, including (X, Y, Z, W) by (1 / Z), would only be one GPU-operation.

Well in the example above, as (Z -> clip-far), the operation would compute:



W = a * Z + b

= (clip-far * clip-far) / (clip-far - clip-near) -
(clip-far * clip-near) / (clip-far - clip-near)

= clip-far * (clip-far - clip-near) /
(clip-far - clip-near)

= clip-far

Therefore,
(W / Z) = (W / clip-far) = 1.0




And, when (Z == clip-near), the operation would compute:



W = a * Z + b

= (clip-far * clip-near) / (clip-far - clip-near) -
(clip-far * clip-near) / (clip-far - clip-near)

= 0.0




Of course I understand that a modern graphics card will have a 32-bit Z-buffer. But then all that needs to be done, for backwards-compatibility with the older system, is to receive a fractional value that has 32 bits instead of 16.

Now, there are two main derivations of this approach, which some game engines offer as features, but which can be achieved just by feeding in a slightly different set of constants to a matrix, which the GPU can work with in an unchanging way:

• Rendering to infinite world coordinates,
• Orthogonal camera-views.

The values that are needed for the same matrix will be:

## Why R2VB Should Not Simply be Deprecated

The designers of certain graphics cards / GPUs, have decided that Render-To-Vertex-Buffer is deprecated. In order to appreciate why I believe this to be a mistake, the reader first needs to know what R2VB is – or was.

The rendering pipeline of DirectX 9 versus DirectX 11 is somewhat different, yet also very similar, and DirectX 9 was extremely versatile, with a wide range of applications written that use it, while the fancier Dx 11 pipeline is more powerful, but has less of an established base of algorithms.

Dx 9 is approximated in OpenGL 2, while Dx 10 and Dx 11 are approximated in OpenGL 3(+) .