Observations about the Z-Buffer

Any game-engine currently on the market, uses the GPU of your computer – or your tablet – to do most of the work of rendering 3D scenes to a 2D screen, that also represents a virtual camera-position. There are two constants about this process which the game-engine defines, which are the closest distance at which fragments are allowed to be rendered, which I will name ‘clip-near’, and the maximum distance rendering is to be extended to, which I will name ‘clip-far’.

Therefore, what some users might expect, is that the Z-buffer, which determines the final outcome of the occlusion of the fragments, should contain a simple value from [ clip-near … clip-far ) . However, this is not truly how the Z-buffer works. And the reason why has to do with its origins. The Z-buffer belonging to the earliest rendering-hardware was only a 16-bit value, associated with each output pixel! And so a system needed to be developed that could use this extremely low resolution, according to which distances closer to (clip-near) would be spaced closer together, and according to which distance closer to (clip-far) could receive a smaller number of Z-values, since at that distance, the ability of the player even to distinguish differences in distances, was also diminished.

And so the way hardware-rendering began, was in this Z-buffer-value representing a fractional value between [ 0.0 … 1.0 ) . In other words, it was decided early-on, that these 16 bits followed a decimal point – even though they were ones and zeros – and that while (0) could be reached exactly, (1.0) could never be reached. And, because game-engine developers love to use 4×4 matrices, there could exist a matrix which defines conversion from the model-view matrix to the model-view-projection matrix, just so that a single matrix could minimally be sent to the graphics card for any one model to render, which would do all the necessary work, including to determine screen-positions and to determine Z-buffer-values.

The rasterizer is given a triangle to render, and rasterizes the 2D space between, to include all the pixels, and to interpolate all the parameters, according to an algorithm which does not need to be specialized, for one sort of parameter or another. The pixel-coordinates it generates are then sent to any Fragment Shader (in modern times), and three main reasons their number does not actually equal the number of screen-pixels are:

  1. Occlusion obviates the need for many FS-calls.
  2. Either Multi-Sampling or Super-Sampling tampers with the true number of fragments that need to be computed, and in the case of Multi-Sampling, in a non-constant way.
  3. Alpha Entities“, whose textures have an Alpha channel in addition to R, G, B per texel, are translucent and do not write the Z-buffer, thereby requiring that Entities behind them additionally be rendered.

And so there exists a projection-matrix which I can suggest which will do this (vertex-related) work:


| 1.0 0.0 0.0 0.0 |
| 0.0 1.0 0.0 0.0 |
| 0.0 0.0 1.0 0.0 |
| 0.0 0.0  a   b  |

a = clip-far / (clip-far - clip-near)
b = - (clip-far * clip-near) / (clip-far - clip-near)


One main assumption I am making, is that a standard, 4-component position-vector is to be multiplied by this matrix, which has the components named X, Y, Z and W, and the (W) component of which equals (1.0), just as it should. But as you can see, now, the output-vector has a (W) component, which will no longer equal (1.0).

The other assumption which I am making here, is that the rasterizer will divide (W) by (Z), once for every output fragment. This last request is not unreasonable. In the real world, when objects move further away from us, they seem to get smaller in the distance. Well in the game-world, we can expect the same thing. Therefore by default, we would already be dividing (X) and (Y) by (Z), to arrive at screen-coordinates from ( -1.0 … +1.0 ), regardless of what the real-world distances from the camera were, that also led to (Z) values.

This gives the game-engine something which photographic cameras fail to achieve at wide angles: Flat Field. The position from the center of the screen, becomes the tangent-function, of a view-angle from the Z-coordinate.

Well, to divide (X) by (Z), and then to divide (Y) by (Z), would actually be two GPU-operations, where to scalar-multiply the entire output-vector, including (X, Y, Z, W) by (1 / Z), would only be one GPU-operation.

Well in the example above, as (Z -> clip-far), the operation would compute:


W = a * Z + b

  = (clip-far * clip-far) / (clip-far - clip-near) -
    (clip-far * clip-near) / (clip-far - clip-near)

  = clip-far * (clip-far - clip-near) /
            (clip-far - clip-near)

  = clip-far

  (W / Z) = (W / clip-far) = 1.0


And, when (Z == clip-near), the operation would compute:


W = a * Z + b

  = (clip-far * clip-near) / (clip-far - clip-near) -
    (clip-far * clip-near) / (clip-far - clip-near)

  = 0.0


Of course I understand that a modern graphics card will have a 32-bit Z-buffer. But then all that needs to be done, for backwards-compatibility with the older system, is to receive a fractional value that has 32 bits instead of 16.

Now, there are two main derivations of this approach, which some game engines offer as features, but which can be achieved just by feeding in a slightly different set of constants to a matrix, which the GPU can work with in an unchanging way:

  • Rendering to infinite world coordinates,
  • Orthogonal camera-views.

The values that are needed for the same matrix will be:


  a = 1.0
  b = - clip-near


| 1.0 0.0 0.0 0.0 |
| 0.0 1.0 0.0 0.0 |
| 0.0 0.0 0.0  a  |
| 0.0 0.0 1.0  b  |

  a = Notional Distance
  b = - (0.5 * Notional Distance)


There is a huge caveat in trying to apply this theory, which is not due to any consistency problems, belonging entirely to this theory. The way the rasterizer works, that forms part of the GPU core-groups, is not reprogrammable, but must work for all game-engine designers. And what makes most sense to humans, does not always reflect how implementations work, at the machine-level.

Because most of the interpolations which are carried out at the pre-FS stage, are corrected for perspective, they need to be given the true value of (Z):


U' = Blend(U1 / Z1, U2 / Z2)
Z' = Blend(1  / Z1, 1  / Z2)

U '' = U' / Z'


(Edit 01/12/2017 : In reality, this interpolation only requires (1 / Z) . But as I wrote above, this was already computed – once per vertex.)

So according to human thinking, it would make most sense if the rasterizer divides the output-vector by (Z), and uses the unmodified (Z) at the same time for its interpolations. But according to how registers work, this actually only makes the second-most amount sense.

This is because, when working with transformation matrices instead of only with 3×3 rotation matrices, it was always our assumption that (W) belonging to each position-vector would be equal to (1.0), and that element (4,4) of the matrices contained (1.0), thus preserving this value for (W), and making the 4th column of the more-normal matrices an additive vector, since it always gets multiplied by (W) before being added to the output vector. Column (4) is therefore the displacement, while the inner 3×3 performs any rotation. Thus, if a GPU was to divide any such vector by its own (W), doing so would have no effect, and would do no damage when instructed at inconvenient times.

But, once we are computing the projection matrix, suddenly (W) does not remain equal to (1.0). So now, dividing the output by (W), would have an effect.

And so according to Linear Algebra, the hypothetical system above could have the meaning of (Z) and (W) in the output vector reversed. Switching them would simply be a question of switching the 3rd and 4th rows of each matrix, and our GPU would be allowed at any time, to divide by (W).

Also, (W) happens to be the the 4th component, and performing special operations through it makes most sense according to register-logic.

There are many examples in which game-designers are told, to sample a texture image, that was once render-output, and to use it as input. In such examples, there is a matrix which we can simply use to change to the fact, that texture-coordinates go from [ 0.0 … 1.0 ) , whereas render-output went from ( -1.0 … +1.0 ) :


| 0.5 0.0 0.0 0.5 |
| 0.0 0.5 0.0 0.5 |
| 0.0 0.0 1.0 0.0 |
| 0.0 0.0 0.0 1.0 |


The game-designer is then simply told to apply the model-view-matrix, then to apply the above, then to divide his output vector by its own (-Z), and then to cast everything down to a ‘Vec2′. The reason he can get away with that, is because he does not need to concern himself with how the Z-buffer worked, when the Render-To-Texture was performed. He only needs to know at what U,V coordinate-pair to sample, that the earlier stage has generated.

But, If we are working in the other direction, and trying to produce screen-output coordinates, that somehow started as U,V texture coordinates, we run in to the problem of having to know what system was used in fact, by the engine-designers, and finally, by the GPU-architects. Because then, whatever position vector the game-designer outputs from his Vertex Shader, will be put through the exact mechanics that the engine uses, not what he prefers, and any Z-buffer will still be applied to it, after his Vertex Shader is done. So it would be tempting to suggest that the following matrix might work:


| 2.0 0.0 0.0 -1.0 |
| 0.0 2.0 0.0 -1.0 |
| 0.0 0.0 1.0  0.0 |
| 0.0 0.0 1.0  0.0 |


I can personally guarantee to the reader, that if he does use this matrix, his rendering system will produce no output at all !

The reason for this will be, the fact that either (Z) or (W) is going to be used to set up the Z-buffer, and while (0.0) will be rendered just fine, (+1.0) does not belong to the allowed range. This is because neither our 16-bit nor our 32-bit fraction can ever equal (1.0) !

So for each fragment, (Z == clip-near) will render just fine, but (Z == clip-far) will get clipped, as being just outside the range of allowed distances.

Depending on which system is being used, either row 3 or row 4 of this matrix would need to be set to something else, at the very minimum.


Note: In the examples for projection matrices I have assumed a viewing arc of +/-45⁰. Because game-engines often give the player other viewing arcs – i.e. greater or less zoom, plus screens with different aspect ratios – the values in rows 1 and 2 are made to scale (X) and (Y) trivially, with respect to (Z). When reusing RTT output as input, the assumption is usually made nevertheless that the RTT camera-constants defined a +/-45⁰ viewing arc.

Also, there exists a completely different definition for what a projection matrix is, according to which this is any sort of matrix for computing a linear transformation, but which has a determinant of (0), thus collapsing 3D positions onto a 2D surface. That type of matrix is mathematically more-correct and also has its uses, especially if simple ways are sought to compute shadows, stencils, etc..

However the matrix I have defined in this posting is meant for more-specialized use, involving a 3D perspective projection, and no longer involving linear operations on coordinates. A division is expected here.

(Edited 4/21/2019, 11h10 : )

Further, some but not all rendering engines compute a naked projection matrix so that the content-designer can use it. This does not refer to the view matrix, nor to the model-view matrix, but would distinctly be stated as the projection matrix, that corresponds to the current view-camera constants, and which the model-view is multiplied by already, to arrive at the model-view-projection matrix. If your rendering system has this feature, and if the texture was rendered to previously, with a 90⁰x90⁰ viewing arc, then you can multiply your texture coordinates by that, to make sure that everything is done correctly, after multiplying by:

(End of edit, 4/21/2019, 11h10. )


//  Engine-supplied matProjection
//  Engine-supplied clip-near

//  In world coordinate units:
float small-distance;
Vec2 TexCoords;
Vec4 ViewPosition 
  = Vec4(Vec3(TexCoords, 1.0) *
    (clip-near + small-distance), 1.0);

Mat4x4 Tex2Screen =
{{ 2.0, 0.0,  0.0, -1.0 },
 { 0.0, 2.0,  0.0, -1.0 },
 { 0.0, 0.0, -1.0,  0.0 },
 { 0.0, 0.0,  0.0,  1.0 }} ;

Position = ViewPosition *
  Tex2Screen * matProjection;


In your Vertex Shader, in order to convert from U,V-texture to compatible screen-coordinate outputs.

Also, practical hardware-graphics works such, that positive view-Z is facing towards the player, while more-negative view-Z is farther out, in front of the camera. Hence, visible scene-details end up with negative values for Z. This needs to be considered when dividing anything by (Z), but I omitted it for the moment to make this posting clearer. In the hypothetical examples of camera-projection matrices I provided, this can be solved by negating column 3. My RTT examples have already taken care of this.


Print Friendly, PDF & Email

4 thoughts on “Observations about the Z-Buffer”

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>