## Computers and Floating-Point Numbers: In Layman’s Terms

There exist WiKiPedia pages that do explain how single- and double-precision floating-point numbers are formatted – both used by computers – but which are so heavily bogged down with tedious details, that the reader would need to be a Computer Scientist already, to be able to understand them. And in that case, those articles can act as a quick reference. But they would do little, to explain the subject to laypeople. The mere fact that single- and double-precision numbers are explained on the WiKi in two separate articles, could already act as a deterrent for most people to try understanding the basic concepts.

I will try to explain this subject in basic terms.

While computers store data in bits that are organized into words – those words either being 32-bit or 64-bit words on most popular architectures – even by the CPU, those words are interpreted as representing numbers in different ways. One way is either as signed or as unsigned ‘integers’, which is another way of saying ‘whole numbers’. And another is either as 32-bit or as 64-bit floating-point numbers. Obviously, the floating-point numbers are used to express fractions, as well as very large or very small values, as well as fractions which are accurate to a high number of digits ‘after the decimal point’.

A CPU must be given an exact opcode, to perform Math on the different representations of numbers, where what type of number they are, is already reflected at compile-time, by which opcode has been encoded, to use the numbers. So obviously, some non-trivial Math goes into defining, how these different number-formats work. I’m to focus on the two most-popular floating-point formats.

Understanding how floating-point numbers work on computers, first requires understanding how Scientists use Scientific Notation. In the Engineering world, as well as in the Household, what most people are used to, is that the number of digits a number has to the left of the decimal-point, be grouped in threes, and that the magnitude of the number is expressed with prefixes such as kilo- , mega- , giga- , tera- , peta- , or, going in the other direction, with milli- , micro- , nano- , pico- , femto- or atto- .

In Science, this notation is so encumbering that the Scientists try to avoid it. What Scientists will do, is state a field of decimal digits, which will always begin with a single (non-zero) digit, followed by the decimal point, followed by an arbitrary number of fractional digits, followed by a multiplication-symbol, followed by the base of 10 raised to either a positive or a negative power. This power also states, how many places further right, or how many places further left, the reader should visualize the decimal point. For example, Avogadro’s number is expressed as

6.022 × 1023

IF we are told to limit our precision to 3 places after the decimal point. If we were told to give 6 places behind the decimal point, we would give it as

6.022141 × 1023

What this means, is that relative to where it is written, the decimal point would need to be shifted to the right 23 places, to arrive at a number, that has the correct order of magnitude.

When I went to High-School, we were drilled to use this notation ad nauseum, so that even if it seemed ridiculous, we would answer in our sleep that to express how much ‘a dozen’ was, using Scientific Notation, yielded

1.2 × 10+1

More importantly, Scientists feel comfortable using the format, because they can express such ideas as ‘how many atoms of regular matter are thought to exist in the known universe’, as long as they were not ashamed to write a ridiculous power of ten:

1 × 1080

Or, how many stars are thought to exist in our galaxy:

( 2 × 1011 … 4 × 1011 )

The latter of which should read, ‘from 200 billion to 400 billion’.

When Computing started, its Scientists had the idea to adapt Scientific Notation to the Binary Number System. What they did was to break down the available word-sizes, essentially, into three fields:

1. A so-called “Significand”, which would correspond to the Mantissa,
2. An Exponent,
3. A Sign-bit for the entire number.

The main difference to Scientific Notation however was, that floating-point numbers on computers, would do everything in powers of two, rather than in powers of ten.

A standard, 32-bit floating-point number reserves 23 bits for the fraction, and 8 bits for the exponent of 2, while a standard, 64-bit floating-point number reserves 52 bits for the fraction, and 11 bits for the exponent of 2. This assignment is arbitrary, but sometimes necessary to know, for implementing certain types of subroutines or hardware.

But one thing that works as well in binary as it does in decimal, is that bits could occur after a point, as easily as they could occur before a point.

Hence, this would be the number 3 in binary:

11

While this would be the fraction 3/4 in binary:

0.11

(Updated 11/09/2019 : )

Any game-engine currently on the market, uses the GPU of your computer – or your tablet – to do most of the work of rendering 3D scenes to a 2D screen, that also represents a virtual camera-position. There are two constants about this process which the game-engine defines, which are the closest distance at which fragments are allowed to be rendered, which I will name ‘clip-near’, and the maximum distance rendering is to be extended to, which I will name ‘clip-far’.

Therefore, what some users might expect, is that the Z-buffer, which determines the final outcome of the occlusion of the fragments, should contain a simple value from [ clip-near … clip-far ) . However, this is not truly how the Z-buffer works. And the reason why has to do with its origins. The Z-buffer belonging to the earliest rendering-hardware was only a 16-bit value, associated with each output pixel! And so a system needed to be developed that could use this extremely low resolution, according to which distances closer to (clip-near) would be spaced closer together, and according to which distance closer to (clip-far) could receive a smaller number of Z-values, since at that distance, the ability of the player even to distinguish differences in distances, was also diminished.

And so the way hardware-rendering began, was in this Z-buffer-value representing a fractional value between [ 0.0 … 1.0 ) . In other words, it was decided early-on, that these 16 bits followed a decimal point – even though they were ones and zeros – and that while (0) could be reached exactly, (1.0) could never be reached. And, because game-engine developers love to use 4×4 matrices, there could exist a matrix which defines conversion from the model-view matrix to the model-view-projection matrix, just so that a single matrix could minimally be sent to the graphics card for any one model to render, which would do all the necessary work, including to determine screen-positions and to determine Z-buffer-values.

The rasterizer is given a triangle to render, and rasterizes the 2D space between, to include all the pixels, and to interpolate all the parameters, according to an algorithm which does not need to be specialized, for one sort of parameter or another. The pixel-coordinates it generates are then sent to any Fragment Shader (in modern times), and three main reasons their number does not actually equal the number of screen-pixels are:

1. Occlusion obviates the need for many FS-calls.
2. Either Multi-Sampling or Super-Sampling tampers with the true number of fragments that need to be computed, and in the case of Multi-Sampling, in a non-constant way.
3. Alpha Entities“, whose textures have an Alpha channel in addition to R, G, B per texel, are translucent and do not write the Z-buffer, thereby requiring that Entities behind them additionally be rendered.

And so there exists a projection-matrix which I can suggest which will do this (vertex-related) work:


| 1.0 0.0 0.0 0.0 |
| 0.0 1.0 0.0 0.0 |
| 0.0 0.0 1.0 0.0 |
| 0.0 0.0  a   b  |

a = clip-far / (clip-far - clip-near)
b = - (clip-far * clip-near) / (clip-far - clip-near)




One main assumption I am making, is that a standard, 4-component position-vector is to be multiplied by this matrix, which has the components named X, Y, Z and W, and the (W) component of which equals (1.0), just as it should. But as you can see, now, the output-vector has a (W) component, which will no longer equal (1.0).

The other assumption which I am making here, is that the rasterizer will divide (W) by (Z), once for every output fragment. This last request is not unreasonable. In the real world, when objects move further away from us, they seem to get smaller in the distance. Well in the game-world, we can expect the same thing. Therefore by default, we would already be dividing (X) and (Y) by (Z), to arrive at screen-coordinates from ( -1.0 … +1.0 ), regardless of what the real-world distances from the camera were, that also led to (Z) values.

This gives the game-engine something which photographic cameras fail to achieve at wide angles: Flat Field. The position from the center of the screen, becomes the tangent-function, of a view-angle from the Z-coordinate.

Well, to divide (X) by (Z), and then to divide (Y) by (Z), would actually be two GPU-operations, where to scalar-multiply the entire output-vector, including (X, Y, Z, W) by (1 / Z), would only be one GPU-operation.

Well in the example above, as (Z -> clip-far), the operation would compute:



W = a * Z + b

= (clip-far * clip-far) / (clip-far - clip-near) -
(clip-far * clip-near) / (clip-far - clip-near)

= clip-far * (clip-far - clip-near) /
(clip-far - clip-near)

= clip-far

Therefore,
(W / Z) = (W / clip-far) = 1.0




And, when (Z == clip-near), the operation would compute:



W = a * Z + b

= (clip-far * clip-near) / (clip-far - clip-near) -
(clip-far * clip-near) / (clip-far - clip-near)

= 0.0




Of course I understand that a modern graphics card will have a 32-bit Z-buffer. But then all that needs to be done, for backwards-compatibility with the older system, is to receive a fractional value that has 32 bits instead of 16.

Now, there are two main derivations of this approach, which some game engines offer as features, but which can be achieved just by feeding in a slightly different set of constants to a matrix, which the GPU can work with in an unchanging way:

• Rendering to infinite world coordinates,
• Orthogonal camera-views.

The values that are needed for the same matrix will be: