Computers and Floating-Point Numbers: In Layman’s Terms

There exist WiKiPedia pages that do explain how single- and double-precision floating-point numbers are formatted – both used by computers – but which are so heavily bogged down with tedious details, that the reader would need to be a Computer Scientist already, to be able to understand them. And in that case, those articles can act as a quick reference. But they would do little, to explain the subject to laypeople. The mere fact that single- and double-precision numbers are explained on the WiKi in two separate articles, could already act as a deterrent for most people to try understanding the basic concepts.

I will try to explain this subject in basic terms.

While computers store data in bits that are organized into words – those words either being 32-bit or 64-bit words on most popular architectures – even by the CPU, those words are interpreted as representing numbers in different ways. One way is either as signed or as unsigned ‘integers’, which is another way of saying ‘whole numbers’. And another is either as 32-bit or as 64-bit floating-point numbers. Obviously, the floating-point numbers are used to express fractions, as well as very large or very small values, as well as fractions which are accurate to a high number of digits ‘after the decimal point’.

A CPU must be given an exact opcode, to perform Math on the different representations of numbers, where what type of number they are, is already reflected at compile-time, by which opcode has been encoded, to use the numbers. So obviously, some non-trivial Math goes into defining, how these different number-formats work. I’m to focus on the two most-popular floating-point formats.

Understanding how floating-point numbers work on computers, first requires understanding how Scientists use Scientific Notation. In the Engineering world, as well as in the Household, what most people are used to, is that the number of digits a number has to the left of the decimal-point, be grouped in threes, and that the magnitude of the number is expressed with prefixes such as kilo- , mega- , giga- , tera- , peta- , or, going in the other direction, with milli- , micro- , nano- , pico- , femto- or atto- .

In Science, this notation is so encumbering that the Scientists try to avoid it. What Scientists will do, is state a field of decimal digits, which will always begin with a single (non-zero) digit, followed by the decimal point, followed by an arbitrary number of fractional digits, followed by a multiplication-symbol, followed by the base of 10 raised to either a positive or a negative power. This power also states, how many places further right, or how many places further left, the reader should visualize the decimal point. For example, Avogadro’s number is expressed as

6.022 × 1023

IF we are told to limit our precision to 3 places after the decimal point. If we were told to give 6 places behind the decimal point, we would give it as

6.022141 × 1023

What this means, is that relative to where it is written, the decimal point would need to be shifted to the right 23 places, to arrive at a number, that has the correct order of magnitude.

When I went to High-School, we were drilled to use this notation ad nauseum, so that even if it seemed ridiculous, we would answer in our sleep that to express how much ‘a dozen’ was, using Scientific Notation, yielded

1.2 × 10+1

More importantly, Scientists feel comfortable using the format, because they can express such ideas as ‘how many atoms of regular matter are thought to exist in the known universe’, as long as they were not ashamed to write a ridiculous power of ten:

1 × 1080

Or, how many stars are thought to exist in our galaxy:

( 2 × 1011 … 4 × 1011 )

The latter of which should read, ‘from 200 billion to 400 billion’.

When Computing started, its Scientists had the idea to adapt Scientific Notation to the Binary Number System. What they did was to break down the available word-sizes, essentially, into three fields:

  1. A so-called “Significand”, which would correspond to the Mantissa,
  2. An Exponent,
  3. A Sign-bit for the entire number.

The main difference to Scientific Notation however was, that floating-point numbers on computers, would do everything in powers of two, rather than in powers of ten.

A standard, 32-bit floating-point number reserves 23 bits for the fraction, and 8 bits for the exponent of 2, while a standard, 64-bit floating-point number reserves 52 bits for the fraction, and 11 bits for the exponent of 2. This assignment is arbitrary, but sometimes necessary to know, for implementing certain types of subroutines or hardware.

But one thing that works as well in binary as it does in decimal, is that bits could occur after a point, as easily as they could occur before a point.

Hence, this would be the number 3 in binary:


While this would be the fraction 3/4 in binary:


(Updated 11/13/2017 : )

Thus, if the fractional part of a 32-bit floating-point number was to store 23 binary digits, equivalent to standard expectations in decimal form, then a bit of weirdness that needs to be taken care of, is that in effect, there would also be 23 different possible ways to store the number (1). Each of them would have a single bit equal to (1), all the other bits equal to (0), and the required exponent that repositions the non-zero bit, as required, to yield a product of (1).

Such oddities do not exist in Computing for very long, because at the very least, they’d lead to a decrease in efficiency. And so a little trick which takes place in Computing, is that an unstated bit of (1) is assumed to precede the stored fraction. That way, there is exactly one way to store the value

1.0 × 100

That being

0  01111111  0000 0000 0000 0000 0000 000

The way the exponent is stored, reflects the fact that Computer Science wants the format to work well, when these numbers are multiplied. This means, that exponents must be easy to add. And so in principle, the exponent could be stored in two’s compliment. But in practice, it actually gets stored as an integer, the value of which is offset, which would almost be the same thing as two’s complement, except for the fact that the offset can be arbitrary, and is chosen to maximize efficiency. Typically, either




are used to denote 2+1.

But one fact which programmers must deal with every time they write source code, that uses floating-point Math, is that in the source code, they write the constants in Base-10. While the compiler can do the work of translating between binary and decimal forms, the programmer must at least know what his available ranges are. And to do that, there exist two coarse approximations, of how binary numbers can be visualized in decimal:

  1. 210 == 1024 ~= 1000
  2. 4 bits ~= 1 decimal digit.

Hence, if we knew that the field of bits of actual precision were equal to 24, then we could estimate that the number of decimal digits this would give us, is a disappointing 6 decimal places.

And if we could say that the exponent ranged from -127 to +127, then this would roughly correspond to the powers of ten

10-32 … 10+32

But because the analogy is only approximate, the actual values that result just from these exponents are

1.2 × 10-38 … 1.7 × 10+38

(Edited 11/08/2017 , Commented on 11/11/2017 … )


So obviously, this “single-precision” format needed replacement early in the History of Computing, with a longer format, and so the 64-bit format, which is also referred to as “double-precision”, is strongly favored today, because its field of significant bits is approximately

53 / 4 ~= 13 decimal digits

and its powers of ten are approximately

+/- 1000  /  4  ~= 250

There exists the unanswered question so far, as to how one would actually store the number zero, since what I have written so far would imply, that the assumed digit of (1) needs to be right-shifted an infinite number of times, so that its ‘real value’ diminishes towards zero.

The convention that gets used is, that the most-negative exponent, which would normally signal the smallest order of magnitude that can be represented, actually signals either that the numeral zero is meant, or that some anomalous result has been obtained. And the highest-possible exponent effectively, signals an overflow.


According to the way I was taught Computing, the CPU was able to distinguish between an underflow, and a valid representation of the number (0). The latter (did) occur when all the fractional bits were set to zeroes.

But according to the way I was taught Computing, there was no analogous way to distinguish between an overflow, and some other, corresponding, ‘meaningful result’. According to the WiKiPedia today, if the fractional bits are all set to zeroes, and the exponent is its maximum-possible value, this actually signals ‘the symbol infinity’.

If that were true, then I’d expect that this symbol behaves, exactly as the Algebraic symbol would behave, if nothing else was known, than the values passed to one operation – i.e., if no further context was given.

This means that operations between infinity and ‘ordinary numbers’ would have predictable results, while operations between opposing infinities would continue to result in error-messages, because according to Math, those can only be resolved – if at all – using Computer Algebra Systems, and if given an entire system of equations. A CPU is generally only fed the contents of a small number of registers – usually two – and not an entire equation. Based only only those two terms, the answer remains undefined.


It has always been possible, for the CPU to signal an error due to numeric values it was instructed to perform an operation on, and for that error message not to have been an overflow, nor an underflow. I.e., traditionally, dividing by zero simply resulted in one such ‘illegal operation’ message. It did not result in an ‘overflow’ message, because the CPU would not attempt to compute its value.

If the symbol ‘infinity’ was recognized by the CPU, then those sort of messages would become less frequent, although I’m not sure how useful it would be, if code which was expected to return a numeric value, was allowed to return ‘infinity’ instead.

But then, actually dividing by zero would result in infinity, and in code that for the time being, continues to run. But, trying to multiply infinity by zero, would finally result in an ‘illegal operation’.


I find this version of ‘the concept of infinity’ to be a misguided effort on the part of the WiKiPedia, because Infinity is not a number; it’s just an Algebraic symbol.

(Erratum 11/08/2017 : )

The fact that Infinity is one out of many existing Algebraic symbols, caused me to misinterpret, what the latest IEEE standard means, when they write, that the bits of a floating-point number can stand for “Not A Number”. Those bits would include the most-positive exponent possible, plus a field of fractional bits, not all of which are zeroes.

According to the IEEE standard, this is equivalent to having an error-code, which carries forward through a series of operations.

Apparently, one main reason for which the IEEE did this, was the fact that for the CPU to throw an exception, is a big problem in massively parallel computing – in fact largely unsupported there. Instead, a core can write to its output, that an error has taken place, and keep running. If one of the two operands input already state this condition, the next output is also set to this condition. And finally, when some final output is examined, and contains this code, the problem can be analyzed by humans, or by more code, of where in the attempted computations the problem took place.

Therefore, according to the new guidelines:

  1. Underflows are not supposed to take place anymore. Instead, what used to lead to underflows, now leads to Signed Zero.
  2. What used to lead to overflows, now leads to Not A Number.
  3. Other operations that cannot be resolved, now lead to Not A Number
  4. If the exponent is at its most-negative, but the fractional bits are not all zeroes, then those fractional bits now represent a Denormalized Number, which means, that there is no longer a preceding, unstated (1), which in turn, can lead to even-smaller (values within the) mantissa (, where leading zeroes become possible).

I suppose that one question this leaves unanswered, concerns the fact that to subtract one floating-point number from another, can sometimes lead to an apparent zero, and that the representation now prefers to know whether this leads to a positive or a negative zero.

This scenario is aggravated by the fact that by default, each floating-point number has a non-zero error-margin, so that even if the known bits did cancel, we could not assume safely that the real value left, should actually become zero. Instead, the real value which the operation fails to find, could be another real number, several orders of magnitude smaller than either operand, but a number that could nevertheless be expressed accurately by itself, in the same format, if it had been found.

If this was treated as Not A Number, then innocent Math could lead to error messages, since subtraction may take place naively. According to the WiKi, this is resolved as Positive Zero, unless rounding takes place negatively, in which case it gets resolved as Negative Zero. IMHO, this should be resolved Negative Zero, IF rounding that led to the zero was positive.

(Erratum 11/10/2017 : )

A personal friend of mine has pointed out to me, that my recent version of how floating-point numbers work, still contained an error:

Apparently, when a numerical result is obtained, which is too large to be expressed, but not necessarily a division of an ordinary number by zero, this can still be referred to as a ‘Regular Overflow’, but is in fact treated by the CPU as equivalent to Infinity. Meaning, that this result can be used in later operations, as this posting describes the usage of ‘Infinity’, and not, that the result is taken out of the computation, as this posting describes the usage of ‘Not A Number’.

On such a fine detail, I thought that the best way to test this person’s claim would be, just to try it out. Because, even the WiKiPedia could be in error, and, the actual, formal documents, are harder for me to analyze, than it was just to write a few lines of code. So this was the result:

// This exercise is to test, whether a general overflow simply
// leads to infinity, and whether my CPU supports
// denormalized numbers.


using std::cout;
using std::endl;

int main() {
	float num1 = 0.0F;
	float num2 = 1.0e+20F;
	double num3 = 1.0e+20;
	float infinity = 1 / num1;
	float overflow = num2 * num2;
	double reg_num = num3 * num3;
	float denorm = 1.0e-20F / num2;
	cout << "Result 1: " << ( 1 / infinity ) << endl;
	cout << "Result 2: " << ( 1 / overflow ) << endl;
	cout << "Result 3: " << ( 1 / reg_num ) << endl;
	cout << "Result 4: " << denorm << endl;
	return 0;


dirk@Plato:~/Programs$ ./infin_test_2
Result 1: 0
Result 2: 0
Result 3: 1e-40
Result 4: 9.99995e-41


(Exercise Augmented 11/13/2017 . )

And as the reader can see, my friend was correct.

This usage of the code ‘infinity’ could be contested, because according to certain logic, a number could just be ridiculously large, and not stand for infinity in real life. My example above, of

1 × 1080

Described how this applies to certain problems in Physics and Astronomy. But apparently, what is more practical in Computing, is that the result of a number becoming

1 × 1040

Is just too large to be computed – If we make the mistake of using single-precision, floating-point numbers – and then to keep using it ‘makes more sense’.


If the reader chooses as I did, to test certain low-level behaviors of the CPU, by writing a program in a high-level language such as C++, the fact needs to be considered, that any modern compiler worth its salt, will optimize our C++, in this case. This also means, that if we write numeric literals, these will be expressions within the source-code, which a compiler may recognize the value of before the program even runs, in which case the compiler will simplify, before generating machine-code.

My reason for putting the upper-case letter ‘F’ at the end of the single-precision numeric literals was, the fact that by default, the compiler will start by taking the literals to be double-precision. This means that by default, the compiler will already convert these double-precision constants into single-precision, just because I declared the variables on the left-hand side of the initialization as single-precision. To put an ‘F’ actually forces the numeric value in the source-code, to be read as single-precision by the compiler. ( :1 )

What I found was that if I declared variables of type ‘float’ or ‘double’, and If I initialized those variables, from right-hand values that are themselves variables and not constants, this will fool the compiler, into allowing the CPU to compute the right-hand side of those definitions, At run-time. But if the compilers become much more intelligent than they currently are, their translation of C++ into machine-language could just as easily short my future attempts to test, what the CPU does.


(Comment 11/11/2017 : )

The above posting states, that if the exponent is at its most-negative value possible, but the fractional bits not all zero, a denormalized number results, in which the preceding (1) is no longer assumed, before the stored fractional bits.

Even though this detail might seem trivial, I should point out, that this needs to take place in a particular way.

In the case of a 32-bit, single-precision, floating-point number, the smallest-possible exponent, that still leads to ‘an ordinary number’, is actually (-126), and, the smallest-possible, positive number that can exist in this form, will be represented as

0 00000001 0000 0000 0000 0000 0000 000

What we need to watch out for, is that even though

1.0 × 2-126

was still an ordinary number, the next range of denormalized numbers which need to be possible, would be of the form

0.5 × 2-126

which would be represented in binary, as

0 00000000 1000 0000 0000 0000 0000 000

What this effectively means is that in practice, stating a field of exponent bits as ‘0’ instead of as ‘1’, will still imply that a power of two is being applied, which stays at (-126) and does not become (-127), since (0.5) will still need to be multiplied by (2-126) and allow the full range of possible numbers to be represented. The range of denormalized numbers needs to be continuous, with the smallest-possible, ‘ordinary number’.

This could run counter to what the reader might expect, since the numeric value of (0) is still smaller than the numeric value of (1). But the way the binary, floating-point format works, the (power of 2) that results, is the same.

The equivalent phenomenon will take place with 64-bit, double-precision floating-point numbers, when those next lead to their denormalized numbers. The most-negative power of two they can express, will be (-1022), even though the corresponding bit-field could express the number (-1023).


1: ) By default, a C or a C++ compiler will allow a numeric literal to initialize a variable, even if the data-type of the variable is not as precise, as the literal was, without generating any messages.

More specifically, a double-precision literal, or even an integer, can be used to initialize a single-precision floating-point variable in this way, because those glyphs may be the easiest way for the programmer to write, what he wants the variable to be initialized to.

  • In C or C++, any built-in computation performed between an integer and a floating-point number, will lead to a floating-point output, which has the highest precision already specified in the input-values, that are called parameters. And this is called a ‘promotion’, which will take place silently. I used it in the code above.

But the convention above is implemented by the compiler, and not by the CPU itself. Hence, a compiler will convert each of the parameters to the required data-type, before putting opcodes into the machine-language representation of the program, that finally perform the computation between the intended parameters, which will already be of the same data-type as the output. Thus, the CPU’s instruction-set only needs to include opcodes, that convert a single parameter of one type, to the equivalent of another type.

But these fine details are best learned, by taking courses in C or in C++ .



Print Friendly, PDF & Email

One thought on “Computers and Floating-Point Numbers: In Layman’s Terms”

Leave a Reply

Your email address will not be published. Required fields are marked *

Please Prove You Are Not A Robot *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>