B Floating Point Numbers

Appendix B
Floating Point Numbers

B.1 IEEE-754 Floating Point Number Representation

This section provides an overview of the IEEE-754 32-bit binary floating point format.[?]

Recall that the place values for integer binary numbers are:

... 128 64 32 16 8 4 2 1

We can extend this to the right in binary similar to the way we do for decimal numbers:
```
        ... 128 64 32 16 8 4 2 1 . 1/2 1/4 1/8 1/16 1/32 1/64 1/128 ...
     
```
The ‘.’ in a binary number is a binary point, not a decimal point.
We use scientific notation as in \(2.7 \times 10^{-47}\) to express either small fractions or large numbers when we are not concerned every last digit needed to represent the entire, exact, value of a number.
The format of a number in scientific notation is \(mantissa \times base^{exponent}\)
In binary we have \(mantissa \times 2^{exponent}\)
IEEE-754 format requires binary numbers to be normalized to \(1.significand \times 2^{exponent}\) where the significand is the portion of the mantissa that is to the right of the binary-point.
- The unnormalized binary value of \(-2.625\) is \(-10.101\)
- The normalized value of \(-2.625\) is \(-1.0101 \times 2^1\)
We need not store the ‘1.’ part because all normalized floating point numbers will start that way. Thus we can save memory when storing normalized values by inserting a ‘1.’ to the left of significand.
\(-((1 + \frac {1}{4} + \frac {1}{16}) \times 2^{128-127}) = -((1 + \frac {1}{4} + \frac {1}{16}) \times 2^1) = -(2 + \frac {1}{2} + \frac {1}{8}) = -(2 + .5 + .125) = -2.625\)

IEEE-754 formats:


	IEEE-754 32-bit	IEEE-754 64-bit

sign	1 bit	1 bit
exponent	8 bits (excess-127)	11 bits (excess-1023)
mantissa	23 bits	52 bits
max exponent	127	1023
min exponent	-126	-1022

When the exponent is all ones, the significand is all zeros, and the sign is zero, the number represents positive infinity.
When the exponent is all ones, the significand is all zeros, and the sign is one, the number represents negative infinity.
Observe that the binary representation of a pair of IEEE-754 numbers (when one or both are positive) can be compared for magnitude by treating them as if they are two’s complement signed integers. This is because an IEEE number is stored in signed magnitude format and therefore positive floating point values will grow upward and downward in the same fashion as for unsigned integers and that since negative floating point values will have its MSB set, they will ‘appear‘ to be less than a positive floating point value.
When comparing two negative IEEE float values by treating them both as two’s complement signed integers, the order will be reversed because IEEE float values with larger (that is, increasingly negative) magnitudes will appear to decrease in value when interpreted as signed integers.
This works this way because excess notation is used in the format of the exponent and why the significand’s sign bit is located on the left of the exponent.¹
Note that zero is a special case number. Recall that a normalized number has an implied 1-bit to the left of the significand… which means that there is no way to represent zero! Zero is represented by an exponent of all-zeros and a significand of all-zeros. This definition allows for a positive and a negative zero if we observe that the sign can be either 1 or 0.
On the number-line, numbers between zero and the smallest fraction in either direction are in the underflow areas.
On the number line, numbers greater than the mantissa of all-ones and the largest exponent allowed are in the overflow areas.
Note that numbers have a higher resolution on the number line when the exponent is smaller.
The largest and smallest possible exponent values are reserved to represent things requiring special cases. For example, the infinities, values representing “not a number” (such as the result of dividing by zero), and for a way to represent values that are not normalized. For more information on special cases see [?].

B.1.1 Floating Point Number Accuracy

Due to the finite number of bits used to store the value of a floating point number, it is not possible to represent every one of the infinite values on the real number line. The following C programs illustrate this point.

B.1.1.1 Powers Of Two

Just like the integer numbers, the powers of two that have bits to represent them can be represented perfectly… as can their sums (provided that the significand requires no more than 23 bits.)

 #include <stdio.h> 
2#include <stdlib.h> 
3#include <unistd.h> 
4 
5union floatbin 
6{ 
7    unsigned int    i; 
8    float           f; 
9}; 
10int main() 
11{ 
12    union floatbin  x; 
13    union floatbin  y; 
14    int             i; 
15    x.f = 1.0; 
16    while (x.f > 1.0/1024.0) 
17    { 
18        y.f = -x.f; 
19        printf("%25.10f = %08x     %25.10f = %08x\n", x.f, x.i, y.f, y.i); 
20        x.f = x.f/2.0; 
21    } 
22}

 1.0000000000 = 3f800000                 -1.0000000000 = bf800000 
20.5000000000 = 3f000000                 -0.5000000000 = bf000000 
30.2500000000 = 3e800000                 -0.2500000000 = be800000 
40.1250000000 = 3e000000                 -0.1250000000 = be000000 
50.0625000000 = 3d800000                 -0.0625000000 = bd800000 
60.0312500000 = 3d000000                 -0.0312500000 = bd000000 
70.0156250000 = 3c800000                 -0.0156250000 = bc800000 
80.0078125000 = 3c000000                 -0.0078125000 = bc000000 
90.0039062500 = 3b800000                 -0.0039062500 = bb800000 
100.0019531250 = 3b000000                 -0.0019531250 = bb000000

B.1.1.2 Clean Decimal Numbers

When dealing with decimal values, you will find that they don’t map simply into binary floating point values.

Note how the decimal numbers are not accurately represented as they get larger. The decimal number on line 10 of subsubsection B.4 can be perfectly represented in IEEE format. However, a problem arises in the 11Th loop iteration. It is due to the fact that the binary number can not be represented accurately in IEEE format. Its least significant bits were truncated in a best-effort attempt at rounding the value off in order to fit the value into the bits provided. This is an example of low order truncation. Once this happens, the value of x.f is no longer as precise as it could be given more bits in which to save its value.

 #include <stdio.h> 
2#include <stdlib.h> 
3#include <unistd.h> 
4 
5union floatbin 
6{ 
7    unsigned int    i; 
8    float           f; 
9}; 
10int main() 
11{ 
12    union floatbin  x, y; 
13    int             i; 
14 
15    x.f = 10; 
16    while (x.f <= 10000000000000.0) 
17    { 
18        y.f = -x.f; 
19        printf("%25.10f = %08x     %25.10f = %08x\n", x.f, x.i, y.f, y.i); 
20        x.f = x.f*10.0; 
21    } 
22}

             10.0000000000 = 41200000                -10.0000000000 = c1200000 
2           100.0000000000 = 42c80000               -100.0000000000 = c2c80000 
3          1000.0000000000 = 447a0000              -1000.0000000000 = c47a0000 
4         10000.0000000000 = 461c4000             -10000.0000000000 = c61c4000 
5        100000.0000000000 = 47c35000            -100000.0000000000 = c7c35000 
6       1000000.0000000000 = 49742400           -1000000.0000000000 = c9742400 
7      10000000.0000000000 = 4b189680          -10000000.0000000000 = cb189680 
8     100000000.0000000000 = 4cbebc20         -100000000.0000000000 = ccbebc20 
9    1000000000.0000000000 = 4e6e6b28        -1000000000.0000000000 = ce6e6b28 
10   10000000000.0000000000 = 501502f9       -10000000000.0000000000 = d01502f9 
11   99999997952.0000000000 = 51ba43b7       -99999997952.0000000000 = d1ba43b7 
12  999999995904.0000000000 = 5368d4a5      -999999995904.0000000000 = d368d4a5 
13 9999999827968.0000000000 = 551184e7     -9999999827968.0000000000 = d51184e7

B.1.1.3 Accumulation of Error

These rounding errors can be exaggerated when the number we multiply the x.f value by is, itself, something that can not be accurately represented in IEEE form.²

For example, if we multiply our x.f value by \(\frac {1}{10}\) each time, we can never be accurate and we start accumulating errors immediately.

 #include <stdio.h> 
2#include <stdlib.h> 
3#include <unistd.h> 
4 
5union floatbin 
6{ 
7    unsigned int    i; 
8    float           f; 
9}; 
10int main() 
11{ 
12    union floatbin  x, y; 
13    int             i; 
14 
15    x.f = .1; 
16    while (x.f <= 2.0) 
17    { 
18        y.f = -x.f; 
19        printf("%25.10f = %08x     %25.10f = %08x\n", x.f, x.i, y.f, y.i); 
20        x.f += .1; 
21    } 
22}

 0.1000000015 = 3dcccccd                 -0.1000000015 = bdcccccd 
20.2000000030 = 3e4ccccd                 -0.2000000030 = be4ccccd 
30.3000000119 = 3e99999a                 -0.3000000119 = be99999a 
40.4000000060 = 3ecccccd                 -0.4000000060 = becccccd 
50.5000000000 = 3f000000                 -0.5000000000 = bf000000 
60.6000000238 = 3f19999a                 -0.6000000238 = bf19999a 
70.7000000477 = 3f333334                 -0.7000000477 = bf333334 
80.8000000715 = 3f4cccce                 -0.8000000715 = bf4cccce 
90.9000000954 = 3f666668                 -0.9000000954 = bf666668 
101.0000001192 = 3f800001                 -1.0000001192 = bf800001 
111.1000001431 = 3f8cccce                 -1.1000001431 = bf8cccce 
121.2000001669 = 3f99999b                 -1.2000001669 = bf99999b 
131.3000001907 = 3fa66668                 -1.3000001907 = bfa66668 
141.4000002146 = 3fb33335                 -1.4000002146 = bfb33335 
151.5000002384 = 3fc00002                 -1.5000002384 = bfc00002 
161.6000002623 = 3fcccccf                 -1.6000002623 = bfcccccf 
171.7000002861 = 3fd9999c                 -1.7000002861 = bfd9999c 
181.8000003099 = 3fe66669                 -1.8000003099 = bfe66669 
191.9000003338 = 3ff33336                 -1.9000003338 = bff33336

B.1.2 Reducing Error Accumulation

In order to use floating point numbers in a program without causing excessive rounding problems an algorithm can be redesigned such that the accumulation is eliminated. This example is similar to the previous one, but this time we recalculate the desired value from a known-accurate integer value. Some rounding errors remain present, but they can not accumulate.

 #include <stdio.h> 
2#include <stdlib.h> 
3#include <unistd.h> 
4 
5union floatbin 
6{ 
7    unsigned int    i; 
8    float           f; 
9}; 
10int main() 
11{ 
12    union floatbin  x, y; 
13    int             i; 
14 
15    i = 1; 
16    while (i <= 20) 
17    { 
18        x.f = i/10.0; 
19        y.f = -x.f; 
20        printf("%25.10f = %08x     %25.10f = %08x\n", x.f, x.i, y.f, y.i); 
21        i++; 
22    } 
23    return(0); 
24}

 0.1000000015 = 3dcccccd                 -0.1000000015 = bdcccccd 
20.2000000030 = 3e4ccccd                 -0.2000000030 = be4ccccd 
30.3000000119 = 3e99999a                 -0.3000000119 = be99999a 
40.4000000060 = 3ecccccd                 -0.4000000060 = becccccd 
50.5000000000 = 3f000000                 -0.5000000000 = bf000000 
60.6000000238 = 3f19999a                 -0.6000000238 = bf19999a 
70.6999999881 = 3f333333                 -0.6999999881 = bf333333 
80.8000000119 = 3f4ccccd                 -0.8000000119 = bf4ccccd 
90.8999999762 = 3f666666                 -0.8999999762 = bf666666 
101.0000000000 = 3f800000                 -1.0000000000 = bf800000 
111.1000000238 = 3f8ccccd                 -1.1000000238 = bf8ccccd 
121.2000000477 = 3f99999a                 -1.2000000477 = bf99999a 
131.2999999523 = 3fa66666                 -1.2999999523 = bfa66666 
141.3999999762 = 3fb33333                 -1.3999999762 = bfb33333 
151.5000000000 = 3fc00000                 -1.5000000000 = bfc00000 
161.6000000238 = 3fcccccd                 -1.6000000238 = bfcccccd 
171.7000000477 = 3fd9999a                 -1.7000000477 = bfd9999a 
181.7999999523 = 3fe66666                 -1.7999999523 = bfe66666 
191.8999999762 = 3ff33333                 -1.8999999762 = bff33333 
202.0000000000 = 40000000                 -2.0000000000 = c0000000

¹I know this is true and was done on purpose because Bill Cody, chairman of IEEE committee P754 that designed the IEEE-754 standard, told me so personally circa 1991.

²Applications requiring accurate decimal values, such as financial accounting systems, can use a packed-decimal numeric format to avoid unexpected oddities caused by the use of binary numbers.

[next] [prev] [prev-tail] [front] [up]