Appendix B
Floating Point Numbers

B.1 IEEE-754 Floating Point Number Representation

This section provides an overview of the IEEE-754 32-bit binary floating point format.[?]

B.1.1 Floating Point Number Accuracy

Due to the finite number of bits used to store the value of a floating point number, it is not possible to represent every one of the infinite values on the real number line. The following C programs illustrate this point.

B.1.1.1 Powers Of Two

Just like the integer numbers, the powers of two that have bits to represent them can be represented perfectly… as can their sums (provided that the significand requires no more than 23 bits.)

     
 #include <stdio.h> 
2#include <stdlib.h> 
3#include <unistd.h> 
4 
5union floatbin 
6{ 
7    unsigned int    i; 
8    float           f; 
9}; 
10int main() 
11{ 
12    union floatbin  x; 
13    union floatbin  y; 
14    int             i; 
15    x.f = 1.0; 
16    while (x.f > 1.0/1024.0) 
17    { 
18        y.f = -x.f; 
19        printf("%25.10f = %08x     %25.10f = %08x\n", x.f, x.i, y.f, y.i); 
20        x.f = x.f/2.0; 
21    } 
22}
     
 1.0000000000 = 3f800000                 -1.0000000000 = bf800000 
20.5000000000 = 3f000000                 -0.5000000000 = bf000000 
30.2500000000 = 3e800000                 -0.2500000000 = be800000 
40.1250000000 = 3e000000                 -0.1250000000 = be000000 
50.0625000000 = 3d800000                 -0.0625000000 = bd800000 
60.0312500000 = 3d000000                 -0.0312500000 = bd000000 
70.0156250000 = 3c800000                 -0.0156250000 = bc800000 
80.0078125000 = 3c000000                 -0.0078125000 = bc000000 
90.0039062500 = 3b800000                 -0.0039062500 = bb800000 
100.0019531250 = 3b000000                 -0.0019531250 = bb000000

B.1.1.2 Clean Decimal Numbers

When dealing with decimal values, you will find that they don’t map simply into binary floating point values.

Note how the decimal numbers are not accurately represented as they get larger. The decimal number on line 10 of subsubsection B.4 can be perfectly represented in IEEE format. However, a problem arises in the 11Th loop iteration. It is due to the fact that the binary number can not be represented accurately in IEEE format. Its least significant bits were truncated in a best-effort attempt at rounding the value off in order to fit the value into the bits provided. This is an example of low order truncation. Once this happens, the value of x.f is no longer as precise as it could be given more bits in which to save its value.

     
 #include <stdio.h> 
2#include <stdlib.h> 
3#include <unistd.h> 
4 
5union floatbin 
6{ 
7    unsigned int    i; 
8    float           f; 
9}; 
10int main() 
11{ 
12    union floatbin  x, y; 
13    int             i; 
14 
15    x.f = 10; 
16    while (x.f <= 10000000000000.0) 
17    { 
18        y.f = -x.f; 
19        printf("%25.10f = %08x     %25.10f = %08x\n", x.f, x.i, y.f, y.i); 
20        x.f = x.f*10.0; 
21    } 
22}
     
             10.0000000000 = 41200000                -10.0000000000 = c1200000 
2           100.0000000000 = 42c80000               -100.0000000000 = c2c80000 
3          1000.0000000000 = 447a0000              -1000.0000000000 = c47a0000 
4         10000.0000000000 = 461c4000             -10000.0000000000 = c61c4000 
5        100000.0000000000 = 47c35000            -100000.0000000000 = c7c35000 
6       1000000.0000000000 = 49742400           -1000000.0000000000 = c9742400 
7      10000000.0000000000 = 4b189680          -10000000.0000000000 = cb189680 
8     100000000.0000000000 = 4cbebc20         -100000000.0000000000 = ccbebc20 
9    1000000000.0000000000 = 4e6e6b28        -1000000000.0000000000 = ce6e6b28 
10   10000000000.0000000000 = 501502f9       -10000000000.0000000000 = d01502f9 
11   99999997952.0000000000 = 51ba43b7       -99999997952.0000000000 = d1ba43b7 
12  999999995904.0000000000 = 5368d4a5      -999999995904.0000000000 = d368d4a5 
13 9999999827968.0000000000 = 551184e7     -9999999827968.0000000000 = d51184e7

B.1.1.3 Accumulation of Error

These rounding errors can be exaggerated when the number we multiply the x.f value by is, itself, something that can not be accurately represented in IEEE form.2

For example, if we multiply our x.f value by \(\frac {1}{10}\) each time, we can never be accurate and we start accumulating errors immediately.

     
 #include <stdio.h> 
2#include <stdlib.h> 
3#include <unistd.h> 
4 
5union floatbin 
6{ 
7    unsigned int    i; 
8    float           f; 
9}; 
10int main() 
11{ 
12    union floatbin  x, y; 
13    int             i; 
14 
15    x.f = .1; 
16    while (x.f <= 2.0) 
17    { 
18        y.f = -x.f; 
19        printf("%25.10f = %08x     %25.10f = %08x\n", x.f, x.i, y.f, y.i); 
20        x.f += .1; 
21    } 
22}
     
 0.1000000015 = 3dcccccd                 -0.1000000015 = bdcccccd 
20.2000000030 = 3e4ccccd                 -0.2000000030 = be4ccccd 
30.3000000119 = 3e99999a                 -0.3000000119 = be99999a 
40.4000000060 = 3ecccccd                 -0.4000000060 = becccccd 
50.5000000000 = 3f000000                 -0.5000000000 = bf000000 
60.6000000238 = 3f19999a                 -0.6000000238 = bf19999a 
70.7000000477 = 3f333334                 -0.7000000477 = bf333334 
80.8000000715 = 3f4cccce                 -0.8000000715 = bf4cccce 
90.9000000954 = 3f666668                 -0.9000000954 = bf666668 
101.0000001192 = 3f800001                 -1.0000001192 = bf800001 
111.1000001431 = 3f8cccce                 -1.1000001431 = bf8cccce 
121.2000001669 = 3f99999b                 -1.2000001669 = bf99999b 
131.3000001907 = 3fa66668                 -1.3000001907 = bfa66668 
141.4000002146 = 3fb33335                 -1.4000002146 = bfb33335 
151.5000002384 = 3fc00002                 -1.5000002384 = bfc00002 
161.6000002623 = 3fcccccf                 -1.6000002623 = bfcccccf 
171.7000002861 = 3fd9999c                 -1.7000002861 = bfd9999c 
181.8000003099 = 3fe66669                 -1.8000003099 = bfe66669 
191.9000003338 = 3ff33336                 -1.9000003338 = bff33336

B.1.2 Reducing Error Accumulation

In order to use floating point numbers in a program without causing excessive rounding problems an algorithm can be redesigned such that the accumulation is eliminated. This example is similar to the previous one, but this time we recalculate the desired value from a known-accurate integer value. Some rounding errors remain present, but they can not accumulate.

     
 #include <stdio.h> 
2#include <stdlib.h> 
3#include <unistd.h> 
4 
5union floatbin 
6{ 
7    unsigned int    i; 
8    float           f; 
9}; 
10int main() 
11{ 
12    union floatbin  x, y; 
13    int             i; 
14 
15    i = 1; 
16    while (i <= 20) 
17    { 
18        x.f = i/10.0; 
19        y.f = -x.f; 
20        printf("%25.10f = %08x     %25.10f = %08x\n", x.f, x.i, y.f, y.i); 
21        i++; 
22    } 
23    return(0); 
24}
     
 0.1000000015 = 3dcccccd                 -0.1000000015 = bdcccccd 
20.2000000030 = 3e4ccccd                 -0.2000000030 = be4ccccd 
30.3000000119 = 3e99999a                 -0.3000000119 = be99999a 
40.4000000060 = 3ecccccd                 -0.4000000060 = becccccd 
50.5000000000 = 3f000000                 -0.5000000000 = bf000000 
60.6000000238 = 3f19999a                 -0.6000000238 = bf19999a 
70.6999999881 = 3f333333                 -0.6999999881 = bf333333 
80.8000000119 = 3f4ccccd                 -0.8000000119 = bf4ccccd 
90.8999999762 = 3f666666                 -0.8999999762 = bf666666 
101.0000000000 = 3f800000                 -1.0000000000 = bf800000 
111.1000000238 = 3f8ccccd                 -1.1000000238 = bf8ccccd 
121.2000000477 = 3f99999a                 -1.2000000477 = bf99999a 
131.2999999523 = 3fa66666                 -1.2999999523 = bfa66666 
141.3999999762 = 3fb33333                 -1.3999999762 = bfb33333 
151.5000000000 = 3fc00000                 -1.5000000000 = bfc00000 
161.6000000238 = 3fcccccd                 -1.6000000238 = bfcccccd 
171.7000000477 = 3fd9999a                 -1.7000000477 = bfd9999a 
181.7999999523 = 3fe66666                 -1.7999999523 = bfe66666 
191.8999999762 = 3ff33333                 -1.8999999762 = bff33333 
202.0000000000 = 40000000                 -2.0000000000 = c0000000

1I know this is true and was done on purpose because Bill Cody, chairman of IEEE committee P754 that designed the IEEE-754 standard, told me so personally circa 1991.

2Applications requiring accurate decimal values, such as financial accounting systems, can use a packed-decimal numeric format to avoid unexpected oddities caused by the use of binary numbers.