This section provides an overview of the IEEE-754 32-bit binary floating point format.[?]
Recall that the place values for integer binary numbers are:
... 128 64 32 16 8 4 2 1
We can extend this to the right in binary similar to the way we do for decimal numbers:
... 128 64 32 16 8 4 2 1 . 1/2 1/4 1/8 1/16 1/32 1/64 1/128 ...The ‘.’ in a binary number is a binary point, not a decimal point.
We use scientific notation as in \(2.7 \times 10^{-47}\) to express either small fractions or large numbers when we are not concerned every last digit needed to represent the entire, exact, value of a number.
The format of a number in scientific notation is \(mantissa \times base^{exponent}\)
In binary we have \(mantissa \times 2^{exponent}\)
IEEE-754 format requires binary numbers to be normalized to \(1.significand \times 2^{exponent}\) where the significand is the portion of the mantissa that is to the right of the binary-point.
The unnormalized binary value of \(-2.625\) is \(-10.101\)
The normalized value of \(-2.625\) is \(-1.0101 \times 2^1\)
We need not store the ‘1.’ part because all normalized floating point numbers will start that way. Thus we can save memory when storing normalized values by inserting a ‘1.’ to the left of significand.
\(-((1 + \frac {1}{4} + \frac {1}{16}) \times 2^{128-127}) = -((1 + \frac {1}{4} + \frac {1}{16}) \times 2^1) = -(2 + \frac {1}{2} + \frac {1}{8}) = -(2 + .5 + .125) = -2.625\)
IEEE-754 formats:
IEEE-754 32-bit | IEEE-754 64-bit | |
sign | 1 bit | 1 bit |
exponent | 8 bits (excess-127) | 11 bits (excess-1023) |
mantissa | 23 bits | 52 bits |
max exponent | 127 | 1023 |
min exponent | -126 | -1022 |
When the exponent is all ones, the significand is all zeros, and the sign is zero, the number represents positive infinity.
When the exponent is all ones, the significand is all zeros, and the sign is one, the number represents negative infinity.
Observe that the binary representation of a pair of IEEE-754 numbers (when one or both are positive) can be compared for magnitude by treating them as if they are two’s complement signed integers. This is because an IEEE number is stored in signed magnitude format and therefore positive floating point values will grow upward and downward in the same fashion as for unsigned integers and that since negative floating point values will have its MSB set, they will ‘appear‘ to be less than a positive floating point value.
When comparing two negative IEEE float values by treating them both as two’s complement signed integers, the order will be reversed because IEEE float values with larger (that is, increasingly negative) magnitudes will appear to decrease in value when interpreted as signed integers.
This works this way because excess notation is used in the format of the exponent and why the significand’s sign bit is located on the left of the exponent.1
Note that zero is a special case number. Recall that a normalized number has an implied 1-bit to the left of the significand… which means that there is no way to represent zero! Zero is represented by an exponent of all-zeros and a significand of all-zeros. This definition allows for a positive and a negative zero if we observe that the sign can be either 1 or 0.
On the number-line, numbers between zero and the smallest fraction in either direction are in the underflow areas.
On the number line, numbers greater than the mantissa of all-ones and the largest exponent allowed are in the overflow areas.
Note that numbers have a higher resolution on the number line when the exponent is smaller.
The largest and smallest possible exponent values are reserved to represent things requiring special cases. For example, the infinities, values representing “not a number” (such as the result of dividing by zero), and for a way to represent values that are not normalized. For more information on special cases see [?].
Due to the finite number of bits used to store the value of a floating point number, it is not possible to represent every one of the infinite values on the real number line. The following C programs illustrate this point.
Just like the integer numbers, the powers of two that have bits to represent them can be represented perfectly… as can their sums (provided that the significand requires no more than 23 bits.)
#include <stdio.h> 2#include <stdlib.h> 3#include <unistd.h> 4 5union floatbin 6{ 7 unsigned int i; 8 float f; 9}; 10int main() 11{ 12 union floatbin x; 13 union floatbin y; 14 int i; 15 x.f = 1.0; 16 while (x.f > 1.0/1024.0) 17 { 18 y.f = -x.f; 19 printf("%25.10f = %08x %25.10f = %08x\n", x.f, x.i, y.f, y.i); 20 x.f = x.f/2.0; 21 } 22}
1.0000000000 = 3f800000 -1.0000000000 = bf800000 20.5000000000 = 3f000000 -0.5000000000 = bf000000 30.2500000000 = 3e800000 -0.2500000000 = be800000 40.1250000000 = 3e000000 -0.1250000000 = be000000 50.0625000000 = 3d800000 -0.0625000000 = bd800000 60.0312500000 = 3d000000 -0.0312500000 = bd000000 70.0156250000 = 3c800000 -0.0156250000 = bc800000 80.0078125000 = 3c000000 -0.0078125000 = bc000000 90.0039062500 = 3b800000 -0.0039062500 = bb800000 100.0019531250 = 3b000000 -0.0019531250 = bb000000
When dealing with decimal values, you will find that they don’t map simply into binary floating point values.
Note how the decimal numbers are not accurately represented as they get larger. The decimal number on line 10 of subsubsection B.4 can be perfectly represented in IEEE format. However, a problem arises in the 11Th loop iteration. It is due to the fact that the binary number can not be represented accurately in IEEE format. Its least significant bits were truncated in a best-effort attempt at rounding the value off in order to fit the value into the bits provided. This is an example of low order truncation. Once this happens, the value of x.f is no longer as precise as it could be given more bits in which to save its value.
#include <stdio.h> 2#include <stdlib.h> 3#include <unistd.h> 4 5union floatbin 6{ 7 unsigned int i; 8 float f; 9}; 10int main() 11{ 12 union floatbin x, y; 13 int i; 14 15 x.f = 10; 16 while (x.f <= 10000000000000.0) 17 { 18 y.f = -x.f; 19 printf("%25.10f = %08x %25.10f = %08x\n", x.f, x.i, y.f, y.i); 20 x.f = x.f*10.0; 21 } 22}
10.0000000000 = 41200000 -10.0000000000 = c1200000 2 100.0000000000 = 42c80000 -100.0000000000 = c2c80000 3 1000.0000000000 = 447a0000 -1000.0000000000 = c47a0000 4 10000.0000000000 = 461c4000 -10000.0000000000 = c61c4000 5 100000.0000000000 = 47c35000 -100000.0000000000 = c7c35000 6 1000000.0000000000 = 49742400 -1000000.0000000000 = c9742400 7 10000000.0000000000 = 4b189680 -10000000.0000000000 = cb189680 8 100000000.0000000000 = 4cbebc20 -100000000.0000000000 = ccbebc20 9 1000000000.0000000000 = 4e6e6b28 -1000000000.0000000000 = ce6e6b28 10 10000000000.0000000000 = 501502f9 -10000000000.0000000000 = d01502f9 11 99999997952.0000000000 = 51ba43b7 -99999997952.0000000000 = d1ba43b7 12 999999995904.0000000000 = 5368d4a5 -999999995904.0000000000 = d368d4a5 13 9999999827968.0000000000 = 551184e7 -9999999827968.0000000000 = d51184e7
These rounding errors can be exaggerated when the number we multiply the x.f value by is, itself, something that can not be accurately represented in IEEE form.2
For example, if we multiply our x.f value by \(\frac {1}{10}\) each time, we can never be accurate and we start accumulating errors immediately.
#include <stdio.h> 2#include <stdlib.h> 3#include <unistd.h> 4 5union floatbin 6{ 7 unsigned int i; 8 float f; 9}; 10int main() 11{ 12 union floatbin x, y; 13 int i; 14 15 x.f = .1; 16 while (x.f <= 2.0) 17 { 18 y.f = -x.f; 19 printf("%25.10f = %08x %25.10f = %08x\n", x.f, x.i, y.f, y.i); 20 x.f += .1; 21 } 22}
0.1000000015 = 3dcccccd -0.1000000015 = bdcccccd 20.2000000030 = 3e4ccccd -0.2000000030 = be4ccccd 30.3000000119 = 3e99999a -0.3000000119 = be99999a 40.4000000060 = 3ecccccd -0.4000000060 = becccccd 50.5000000000 = 3f000000 -0.5000000000 = bf000000 60.6000000238 = 3f19999a -0.6000000238 = bf19999a 70.7000000477 = 3f333334 -0.7000000477 = bf333334 80.8000000715 = 3f4cccce -0.8000000715 = bf4cccce 90.9000000954 = 3f666668 -0.9000000954 = bf666668 101.0000001192 = 3f800001 -1.0000001192 = bf800001 111.1000001431 = 3f8cccce -1.1000001431 = bf8cccce 121.2000001669 = 3f99999b -1.2000001669 = bf99999b 131.3000001907 = 3fa66668 -1.3000001907 = bfa66668 141.4000002146 = 3fb33335 -1.4000002146 = bfb33335 151.5000002384 = 3fc00002 -1.5000002384 = bfc00002 161.6000002623 = 3fcccccf -1.6000002623 = bfcccccf 171.7000002861 = 3fd9999c -1.7000002861 = bfd9999c 181.8000003099 = 3fe66669 -1.8000003099 = bfe66669 191.9000003338 = 3ff33336 -1.9000003338 = bff33336
In order to use floating point numbers in a program without causing excessive rounding problems an algorithm can be redesigned such that the accumulation is eliminated. This example is similar to the previous one, but this time we recalculate the desired value from a known-accurate integer value. Some rounding errors remain present, but they can not accumulate.
#include <stdio.h> 2#include <stdlib.h> 3#include <unistd.h> 4 5union floatbin 6{ 7 unsigned int i; 8 float f; 9}; 10int main() 11{ 12 union floatbin x, y; 13 int i; 14 15 i = 1; 16 while (i <= 20) 17 { 18 x.f = i/10.0; 19 y.f = -x.f; 20 printf("%25.10f = %08x %25.10f = %08x\n", x.f, x.i, y.f, y.i); 21 i++; 22 } 23 return(0); 24}
0.1000000015 = 3dcccccd -0.1000000015 = bdcccccd 20.2000000030 = 3e4ccccd -0.2000000030 = be4ccccd 30.3000000119 = 3e99999a -0.3000000119 = be99999a 40.4000000060 = 3ecccccd -0.4000000060 = becccccd 50.5000000000 = 3f000000 -0.5000000000 = bf000000 60.6000000238 = 3f19999a -0.6000000238 = bf19999a 70.6999999881 = 3f333333 -0.6999999881 = bf333333 80.8000000119 = 3f4ccccd -0.8000000119 = bf4ccccd 90.8999999762 = 3f666666 -0.8999999762 = bf666666 101.0000000000 = 3f800000 -1.0000000000 = bf800000 111.1000000238 = 3f8ccccd -1.1000000238 = bf8ccccd 121.2000000477 = 3f99999a -1.2000000477 = bf99999a 131.2999999523 = 3fa66666 -1.2999999523 = bfa66666 141.3999999762 = 3fb33333 -1.3999999762 = bfb33333 151.5000000000 = 3fc00000 -1.5000000000 = bfc00000 161.6000000238 = 3fcccccd -1.6000000238 = bfcccccd 171.7000000477 = 3fd9999a -1.7000000477 = bfd9999a 181.7999999523 = 3fe66666 -1.7999999523 = bfe66666 191.8999999762 = 3ff33333 -1.8999999762 = bff33333 202.0000000000 = 40000000 -2.0000000000 = c0000000
1I know this is true and was done on purpose because Bill Cody, chairman of IEEE committee P754 that designed the IEEE-754 standard, told me so personally circa 1991.
2Applications requiring accurate decimal values, such as financial accounting systems, can use a packed-decimal numeric format to avoid unexpected oddities caused by the use of binary numbers.