Floating-Point Format Examples page 1 of 2
ENCM 369 Winter 2008: Floating-Point Format Examples
(handout for L01 lecture Thu., Mar. 6)
Author: Dr. S. A. Norman. Electronic copies of handouts for this course can be found at
http://www.enel.ucalgary.ca/People/Norman/encm369winter2008/
Introduction: This handout provides some useful examples of numbers being either approximately or exactly
represented in IEEE 754 floating-point formats. The material is complementary to the lecture notes—please read it
on your own time.
IEEE 754 representations of π: Most students oriented towards mathematics and science know that π ≈
3.14159265358979323846 can’t be represented exactly with a finite number of base ten digits. It also can’t be
represented in base two with a finite number of bits. Here is an approximation, with 64 bits to the right of the
“binary point”:
π ≈ 11.0010 0100 0011 1111 0110 1010 1000 1000 1000 0101 1010 0011 0000 1000 1101 0011
If we write this in base two normalized scientific notation we have:
π ≈ 1.1001 0010 0001 1111 1011 0101 0100 0100 0100 0010 1101 0001 1000 0100 0110 1001 1 × two one
Above we have 65 fraction bits.
Let’s determine the best possible IEEE 754 single-precision approximation to π. We can only put 23 fraction bits
into our number. From above, the fraction bits should be based on the fraction
0.1001 0010 0001 1111 1011 010 1 0100 0100 0100 0010 1101 0001 1000 0100 0110 1001 1,
where I have put a wide space between the 23rd and 24th bits to the right of the binary point. Looking at the
24th bit and beyond, it should be clear that rounding up will be slightly more accurate than rounding down, so the
fraction bits in our single-precision number should be
1001_0010_0001_1111_1011_011
π is positive, so the sign bit willl be 0. The real exponent is 1, so the biased exponent will be 128 ten, providing
exponent bits 1000_0000. So the overall bit pattern for the best 32-bit IEEE 754 approximation to π is
0_1000_0000_1001_0010_0001_1111_1011_011
Note that this number is slightly larger than the actual value of π.
Now let’s try a 64-bit IEEE 754 double-precision approximation. Again, the fraction should be based on
0.1001 0010 0001 1111 1011 0101 0100 0100 0100 0010 1101 0001 1000 0100 0110 1001 1;
this time the big space is between the 52nd and 53rd bits of the fraction, because we can now use 52 bits of fraction.
In this case the best approximation will be obtained by rounding down—we should use
1001_0010_0001_1111_1011_0101_0100_0100_0100_0010_1101_0001_1000
as the fraction bits. The sign bit will be 0, just as in the single-precision case. The exponent bias for double precision
is 1023ten, so the biased exponent is 1024ten, and the eleven-bit pattern for the exponent will be 100_0000_0000. So
the overall 64-bit pattern will be
0_100_0000_0000_1001_0010_0001_1111_1011_0101_0100_0100_0100_0010_1101_0001_1000
This number is a tiny amount smaller than the actual value of π; it’s a much better approximation than the single-
precision approximation, but it’s still just an approximation.
IEEE 754 representations of 0.6: The number 0.6 is of course much less famous and interesting than π, but trying
to represent it in base two floating-point will make an important point: Some numbers that are easily represented
Floating-Point Format Examples page 2 of 2
with a finite number of base ten digits can’t be represented exactly in base two. If we try to write 0.6 ten in base two,
it turns out that
1 0 0 1 1 0 0 1 1 0 0 1
0.6 = + + + + + 6+ 7+ 8 + + 10 + 11 + 12 + · · · = (0.1001 1001 1001 · · ·)two ,
2 4 8 16 25 2 2 2 29 2 2 2
where the bit sequence 1001 repeats forever. So it’s not going to be possible to represent 0.6 exactly in base two
floating-point.
Normalizing and keeping 56 fraction bits we have 0.6ten as approximately
1.0011 0011 0011 0011 0011 001 1 0011 0011 0011 0011 0011 0011 0011 × two −1
Above there are wide spaces between the 23rd and 24th fraction bits and between the 52nd and 53rd fraction bits.
The best 23-bit fraction to use will be obtained by rounding up, but the best 52-bit fraction will involve rounding
down. The sign bit will of course be 0 in both cases. The exponent bits will be appropriately biased representations
of −1: 0111_1110 for single precision and 011_1111_1110 for double precision. So the overall bit pattern for the
best single-precision approximation to 0.6 is
0_0111_1110_0011_0011_0011_0011_0011_010
And the best double-precision approximation will be
0_011_1111_1110_0011_0011_0011_0011_0011_0011_0011_0011_0011_0011_0011_0011_0011
Taking our floating-point representations back to base ten, it turns out that the single-precision value is exactly equal
to
0.60000002384185791015625
and the double-precision value is exactly
0.59999999999999997779553950749686919152736663818359375.
Obviously, both of these numbers are very close to but not exactly equal to 0.6.
IEEE 754 representations of 1.0: The single-precision and double-precision bit patterns for 1.0 are
0_0111_1111_0000_0000_0000_0000_0000_000
and
0_011_1111_1111_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000.
(You should do the work to understand exactly why these are correct.)
Of course, the 32-bit and 64-bit integer representations of 1 would be:
0000_0000_0000_0000_0000_0000_0000_0001
and
0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0001
This is a simple example illustrating a simple point: Although integers can be represented in floating-point format,
the floating-point bit patterns look totally different from the bit patterns that would be used in integer representations
of the same numbers.