# Floating Point Representation - DOC by sdfsb346f

VIEWS: 68 PAGES: 7

• pg 1
Floating Point Representation
 D. Banerji

Department of Computing & Information Science
University of Guelph

Used for representing real numbers - those containing both an integer and fraction
part. There are four parts to a floating point representation :
 Sign of the number (convention : 0 for positive, 1 for negative)
 Fraction (or mantissa)
 Signed exponent

In general, a number X can be expressed as:
X = (-1) s f  b  e , where s {0,1}; f is the fraction, 0  f <1; b is the base, and e is the
exponent. For example, in the decimal system, X = 123.45 can be expressed as:
X = 0.12345  103 . This is not a unique representation. In fact, there is an infinite number
of representations for this number. For example: X = 0.012345 x 104 = 0.0012345 x 105
…etc.

In order to have a standard, unique representation (for arithmetic circuits), we use
a normalized representation, where the first digit of the fraction is always non-zero. This
also helps represent the fraction with greater precision because the leading 0s of the
fraction are removed, allowing any trailing, non-zero digits of the fraction to be included
in a fixed word size used for floating point representation. Thus, 0.12345  103 is a
normalized representation. Note:
 More digits allowed in fraction implies greater precision
 Larger allowed value for exponent means larger range of representation.

Computers have a fixed word length and, therefore, floating point numbers have a
fixed size. Within this fixed size, fraction bits (and, hence, precision) can be traded off
with exponent bits (and, hence, range). Before discussing machine floating point
representation, let us discuss integer vs floating point representation for the same word
size.

Integer representation and arithmetic are exact as long as the result is within the
range of representation. There is no loss of precision in integer arithmetic. On a machine,
it is also faster because of fewer steps involved in integer operations. Hence, for speed
and precision, it is the preferred mode of arithmetic. But, it is limited to integer operands
and results.

Floating point representation is necessary when real values are involved. It can
give us a higher range of representation ( depending on the number of exponent bits).
2

But, precision is affected because not all real values can be represented exactly (e.g.,
0.310 does not have an exact representation in base 2).

To compare integer and floating point representations, consider a 16-bit word size.
Integer representation :

15 14                           0
s        value

Range  215 to 215 (smallest to largest values)

Floating point representation :

15 14       98          0
s exponent    fraction

Exponent : 6 bits (-31 to +31 in sign-magnitude notation or -32 to +31 in 2’s
complement)
Fraction(max) = 0.1111111112  (but  1).
Hence, range of representation  -1  2 31 to 1  2 31 , considering sign-magnitude notation
for exponent. Thus, we have a larger range, but lower precision. We essentially have
only 9 bits in the fraction(bits 0 to 8) to represent the magnitude or value of a number, as
opposed to 15 bits in the integer representation. Thus, we have traded off precision for a
larger range of representation. We could, for example, increase precision by 1 more bit
by stealing 1 bit from the exponent field; this would, however, reduce the range of
representation by half! These are some of the rather difficult choices that a
designer/computer architect must make.

IBM 370 Floating Point Representation

Single precision format

       uses 32-bits, with the fields laid out as follows:

0 1         78                           31
s exponent                 fraction

For floating point operations, IBM uses base 16 (hex) arithmetic.

Example:             X = 123 4510

16 123    Rem.
16  7       B
0       7
Hence, 12310 = 7B16.
3

Hex fraction digits
0.45  16 = 7.20         0.7
0.2  16 = 3.2          0.73
0.2  16 = 3.2          0.733

Therefore, 123 4510  7 733316  0 77333  162
S = 0 (positive number)
f = 7B7333
The exponent is in excess-64 notation. It is concatenated with the sign bit and then
written as 2 hex digits.
Excess-64 exponent = 64+2= 6610 = 4216 (expressed as 7 bits)

0 100 0010      4216

sign exponent

Therefore, hex representation for 123 4510 is: 427B7333

Another Example
0 510  0 816  0 8  160
Excess-64 exponent = 6410  4016
Therefore, we have 40800000 as the representation for 0 510
0 510 ?  C0800000

Range of representation ?
Exponent is 7 bits
Values :
0 1 2 ... 63 64 65 66 ... 127

Represent negative exponents        Represent positive exponents
in excess-64 form                   in excess-64 form
(-1 to -64)                        (0 to 63)
Maximum fraction = 0.FFFFFF  1 (but  1)
Therefore, Range  -1663 to 1663
In terms of magnitude, very small numbers can be represented, with exponent 1664 .
This representation trades off range against precision. This is because in normalized hex
representation, we lose 3 bits of precision compared to normalized binary representation.
To see this, consider a normalized hex fraction as:

0.1XXXXX, where 0  X  F.

In binary:    0.0001 xxxx….
4

The 3 leading zeros in normalized hex cause loss of precision by 3 bits compared to
normalized binary, where a normalized fraction would have the form: 0.1xxx….
However, with the same number of exponent bits, normalized binary would have a range
of  -263 to + 263. Thus, IBM representation provides greater range at the expense of
precision. They try to address this problem by defining double-precision representation,
using 64 bits, where 56 bits are used for fraction; however, 3 bits of precision are still
lost, compared to double-precision binary representation.

IEEE Floating Point Representation

This addresses both precision and range
 4 floating point formats
 uses base 2 representation (normalized)
 Normalized mantissa values are 1.000....02 to 1.111....12

Thus, the leading bit is always 1 and, hence, can always be 'assumed' to be
present. Therefore, the leading bit is absent from the representation but the arithmetic
circuits must 'remember' that it exists. This trick allows us to gain one extra bit of
precision.

Single-Precision Representation
s    Exponent       Fraction
8 bits        23 bits

Mantissa is really 24-bits long, with the leading 1 bit treated as a phantom bit. The
exponent is in excess-127 form.

Double Precision
Exponent              Fraction
s     11 bits              52 bits
s     11 bits
Exponent is in excess-1023 form.

Single precision representation for 10 010 would be :
10  010  1010  00  2  1 01000  2  2 3
The leading bit is not stored since it is always there.
Exponent in excess-127 form = 127+3 = 13010 = 1000 00102
Sign bit = 0
Hence, the 32-bit representation is :
0100 0001 0010 000 .... 0
41200000
16

10 010 would be : C120000016
5

Zero is represented as : sign (0 or 1), exponent of all zero bits, mantissa of all zero bits
(i.e., as 00000000 16 or 80000000 16 ). Thus, we can have a positive or negative zero.

Special Representations in IEEE Format

+ : 0 11111111000…  02
- : 1 11111111000….
1 : 0 01111111000…..
-1 : 1 01111111000…

Floating Point Arithmetic

We will use the IBM/370 representation and briefly describe the floating point
operations.

X: fx bex , Y: fy bey , 0  fx, fy <1
Assuming ex  ey , the sum Z is given by,

Z = X+Y = fx + fy  b            
-(ex-ey)  b ex

An algorithm for addition can be given as :
1. k  ex - ey
2. If k  0 , shift fy right by k places;
else shift fx right by k places
3. f = fx + shifted fy (if k  0 )
or f = fy + shifted fx (if k<0)
4. e = max( ex,ey )
5. Normalize result and adjust e.

Example
X = 0.510 = 0.816
IBM/370 representation : 40800000
Y =1.7510 =1.C16  411C0000
ey > ex
ey - ex = (41- 40)16 = 116
Hence, shift fx right by 1,

fy:         1C0000
+
shifted fx: 080000
----------
24000016
----------

Result = 41240000 which represents 0 24  161  2 416  2 2510
6

2) Subtraction
Y=411C0000
X=40800000
Perform D = Y - X
Exponent of D = 41 = max(41,40)
ey - ex = 1
Hence, shift fx by 1 to right,

fy            : 1C0000
-
shifted fx   : 080000
------------
140000
------------

Therefore, D = 41140000  014  161  1 416  1 2510
Note: 1 75  0 510  1 2510

In case of X-Y :
Since X<Y, perform Y-X as previously, and make sign bit of D as 1.
Hence D : C1140000  014  161  1 416  1 2510

3) Multiplication (positive operands)
X: fx  b ex
Y: fy  b ey
P = X.Y: fx  fy  b ex+ey

If exponents are in "excess-n" form, then the exponent of P must be adjusted as ex + ey - n

Example:
X : 40800000 (0.510)
Y : 40800000
Then exponent of P = (ex + ey - 40)16 = 4016
fx  fy = 40000016 because
(0.8  0.8)16 = 0.4016
Therefore, P = 4040000016  0.4  16 0  0.416  0.2510

Multiplication may require post normalization of the fraction, as in the following
example:
7

X = 0.12510 = 0.216
X : 40200000
Let Y : 40200000
Exponent of product P = (40 + 40 - 40)16 = 4016
fx  fy = 0.040000 since
(0.2  0.2)16 = 0.0416
Therefore, P : 40040000( unnormalized)  3F400000 (normalized)
Thus, P has the value 0.4  16 -1 = 0.0416 = 0.01562510
which is the correct result.

4) Division (positive operands)
X : fx  b ex
Y : fy  b ey
X fx
Quotient Z = =  b ex-ey
Y fy
Example : Using IBM/370 representation.
X = 5.010 = 5.016 = 0.5  161
Hence, X : 41500000
Y = 2.510 = 2.816 = 0.28  161
Y : 41280000

Since the exponents are in excess-64 form, the difference of exponents (ex - ey)
cancels out the excess value. This must be restored. Hence,
e = ex - ey + 40 = 41- 41+ 40 = 4016
fx  50 
=   = 2.016 = 0.2  161
fy  28  16

Therefore, re-adjust exponent e to 40 + 1 = 41
Z : 41200000  0.2  161 = 216 = 210
The same process works if the quotient is <1.
Y
If Z =
X
Then exponent of Z = 41-41 + 40 = 4016
fy  28 
=          = 0.816
fx  50  16
 
Hence, Z : 40800000  0 816  0 510

Note: You should try some floating point arithmetic using the IEEE representation or the
better.

To top