VIEWS: 68 PAGES: 7 POSTED ON: 2/26/2010 Public Domain
Floating Point Representation D. Banerji Department of Computing & Information Science University of Guelph Used for representing real numbers - those containing both an integer and fraction part. There are four parts to a floating point representation : Sign of the number (convention : 0 for positive, 1 for negative) Fraction (or mantissa) Base (or radix) Signed exponent In general, a number X can be expressed as: X = (-1) s f b e , where s {0,1}; f is the fraction, 0 f <1; b is the base, and e is the exponent. For example, in the decimal system, X = 123.45 can be expressed as: X = 0.12345 103 . This is not a unique representation. In fact, there is an infinite number of representations for this number. For example: X = 0.012345 x 104 = 0.0012345 x 105 …etc. In order to have a standard, unique representation (for arithmetic circuits), we use a normalized representation, where the first digit of the fraction is always non-zero. This also helps represent the fraction with greater precision because the leading 0s of the fraction are removed, allowing any trailing, non-zero digits of the fraction to be included in a fixed word size used for floating point representation. Thus, 0.12345 103 is a normalized representation. Note: More digits allowed in fraction implies greater precision Larger allowed value for exponent means larger range of representation. Computers have a fixed word length and, therefore, floating point numbers have a fixed size. Within this fixed size, fraction bits (and, hence, precision) can be traded off with exponent bits (and, hence, range). Before discussing machine floating point representation, let us discuss integer vs floating point representation for the same word size. Integer representation and arithmetic are exact as long as the result is within the range of representation. There is no loss of precision in integer arithmetic. On a machine, it is also faster because of fewer steps involved in integer operations. Hence, for speed and precision, it is the preferred mode of arithmetic. But, it is limited to integer operands and results. Floating point representation is necessary when real values are involved. It can give us a higher range of representation ( depending on the number of exponent bits). 2 But, precision is affected because not all real values can be represented exactly (e.g., 0.310 does not have an exact representation in base 2). To compare integer and floating point representations, consider a 16-bit word size. Integer representation : 15 14 0 s value Range 215 to 215 (smallest to largest values) Floating point representation : 15 14 98 0 s exponent fraction Exponent : 6 bits (-31 to +31 in sign-magnitude notation or -32 to +31 in 2’s complement) Fraction(max) = 0.1111111112 (but 1). Hence, range of representation -1 2 31 to 1 2 31 , considering sign-magnitude notation for exponent. Thus, we have a larger range, but lower precision. We essentially have only 9 bits in the fraction(bits 0 to 8) to represent the magnitude or value of a number, as opposed to 15 bits in the integer representation. Thus, we have traded off precision for a larger range of representation. We could, for example, increase precision by 1 more bit by stealing 1 bit from the exponent field; this would, however, reduce the range of representation by half! These are some of the rather difficult choices that a designer/computer architect must make. IBM 370 Floating Point Representation Single precision format uses 32-bits, with the fields laid out as follows: 0 1 78 31 s exponent fraction For floating point operations, IBM uses base 16 (hex) arithmetic. Example: X = 123 4510 16 123 Rem. 16 7 B 0 7 Hence, 12310 = 7B16. 3 Hex fraction digits 0.45 16 = 7.20 0.7 0.2 16 = 3.2 0.73 0.2 16 = 3.2 0.733 Therefore, 123 4510 7 733316 0 77333 162 S = 0 (positive number) f = 7B7333 The exponent is in excess-64 notation. It is concatenated with the sign bit and then written as 2 hex digits. Excess-64 exponent = 64+2= 6610 = 4216 (expressed as 7 bits) 0 100 0010 4216 sign exponent Therefore, hex representation for 123 4510 is: 427B7333 Another Example 0 510 0 816 0 8 160 Excess-64 exponent = 6410 4016 Therefore, we have 40800000 as the representation for 0 510 0 510 ? C0800000 Range of representation ? Exponent is 7 bits Values : 0 1 2 ... 63 64 65 66 ... 127 Represent negative exponents Represent positive exponents in excess-64 form in excess-64 form (-1 to -64) (0 to 63) Maximum fraction = 0.FFFFFF 1 (but 1) Therefore, Range -1663 to 1663 In terms of magnitude, very small numbers can be represented, with exponent 1664 . This representation trades off range against precision. This is because in normalized hex representation, we lose 3 bits of precision compared to normalized binary representation. To see this, consider a normalized hex fraction as: 0.1XXXXX, where 0 X F. In binary: 0.0001 xxxx…. 4 The 3 leading zeros in normalized hex cause loss of precision by 3 bits compared to normalized binary, where a normalized fraction would have the form: 0.1xxx…. However, with the same number of exponent bits, normalized binary would have a range of -263 to + 263. Thus, IBM representation provides greater range at the expense of precision. They try to address this problem by defining double-precision representation, using 64 bits, where 56 bits are used for fraction; however, 3 bits of precision are still lost, compared to double-precision binary representation. IEEE Floating Point Representation This addresses both precision and range 4 floating point formats uses base 2 representation (normalized) Normalized mantissa values are 1.000....02 to 1.111....12 Thus, the leading bit is always 1 and, hence, can always be 'assumed' to be present. Therefore, the leading bit is absent from the representation but the arithmetic circuits must 'remember' that it exists. This trick allows us to gain one extra bit of precision. Single-Precision Representation s Exponent Fraction 8 bits 23 bits Mantissa is really 24-bits long, with the leading 1 bit treated as a phantom bit. The exponent is in excess-127 form. Double Precision Exponent Fraction s 11 bits 52 bits s 11 bits Exponent is in excess-1023 form. Single precision representation for 10 010 would be : 10 010 1010 00 2 1 01000 2 2 3 The leading bit is not stored since it is always there. Exponent in excess-127 form = 127+3 = 13010 = 1000 00102 Sign bit = 0 Hence, the 32-bit representation is : 0100 0001 0010 000 .... 0 41200000 16 10 010 would be : C120000016 5 Zero is represented as : sign (0 or 1), exponent of all zero bits, mantissa of all zero bits (i.e., as 00000000 16 or 80000000 16 ). Thus, we can have a positive or negative zero. Special Representations in IEEE Format + : 0 11111111000… 02 - : 1 11111111000…. 1 : 0 01111111000….. -1 : 1 01111111000… Floating Point Arithmetic We will use the IBM/370 representation and briefly describe the floating point operations. 1) Addition X: fx bex , Y: fy bey , 0 fx, fy <1 Assuming ex ey , the sum Z is given by, Z = X+Y = fx + fy b -(ex-ey) b ex An algorithm for addition can be given as : 1. k ex - ey 2. If k 0 , shift fy right by k places; else shift fx right by k places 3. f = fx + shifted fy (if k 0 ) or f = fy + shifted fx (if k<0) 4. e = max( ex,ey ) 5. Normalize result and adjust e. Example X = 0.510 = 0.816 IBM/370 representation : 40800000 Y =1.7510 =1.C16 411C0000 ey > ex ey - ex = (41- 40)16 = 116 Hence, shift fx right by 1, fy: 1C0000 + shifted fx: 080000 ---------- 24000016 ---------- Result = 41240000 which represents 0 24 161 2 416 2 2510 6 2) Subtraction Basically same process as addition. Y=411C0000 X=40800000 Perform D = Y - X Exponent of D = 41 = max(41,40) ey - ex = 1 Hence, shift fx by 1 to right, fy : 1C0000 - shifted fx : 080000 ------------ 140000 ------------ Therefore, D = 41140000 014 161 1 416 1 2510 Note: 1 75 0 510 1 2510 In case of X-Y : Since X<Y, perform Y-X as previously, and make sign bit of D as 1. Hence D : C1140000 014 161 1 416 1 2510 3) Multiplication (positive operands) X: fx b ex Y: fy b ey P = X.Y: fx fy b ex+ey If exponents are in "excess-n" form, then the exponent of P must be adjusted as ex + ey - n Example: X : 40800000 (0.510) Y : 40800000 Then exponent of P = (ex + ey - 40)16 = 4016 fx fy = 40000016 because (0.8 0.8)16 = 0.4016 Therefore, P = 4040000016 0.4 16 0 0.416 0.2510 Multiplication may require post normalization of the fraction, as in the following example: 7 X = 0.12510 = 0.216 X : 40200000 Let Y : 40200000 Exponent of product P = (40 + 40 - 40)16 = 4016 fx fy = 0.040000 since (0.2 0.2)16 = 0.0416 Therefore, P : 40040000( unnormalized) 3F400000 (normalized) Thus, P has the value 0.4 16 -1 = 0.0416 = 0.01562510 which is the correct result. 4) Division (positive operands) X : fx b ex Y : fy b ey X fx Quotient Z = = b ex-ey Y fy Example : Using IBM/370 representation. X = 5.010 = 5.016 = 0.5 161 Hence, X : 41500000 Y = 2.510 = 2.816 = 0.28 161 Y : 41280000 Since the exponents are in excess-64 form, the difference of exponents (ex - ey) cancels out the excess value. This must be restored. Hence, e = ex - ey + 40 = 41- 41+ 40 = 4016 fx 50 = = 2.016 = 0.2 161 fy 28 16 Therefore, re-adjust exponent e to 40 + 1 = 41 Z : 41200000 0.2 161 = 216 = 210 The same process works if the quotient is <1. Y If Z = X Then exponent of Z = 41-41 + 40 = 4016 fy 28 = = 0.816 fx 50 16 Hence, Z : 40800000 0 816 0 510 Note: You should try some floating point arithmetic using the IEEE representation or the representation shown in the textbook. This will help your understanding of the process better.