# Number Representation Fixed and Floating Point

Document Sample

```					               Number Representation
Fixed and Floating Point
•   No Method Capable of Representing ALL Real
Numbers Using Finite Register Lengths
•   Must Use Approximations to Represent Values
•   Concentrate on Two Forms:
– Fixed Point
– Floating Point
– Others are:
•   Rational Number Systems – uses ratios of integers
•   Logarithmic Number Systems – uses signs and logarithms of
values
Fixed Versus Floating Point

•   Fixed Point Values Represent Values where Any
Two Differ by 1 unit in the last place (ulp)
– Equal Spacing Between Numbers
•   Floating Point Values Use Two Multi-Bit Words
– Mantissa
– Exponent
•   Both Forms Must be Capable of Representing
Signed Quantities
•   Fixed Point Values CAN be Used to Represent
Fractional Quantities
Floating Point Characteristics
• Total Number of Representations = Total Bit Strings
– For n-bit Register we have 2n
• Range of Value is Larger than Fixed Point
• Precision of Value is Smaller
• Distance Between Two Consecutive Values Increases
Floating Point

s      e                     m

s – Sign Bit (signed magnitude)
e – Exponent (in 2’s Complement Form)
m – Mantissa (significand or fraction) mMAX=1 - ulp; [0,1)
hidden bit
( e  BIAS )
Value  (1) 1.m  2
s

float – BIAS = 127 (32 bits-23 for m and 8 for e)
double – BIAS=1023 (64 bits-52 for m and 11 for e)

Sign of Exponent is Complement of it’s MSb
Thus, adding/subtracting bias is just complementation of MSb
Floating Point Example

double = 00000000 bfe80000

Big Endian – MSW has Higher Address

s        e                                       m
1   011 1111 1110   1000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

s = 1; e = 1022; m = 0.5

Value = (-1)11.5 2(1022-1023)

Value = -(1.5)(0.5) = -0.75
Floating Point Normalization
• Redundant /representations are Possible!

0.110101 2101  0.01101 2110  0.01101 2111

• Hidden Bit Helps
• Out of All Possible Representations, Choose One With
• This is Normalization
• After Performing Arithmetic, Renormalization May
Need to be Accomplished
Floating Point Special Numbers

Value v when exponent e and fraction f are
special values (IEEE standard)
Note: NaN = Not a Number
IEEE/ANSI 754/854 Standard
Denormalized Numbers

Denormals
Operations – Internal Precision
Floating Point Multiplication/Division
Conversions and Roundings
Exceptions
Rounding Schemes

Signed Magnitude   Two’s Complement
Round to Nearest (Signed Magnitude)
Round to Nearest Even/Odd

Round to Nearest Even   Round to Nearest Odd (R*)
Jamming/von Neumann Rounding
ROM Rounding
Rounding
Rounding Examples

Round Towards +      Downward Directed Rounding
Floating Point Operations
Operand Packing/Unpacking
Other Key Parts of FP Add/Sub Unit
Pre-Shifting
Four-stage Combinational Shifter

Pre-shifts Operand by 0 to 15 Bits
Leading Zeros/Ones – Counting vs. Prediction
Guard Digits

What is the smallest number of extra digits
needed for rounding? post-normalization?

• Multiplication – Double Length Result
• Add/Sub w/ differing exp. – Can have Double
Length Result

• FP Unit Provides One Length Result
Significand Ranges
• Assume Significand M(0,1-ulp]
• Then Normalized M ranges as:

M min  1               M max  1  ulp

• Multiplication: prod=M1M2
1         prod  1  2ulp  ulp 2  1
2
• For postnormalization need at most one shift
left to get:

  prod  1
1
Significand Ranges (cont)
• Division: quot=M1M2

  quot     ulp  
1

• Need at most one shift right to get:

  quot  1
1

• Conclusion:
– 1 Extra Digit Needed for Postnormalization
– 1 Extra Digit Needed for Round-to-Nearest
• 2 Extra Digits Needed
– G - guard
– R - round
“Sticky Bit” in std754

• Round-to-Nearest-Even Requires 1
Extra Bit
– The “sticky bit”, S

• Turns out to be Logical-OR of Other