Your Federal Quarterly Tax Payments are due April 15th

# Floating-Point by guf14004

VIEWS: 59 PAGES: 20

• pg 1
```									                          Floating-Point

 Floating-point notation is a method for storing
approximations of real numbers over large ranges. Computers
use values in floating-point notation primarily for scientific
and engineering applications.

Converting numbers with “radix points” from various bases
into decimal:

ex.

(11011.0101)2     = 1 * 24 + 1 * 23 + 1 * 21 + 1 *20
+ 1 * 2-2 + 1 * 2-4

= 27.3125

ex.

(1B.5)16 =      1 * 161 + 11 * 160 + 5 * 16-1
=      16 + 11 + .3125
=      27.3125
Converting from Decimal to other bases:

 The integer part is converted in the normal method (repeated
division by 2)

 The fractional part is multiplied by the radix of the new
base. The integer portion of the resulting number becomes the
next digit in the fraction’s expansion, and the process is
repeated with the fraction portion replacing the old fraction.

ex.

(4.25)10 =

4 = 100
.25 =

Calculation        Integer        Fraction
0.25 * 2           0                     .5
0.5 * 2               1                       .0

so
.25 =     01

4.25 =   100.01
ex.

0.3 =     (0.0100 1100 1100 .......)2

Calculation             Integer         Fraction

0.3 * 2 = 0.6            0                         .6
0.6 * 2 = 1.2            1                         .2
0.2 * 2 = 0.4            0                  .4
0.4 * 2 = 0.8                       0
.8
0.8 * 2 = 1.6            1                              .6
0.6 * 2 = 1.2            1                         .2
0.2 * 2 = 0.4            0                  .4
0.4 * 2 = 0.8                       0
.8
0.8 * 2 = 1.6            1                              .6
0.6 * 2 = 1.2            1                         .2
.                   .                  .
.                   .                  .
Floating-Point Representation

 m  re

where
m - mantissa ( 0  m  1)
e - exponent

ex.

0.1101 23

 VAX allows 4 different formats for floating-point numbers,
all of which use base 2. These formats are:

1. Single precision ( F-Floating)
2. Double precision ( D-Floating)
3. G-Floating
4. H-Floating

 The most common one is the F-Floating.
Normalized Floating-Point Numbers

 A floating-point number is in normalized form if the first
digit after the radix point is nonzero, and the integer portion
of the number is 0

ex.

0.3454102

0.101123

 IEEE 754 floating-point numbers are always normalized

m       since the digit after the radix point is 1

   1/2  m  1
 Why use normalized form?

1. The leading digit after the radix point is always 1, so it can be
omitted and could be assumed hence, we can express b+1
significant bits with b bits.

2. Increased accuracy

0.110123 = 0.00110125

using normalized form makes the representation unique.
Biased Exponent

 Both the mantissa and the exponent are signed values(they
can be negative or positive). This means that two sign bits are
needed.

 In IEEE 754 standards, the mantissa is expressed in the
sign-magnitude form, and the exponent is expressed in the
biased form. This requires only one sign bit ( for the
mantissa)

 The floating-point representation uses n bits for the exponent
so 2n different values can be represented. In the biased
exponent representation one half of these 2n values are used
to represent negative numbers, and the other half are used to
represent positive numbers.

 Instead of storing the actual exponent, the biased exponent is
stored as follows:

Biased exponent = true exponent + 2(b - 1)

where b is the number of bits used for the exponent field.
ex.

In the VAX F-Floating data type there are 8 bits in the exponent
field b = 8, the biased is 128 ( 28-1 )

so the exponent = true exponent + 127

Bit Pattern         Value in Decimal           True Exponent
00000000                0                      special case
00000001                1                           -127
00000010                2                           -126
.                  .                                .
.                  .                                .
.                  .                                .
01111111                127                            -1
10000000                128                             0
10000001                129                             1
.                  .                                .
.                  .                                .
11111111                255                            127

 The bit pattern is what the machine stores for the true
exponent. (true + 128)

 On the VAX the Biased 00000000 is used as follows

- if the mantissa sign bit is 0 it represent 0.0
- if the mantissa sign bit is 1 not used
 With 8 bits we represent -127 to 127, which is similar to the
two’s complement representation. However with two’s
complement the negative numbers are larger that positive
numbers.

 To get the actual exponent from the biased exponent, the
machine will subtract the biased ( in the above example 128).
VAX Floating-Point Representations

1. Single Precision (F-Floating)
It requires 32-bit
15   14                         7 6           0
1st
word    S         b.exponent       m1
m2

where S = sign bit
0    positive
1    negative

m = .1m1m2
b.exponent = biased exponent
= true exponent + 128

so the number represented using F-Floating would be

.1m1m2  2p

where p = b.exponent - 128
ex.

0000 0000 0000 0000 1 100 00010      1010000

m2              S   b.exp          m1

b.exponent = 10000010
= 13010

so p = 130 - 128 = 2 (true exponent)

.1m1m2  2p

S = 1 (negative number)

 - .11010000  22
 - 11.01 = -3.2510
ex.

What floating-point number is represented by the following?

( 00004080 )16

0000 0000 0000 0000 0 10000001         0000000

m2            S    b.exp         m1

b.exponent = 10000001
= 12910
so p = 129 - 128 = 1( true exponent)

.1m1m2  2p

S = 0 (positive)
.10000000  21
= 1.010

ex.
Show the floating-point representation (F-type) of the following:

5/8 =     0.625

0.625     =    0.1012 (normalized)

.1m1m2  2p

S = 0 (positive)
m1 = 0100000
m2 = 0000 0000 0000 0000
b.exp = 0 + 128
= 128  1000 0000

0000 0000 0000 0000 0 10000000 0100000

 0000402016

2. Double Precision (D-Floating)
It requires 64-bit
15   14                          7 6              0
1st
word     S           b.exponent      m1
m2
m3
m4

where S = sign bit
0    positive
1    negative

m = .1m1m2m3m4
b.exponent = biased exponent
= true exponent + 128

so the number represented using D-Floating would be

.1m1m2m3m4  2p

where p = b.exponent - 128

 The range of values represented by this type is the same as in
F type, however the number of significant digits is increased.
3. G-Floating

It requires 64-bit
15   14                                4 3   0
1st
word     S         b.exponent         m1
m2
m3
m4

where S = sign bit
0    positive
1    negative

m = .1m1m2m3m4
b.exponent = biased exponent
= true exponent + 210

so the number represented using G-Floating would be

.1m1m2m3m4  2p

where p = b.exponent - 1024

4. H-Floating

It requires 8 words
15   14                                      0
1st
word    S               b.exponent
m1
.
.
.
m7

where S = sign bit
0    positive
1    negative

m = .1m1m2m3m4m5m6m7
b.exponent = biased exponent
= true exponent + 214

so the number represented using H-Floating would be

.1m1m2m3m4  2p

where p = b.exponent - 214

Allocating Memory for Floating-point Values
Initialized memory

.x_FLOATING         list_of_floating_point_constants

where x = F, D, G, H

or use
.FLOAT
.DOUBLE

ex.

X:    .FLOAT         3.5
Y:    .DOUBLE        4.121

- The first directive allocates 4 bytes (F type) and initializes
to 3.5 represented in F-Floating type.

- The second directive allocates 8 bytes ( D type) and
initializes to 4.121 represented in D-Floating type.

Uninitialized Storage
.BLKF    .BLKD          .BLKG          .BLKH

All allocate as many bytes as the type specified requires and
initialize to zero.

ex.

REAL:    .BLKF     3

Allocates 3 F type (3 longwords) and initializes them to zero

Floating-Point Operations
 All the arithmetic instructions, special purpose instructions,
convert instructions, move instructions, compare instruction,
and test instruction operate on floating point types in the
normal way as before

SUBx2     op1, op2
SUBx3     op1, op2, op3

MULx2 op1, op2
MULx3 op1, op2, op3

DIVx2     op1, op2
DIVx3     op1, op2, op3

CLRx      op
MOVx      source, dest
MOVAx     source, dest
MNEGx     source, dest

TSTx      op
CMPx      op1, op2

where x = F, D, G, H

Floating-Point Convert Instructions
CVTxy         source, dest

x, y = F, D, G, H, B, W, L
xy
xy  DG
xy  GD

 Convert Round to Longword

CVTRxL        source, dest

x = F, D, G, H

 Convert Truncate to Longword

CVTxL         source, dest

x = F, D, G, H

```
To top