Your Federal Quarterly Tax Payments are due April 15th Get Help Now >>

Floating-Point by guf14004

VIEWS: 59 PAGES: 20

									                          Floating-Point


 Floating-point notation is a method for storing
  approximations of real numbers over large ranges. Computers
  use values in floating-point notation primarily for scientific
  and engineering applications.



Converting numbers with “radix points” from various bases
into decimal:


ex.

      (11011.0101)2     = 1 * 24 + 1 * 23 + 1 * 21 + 1 *20
                             + 1 * 2-2 + 1 * 2-4

                         = 27.3125



ex.

      (1B.5)16 =      1 * 161 + 11 * 160 + 5 * 16-1
               =      16 + 11 + .3125
               =      27.3125
Converting from Decimal to other bases:


 The integer part is converted in the normal method (repeated
  division by 2)

 The fractional part is multiplied by the radix of the new
  base. The integer portion of the resulting number becomes the
  next digit in the fraction’s expansion, and the process is
  repeated with the fraction portion replacing the old fraction.


ex.

      (4.25)10 =

4 = 100
.25 =

          Calculation        Integer        Fraction
          0.25 * 2           0                     .5
          0.5 * 2               1                       .0

so
.25 =     01


      4.25 =   100.01
ex.


      0.3 =     (0.0100 1100 1100 .......)2



      Calculation             Integer         Fraction

      0.3 * 2 = 0.6            0                         .6
      0.6 * 2 = 1.2            1                         .2
      0.2 * 2 = 0.4            0                  .4
      0.4 * 2 = 0.8                       0
      .8
      0.8 * 2 = 1.6            1                              .6
      0.6 * 2 = 1.2            1                         .2
      0.2 * 2 = 0.4            0                  .4
      0.4 * 2 = 0.8                       0
      .8
      0.8 * 2 = 1.6            1                              .6
      0.6 * 2 = 1.2            1                         .2
           .                   .                  .
           .                   .                  .
Floating-Point Representation



               m  re

where
         m - mantissa ( 0  m  1)
         r - Radix (base)
         e - exponent


ex.

         0.1101 23


 VAX allows 4 different formats for floating-point numbers,
  all of which use base 2. These formats are:

1. Single precision ( F-Floating)
2. Double precision ( D-Floating)
3. G-Floating
4. H-Floating


 The most common one is the F-Floating.
Normalized Floating-Point Numbers


 A floating-point number is in normalized form if the first
  digit after the radix point is nonzero, and the integer portion
  of the number is 0

ex.

          0.3454102

          0.101123




 IEEE 754 floating-point numbers are always normalized

          m       since the digit after the radix point is 1

         1/2  m  1
 Why use normalized form?

1. The leading digit after the radix point is always 1, so it can be
   omitted and could be assumed hence, we can express b+1
   significant bits with b bits.


2. Increased accuracy

     0.110123 = 0.00110125


using normalized form makes the representation unique.
Biased Exponent


 Both the mantissa and the exponent are signed values(they
  can be negative or positive). This means that two sign bits are
  needed.

 In IEEE 754 standards, the mantissa is expressed in the
  sign-magnitude form, and the exponent is expressed in the
  biased form. This requires only one sign bit ( for the
  mantissa)

 The floating-point representation uses n bits for the exponent
  so 2n different values can be represented. In the biased
  exponent representation one half of these 2n values are used
  to represent negative numbers, and the other half are used to
  represent positive numbers.

 Instead of storing the actual exponent, the biased exponent is
  stored as follows:

          Biased exponent = true exponent + 2(b - 1)


where b is the number of bits used for the exponent field.
ex.

In the VAX F-Floating data type there are 8 bits in the exponent
field b = 8, the biased is 128 ( 28-1 )

so the exponent = true exponent + 127

Bit Pattern         Value in Decimal           True Exponent
00000000                0                      special case
00000001                1                           -127
00000010                2                           -126
     .                  .                                .
     .                  .                                .
     .                  .                                .
01111111                127                            -1
10000000                128                             0
10000001                129                             1
     .                  .                                .
     .                  .                                .
11111111                255                            127


 The bit pattern is what the machine stores for the true
  exponent. (true + 128)

 On the VAX the Biased 00000000 is used as follows

          - if the mantissa sign bit is 0 it represent 0.0
          - if the mantissa sign bit is 1 not used
 With 8 bits we represent -127 to 127, which is similar to the
  two’s complement representation. However with two’s
  complement the negative numbers are larger that positive
  numbers.

 To get the actual exponent from the biased exponent, the
  machine will subtract the biased ( in the above example 128).
VAX Floating-Point Representations


1. Single Precision (F-Floating)
It requires 32-bit
        15   14                         7 6           0
1st
word    S         b.exponent       m1
                    m2



where S = sign bit
         0    positive
         1    negative

m = .1m1m2
b.exponent = biased exponent
           = true exponent + 128

so the number represented using F-Floating would be

         .1m1m2  2p


where p = b.exponent - 128
ex.

0000 0000 0000 0000 1 100 00010      1010000

      m2              S   b.exp          m1


b.exponent = 10000010
                = 13010

  so p = 130 - 128 = 2 (true exponent)


           .1m1m2  2p


S = 1 (negative number)


 - .11010000  22
 - 11.01 = -3.2510
ex.

What floating-point number is represented by the following?

( 00004080 )16

0000 0000 0000 0000 0 10000001         0000000

      m2            S    b.exp         m1


b.exponent = 10000001
            = 12910
so p = 129 - 128 = 1( true exponent)


           .1m1m2  2p

S = 0 (positive)
.10000000  21
 = 1.010




ex.
Show the floating-point representation (F-type) of the following:


5/8 =     0.625

0.625     =    0.1012 (normalized)

     .1m1m2  2p


S = 0 (positive)
m1 = 0100000
m2 = 0000 0000 0000 0000
 b.exp = 0 + 128
     = 128  1000 0000


0000 0000 0000 0000 0 10000000 0100000


 0000402016




2. Double Precision (D-Floating)
It requires 64-bit
         15   14                          7 6              0
1st
word     S           b.exponent      m1
                       m2
                       m3
                       m4



where S = sign bit
         0    positive
         1    negative

m = .1m1m2m3m4
b.exponent = biased exponent
           = true exponent + 128

so the number represented using D-Floating would be

          .1m1m2m3m4  2p

where p = b.exponent - 128


 The range of values represented by this type is the same as in
  F type, however the number of significant digits is increased.
3. G-Floating

It requires 64-bit
         15   14                                4 3   0
1st
word     S         b.exponent         m1
                     m2
                     m3
                     m4



where S = sign bit
         0    positive
         1    negative

m = .1m1m2m3m4
b.exponent = biased exponent
           = true exponent + 210

so the number represented using G-Floating would be

          .1m1m2m3m4  2p

where p = b.exponent - 1024




4. H-Floating

It requires 8 words
         15   14                                      0
1st
word    S               b.exponent
                   m1
                   .
                   .
                   .
                   m7




where S = sign bit
         0    positive
         1    negative

m = .1m1m2m3m4m5m6m7
b.exponent = biased exponent
           = true exponent + 214

so the number represented using H-Floating would be

         .1m1m2m3m4  2p

where p = b.exponent - 214


Allocating Memory for Floating-point Values
Initialized memory


      .x_FLOATING         list_of_floating_point_constants

where x = F, D, G, H

or use
          .FLOAT
          .DOUBLE


ex.

X:    .FLOAT         3.5
Y:    .DOUBLE        4.121


- The first directive allocates 4 bytes (F type) and initializes
   to 3.5 represented in F-Floating type.

- The second directive allocates 8 bytes ( D type) and
   initializes to 4.121 represented in D-Floating type.




Uninitialized Storage
      .BLKF    .BLKD          .BLKG          .BLKH


All allocate as many bytes as the type specified requires and
initialize to zero.



ex.

      REAL:    .BLKF     3

Allocates 3 F type (3 longwords) and initializes them to zero




Floating-Point Operations
 All the arithmetic instructions, special purpose instructions,
  convert instructions, move instructions, compare instruction,
  and test instruction operate on floating point types in the
  normal way as before

ADDx2 op1, op2
ADDx3 op1, op2, op3

SUBx2     op1, op2
SUBx3     op1, op2, op3

MULx2 op1, op2
MULx3 op1, op2, op3

DIVx2     op1, op2
DIVx3     op1, op2, op3

CLRx      op
MOVx      source, dest
MOVAx     source, dest
MNEGx     source, dest

TSTx      op
CMPx      op1, op2

where x = F, D, G, H

Floating-Point Convert Instructions
    CVTxy         source, dest

    x, y = F, D, G, H, B, W, L
    xy
    xy  DG
    xy  GD


 Convert Round to Longword


    CVTRxL        source, dest

x = F, D, G, H


 Convert Truncate to Longword


    CVTxL         source, dest

x = F, D, G, H

								
To top