1

Document Sample
1 Powered By Docstoc
					CSE 246: Computer Arithmetic
Algorithms and Hardware Design
Fall 2006
Lecture 8: Division



      Instructor:
      Prof. Chung-Kuan Cheng
Topics:


     Radix-4 SRT Division
     Division by a Constant
     Division by a Repeated Multiplication




CSE 246                 2
Project Update
 Come in to speak briefly about the
  final project
     Status Update
     2:30 – 3:00 p.m.
     Tuesday or Thursday




CSE 246               3
Radix-4 SRT Division
 4sj-1 = qjd + sj where
     qj is in [-2,2] and sj-1 is in [-hd,+hd]
     h is less than or equal to 2/3
     Therefore, sj-1 is in [-2d/3, 2d/3]
     And, 4sj-1 is in [-8d/3, 8d/3]
           s shifts to the left by 2 bits




CSE 246                         4
Radix-4 SRT Division
       4sj-1
                                                                    8d/3
11.0
               Anything above 8d/3 goes against our
               assumption and is therefore the
10.1
               infeasible region                                                   qj=2
10.0                                                                5d/3
 1.1                                                                4d/3
                                                                              qj=1
 1.0
                                                                    2d/3
 0.1
                                                                    d/3
 0.0                                                                 d      qj=0
        .1
                        .101            .110              .111   1.00

                                                                    -2d/3



 The overlap regions of qj denote a choice still
 allowing for recursion. The gap defines the
 precision for carry save addition.
CSE 246                                               5
Radix-4 SRT Division
 The value of qj determines the range
 it governs

 For example, qj = 1
     1 + 2/3 = 5/3
     1 – 2/3 = 1/3
     The range is 1/3 to 5/3




CSE 246                 6
Division by a Constant
 Multiplication is O(log n) but division
  is linear…much slower
     Try to convert division to multiplication


 Property: Given an odd number d
   m such that d*m = 2n – 1
    E

 Ex.
     d = 3, m = 5        3*5 = 24 – 1
     d = 7, m =9         7*9 = 26 – 1
     d = 11, m = 93 11 * 93 = 210 - 1
CSE 246                  7
Division by a Constant
 1/d = m/(2n – 1)
     m   1    m
   = n       = n (1+2-n)(1+2-2n)(1+2-4n)
     2 1-2-n  2
 1/(1-r) = 1+r+r2+r3+…
          = (1+r)(1+r2)(1+r4)(1+r8)…
 Example
     z/7 = zm/(2n-1), m=9, n=6
       z   9    9z
     = 6       = 6 (1+2-6)(1+2-12)(1+2-24)
       2 1-2-6  2
     log(n/6) operations
CSE 246                8
Division by Reciprocation
 Find 1/d with iteration
 Newton Raphson Algorithm
  xi+1=xi-f(xi)/f’(xi)
 Set f(x)=1/x-d, (1/2<=d<1)
  We have f’(x)=-1/x2
 Thus xi+1=xi(2-xid)
 Let ei=1/d-xi
  We have ei+1=1/d-xi+1=1/d-xi(2-xid)
                 =d(1/d-xi)2=dei2
 The convergence rate is quadratic.
 For k iterations, it takes 2k multiplications

CSE 246                 9
Division by Reciprocation
   z/d=3/0.7
   x0=4(31/2-1)-2d=2.9282-2d=1.5282
   e0=1/d-x0=1/0.7-1.5282=-0.0996286
   x1=x0(2-x0d)=1.42164
   e1=1/d-x1=1/0.7-1.42164=0.0069314
   x2=x1(2-x1d)=1.4285377
   e2=1/d-x2=1/0.7-1.4285377=0.0000337
   x3=x2(2-x2d)=1.4285715
   e3=1/d-x3=1/0.7-1.4285715=-0.000000(1)
   The convergence rate is quadratic.


CSE 246              10
Division by Recursive Multiplication
 q = z/d =
     (z/d) (x0/x0) (x1/x1)… (xk-1/xk-1)
  eq(a)
 Let ½<=d<1
 It takes 2k multiplication for eq(a)
 We also need k operations to find xi




CSE 246             11
Division by a Repeated Multiplication
 q = z/d =
     (z/d) (x0/x0) (x1/x1)… (xk-1/xk-1)
 Let ½<=d<1
 Set d0=d, xk = 2-dk
  1. d1 = dxo = d(2-d) = 1-(1-d)2
  2. dk+1= dkxk = dk(2-dk) = 1-(1-dk)2
  3. 1-dk+1 = (1-dk) 2 =(1-d)2k

  quadratic convergence
 For k-bit operands, we need 2m-1
  multiplications
     m 2’s complement
     m = ceiling(log2 k) with log2 m extra bits for
      precision

CSE 246                    12
Division by a Repeated Multiplication
 q = z/d=3/0.7 =
     (z/d) (x0/x0) (x1/x1)… (xk-1/xk-1)
 d0=d=0.7, xk = 2-dk, dk+1=dkxk
  1. x0=2-d0=1.3,
     d1=d0xo= 0.7x1.3 = 0.91
  2. x1=2-d1=1.09,
     d2=d1x1=0.91x1.09=0.9919
  3. x2=2-d2=1.0081,
     d3=d2x2=0.9919x1.0081=0.9999343



CSE 246             13
Division Methods
 Iteration
 Memory
 Arithmetic




CSE 246        14
Division –                 Iteration effort
 Pencil and paper method:                      (A=QB+2-nR
    and R<B)
    1 bit partial quotient per iteration, n
    iterations               0.1 1 0 1
  A=                                   1010   1001        R0=A
     0.1001,                                   1010
                                               1000       R1
  B=                            Q1 = 0.1
                                Q2 = 0.01       1010
     0.1010;                                    0100      R2
                                Q3 = 0.000
  Q = A / B.                    Q4 = 0.0001      0000
                            +                    1000     R3
   Qi: Partial Quotient                            1010
                                Q = 0.1101
   Ri: Partial Remainder                           0110   R4
   Ri+1 = Ri – B  Qi


CSE 246                               15
Division –   Memory effort
 Lookup table is the simplest way to
  obtain multiple partial quotient bits in
  each iteration.
 SRT method: a lookup tables stores
  m-bit partial quotients decided by m
  bits of partial remainder and m bits of
  divisor.
           Table size: 22m  m
 STR method is limited by memory
  wall.
CSE 246             16
Division –     Arithmetic effort
 Partial quotient is calculated by arithmetic
  functions.
 Prescaling:           1   z z  E z'
                     E          
                           d      d dE         d'
                       Qi  Ri  Ri  E
 Taylor expansion:           1 1          1            1
                         E        dl  ( ) 2  dl  ( )3 
                                                    2

                              d dh         dh           dh
                         Qi  Ri  E
 Series expansion:
              d  1 X
                    1
              E   1  X  X 2  X 3   (1  X )(1  X 2 )(1  X 4 ) 
                    d
              Q i  Ri  E
CSE 246                     17
Division –                   Solution space

                                      Memory
                                       Effort
                                                   Our target
                              SRT
                                                                     Memory
                                                         Low latency  Wall
                              Prescaling
          Pencil-and-paper                                      Series Expansion

      Iteration                      Taylor Expansion               Arithmetic
        Effort                                                        Effort
          Low area

    Modern FPGAs contains plenty of
     memory and build-in multipliers,
     which enable high performance
     divider.
CSE 246                                18
Division –   PST algorithm
 Utilize the power of series expansion,
  but need a good start point.
                                        1
                d ' 1 X             E  1 X
                                       d'
                B  E   (1  X )(1  X )  1  X 2
 Prescaling provide a scaled divisor
  close to 1.      1  z z  E z'
                E                         
                     d        d       dE        d'

 0-order Taylor expansion iterates to
  reach the final quotient Q  R  E           i    i




CSE 246                  19
 Division –                   PST algorithm

                              B(m) = 0.1100  E0 = 1.0011
   z = 0.1011,0110
   d = 0.1100,1011            z1 = z  E0 = 0.1101,1000,0010
                              d1 = d  E0 = 0.1111,0001,0001
                              E1 = INV(d1(2m)) = 1.0000,1110
  E0 = Table (d(m))  1/d
                              Q1 = z1  E1 = 0.1110,0011
  z1 = zE0; d1 = dE0
                              R1 = B1 – Q1  d1 = 0.0000,0010,0101,1110,1101
E1 = (2  d1)  INV(d1(2m))
      Qi = Ri-1  E1
    Ri = Ri-1  Qi  B1
                              Q2 = R1  E1 = 0.1001,1111
       Q = Q + Qi             R2 = R1 – Q2  d1 = 0.0000,0001,1111,1011,0001

                              Q = 0.1110,0011 +
                                  0.0000,0010,0111,11 =
                                  0.1110,0101,0111,11
  CSE 246                              20
Division –                     FPGA Implementation
 PST algorithm is suitable for high-
  performance division unit design in
  FPGAs Fmax ALUTs Memory DSP Power   Throughput
                   (Period)            Bits   Blocks         Consumption
                                                           (Dynamic+Static)
     IP Core      50.16MH       1203   84       0              381mW               50.16Mdiv/s
    (no DSP)          z                                    (52mW+329mW)
                  (19.935ns)
      PST         72.8MHz       213    768     28              350mW                24.3Mdiv/s
     (DSP)        (13.737ns)                               (23mW+327mW)

      PST         73.20MH       1437   768      0              378mW                24.4Mdiv/s
    (no DSP)          z                                    (50mW+328mW)
                  (13.661ns)
  PST-pipelined   74.15MH       261    768     40              344mW               74.15Mdiv/s
     (DSP)            z                                    (17mW+327mW)
                  (13.486ns)
       PSTp       76.05MH       1940   768      0      32-bit division with 5-cycle latency
                                                                 359mW                76.05Mdiv/s
CSE 246 DSP)
     (no              z                  21                (31mW+328mW)

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:2
posted:4/21/2013
language:Unknown
pages:21