1

Document Sample

```					CSE 246: Computer Arithmetic
Algorithms and Hardware Design
Fall 2006
Lecture 8: Division

Instructor:
Prof. Chung-Kuan Cheng
Topics:

 Division by a Constant
 Division by a Repeated Multiplication

CSE 246                 2
Project Update
 Come in to speak briefly about the
final project
 Status Update
 2:30 – 3:00 p.m.
 Tuesday or Thursday

CSE 246               3
 4sj-1 = qjd + sj where
 qj is in [-2,2] and sj-1 is in [-hd,+hd]
 h is less than or equal to 2/3
 Therefore, sj-1 is in [-2d/3, 2d/3]
 And, 4sj-1 is in [-8d/3, 8d/3]
 s shifts to the left by 2 bits

CSE 246                         4
4sj-1
8d/3
11.0
Anything above 8d/3 goes against our
assumption and is therefore the
10.1
infeasible region                                                   qj=2
10.0                                                                5d/3
1.1                                                                4d/3
qj=1
1.0
2d/3
0.1
d/3
0.0                                                                 d      qj=0
.1
.101            .110              .111   1.00

-2d/3

 The overlap regions of qj denote a choice still
allowing for recursion. The gap defines the
CSE 246                                               5
 The value of qj determines the range
it governs

 For example, qj = 1
 1 + 2/3 = 5/3
 1 – 2/3 = 1/3
 The range is 1/3 to 5/3

CSE 246                 6
Division by a Constant
 Multiplication is O(log n) but division
is linear…much slower
 Try to convert division to multiplication

 Property: Given an odd number d
m such that d*m = 2n – 1
E

 Ex.
 d = 3, m = 5        3*5 = 24 – 1
 d = 7, m =9         7*9 = 26 – 1
 d = 11, m = 93 11 * 93 = 210 - 1
CSE 246                  7
Division by a Constant
 1/d = m/(2n – 1)
m   1    m
= n       = n (1+2-n)(1+2-2n)(1+2-4n)
2 1-2-n  2
 1/(1-r) = 1+r+r2+r3+…
= (1+r)(1+r2)(1+r4)(1+r8)…
 Example
 z/7 = zm/(2n-1), m=9, n=6
z   9    9z
= 6       = 6 (1+2-6)(1+2-12)(1+2-24)
2 1-2-6  2
 log(n/6) operations
CSE 246                8
Division by Reciprocation
 Find 1/d with iteration
 Newton Raphson Algorithm
xi+1=xi-f(xi)/f’(xi)
 Set f(x)=1/x-d, (1/2<=d<1)
We have f’(x)=-1/x2
 Thus xi+1=xi(2-xid)
 Let ei=1/d-xi
We have ei+1=1/d-xi+1=1/d-xi(2-xid)
=d(1/d-xi)2=dei2
 The convergence rate is quadratic.
 For k iterations, it takes 2k multiplications

CSE 246                 9
Division by Reciprocation
   z/d=3/0.7
   x0=4(31/2-1)-2d=2.9282-2d=1.5282
   e0=1/d-x0=1/0.7-1.5282=-0.0996286
   x1=x0(2-x0d)=1.42164
   e1=1/d-x1=1/0.7-1.42164=0.0069314
   x2=x1(2-x1d)=1.4285377
   e2=1/d-x2=1/0.7-1.4285377=0.0000337
   x3=x2(2-x2d)=1.4285715
   e3=1/d-x3=1/0.7-1.4285715=-0.000000(1)
   The convergence rate is quadratic.

CSE 246              10
Division by Recursive Multiplication
 q = z/d =
(z/d) (x0/x0) (x1/x1)… (xk-1/xk-1)
eq(a)
 Let ½<=d<1
 It takes 2k multiplication for eq(a)
 We also need k operations to find xi

CSE 246             11
Division by a Repeated Multiplication
 q = z/d =
(z/d) (x0/x0) (x1/x1)… (xk-1/xk-1)
 Let ½<=d<1
 Set d0=d, xk = 2-dk
1. d1 = dxo = d(2-d) = 1-(1-d)2
2. dk+1= dkxk = dk(2-dk) = 1-(1-dk)2
3. 1-dk+1 = (1-dk) 2 =(1-d)2k

 For k-bit operands, we need 2m-1
multiplications
 m 2’s complement
 m = ceiling(log2 k) with log2 m extra bits for
precision

CSE 246                    12
Division by a Repeated Multiplication
 q = z/d=3/0.7 =
(z/d) (x0/x0) (x1/x1)… (xk-1/xk-1)
 d0=d=0.7, xk = 2-dk, dk+1=dkxk
1. x0=2-d0=1.3,
d1=d0xo= 0.7x1.3 = 0.91
2. x1=2-d1=1.09,
d2=d1x1=0.91x1.09=0.9919
3. x2=2-d2=1.0081,
d3=d2x2=0.9919x1.0081=0.9999343

CSE 246             13
Division Methods
 Iteration
 Memory
 Arithmetic

CSE 246        14
Division –                 Iteration effort
 Pencil and paper method:                      (A=QB+2-nR
and R<B)
1 bit partial quotient per iteration, n
iterations               0.1 1 0 1
A=                                   1010   1001        R0=A
0.1001,                                   1010
1000       R1
B=                            Q1 = 0.1
Q2 = 0.01       1010
0.1010;                                    0100      R2
Q3 = 0.000
Q = A / B.                    Q4 = 0.0001      0000
+                    1000     R3
Qi: Partial Quotient                            1010
Q = 0.1101
Ri: Partial Remainder                           0110   R4
Ri+1 = Ri – B  Qi

CSE 246                               15
Division –   Memory effort
 Lookup table is the simplest way to
obtain multiple partial quotient bits in
each iteration.
 SRT method: a lookup tables stores
m-bit partial quotients decided by m
bits of partial remainder and m bits of
divisor.
Table size: 22m  m
 STR method is limited by memory
wall.
CSE 246             16
Division –     Arithmetic effort
 Partial quotient is calculated by arithmetic
functions.
 Prescaling:           1   z z  E z'
E          
d      d dE         d'
Qi  Ri  Ri  E
 Taylor expansion:           1 1          1            1
E        dl  ( ) 2  dl  ( )3 
2

d dh         dh           dh
Qi  Ri  E
 Series expansion:
d  1 X
1
E   1  X  X 2  X 3   (1  X )(1  X 2 )(1  X 4 ) 
d
Q i  Ri  E
CSE 246                     17
Division –                   Solution space

Memory
Effort
Our target
SRT
Memory
Low latency  Wall
Prescaling
Pencil-and-paper                                      Series Expansion

Iteration                      Taylor Expansion               Arithmetic
Effort                                                        Effort
Low area

 Modern FPGAs contains plenty of
memory and build-in multipliers,
which enable high performance
divider.
CSE 246                                18
Division –   PST algorithm
 Utilize the power of series expansion,
but need a good start point.
1
d ' 1 X             E  1 X
d'
B  E   (1  X )(1  X )  1  X 2
 Prescaling provide a scaled divisor
close to 1.      1  z z  E z'
E                         
d        d       dE        d'

 0-order Taylor expansion iterates to
reach the final quotient Q  R  E           i    i

CSE 246                  19
Division –                   PST algorithm

B(m) = 0.1100  E0 = 1.0011
z = 0.1011,0110
d = 0.1100,1011            z1 = z  E0 = 0.1101,1000,0010
d1 = d  E0 = 0.1111,0001,0001
E1 = INV(d1(2m)) = 1.0000,1110
E0 = Table (d(m))  1/d
Q1 = z1  E1 = 0.1110,0011
z1 = zE0; d1 = dE0
R1 = B1 – Q1  d1 = 0.0000,0010,0101,1110,1101
E1 = (2  d1)  INV(d1(2m))
Qi = Ri-1  E1
Ri = Ri-1  Qi  B1
Q2 = R1  E1 = 0.1001,1111
Q = Q + Qi             R2 = R1 – Q2  d1 = 0.0000,0001,1111,1011,0001

Q = 0.1110,0011 +
0.0000,0010,0111,11 =
0.1110,0101,0111,11
CSE 246                              20
Division –                     FPGA Implementation
 PST algorithm is suitable for high-
FPGAs Fmax ALUTs Memory DSP Power   Throughput
(Period)            Bits   Blocks         Consumption
(Dynamic+Static)
IP Core      50.16MH       1203   84       0              381mW               50.16Mdiv/s
(no DSP)          z                                    (52mW+329mW)
(19.935ns)
PST         72.8MHz       213    768     28              350mW                24.3Mdiv/s
(DSP)        (13.737ns)                               (23mW+327mW)

PST         73.20MH       1437   768      0              378mW                24.4Mdiv/s
(no DSP)          z                                    (50mW+328mW)
(13.661ns)
PST-pipelined   74.15MH       261    768     40              344mW               74.15Mdiv/s
(DSP)            z                                    (17mW+327mW)
(13.486ns)
PSTp       76.05MH       1940   768      0      32-bit division with 5-cycle latency
359mW                76.05Mdiv/s
CSE 246 DSP)
(no              z                  21                (31mW+328mW)

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 2 posted: 4/21/2013 language: Unknown pages: 21