3) Static(Complementary) CMOS
Shared by: b4k9USl
-
Stats
- views:
- 26
- posted:
- 5/19/2012
- language:
- Latin
- pages:
- 76
Document Sample


6 ALU Blocks and Control
Contents
1. Adder
2. Multiplier
3. Datapath Generation
6.1
1. Adder
Full Adder
Boolean equation
CARRY A B B C C A
A B C (A B)
SUM A B C A B C A B C A B C
A B C CARRY (A B C)
Sum(Odd Parity) A×B×C CARRY A+B+C
6.2
Which is better?
Boolean Equation 1 : CARRY A B C (A B)
SUM A B C CARRY (A B C)
Boolean Equation 2 : CARRY A B C SUM (A B C)
SUM A B C A B C A B C A B C
CARRY evaluation is more urgent since CARRY is in the critical
path
S0 S1 S2 Sn
C1 C2 Cn Cn
C0 ADDER ADDER ADDER ADDER
A0 B0 A1 B1 A2 B2 An Bn
[ Ripple Carry Adder ]
6.3
Alternating Complementary Form
At Odd Stages At Even Stages
A A
B CARRY B CARRY
C C
A A
B SUM B SUM
C C
CARRY A B C (A B) CARRY (A B) (C A B)
SUM A B C CARRY (A B C) SUM (A B C)(CARRY A B C)
SUM SUM
CARRY CARRY
6.4
Alternating Complementary Form
6.5
Dynamic Serial Adder
CARRY (t 1) A(t 1) B(t 1) C (t ) [ A(t 1) B(t 1)]
SUM (t 1) A(t 1) B(t 1) C (t 1) CARRY (t 1) [ A(t 1) B(t 1) C (t 1)]
a n 1a 0 sn 1s 0
A A SUM S
b n 1b 0
B B CARRY
C
R/S
Q D
CLOCK
6.6
Dynamic Configuration
CARRY GATE SUM GATE
OPTIONAL
PRECHARGE
CK CK DEVICE
A B A CK
C B A A SUM
B CK
OPTIONAL C B
PRECHARGE
C
DEVICE
CARRY A B C [ A B]
R S
CK CK
C (CARRY)
CK CK
R
Set/Reset
Circuit S
6.7
Full Adder Truth Table
A B C CARRY SUM
Mutually
0 0 0 0 0 0
1 0 0 1 0 1
Complement
2 0 1 0 0 1
3 0 1 1 1 0 0 1 2 3 FC - on terms
4 1 0 0 0 1 7 6 5 4 FS - on terms
5 1 0 1 1 0
6 1 1 0 1 0
7 1 1 1 1 1
Conjugate Symmetry
SUM FS (A, B,C) SUM FS (A, B,C)
CARRY FC (A, B,C)
CARRY FC (A, B,C)
6.8
Another Configuration of Carry & Sum
Logic
CARRY(t 1) FC (A, B,C) A B B C C A A B C (A B)
SUM(t 1) FS (A, B,C) A B C A B C A B C A B C
A B C CARRY (A B C)
A
1 PROPAGATE A B A B C
1 GENERATE B
A C
C CARRY CARRY SUM
C
A
1 GENERATE B
1 PROPAGATE A B A B C A
CARRY STAGE SUM STAGE
6.9
Dynamic full adder using
np CMOS logic style
6.10
Layout of the dynamic full adder
6.11
Looking at the FA Truth Table
C when A B 1
A B C CARRY SUM
CARRY
0 0 0 0 0 A(orB) when A B = 0
0 0 1 0 1
0 1 0 0 1
0 1 1 1 0
1 0 0 0 1 C when A B 1
1 0 1 1 0 SUM
1
1
1
1
0
1
1
1
0
1
C when A B 0
CARRY P C P B where P A B
SUM P C P C
6.12
Transmission Gate Implementation
C
C
A B A B SUM
P C P C
A PC
B ( P A B)
A B
B
A B A B CARRY
C
6.13
CLA (Carry Lookahead Adder)
C0
P1 C1
G1 An
P2 Gn
C2
G2 Bn Pn
P3
C3
Ci G i Pi Ci 1where G i A i Bi
G3 = G i Pi G i 1 Pi Pi 1 G i 2 ... + Pi Pi-1 ... P2 P1 C0
Si Ci Pi
P4 Available for (# of inputs 4)
C4
G4
6.14
Carry bypass structure - basic concept
6.15
(N=16)-bit carry bypass adder(each stage: M bits)
tp = tsetup + M * tcarry+(N/M - 1) tbypass + M*tcarry+tsum
tsetup : time to create G and P signals
tcarry : propagation delay through a single bit
tbypass : propagation delay through MUX
tsum : time to generate sum
6.16
Combining 4 Domino Carry Lookahead
Blocks
Manchester Carry Chain (4-bit)
CK G1 P1 G2 P2 G3 P3 G4 P4
P1 P2 P3 P4
C0 C1 C2 C3 C4
C4 MANCHESTER
C0 C4
CARRY CHAIN
C0 G1 G2 G3 G4
CK
C0 C1 C2 C3 C4
C 1 G 2 P1 C 0
Limit @ 4 stages
In the worst case, 6 Series Tr.s to the ground.
6.17
Improving Worst Case Carry Prop. Time
C0 MANCHESTER C4
CARRY CHAIN
C0 C4
CK
P1 P2 P3 P4
CK
6.18
Manchester CC Adder Floorplan
C4
Dual CC Scheme
One for Carry Prop.
CARRY CHAINCARRY CHAINCARRY CHAINCARRY CHAIN
CARRY CHAINCARRY CHAINCARRY CHAINCARRY CHAIN
MANCHESTER MANCHESTER MANCHESTER MANCHESTER
MANCHESTER MANCHESTER MANCHESTER MANCHESTER
A4 The other for off-loading
BIT 4 GP SUM S4
B4 the 1st CC from the
SUM-block.
A3
BIT 3 GP SUM S3
B3
A2
BIT 2 GP SUM S2
B2
A1 SUM SUM
BIT 1 GENERATE GENERATE S1
B1
C0
6.19
CSA (Carry Select Adder)
Realization of MUX with restoring logic
A4 ~ A7 B4 ~ B7
Carry Selection
C81
1 S41 ~ S71 C8 C 4 C1 C 4 C8
8
0
C 4 C1 C 4 C8 C1 (since always C8 C1 0 )
8
0
8
0
8
1
S4 ~ S 7 C1 (C4 C 4 C8 )
8
0
A4 ~ A7 B4 ~ B7 0
C80 C1 (C4 C8 )
8
0
0 S40 ~ S70 C8
Note) Realization of MUX with pass-transistor gates
C4 C8
C80 0 C80 C120
A0 ~ A3 B0 ~ B3 C8 C8 C12
C81 1 C4 C8
C4
C0 S0 ~ S3
C4
C81 C121
C4 C8
S0 ~ S3 Vdd Vdd - Vt Vdd - 2Vt
Threshold voltage loss per stage
6.20
CSA (Carry Select Adder)
For carry propagation, use restoring logic in the
alternative pattern
A0 ~ A3 B0 ~ B3
C0 S0 ~ S3
C4
C80 C81 C120 C121
C8
Number of bits for each stage
ex1) 32-bit case : 4, 4, 5, 6, 7, 6 ( or 4, 4, 5, 6, 6, 7)
ex2) 64-bit case : 4, 4, 5, 6, 7, 8, 9, 10
6.21
Minimization of Carry Propagation Path
Delay
Carry Select Scheme (prepare result for each case, Cin=1, Cin=0)
Simplify the carry selection using the characteristic between Ci0 & Ci1
Take complement carries alternating the Even and Odd stages
Adjust each block size with the consideration to the delay of carry
select logic
carry propagation delay of each block = = carry propagation delay to the
block
adjust
eg. for 32-bit path
4 4 5 6 6 7
6.22
16-bit Linear CSA(Carry Select Adder)
tadd = tsetup + M * tcarry+ (N/M ) tmux + tsum M: #of bits/stage
N : total # of bits
6.23
Square Root CSA
tadd = tsetup + M * tcarry+ 2N tmux + tsum
9 stage
N = M + (M+1) + ….. + (M+P-1)
= MP + P(P-1)/2 = P2/2 + P(M - 1/2 )
6.24
Propagation Delay of Linear and Square Root CSA and linear RCA
6.25
Carry Skip Adder
Ripple Carry Adder와 CLA Adder의 Compromise
PO3 p 0 p1 p 2 p 3
G O3 g 3 g 2 p 3 g1 p 3 p 2 g 0 p 3 p 2 p1
a15b15 a13 b13 a3b3 a1 b1
b
a14b14 a12 12 a2b2 a0 b0
G12,15
c16 c12 G8,11 c8 G4,7 c4
c0
P12, 15 P8, 11 P4, 7
6.26
pi’s and gi’s are computed from pi=aibi and gi = aibi
Initially, c4, c8 and c12 are cleared
After 4 clock cycle (at T0+4Tc), G-values are calculated as cout
assuming ci=0(P-values are also calculated by then)
At this time (at T0+4Tc), true cout in the first stage, c4 is
obtained.
After one, two and three clock cycles respectively, assuming
the delay of each AOI gate as Tc true values of c8, c12 and c16
are obtained.
Sum and cout of the last block are obtained at
(T0+4Tc+2Tc+4Tc)
6.27
Comparison of Carry Select & Carry
Skip Adder
A 32-bit Carry Select Adder
Speed @ 8k 2 ( where k 2 multiplexe r delays)
Area @ 2 AreaRCA
Stage # 1 2 3 4 5 6
bits/stage 4 4 5 6 7 6 32 bit
inc. delay 4 1 1 1 1 1 9k2(k2=delay due to
1-bit addition or
A 32-bit Carry Skip Adder MUX)
Speed @ 12k 2 ( where k 2 multiplexe r delays)
Area @ AreaRCA AreaP-logic
Stage # 1 2 3 4 5 6
bits/stage 4 5 6 7 8 2
inc. delay 4 1 1 1 1 2 10k2
6.28
Conditional Sum Adder
A2 B2 A1 B1 A0 B0
S21 C31 S20 C30 S11 C21 S10 C20 S01 C11 S00 C10
MPX MPX MPX C0
C3
S2 C3 S1 S2 (C1=0)
S1
(C1=1) (C1=1) (C1=1) (C1=0) S0
(C1=0)
C1
Triple 2-input MUX
S2 C3 S1
6.29
Carry Lookahead Tree Adder
Previous CLA implementation is not very adequate due to
fan-in, fan-out problem & irregularity, despite the small(5)
number of logic levels.
Make it regular, using log2n - logic levels.
a 3 b3 a 2 b2 a 1 b1 a 0 b0 a i bi
g i ai bi
p a b
g 3 p3 g 2 p2 g 1 p1 g 0 p0 g i pi i i i
G2,3 P2,3 G0,1 P0,1 Gj+1,k Pj+1,k
Gi,j Gi ,k G j 1,k Pj 1,k Gi , j
G0,3 P0,3 Gi,k Pi,k
Pi,j P P P
i ,k i, j j 1, k
[ 1st Part ]
6.30
Carry Lookahead Tree Adder
C3 C2 C1 C0 Cj+1 Ci
g2 g0 Gi,j C j 1 Gi , j Pij Ci
p2 p0 Pi,j
C2 C0 Ci
G0,1
P0,1
[ 2nd Part ]
C0
S3 a3b3 S2 a2b2 S1 a1 b1 S0 a0b0 S3 aibi
S i ai bi ci
P a b
C1 gi pi C i i i
C3 C2 C0 i g i ai bi
Gj+1,k Pj+1,k
C0 Cj+1
Gi,j
Pi,j
C0 Gi,k Ci
Pi,k Ci
[ Complete CLA Tree Adder ]
6.31
Carry Save Adder
Ripple Carry Adder
Carry Lookahead Adder
CSA (Conditional Sum Adder) Carry Propagate Adder
CSA (Carry Select Adder)
CSA (Carry Skip Adder)
CSA (Carry Save Adder)
6.32
Carry Save Adder
Carry Save Adder is used wherever a large number of
operands have to be added.
Previous Cycle
Sum
Operand
Previous Cycle
Carry aibici
F.A
F.A F.A F.A F.A F.A F.A
Carry Sum
F/F F/F
F.A F.A F.A F.A F.A F.A CSA
stages
F.A F.A F.A F.A F.A F.A
F.A F.A F.A F.A F.A F.A CPA
6.33
2. Multiplier
Add-and-Shift Algorithm 0 0 0 0 1 0 10
+ 0 0 0 0
0 0 1 1 multiplicand 0 0 0 0
1 0 1 0 multiplier 0 0 0 0 0
0 0 0 0 + 0 0 1 1
0 0 1 1
0 0 1 1
0 0 0 1 1 0
0 0 0 0 + 0 0 0 0
0 0 1 1 0 0 0 1
0 0 1 1 1 1 0 0 0 0 0 1 1 0
+ 0 0 1 1
0 0 1 1 1 1 0
Multiplication procedure Multiplication procedure
by Pencil-and-Paper Method by Add-and-Shift Algorithm
6.34
The Serial-Parallel Multiplier
If A (a n , a n 1 , ... , a 0 )
B (b n , b n 1 , ... , b 0 )
The product A B is expressed as
A B A 2 n b n A 2 n 1 b n 1 A 20 b 0
A
B a3 a2 a1 a0
D D D D D D D D
b2 D
b1 D F.A F.A F.A F.A F.A F.A F.A 0
b0 D D D D D D D D
Output
6.35
4x4 array multiplier
6.36
N(4)
M(3)
tmult = [(M-1) + (N-1)] * tcarry + (N-1) * tsum+ tand
both tcarry and tsum are important
Sum and Carry generation time need to be similar.
6.37
Carry-save Multiplier(CSM)
Rectangular floorplan of CSM
6.38
The Modified Booth Algorithm (cont’)
Booth Encoder Table Booth Encoder
b2k+1 b2k b2k-1 multiplied by
0 0 0 0
0 0 1 +x b2k-1 A
0 1 0 +x = b2k b2k-1
0 1 1 + 2x b2k 2A
1 0 0 - 2x
1 0 1 -x b2k+1
1 1 0 -x negative
1 1 1 0 = b2k+1
6.39
Booth Multiplication Example
A 01 00 01 17
X 11 01 11 -9
-A +2A -A Operation
Initial 0 00 00 00
Add -A + 10 11 11
10 11 11
2-bit Shift 11 10 11 11
Add 2A + 10 00 10
01 11 01 11
2-bit Shift 00 01 11 01 11
Add -A + 10 11 11
11 01 10 01 11 -153
6.40
The Modified Booth Algorithm
Let’s consider a number B = (bn-1, bn-2, ... , b1, b0) written in 2’s-
complement. n2
B b n 1 2 n 1 b k 2 k
k 0
B may be rewritten as follows :
n 1
B (b2k 1 b 2k 2 b 2k 1 ) 22k (assumeb 1 = 0)
2
k 0
Example ( b 1 b 0 2 b1 )2 0
(b1 b 2 2 b 3 )2 2
(b3 b 4 2 b 5 )2 4
b 1 b 0 20 b1 21 b 2 2 2 b 3 23 b 4 2 4
In this equation, the terms in brackets is in the set {-2, -1, 0, 1, 2}
n-bit multiplier generates exactly n/2 partial products
6.41
Parallel Multiplier
Multiplier has two basic operations
The generation of partial products
The summation of partial products
Parallel multiplier avoids the overhead that is due to the separate
controls of these two operations
We speed up the multiplication
The gain in speed is obtained at the expense of extra hardware
Parallel multiplier can be implemented so as to support a high rate of
pipelining
6.42
The Braun Multiplier
A straightforward a3 a2 a1 a0
implementation b3 b2 b1 b0
One bit of the new partial a3b0 a2b0 a1b0 a0b0
product
a3b1 a2b1 a1b1 a0b1
( ai .b
j) a3b2 a2b2 a1b2 a0b2
One bit of the previous partial a3b3 a2b3 a1b3 a0b3
product
P6 P5 P4 P3 P2 P1 P0
Carry in
In the first four rows there is no
horizontal carry propagation
(using carry-save adder)
6.43
The Braun Multiplier (cont’)
a3 a2 a1 a0
b0
0 0 0 p0
b1
F.A F.A F.A
p1
b2
F.A F.A F.A
p2
b3
F.A F.A F.A
p3
0
F.A F.A F.A
p7 p6 p5 p4
6.44
Baugh-Wooley Multiplier
Modified in order to allow multiplication of signed number
Let’s consider 2 number A and B (2’s complement number)
n 2 a n 1 : sign bit in 2' s complement
A (a n 1 ... a 0 ) a n 1 2 n 1
ai 2 i
n 2
n 2
0
A a i 2i , when a n -1 0
B (b n 1 ... b 0 ) b n 1 2 b i 2i
n 1 0
n 2
0
A a n -1 2 a i 2i , when a n -1 1
n -1
0
The product A.B is
n 2 n 2 n 2 n 2
A B a n 1 b n 1 2 2n 2
a i b j 2 a n 1 b i 2
i j n i 1
b n 1 a i 2 n i1
b 2
0 0 0 0
n 2 n 2 n 2 n 2
a n 1 b n 1 2 2n 2
a b 2 a n 1 2
i j
i j 2n 2
2 bi 2
n 1 i n 1
n 1
2n 2
2 a i 2in 1
n 1
0 0 0 0
n 2 n 2 n 2 n 2
2 (a b a b ) 2
2n 1
n 1 n 1 n 1 n 1
2n 2
a b 2 (a n 1 b n 1 ) 2 b n 1 a i 2
i j
i j n 1 i n 1
a n 1 b i 2in 1
0 0 0 0
because : (b n 1 a n 1 ) 2 2n 2 2 2n 1 (a n-1 b n-1 ) 2 2n 2
6.45
Baugh-Wooley Multiplier (cont’)
a3 a2 a1 a0
b0
0 0 0 p0
b1
F.A F.A F.A
p1
b2
F.A F.A F.A
p2
b3
F.A F.A F.A F.A a3 b3
1
F.A F.A F.A F.A F.A
p7 p6 p5 p4 p3
6.46
Wallace Tree Multipliers
Full adder vs Wallace tree
20 20 20 20 20 20
Full Adder Wallace n
21 20 2n 21 20
Useful whenever a large number of operands are to add.
Completion time in Braun or Baugh-Wooley multiplier
Using Ripple Carry Adder:
Proportional to the twice number of n of bits
Using Wallace trees,
Proportional to log2 (n)
6.47
Recursive Decomposition of the
Multiplication
Partitioning two operands
A 2P A H A L
B 2 P BH BL
A B 2 2P A H BH 2 P (A H BL A L BH ) A L BL
Four Terms (AH.BH, AH.BL, AL.BH, AL.BL) are computed
using 4 p-bits multipliers
The results are collected through Wallace tree
6.48
Recursive Decomposition of the
Multiplication
AH AL AH AL BH BL
BH BL
AL X BL
AL X BH
AH X BL
AL X BH AH X BH AL X BL
AH X BH
AH X BL AH X BL
AH X BH AL X BL
AL X BH 4 X W3 4 X W3
Adder
Aligning the four partial products
6.49
Booth’s Algorithm Array Multiplication
Another approach to the design of a parallel multiplier for two’s
complement operands
The basic cell in rows i perform an add, subtract or transfer-only
CASS (Controlled Add/Subtract/Shift) Cell
a Pin (partial product)
Pout Pin (a H) (c in H)
If H 0, Pout Pin (transfer )
H
D If H 1, Pout Pin a c in ( sum)
c out (Pin D) (a c in ) a c in
cout cin
If D 0, c out Pin (a c in ) a c in (add)
If D 1, c out Pin (a c in ) a c in (subtract)
6.50
Booth’s Algorithm Array Multiplication
(cont’)
0 0 0 0
a3 a2 a1 a0
x3 H
CTRL CASS CASS CASS CASS
D 0 0
x2 H
CTRL CASS CASS CASS CASS CASS
D 0 0
x1 H
CTRL CASS CASS CASS CASS CASS CASS
D 0 0
x0 H
CTRL CASS CASS CASS CASS CASS CASS CASS
0 D 0
P6 P5 P4 P3 P2 P1 P0
Xi Xi-1 H D
d H X i X i 1
0 0 Shift 0
1 1 Shift 0 d D Xi
1 0 Subtract 1 1
0 1 Add 1 0
6.51
Generalized block diagram of an array multiplier
6.52
Q. Why use an array multiplier if it requires as many addition steps?
A1) Array multiplier is combinational circuit, where the signals
flow without being clocked.
Multi-pass Array Multiplier : normally use a clock, but the cycle
time for passing through k arrays is < kTc 6.53
A2) Some speed-up schemes are possible.
e.g. E/O array, Wallace-tree
Even-Odd Array
6.54
Wallace-tree Multiplier
6.55
6 x 6 Wallace-tree Multiplier Example
Delay log 3 n (n : width of the Wallace tree)
2
e.g. For 32-bit, number of adders necessary for each stage is
32 - 22 - 16 - 12 - 8 - 6 - 4 - 3 - 2
Total delay = 9 x adder delay
6.56
6.57
3. Datapath Generation
Datapath and its elements in bit-slice organization
MEMORY
INPUT-OUTPUT
CONTROL
DATAPATH
6.58
Two layout strategies for bit-slice datapath
6.59
Layout of 4-bit DP using layout strategy II (feedthrough)
6.60
1-D placement vs. 2-D placement
6.61
1-D placement vs. 2-D placement(Cont’)
6.62
Datapath Layout Flow
RTL description circuit design
Floorplan floorplan : block ordering, bus track assignment
Schematic Drawing
schematic drawing : tr. sizing
Cell Drawing
layout
cell drawing : leaf cell layout
Layout Assemble
layout assemble : leaf cell integration (routing)
DRC / LVS
DRC / LVS : design rule check, layout vs.
Back-Annotation
schematic
Datapath Layout back-annotation
simulation with the exact capacitance
6.63
Datapath Design Case (ACCENT HK386)
real mode support of x86
instruction set
enhanced (pipelined)
datapath
problems & practices of
general DP layout
6.64
Datapath structure
Segment,EA
3 major blocks
alu, register file(32bit)
barallel shifter(40bit) Barrel
Shifter
segment/effective address(32bit)
ALU
Register File
6.65
Track capacity
metal1 VSS VDD TRACK(6) metal2
Control, Clock
Power
N-well P-well
6 vertical wires/track in metal 1
metal3 reserved for P & G routing
6.66
Segment,EA BSH ALU
Power Grid
From bottom & left(chip edges)
Considering IR drop
RF
6.67
Cell Structure
Initial cell template decision
70 80 Nwell in the left
Pwell in the right
N-well P-well
data flow vertical
control flow horizontal
Similar cell structure as VTI
Cell width
– 80 for PMOS
– 70 for NMOS
25 10 35 45 10 25
6.68
Cell Structure
모든 쎌에 power line이 통과함
power line width
@10 (2 contact)
power line location
@ 25 to the inside
from the boundary
6.69
Accent Cell Layout Flow
Block Spec. 처음에 cap을 가정하고 시뮬레이션
TR sizing은 간단하게 끝냄
Schematic
cap이 정확하지 않으니까 optimize는 필요 없고
spec만 만족하면 된다고 생각함
SPICE 전체 assemble이 되어야 정확한 cap이 나오므로
한참동안 일에서 손을 뗌
assemble된 다음 layout을 고치면 새로 다시
assemble해야 하는데 엄청난 노가다
6.70
Cell Design(I)
Using 45 degree line for cell design
Control flow
Data flow
6.71
Cell Design(II)
needless effort to reduce cell size
ugly poly; current crowding
Data flow
6.72
Critical path used for transistor sizing in relevant datapath element
6.73
Assemble Data flow
•Track assignment needs to be done before the cell layout
(not after).
6.74
학점의 가치
대학 성적과 사회에서의 성공은
별로 correlation이 없는데,
이것은 사실 신기한 일이 아니다.
사회 성공의 요인과 대학성적
기준이 매우 다르니까.
6.75
창업의 갈림길
창업을 하려면 우선 두가지를 명확히 해야 한다.
첫째, 국내시장 만을 target으로 하든지,
세계시장 만을 target으로 하든지,
둘 중의 하나만 하라.
세계시장에의 도전은 어렵지만,
성공하면 국내시장은 저절로 따라간다.
6.76
Get documents about "