Sequential Logic

Document Sample

```					                                 Pipelining

Between 411 problems
sets, I haven’t had a            Now that’s what I
minute to do laundry             call dirty laundry

Comp 411 – Fall 2009                       11/16/2009                                      L16 – Pipelining 1
Forget 411… Let’s Solve a “Relevant Problem”

INPUT:
Device: Washer
dirty laundry
Function: Fill, Agitate, Spin
WasherPD = 30 mins

OUTPUT:
4 more weeks                 Device: Dryer
Function: Heat, Spin
DryerPD = 60 mins

Comp 411 – Fall 2009                   11/16/2009                            L16 – Pipelining 2
Everyone knows that the real
reason that UNC students put                   Step 1:
off doing laundry so long is *not*
because they procrastinate, are
lazy, or even have better things to
do.

Step 2:
The fact is, doing laundry one load
at a time is not smart.

(Sorry Mom, but you were wrong
Total = WasherPD + DryerPD
90
= _________ mins

Comp 411 – Fall 2009                            11/16/2009                          L16 – Pipelining 3
Here’s how they do laundry at                     Step 1:
Duke, the “combinational” way.

(Actually, this is just an urban                  Step 2:
legend. No one at Duke actually
does laundry. The butler’s all
arrive on Wednesday morning, pick                 Step 3:
up the dirty laundry and return it
all pressed and starched by
Step 4:
…
dinner)

Total = N*(WasherPD + DryerPD)
N*90
= ____________ mins

Comp 411 – Fall 2009                           11/16/2009                     L16 – Pipelining 4
Doing N Loads… the UNC way
UNC students “pipeline”                               Step 1:
the laundry process.
Step 2:

That’s why we wait!
Step 3:

Actually, it’s more like N*60 +
…
30 if we account for the startup
transient correctly. When doing
pipeline analysis, we’re mostly    Total = N * Max(WasherPD, DryerPD)
where we assume we have an                            N*60
= ____________ mins
infinite supply of inputs.

Comp 411 – Fall 2009                            11/16/2009                         L16 – Pipelining 5
Recall Our Performance Measures
Latency:
The delay from when an input is established until the
output associated with that input becomes valid.

90
(Duke Laundry = _________ mins)     Assuming that the wash
120
( UNC Laundry = _________ mins)     is started as soon as
possible and waits (wet)
in the washer until dryer
is available.
Throughput:                                                       Even though

The rate at which inputs or outputs are processed.          we increase
latency, it
takes less

1/90
(Duke Laundry = _________ outputs/min)
1/60
( UNC Laundry = _________ outputs/min)

Comp 411 – Fall 2009                      11/16/2009                         L16 – Pipelining 6
Okay, Back to Circuits…

F
For combinational logic:
latency = tPD,
throughput = 1/tPD.
X                  H        P(X)
We can’t get the answer faster,
but are we making effective use of
G                             our hardware at all times?

X
F(X)
G(X)
P(X)

F & G are “idle”, just holding their outputs
stable while H performs its computation

Comp 411 – Fall 2009                           11/16/2009                           L16 – Pipelining 7
Pipelined Circuits
use registers to hold H’s input stable!

Now F & G can be working on input Xi+1
F                         while H is performing its computation
15
on Xi. We’ve created a 2-stage pipeline :
X                       H      P(X)   if we have a valid input X during clock
25
cycle j, P(X) is valid during clock j+2.
G
20

Suppose F, G, H have propagation delays of 15, 20, 25 ns and we
are using ideal zero-delay registers (ts = 0, tpd = 0):
Pipelining uses
latency throughput               registers to
unpipelined         45       1/45                  improve the
50                           throughput of
2-stage pipeline                 1/25
______ ______                   combinational
worse        better
circuits

Comp 411 – Fall 2009                             11/16/2009                           L16 – Pipelining 8
Pipeline Diagrams
F                                                                                       This is an example
15
Clock cycle                                            of parallelism. At
X                           H      P(X)                                                          any instant we are
25

G
i          i+1    i+2        i+3         computing 2 results.
20

Input         Xi            Xi+1    Xi+2     Xi+3      …
Pipeline stages

F Reg                      F(Xi)   F(Xi+1)   F(Xi+2)
…
G Reg                      G(Xi)    G(Xi+1) G(Xi+2)

H Reg                                H(Xi)    H(Xi+1) H(Xi+2)

The results associated with a particular set of input
data moves diagonally through the diagram,
progressing through one pipeline stage each clock cycle.

Comp 411 – Fall 2009                                             11/16/2009                                L16 – Pipelining 9
Pipelining Summary
– Higher throughput than combinational system
– Different parts of the logic work on different parts of the
problem…

– Generally, increases latency
– Only as good as the *weakest* link                            This bottleneck
is the only
(often called the pipeline’s BOTTLENECK)                          problem

Isn’t there a way around this “weak link” problem?

Comp 411 – Fall 2009                            11/16/2009                          L16 – Pipelining 10
How do UNC students REALLY do Laundry?

They work around the bottleneck.
First, they find a place with
Step 1:              twice as many dryers as
washers.
Step 2:

Throughput =    1/30
Step 3:

Step 4:
90

Comp 411 – Fall 2009          11/16/2009                  L16 – Pipelining 11
Better Yet… Parallelism

Step 1:                      We can combine interleaving
and pipelining with parallelism.

Step 2:                      Throughput =
= 1/15

Step 3:                                  90
Latency = _______ min

Step 4:

Step 5:

Comp 411 – Fall 2009              11/16/2009                          L16 – Pipelining 12
“Classroom Computer”
There are lots of problem sets to grade, each with six problems.
Students in Row 1 grade Problem 1 and then hand it back to Row 2
for grading Problem 2, and so on… Assuming we want to pipeline
the grading, how do we time the passing of papers between rows?

Psets in         Row 1     Row 2      Row 3        Row 4   Row 5     Row 6

Comp 411 – Fall 2009                         11/16/2009                          L16 – Pipelining 13
Controls for “Classroom Computer”
Synchronous                        Asynchronous

Teacher picks time interval          Teacher picks variable time
long enough for worst-case           interval long enough for
Globally
problem. Everyone passes             current set of problems.
Timed
psets at end of interval.            Everyone passes psets at
end of interval.

Students raise hands when            Students grade current
Locally           they finish grading current          problem, wait for student
Timed             problem. Teacher checks              in next row to be free, and
every 10 secs, when all hands        then pass the pset back.
are raised, everyone passes
psets to the row behind.
Variant: students can pass
when all students in a
“column” have hands raised.
Comp 411 – Fall 2009                           11/16/2009                             L16 – Pipelining 14
Control Structure Taxonomy
Easy to design but fixed-sized                                        Large systems lead to very
interval can be wasteful (no                                          complicated timing
data-dependencies in timing)                                          generators… just say no!
Synchronous                          Asynchronous

Globally           Centralized clocked                Central control unit tailors
Timed             FSM generates all                  current time slice to

Start and Finish signals           Each subsystem takes
Locally            generated by each major            asynchronous Start,
Timed              subsystem,                         generates asynchronous
synchronously with                 Finish (perhaps using local
global clock.                      clock).
The “next big idea” for the last
several decades: a lot of design
The best way to build large
work to do in general, but extra
systems that have independent
work is worth it in special cases
components.
Comp 411 – Fall 2009                               11/16/2009                                 L16 – Pipelining 15
Review of CPU Performance

MIPS = Millions of Instructions/Second
Freq
MIPS =                       Freq = Clock Frequency, MHz
CPI
CPI = Clocks per Instruction

To Increase MIPS:
1. DECREASE CPI.
- RISC simplicity reduces CPI to 1.0.
- CPI below 1.0? State-of-the-art multiple instruction issue
2. INCREASE Freq.
- Freq limited by delay along longest combinational path; hence
- PIPELINING is the key to improving performance.

Comp 411 – Fall 2009                                11/16/2009                         L16 – Pipelining 16
CLK
miniMIPS Timing
New PC

PC+4                          Fetch Inst.
The diagram on the left
illustrates the Data Flow
Control Logic   of miniMIPS
Wanted: longest path

Complications:
+OFFSET    ASEL mux               BSEL mux
•some apparent paths
ALU                                  aren’t “possible”
•functional units have
Fetch data                 variable execution times
PCSEL mux   WASEL mux     WDSEL mux
(eg, ALU)

PC setup                   RF setup            Mem setup
•time axis is not to scale
(eg, tPD,MEM is very big!)
CLK

Comp 411 – Fall 2009                                                  11/16/2009                                 L16 – Pipelining 17
Where Are the Bottlenecks?
0x80000000
PC<31:29>:J<25:0>:00
0x80000040
0x80000080                JT         BT

PCSEL     6   5   4    3   2     1    0                                        Pipelining goal:
PC        00
Break LONG combinational paths
A      Instruction
Memory
 memories, ALU in separate stages
D
+4

Rs: <25:21>                      Rt: <20:16>
WASEL
J:<25:0>
Rd:<15:11>
Rt:<20:16>
0
1
RA1
Register             RA2
WD
“31”               WA
WA
“27”
2
3          RD1
File               RD2      WE       WERF

Imm: <15:0>
RESET
JT            SEXT           SEXT
IRQ Z N V C
x4            shamt:<10:6>

Control Logic
+                  “16”

0 1 2          ASEL   1       0       BSEL
PCSEL
BT
WASEL
SEXT                                  A                             B
BSEL                ALUFN                      ALU                                        WD     R/W
Wr
WDSEL
ALUFN                                                                              Data Memory
Wr                                        NV C Z                                   Adr    RD
WERF
ASEL

PC+4

0     1   2      WDSEL

Comp 411 – Fall 2009                                                                                 11/16/2009                                                             L16 – Pipelining 18
Ultimate Goal: 5-Stage Pipeline

GOAL: Maintain (nearly) 1.0 CPI, but increase clock speed to
barely include slowest components (mems, regfile, ALU)
APPROACH: structure processor as 5-stage pipeline:

Instruction Fetch stage: Maintains PC, fetches
IF             one instruction per cycle and passes it to
Instruction Decode/Register File stage: Decode
ID/RF           control lines and select source operands
ALU stage: Performs specified operation,
ALU             passes result to
Memory stage: If it’s a lw, use ALU result as an
MEM             address, pass mem data (or ALU result if not
lw) to
WB           Write-Back stage: writes result back into
register file.

Comp 411 – Fall 2009                            11/16/2009                       L16 – Pipelining 19
miniMIPS Timing
Different instructions use various parts of the data path.
1 instr every 14 nS, 14 nS, 20 nS, 9 nS, 19 nS
Program
execution        Time
order
CLK
beq \$1, \$2, 40
lw \$3, 30(\$0)
jal 20000
sw \$2, 20(\$4)

6 nS         Instruction Fetch     This is an example of a “Asynchronous
2 nS         Instruction Decode    Globally-Timed” control strategy (see
2 nS         Register Prop Delay
Lecture 18). Such a system would vary the
5 nS         ALU Operation
4 nS         Branch Target
clock period based on the instruction
6 nS         Data Access           being executed. This leads to complicated
1 nS        Register Setup        timing generation, and, in the end, slower
systems, since it is not very compatible
with pipelining!

Comp 411 – Fall 2009                              11/16/2009                                         L16 – Pipelining 20
Uniform miniMIPS Timing
With a fixed clock period, we have to allow for the worse case.
1 instr EVERY 20 nS
Program
execution        Time
order
CLK
beq \$1, \$2, 40
lw \$3, 30(\$0)
jal 20000
sw \$2, 20(\$4)

6 nS         Instruction Fetch     By accounting for the “worse case” path
2 nS         Instruction Decode    (i.e. allowing time for each possible
2 nS         Register Prop Delay
combination of operations) we can
5 nS         ALU Operation
4 nS         Branch Target
implement a “Synchronous Globally-Timed”             Isn’t the
net effect
6 nS         Data Access           control strategy. This simplifies timing             just a
1 nS        Register Setup        generation, enforces a uniform processing            slower
CPU?
order, and allows for pipelining!

Comp 411 – Fall 2009                             11/16/2009                            L16 – Pipelining 21
Step 1: A 2-Stage Pipeline
0x80000000
PC<31:29>:J<25:0>:00
0x80000040
JT
IF
0x80000080                           BT

PCSEL     6   5   4    3   2     1    0

PC        00

A      Instruction
Memory                                                                                 EXE
D
+4
PCEXE      00                   IREXE

J:<25:0>                        WASEL           Rs: <25:21>                      Rt: <20:16>
Rd:<15:11>
Rt:<20:16>
0
1
RA1
Register             RA2
WD
“31”               WA
WA
“27”
2
3          RD1
File               RD2        WE     WERF

Imm: <15:0>
RESET
IR stands for                                                                                JT            SEXT           SEXT
IRQ Z N V C
“Instruction Register”.                                                                  x4            shamt:<10:6>
The superscript “EXE”
denotes the pipeline
Control Logic
+                  “16”

0 1 2          ASEL   1       0        BSEL
stage, in which the PC                                       PCSEL
BT
and IR are used.                                             WASEL
A                             B
SEXT
BSEL                ALUFN                      ALU                                        WD     R/W
Wr
WDSEL
ALUFN                                                                                Data Memory
Wr                                        NV C Z                                    Adr   RD
WERF
ASEL

PC+4

0     1   2      WDSEL

Comp 411 – Fall 2009                                                                                 11/16/2009                                                             L16 – Pipelining 22
2-Stage Pipe Timing
Improves performance by increasing instruction throughput.
Ideal speedup is number of pipeline stages in the pipeline.
Program
execution        Time
order
CLK
beq \$1, \$2, 40
lw \$3, 30(\$0) During this, and all subsequent
jal 20000      clocks two instructions are in
sw \$2, 20(\$4) various stages of execution

6 nS         Instruction Fetch              By partitioning each instruction cycle into
2 nS         Instruction Decode             a “fetch” stage and an “execute” stage, we
2 nS         Register Prop Delay
get a simple pipeline. Why not include the
5 nS         ALU Operation
4 nS         Branch Target
Instruction-Decode/Register-Access time           Latency?
2 Clock
6 nS         Data Access                    with the Instruction Fetch? You could. But         periods =
2*14 nS
1 nS        Register Setup                 this partitioning allows for a useful variant     Throughput?
with 2-cycle loads and stores.                     1 instr
per
14 nS

Comp 411 – Fall 2009                                        11/16/2009                      L16 – Pipelining 23
Further improves performance, with slight increase in control
complexity. Some 1st generation (pre-cache) RISC processors used
Program this approach.
execution        Time
order
CLK
add \$4, \$5, \$6                                                              This design is very similar
beq \$1, \$2, 40                                                              to the multicycle CPU
lw \$3, 30(\$0)                                                               described in section 5.5 of
jal 20000                                                                   the text, but with
pipelining.
sw \$2, 20(\$4)

Clock:
6 nS         Instruction Fetch     The clock rate of this variant is over twice                      8 nS!
2 nS         Instruction Decode    that of our original design. Does that
2 nS         Register Prop Delay
mean it is that much faster?
5 nS         ALU Operation         Not likely. In practice, as many as 30% of
4 nS         Branch Target
6 nS
instructions access memory. Thus, the
Data Access
1 nS        Register Setup        effective speed up is:
speedup  newclockperiod( 0 .72*0 .3 )
old clockperiod

 8 (203 )  1923
1.      .
Comp 411 – Fall 2009                                 11/16/2009                                   L16 – Pipelining 24
2-Stage Pipelined Operation

Consider a sequence
...
xor                \$t2,\$t1,\$t2
sltiu              \$t3,\$t2,1
srl                \$t2,\$t2,1
...
Recall
“Pipeline Diagrams”
from an earlier slide.
Executed on our 2-stage pipeline:
TIME (cycles)                                  It can’t be
i+1
this easy!?
i            i+2      i+3           i+4      i+5   i+6
Pipeline

IF addi   xor    sltiu     srl          ...
EXE        addi   xor       sltiu        srl       ...

Comp 411 – Fall 2009                                      11/16/2009                                       L16 – Pipelining 25
0x80000000
PC<31:29>:J<25:0>:00

Step 2: 4-Stage miniMIPS
0x80000040
0x80000080                 JT         BT

PCSEL      6   5   4    3   2     1    0

PC        00              Instruction
Memory
A

Instruction              +4
D
Treats register file
Fetch
PCREG
as two separate
00                IRREG

Rs: <25:21>                      Rt: <20:16>
devices:
J:<25:0>
RA1
Register             RA2                               combinational
WA
RD1
File               RD2
Imm: <15:0>
JT
=                                                WRITE at end of
SEXT           SEXT                               BZ                                               pipe.
x4                shamt:<10:6>

+                           “16”
What other
information do we
0 1 2          ASEL   1       0        BSEL
Register
File                                                                     BT
PCALU      00                IRALU                                  A                             B                      WDALU          have to pass down
pipeline?
A                             B
ALUFN                            ALU                                                          PC
ALU                                                                                              NV C Z                                                            instruction fields
PCMEM                        IRMEM                                                                                       WDMEM
(decoding)
00                                                                      Y

R/W
Wr
Data Memory
Rt:<20:16>
Rd:<15:11>
RD              What sort of
“31” “27”
improvement should
WASEL            0   1    2     3                 0     1    2     WDSEL
expect in cycle time?
Write                                                                    WA
Register                  WD                 (NB: SAME RF
Back                                                WERF         WA
WE
File                                            AS ABOVE!)

Comp 411 – Fall 2009                                                                                             11/16/2009                                                         L16 – Pipelining 26
4-Stage miniMIPS Operation

Consider a sequence
...
sll         \$t1,\$t1,2
andi        \$t2,\$t2,15
sub         \$t3,\$0,\$t3
...
Executed on our 4-stage pipeline:
TIME (cycles)

i     i+1    i+2          i+3      i+4    i+5    i+6

IF addi    sll   andi         sub      ...
Pipeline

RF        addi    sll         andi     sub    ...

ALU               addi          sll     andi   sub    ...