History of Pipelining
• Introduced in IBM 7030 (Stretch computer)
• CDC 6600 used pipelining and multiple
functional units
• RISC processors in 80s were pipelined and
were efforts to get IPC of 1
• I486 was the first pipelined CISC processor
• Pipelined VAX from Digital
• Pipelined Motorola 68000K
• Current Trend – deep pipelines
Pipeline Illustrated:
L Comb. Logic BW = ~(1/n)
n Gate Delay
n Gate n Gate
L -- Delay L -- Delay BW = ~(2/n)
2 2
n Gate n Gate n Gate
L -- Delay L -- Delay L -- Delay BW = ~(3/n)
3 3 3
Pipeline Partitioning
Divide functionality into k-stages, k-fold
speedup?
Latches
Clock skew
Uniform sub-computations
Earle Latch
Pipeline Partitioning
Pipeline Partitioning
K opt = Sq Rt (GT/LS)
G = Cost of non pipelined design
L = Cost of each Latch
K = number of stages
T = latency of non-pipelined design
S = latency increase due to latch ( i.e. T/k + S
is new clock period)
Non-Pipelined FP
Multiplier
Pipelined FP Multiplier
Non-pipelined chip count =175
Non pipelined delay = 400 ns
Non pipelined clk = 2.5 MHz
Assume latching delay=17ns
Set up time = 5ns
Max stage delay = 150
Minimum clk period = 172ns
Pipelined clk = 5.8 MHz
Latency for each mult = 516 ns (instead of 400 ns)
Pipelined FP Multiplier
Pipeline Partitioning
CPU Example
• Suppose 2 ns for memory access, 2 ns for ALU
operation, and 1 ns for register file read or
write; compute instr rate
• Nonpipelined Execution:
–lw : IF + Read Reg + ALU + Memory + Write Reg
= 2 + 1 + 2 + 2 + 1 = 8 ns
–add: IF + Read Reg + ALU + Write Reg
= 2 + 1 + 2 + 1 = 6 ns
• Pipelined Execution:
–Max(IF,Read Reg,ALU,Memory,Write Reg) = 2
ns
Problems for Computers
• Limits to pipelining: Hazards prevent next
instruction from executing during its
designated clock cycle
– Data hazards: Instruction depends on result of
prior instruction still in the pipeline (missing
sock)
– Structural hazards: 2 instructions need the same
resource at the same time
– Control hazards: Pipelining of branches & other
instructions stall the pipeline until the hazard;
“bubbles” in the pipeline
Structural Hazard #1: Single Memory (1/2
Time (clock cycles)
I
n
ALU
I$ Reg D$ Reg
s Load
ALU
I$ Reg D$ Reg
t Instr 1
r.
ALU
I$ Reg D$ Reg
Instr 2
O
ALU
I$ Reg D$ Reg
Instr 3
r
ALU
I$ Reg D$ Reg
d Instr 4
e
r
Read same memory twice in same clock cycle
Structural Hazard #1: Single
Memory (2/2)
• Solution:
– have both an L1 Instruction Cache and an L1
Data Cache
– need more complex hardware to control when
both caches miss
Structural Hazard #2:
Registers (1/2)
Time (clock cycles)
I
n
s
ALU
I$ Reg D$ Reg
t sw
ALU
r. Instr 1 I$ Reg D$ Reg
ALU
I$ Reg D$ Reg
O Instr 2
r
ALU
I$ Reg D$ Reg
Instr 3
d
ALU
e Instr 4 I$ Reg D$ Reg
r
Can’t read and write to registers simultaneously
Structural Hazard #2:
Registers (2/2)
• Fact: Register access is VERY fast: takes less
than half the time of ALU stage
• Solution: introduce convention
– always Write to Registers during first half of
each clock cycle
– always Read from Registers during second half
of each clock cycle
– Result: can perform Read and Write during
same clock cycle
Things to Remember
• Optimal Pipeline
– Each stage is executing part of an instruction each
clock cycle.
– One instruction finishes during each clock cycle.
– On average, execute far more quickly.
• What makes this work?
– Similarities between instructions allow us to use
same stages for all instructions (generally).
– Each stage takes about the same amount of time as
all others: little wasted time.
MIPS ISA Handout
Data Hazards (1/2)
• Consider the following sequence of
instructions
add $t0, $t1, $t2
sub $t4, $t0 ,$t3
and $t5, $t0 ,$t6
or $t7, $t0 ,$t8
xor $t9, $t0 ,$t10
Data Hazards (2/2)
Dependencies backwards in time are hazards
I Time (clock cycles)
n
IF ID/RF EX MEM WB
s add $t0,$t1,$t2 I$ Reg
ALU
D$ Reg
t
ALU
r. sub $t4,$t0,$t3 I$ Reg D$ Reg
ALU
I$ Reg D$ Reg
O and $t5,$t0,$t6
r
ALU
I$ Reg D$ Reg
or $t7,$t0,$t8
d
ALU
e xor $t9,$t0,$t10 I$ Reg D$ Reg
r
Data Hazard Solution: Forwarding
• Forward result from one stage to another
IF ID/RF EX MEM WB
ALU
add $t0,$t1,$t2 I$ Reg D$ Reg
ALU
sub $t4,$t0,$t3 I$ Reg D$ Reg
ALU
I$ Reg D$ Reg
and $t5,$t0,$t6
ALU
I$ Reg D$ Reg
or $t7,$t0,$t8
ALU
I$ Reg D$ Reg
xor $t9,$t0,$t10
“sub” and “and” could use forwarding
Data Hazard: Loads (1/4)
• Dependencies backwards in time are
hazards
IF ID/RF EX MEM WB
ALU
I$ Reg Reg
lw $t0,0($t1) D$
ALU
I$ Reg D$ Reg
sub $t3,$t0,$t2
• Can’t solve with forwarding
• Must stall instruction dependent on
load, then forward (more hardware)
Data Hazard: Loads (2/4)
• Hardware must stall pipeline
• Called “interlock”
IF ID/RF EX MEM WB
lw $t0, 0($t1)
ALU
I$ Reg D$ Reg
bub
ALU
sub $t3,$t0,$t2 I$ Reg D$ Reg
ble
ALU
I$ bub Reg D$ Reg
and $t5,$t0,$t4 ble
bub
ALU
or $t7,$t0,$t6 I$ Reg D$
ble
Data Hazard: Loads (3/4)
• Stall is equivalent to nop
lw $t0, 0($t1)
ALU
I$ Reg D$ Reg
bub bub bub bub bub
nop ble ble ble ble ble
ALU
I$ Reg D$ Reg
sub $t3,$t0,$t2
ALU
and $t5,$t0,$t4 I$ Reg D$ Reg
ALU
or $t7,$t0,$t6 I$ Reg D$
Data Hazard: Loads (4/4)
• Instruction slot after a load is called “load
delay slot”
• If that instruction uses the result of the load,
then the hardware interlock will stall it for one
cycle.
• If the compiler puts an unrelated instruction in
that slot, then no stall
• Letting the hardware stall the instruction in the
delay slot is equivalent to putting a nop in the
slot (except the latter uses more code space)
Control Hazard: Branching (1/5)
Time (clock cycles)
I
n
ALU
I$ Reg D$ Reg
beq
s
ALU
I$ Reg D$ Reg
t Instr 1
r.
ALU
I$ Reg D$ Reg
Instr 2
O
ALU
I$ Reg D$ Reg
Instr 3
r
ALU
I$ Reg D$ Reg
d Instr 4
e
r
Where do we do the compare for the branch?
Control Hazard: Branching (2/5)
• We put branch decision-making hardware in
ALU stage
– therefore two more instructions after the branch
will always be fetched, whether or not the
branch is taken
• Desired functionality of a branch
– if we do not take the branch, don’t waste any
time and continue executing normally
– if we take the branch, don’t execute any
instructions after the branch, just go to the
desired label
Control Hazard: Branching (3/5)
• Initial Solution: Stall until decision is made
– insert “no-op” instructions: those that
accomplish nothing, just take time
– Drawback: branches take 3 clock cycles each
(assuming comparator is put in ALU stage)
Control Hazard: Branching (4/5)
• Optimization #1:
– move asynchronous comparator up to Stage 2
– as soon as instruction is decoded (Opcode
identifies is as a branch), immediately make a
decision and set the value of the PC (if
necessary)
– Benefit: since branch is complete in Stage 2,
only one unnecessary instruction is fetched, so
only one no-op is needed
– Side Note: This means that branches are idle in
Stages 3, 4 and 5.
Control Hazard: Branching (5/5)
I • Insert a single no-op (bubble)
n Time (clock cycles)
s
ALU
I$ Reg D$ Reg
t add
r.
ALU
I$ Reg D$ Reg
beq
O lw
ALU
bub I$ Reg D$ Reg
ble
r
d
e
• Impact: 2 clock cycles per branch instruction
r
slow
Quiz
Assume 1 instr/clock, delayed branch, 5 stage pipeline,
forwarding, interlock on unresolved load hazards (after
103 loops, so pipeline full)
Loop: lw $t0, 0($s1) 1
addu $t0, $t0, $s2 2
sw $t0, 0($s1)
addiu $s1, $s1, -4 3
bne $s1, $zero, Loop 4
nop 5
•How many pipeline stages (clock cycles) per loop 6
iteration to execute this code? 7
8
9
10
Quiz Answer
• Assume 1 instr/clock, delayed branch, 5 stage
pipeline, forwarding, interlock on unresolved
load hazards. 103 iterations, so pipeline so stall)
2. (data hazard full.
Loop: 1. lw $t0, 0($s1)
3. addu $t0, $t0, $s2
4. sw $t0, 0($s1)
5. addiu $s1, $s1, -4
6. bne $s1, $zero, Loop
7. nop (delayed branch so exec. nop)
• How many pipeline stages (clock cycles) per
loop iteration to execute this code?
1 2 3 4 5 6 7 8 9 10
Pipelining Idealism
• Uniform Suboperations
The operation to be pipelined can be evenly partitioned
into uniform-latency suboperations
• Repetition of Identical Operations
The same operations are to be performed repeatedly on a
large number of different inputs
• Repetition of Independent Operations
All the repetitions of the same operation are mutually
independent, i.e. no data dependence
and no resource conflicts
Good Examples: automobile assembly line
floating-point multiplier
instruction pipeline???
Instruction Pipeline Design
• Uniform Suboperations ...
balance pipeline stages
- stage quantization to yield balanced stages
- minimize internal fragmentation (some waiting stages)
• Identical operations ...
unifying instruction types
- coalescing instruction types into one “multi-function” pipe
- minimize external fragmentation (some idling stages)
• Independent operations ...
resolve data and resource hazards
- inter-instruction dependency detection and resolution
- minimize performance loss
The Generic Instruction Cycle
• The “computation” to be pipelined
1. Instruction Fetch (IF)
2. Instruction Decode (ID)
3. Operand(s) Fetch (OF)
4. Instruction Execution (EX)
5. Operand Store (OS)
6. Update Program Counter (PC)
The GENERIC Instruction Pipeline
(GNR)
Based on Obvious Subcomputations:
1. Instruction IF
Fetch
2. Instruction ID
Decode
3. Operand OF
Fetch
4. Instruction EX
Execute
5. Operand OS
Store
Balancing Pipeline Stages
• Without pipelining
IF TIF= 6 units Tcyc TIF+TID+TOF+TEX+TOS
= 31
ID TID= 2 units
• Pipelined
Tcyc max{TIF, TID, TOF, TEX, TOS}
OF TID= 9 units
=9
EX TEX= 5 units
Speedup= 31 / 9
OS TOS= 9 units
Can we do better in terms of
either performance or
efficiency?
Balancing Pipeline Stages
• Two Methods for Stage Quantization:
– Merging of multiple subcomputations into one.
– Subdividing a subcomputation into multiple
subcomputations.
• Current Trends:
– Deeper pipelines (more and more stages).
– Multiplicity of different (subpipelines).
– Pipelining of memory access (tricky).
Granularity of Pipeline Stages
Coarser-Grained Machine Cycle: Finer-Grained Machine Cycle:
4 machine cyc / instruction cyc 11 machine cyc /instruction cyc
IF 1
IF
IF DELAY
DELAY 2
IF
ID
1 TIF&ID= 8 units
ID 3
ID ID
ID
OF 4
OF 2 TID= 9 units OF DELAY 5
DELAY
DELAY
DELAY 6
EX1 7
EX 3 TEX= 5 units EX
EX2
EX2 8
OS 9
OS 4 TOS= 9 units
OS DELAY 10
DELAY 11
Tcyc= 3 units
Hardware Requirements
• Logic needed for each IF IF
IF 1
IF 1 DELAY
DELAY 2
pipeline stage ID
ID
3
ID ID
ID
• Register file ports OF 4
OF 2
needed to support all OF DELAY
DELAY 5
the stages DELAY
DELAY 6
EX 3 EX1 7
• Memory accessing EX
EX2
EX2 8
ports needed to OS 4
OS 9
support all the stages OS DELAY 10
DELAY 11
Pipeline Examples
MIPS R2000/R3000 AMDAHL 470V/7
IF IF PC GEN
PC GEN .
1
IF 1
Cache Read
PC GEN . 2
ID
Cache Read
PC GEN . 3
OF RD 2
ID Decode
PC GEN . 4
OF PC REG
ReadGEN. 5
EX ALU 3 PC GEN
Add GEN. 6
Cache Read
PC GEN . 7
MEM Cache Read 8
OS 4 PC GEN .
EX EX 1.
PC GEN 9
EX 2.
PC GEN 10
WB 5
OS PC Result
CheckGEN. 11
Write Result
PC GEN . 12
Coalescing Resource
Requirements
• Procedure:
1. Analyze the sequence of register transfers
required by each instruction type.
2. Find commonality across instruction types and
merge them to share the same pipeline stage.
3. If there exists flexibility, shift or reorder some
register transfers to facilitate further merging.
Unifying instructions to 6-stage
pipeline
The 6-stage TYPICAL (TYP) pipeline:
ALU LOAD STORE BRANCH
IF: I-CACHE I-CACHE I-CACHE I-CACHE IF 1
PC PC PC PC
ID: DECODE DECODE DECODE DECODE ID 2
OF: RD. REG. RD. REG. RD. REG. RD. REG. RD 3
ADDR. GEN. ALU 4
RD. MEM.
MEM 5
EX: ALU OP.
OS: WR. REG. WR. REG. ADDR.GEN. ADDR. GEN. WB 6
WR. MEM. WR. PC
Physical Organization of 6-stage
pipeline
Pipeline Interface to Memory
Pipeline Interface to Register File
Minimizing Pipeline Stalls
• Dependences lead to Pipeline Hazards
Occurrence of Hazards
Necessary Conditions
Penalties Due to RAW hazards
Leading Insti ALU Load Branch
Trailing Instj ALU, L/S, Br. ALU, L/S, Br. ALU, L/S, Br.
Hazard register Int. Reg. (Ri) Int. Reg. (Ri) PC
Register WRITE WB (stage 6) WB (stage 6) MEM (stage 5)
stage (i)
Register READ RD (stage 3) RD (stage 3) IF (stage 1)
stage (j)
RAW distance or 3 cycles 3 cycles 4 cycles
penalty:
Incorporation of Forwarding
Paths in TYP pipeline
Penalties with Forwarding Paths
Leading Insti
ALU Load Branch
(producer)
Trailing Instj
ALU, L/S, Br. ALU, L/S, Br. ALU, L/S, Br.
(consumer)
Hazard register Int. Reg. (Ri) Int. Reg. (Ri) PC
Value Produced stage ALU (stage 4) MEM (stage 5) MEM (stage 5)
(i)
Value Consumed ALU (stage 4) ALU (stage 4) IF (stage 1)
stage (j)
Forward from outputs ALU,MEM,WB MEM,WB MEM
of:
Forward to input of: ALU ALU IF
RAW distance or 0 cycles 1 cycles 4 cycles
penalty:
Forwarding Paths for leading
ALU instruction
Pipeline Interlocks for leading
ALU instruction
Forwarding for leading Load
Pipeline Interlocks for ALU,Load
Pipeline Interlocks for Branch
Historical Trivia
• First MIPS design did not interlock and stall
on load-use data hazard
• Real reason for name behind MIPS:
Microprocessor without
Interlocked
Pipeline
Stages
– Word Play on acronym for
Millions of Instructions Per Second,
also called MIPS
Load1 Delay Slot (MIPSt5R2000)
t0 t t2 t3 t4
i: IF ID RD ALU MEM WB
j: IF ID RD ALU MEM WB
k: IF ID RD ALU MEM WB
- The effect of a “delayed” Load
is not visible to the
h: Rk -- instructions in its delay slots.
……
i: Rk MEM[ - ]
j: -- Rk
Which (Rk) do we really mean?
k: -- Rk
RISC Pipeline Example
Real Pipelined Processor
Example: MIPS R2000
Stage name Phase Function performed
1. IF 1 Translate virtual instr. addr. using TLB
2 Access I-cache using physical address
2. RD 1 Return instr. from I-cache, check tags & parity
2 Read reg. file; if a branch, generate target addr.
3. ALU 1 Start ALU op.; if a branch, check br. Condition
2 Finish ALU op.; if a load/store, translate virtual addr.
4. MEM 1 Access D-cache
2 Return data from D-cache, check tags & parity
5. WB 1 Write register file
2 ---
Intel i486 5-Stage “CISC”
Pipeline
Stage name Function performed
1. Instruction Fetch Fetch instruction from the 32-byte prefetch queue
(prefetch unit fills and flushes prefetch queue)
2. Instruction Decode-1 Translate instr. into control signals or microcode addr.
Initiate addr. generation and memory access
3. Instruction Decode-2 Access microcode memory
Outputs microinstruction to execution unit
4. Execute Execute ALU and memory accessing operations
5. Register Write-back Write back results to register
IBM’s Experience on Pipelined Processors
[Agerwala and Cocke 1987]
Attributes and Assumptions:
• Memory Bandwidth
– at least one word/cycle to fetch 1 instruction/cycle from I-
cache
– 40% of instructions are load/store, require access to D-
cache
• Code Characteristics (dynamic)
– loads - 25%
– stores - 15%
– ALU/Reg-Reg - 40%
– branches - 20% 1/3 unconditional (always taken);
» 1/3 conditional taken;
» 1/3 conditional not taken
More Statistics and Assumptions
• Cache Performance
– hit ratio of 100% is assumed in the experiments
– cache latency: I-cache = i; D-cache = d; default: i=d=1
cycle
• Load and Branch Scheduling
– loads:
• 25% cannot be scheduled
• 65% can be moved back 2 inst; 10% - 1 delay slot
– branches:
• unconditional - 100% schedulable (fill 1 delay slot)
• conditional - 50% schedulable (fill 1 delay slot)
25%L
15%S
CPI Calculations I 40%ALU
20%Br
• No cache bypass of reg. file, no scheduling of
loads or branches
– Load Penalty: 2 cycles
– Branch Penalty: 2 cycles
– Total CPI: 1 + 0.25*2+ 0.2*0.66*2= 1 + 0.5 + 0.27 =
1.77 CPI (assume br not taken, penalty only for 66%
branches)
• Bypass reg file for loads
– Load Penalty: 1 cycle
– Total CPI: 1 + 0.25 + 0.27 = 1.52 CPI
CPI Calculations II
• Bypass, scheduling of loads and branches
– Load Penalty:
75% can be moved back 1 => no penalty
remaining 25% => 1 cycle penalty
Load overhead=0.25*0.25*1=0.0625
– Branch Penalty:
1/3 Uncond. 100% schedulable => 1 cycle
1/3 Cond. Not Taken, if biased for NT => no penalty
1/3 Cond. Taken
50% schedulable => 1 cycle
50% unschedulable => 2 cycles
Branch overhead=0.2*[0.33*1+0.33*0.5*1 +0.33*0.5*2] = 0.167
– Total CPI: 1 + 0.063 + 0.167 = 1.23 CPI
CPI Calculations III
• Parallel target address generation to reduce penalty from 2->1cycle
– 90% of branches can be coded as PC relative (inst of reg indirect)
i.e. target address can be computed without register access
– A separate adder can compute (PC+offset) during reg read stage
– Branch Penalty:
PC-relative addressing Schedulable Branch penalty
YES (90%) YES (50%) 0 cycle
YES (90%) NO (50%) 1 cycle
NO (10%) YES (50%) 1 cycle
NO (10%) NO (50%) 2 cycles
– Conditional: Unconditional:
– Uncond = 0.2*0.33*0.1*1 = 0.0066 CPI
– Cond = 0.2*0.66*{[0.9*0.5*1]+[0.1*0.5*1]+[0.1*0.5*2]}=0.079
– Total CPI: 1 + 0.063 + 0.087 = 1.15 CPI = 0.87 IPC
Deeply Pipelined Processors
Deeply Pipelined Processors
Problem Set 2
• 2.4, 2.8, 2.18, 2.21