Embed
Email

History of Pipelining

Document Sample
History of Pipelining
Shared by: HC11120718410
Categories
Tags
Stats
views:
1
posted:
12/7/2011
language:
pages:
69
History of Pipelining

• Introduced in IBM 7030 (Stretch computer)

• CDC 6600 used pipelining and multiple

functional units

• RISC processors in 80s were pipelined and

were efforts to get IPC of 1

• I486 was the first pipelined CISC processor

• Pipelined VAX from Digital

• Pipelined Motorola 68000K

• Current Trend – deep pipelines

Pipeline Illustrated:

L Comb. Logic BW = ~(1/n)

n Gate Delay





n Gate n Gate

L -- Delay L -- Delay BW = ~(2/n)

2 2







n Gate n Gate n Gate

L -- Delay L -- Delay L -- Delay BW = ~(3/n)

3 3 3

Pipeline Partitioning



Divide functionality into k-stages, k-fold

speedup?

Latches

Clock skew

Uniform sub-computations

Earle Latch

Pipeline Partitioning

Pipeline Partitioning

K opt = Sq Rt (GT/LS)



G = Cost of non pipelined design



L = Cost of each Latch



K = number of stages



T = latency of non-pipelined design



S = latency increase due to latch ( i.e. T/k + S

is new clock period)

Non-Pipelined FP

Multiplier

Pipelined FP Multiplier

Non-pipelined chip count =175

Non pipelined delay = 400 ns

Non pipelined clk = 2.5 MHz



Assume latching delay=17ns

Set up time = 5ns

Max stage delay = 150

Minimum clk period = 172ns

Pipelined clk = 5.8 MHz



Latency for each mult = 516 ns (instead of 400 ns)

Pipelined FP Multiplier

Pipeline Partitioning

CPU Example

• Suppose 2 ns for memory access, 2 ns for ALU

operation, and 1 ns for register file read or

write; compute instr rate

• Nonpipelined Execution:

–lw : IF + Read Reg + ALU + Memory + Write Reg

= 2 + 1 + 2 + 2 + 1 = 8 ns

–add: IF + Read Reg + ALU + Write Reg

= 2 + 1 + 2 + 1 = 6 ns

• Pipelined Execution:

–Max(IF,Read Reg,ALU,Memory,Write Reg) = 2

ns

Problems for Computers

• Limits to pipelining: Hazards prevent next

instruction from executing during its

designated clock cycle

– Data hazards: Instruction depends on result of

prior instruction still in the pipeline (missing

sock)

– Structural hazards: 2 instructions need the same

resource at the same time

– Control hazards: Pipelining of branches & other

instructions stall the pipeline until the hazard;

“bubbles” in the pipeline

Structural Hazard #1: Single Memory (1/2

Time (clock cycles)

I

n







ALU

I$ Reg D$ Reg



s Load







ALU

I$ Reg D$ Reg

t Instr 1

r.







ALU

I$ Reg D$ Reg

Instr 2

O





ALU

I$ Reg D$ Reg

Instr 3

r







ALU

I$ Reg D$ Reg

d Instr 4

e

r

Read same memory twice in same clock cycle

Structural Hazard #1: Single

Memory (2/2)

• Solution:

– have both an L1 Instruction Cache and an L1

Data Cache

– need more complex hardware to control when

both caches miss

Structural Hazard #2:

Registers (1/2)

Time (clock cycles)

I

n

s







ALU

I$ Reg D$ Reg

t sw







ALU

r. Instr 1 I$ Reg D$ Reg









ALU

I$ Reg D$ Reg

O Instr 2

r





ALU

I$ Reg D$ Reg

Instr 3

d







ALU

e Instr 4 I$ Reg D$ Reg



r

Can’t read and write to registers simultaneously

Structural Hazard #2:

Registers (2/2)

• Fact: Register access is VERY fast: takes less

than half the time of ALU stage

• Solution: introduce convention

– always Write to Registers during first half of

each clock cycle

– always Read from Registers during second half

of each clock cycle

– Result: can perform Read and Write during

same clock cycle

Things to Remember

• Optimal Pipeline

– Each stage is executing part of an instruction each

clock cycle.

– One instruction finishes during each clock cycle.

– On average, execute far more quickly.

• What makes this work?

– Similarities between instructions allow us to use

same stages for all instructions (generally).

– Each stage takes about the same amount of time as

all others: little wasted time.

MIPS ISA Handout

Data Hazards (1/2)

• Consider the following sequence of

instructions

add $t0, $t1, $t2



sub $t4, $t0 ,$t3



and $t5, $t0 ,$t6



or $t7, $t0 ,$t8



xor $t9, $t0 ,$t10

Data Hazards (2/2)

Dependencies backwards in time are hazards

I Time (clock cycles)

n

IF ID/RF EX MEM WB

s add $t0,$t1,$t2 I$ Reg









ALU

D$ Reg

t









ALU

r. sub $t4,$t0,$t3 I$ Reg D$ Reg









ALU

I$ Reg D$ Reg

O and $t5,$t0,$t6

r







ALU

I$ Reg D$ Reg

or $t7,$t0,$t8

d









ALU

e xor $t9,$t0,$t10 I$ Reg D$ Reg



r

Data Hazard Solution: Forwarding

• Forward result from one stage to another

IF ID/RF EX MEM WB









ALU

add $t0,$t1,$t2 I$ Reg D$ Reg









ALU

sub $t4,$t0,$t3 I$ Reg D$ Reg









ALU

I$ Reg D$ Reg

and $t5,$t0,$t6









ALU

I$ Reg D$ Reg

or $t7,$t0,$t8









ALU

I$ Reg D$ Reg

xor $t9,$t0,$t10





“sub” and “and” could use forwarding

Data Hazard: Loads (1/4)

• Dependencies backwards in time are

hazards

IF ID/RF EX MEM WB









ALU

I$ Reg Reg

lw $t0,0($t1) D$









ALU

I$ Reg D$ Reg

sub $t3,$t0,$t2







• Can’t solve with forwarding

• Must stall instruction dependent on

load, then forward (more hardware)

Data Hazard: Loads (2/4)

• Hardware must stall pipeline

• Called “interlock”

IF ID/RF EX MEM WB

lw $t0, 0($t1)









ALU

I$ Reg D$ Reg





bub









ALU

sub $t3,$t0,$t2 I$ Reg D$ Reg

ble









ALU

I$ bub Reg D$ Reg

and $t5,$t0,$t4 ble



bub









ALU

or $t7,$t0,$t6 I$ Reg D$

ble

Data Hazard: Loads (3/4)

• Stall is equivalent to nop

lw $t0, 0($t1)









ALU

I$ Reg D$ Reg





bub bub bub bub bub

nop ble ble ble ble ble









ALU

I$ Reg D$ Reg

sub $t3,$t0,$t2









ALU

and $t5,$t0,$t4 I$ Reg D$ Reg









ALU

or $t7,$t0,$t6 I$ Reg D$

Data Hazard: Loads (4/4)

• Instruction slot after a load is called “load

delay slot”

• If that instruction uses the result of the load,

then the hardware interlock will stall it for one

cycle.

• If the compiler puts an unrelated instruction in

that slot, then no stall

• Letting the hardware stall the instruction in the

delay slot is equivalent to putting a nop in the

slot (except the latter uses more code space)

Control Hazard: Branching (1/5)

Time (clock cycles)

I

n







ALU

I$ Reg D$ Reg

beq

s







ALU

I$ Reg D$ Reg

t Instr 1

r.







ALU

I$ Reg D$ Reg

Instr 2

O





ALU

I$ Reg D$ Reg

Instr 3

r







ALU

I$ Reg D$ Reg

d Instr 4

e

r

Where do we do the compare for the branch?

Control Hazard: Branching (2/5)

• We put branch decision-making hardware in

ALU stage

– therefore two more instructions after the branch

will always be fetched, whether or not the

branch is taken

• Desired functionality of a branch

– if we do not take the branch, don’t waste any

time and continue executing normally

– if we take the branch, don’t execute any

instructions after the branch, just go to the

desired label

Control Hazard: Branching (3/5)

• Initial Solution: Stall until decision is made

– insert “no-op” instructions: those that

accomplish nothing, just take time

– Drawback: branches take 3 clock cycles each

(assuming comparator is put in ALU stage)

Control Hazard: Branching (4/5)

• Optimization #1:

– move asynchronous comparator up to Stage 2

– as soon as instruction is decoded (Opcode

identifies is as a branch), immediately make a

decision and set the value of the PC (if

necessary)

– Benefit: since branch is complete in Stage 2,

only one unnecessary instruction is fetched, so

only one no-op is needed

– Side Note: This means that branches are idle in

Stages 3, 4 and 5.

Control Hazard: Branching (5/5)

I • Insert a single no-op (bubble)

n Time (clock cycles)

s





ALU

I$ Reg D$ Reg

t add

r.





ALU

I$ Reg D$ Reg

beq

O lw







ALU

bub I$ Reg D$ Reg

ble

r

d

e

• Impact: 2 clock cycles per branch instruction 

r

slow

Quiz



Assume 1 instr/clock, delayed branch, 5 stage pipeline,

forwarding, interlock on unresolved load hazards (after

103 loops, so pipeline full)

Loop: lw $t0, 0($s1) 1

addu $t0, $t0, $s2 2

sw $t0, 0($s1)

addiu $s1, $s1, -4 3

bne $s1, $zero, Loop 4

nop 5

•How many pipeline stages (clock cycles) per loop 6

iteration to execute this code? 7

8

9

10

Quiz Answer

• Assume 1 instr/clock, delayed branch, 5 stage

pipeline, forwarding, interlock on unresolved

load hazards. 103 iterations, so pipeline so stall)

2. (data hazard full.

Loop: 1. lw $t0, 0($s1)

3. addu $t0, $t0, $s2

4. sw $t0, 0($s1)

5. addiu $s1, $s1, -4

6. bne $s1, $zero, Loop

7. nop (delayed branch so exec. nop)

• How many pipeline stages (clock cycles) per

loop iteration to execute this code?





1 2 3 4 5 6 7 8 9 10

Pipelining Idealism

• Uniform Suboperations

The operation to be pipelined can be evenly partitioned

into uniform-latency suboperations

• Repetition of Identical Operations

The same operations are to be performed repeatedly on a

large number of different inputs

• Repetition of Independent Operations

All the repetitions of the same operation are mutually

independent, i.e. no data dependence

and no resource conflicts

Good Examples: automobile assembly line

floating-point multiplier

instruction pipeline???

Instruction Pipeline Design

• Uniform Suboperations ...

 balance pipeline stages

- stage quantization to yield balanced stages

- minimize internal fragmentation (some waiting stages)

• Identical operations ...

 unifying instruction types

- coalescing instruction types into one “multi-function” pipe

- minimize external fragmentation (some idling stages)

• Independent operations ...

 resolve data and resource hazards

- inter-instruction dependency detection and resolution

- minimize performance loss

The Generic Instruction Cycle

• The “computation” to be pipelined



1. Instruction Fetch (IF)

2. Instruction Decode (ID)

3. Operand(s) Fetch (OF)

4. Instruction Execution (EX)

5. Operand Store (OS)

6. Update Program Counter (PC)

The GENERIC Instruction Pipeline

(GNR)

Based on Obvious Subcomputations:



1. Instruction IF

Fetch



2. Instruction ID

Decode



3. Operand OF

Fetch



4. Instruction EX

Execute



5. Operand OS

Store

Balancing Pipeline Stages

• Without pipelining

IF TIF= 6 units Tcyc TIF+TID+TOF+TEX+TOS

= 31



ID TID= 2 units

• Pipelined

Tcyc  max{TIF, TID, TOF, TEX, TOS}

OF TID= 9 units

=9



EX TEX= 5 units

Speedup= 31 / 9

OS TOS= 9 units

Can we do better in terms of

either performance or

efficiency?

Balancing Pipeline Stages

• Two Methods for Stage Quantization:

– Merging of multiple subcomputations into one.

– Subdividing a subcomputation into multiple

subcomputations.





• Current Trends:

– Deeper pipelines (more and more stages).

– Multiplicity of different (subpipelines).

– Pipelining of memory access (tricky).

Granularity of Pipeline Stages

Coarser-Grained Machine Cycle: Finer-Grained Machine Cycle:

4 machine cyc / instruction cyc 11 machine cyc /instruction cyc



IF 1

IF

IF DELAY

DELAY 2

IF

ID

1 TIF&ID= 8 units

ID 3

ID ID

ID



OF 4

OF 2 TID= 9 units OF DELAY 5

DELAY



DELAY

DELAY 6

EX1 7

EX 3 TEX= 5 units EX

EX2

EX2 8

OS 9

OS 4 TOS= 9 units

OS DELAY 10

DELAY 11

Tcyc= 3 units

Hardware Requirements

• Logic needed for each IF IF

IF 1

IF 1 DELAY

DELAY 2

pipeline stage ID

ID

3

ID ID

ID



• Register file ports OF 4

OF 2

needed to support all OF DELAY

DELAY 5



the stages DELAY

DELAY 6

EX 3 EX1 7

• Memory accessing EX

EX2

EX2 8

ports needed to OS 4

OS 9

support all the stages OS DELAY 10

DELAY 11

Pipeline Examples

MIPS R2000/R3000 AMDAHL 470V/7

IF IF PC GEN

PC GEN .

1

IF 1

Cache Read

PC GEN . 2

ID

Cache Read

PC GEN . 3

OF RD 2

ID Decode

PC GEN . 4

OF PC REG

ReadGEN. 5

EX ALU 3 PC GEN

Add GEN. 6

Cache Read

PC GEN . 7



MEM Cache Read 8

OS 4 PC GEN .





EX EX 1.

PC GEN 9

EX 2.

PC GEN 10

WB 5

OS PC Result

CheckGEN. 11

Write Result

PC GEN . 12

Coalescing Resource

Requirements

• Procedure:

1. Analyze the sequence of register transfers

required by each instruction type.

2. Find commonality across instruction types and

merge them to share the same pipeline stage.

3. If there exists flexibility, shift or reorder some

register transfers to facilitate further merging.

Unifying instructions to 6-stage

pipeline

The 6-stage TYPICAL (TYP) pipeline:

ALU LOAD STORE BRANCH

IF: I-CACHE I-CACHE I-CACHE I-CACHE IF 1

PC PC PC PC





ID: DECODE DECODE DECODE DECODE ID 2



OF: RD. REG. RD. REG. RD. REG. RD. REG. RD 3



ADDR. GEN. ALU 4

RD. MEM.

MEM 5

EX: ALU OP.





OS: WR. REG. WR. REG. ADDR.GEN. ADDR. GEN. WB 6



WR. MEM. WR. PC

Physical Organization of 6-stage

pipeline

Pipeline Interface to Memory

Pipeline Interface to Register File

Minimizing Pipeline Stalls

• Dependences lead to Pipeline Hazards

Occurrence of Hazards

Necessary Conditions

Penalties Due to RAW hazards

Leading Insti ALU Load Branch



Trailing Instj ALU, L/S, Br. ALU, L/S, Br. ALU, L/S, Br.



Hazard register Int. Reg. (Ri) Int. Reg. (Ri) PC



Register WRITE WB (stage 6) WB (stage 6) MEM (stage 5)

stage (i)

Register READ RD (stage 3) RD (stage 3) IF (stage 1)

stage (j)

RAW distance or 3 cycles 3 cycles 4 cycles

penalty:

Incorporation of Forwarding

Paths in TYP pipeline

Penalties with Forwarding Paths

Leading Insti

ALU Load Branch

(producer)

Trailing Instj

ALU, L/S, Br. ALU, L/S, Br. ALU, L/S, Br.

(consumer)

Hazard register Int. Reg. (Ri) Int. Reg. (Ri) PC



Value Produced stage ALU (stage 4) MEM (stage 5) MEM (stage 5)

(i)

Value Consumed ALU (stage 4) ALU (stage 4) IF (stage 1)

stage (j)

Forward from outputs ALU,MEM,WB MEM,WB MEM

of:

Forward to input of: ALU ALU IF



RAW distance or 0 cycles 1 cycles 4 cycles

penalty:

Forwarding Paths for leading

ALU instruction

Pipeline Interlocks for leading

ALU instruction

Forwarding for leading Load

Pipeline Interlocks for ALU,Load

Pipeline Interlocks for Branch

Historical Trivia

• First MIPS design did not interlock and stall

on load-use data hazard

• Real reason for name behind MIPS:

Microprocessor without

Interlocked

Pipeline

Stages

– Word Play on acronym for

Millions of Instructions Per Second,

also called MIPS

Load1 Delay Slot (MIPSt5R2000)

t0 t t2 t3 t4

i: IF ID RD ALU MEM WB

j: IF ID RD ALU MEM WB

k: IF ID RD ALU MEM WB





- The effect of a “delayed” Load

is not visible to the

h: Rk  -- instructions in its delay slots.

……

i: Rk  MEM[ - ]

j: --  Rk

Which (Rk) do we really mean?

k: --  Rk

RISC Pipeline Example

Real Pipelined Processor

Example: MIPS R2000

Stage name Phase Function performed

1. IF 1 Translate virtual instr. addr. using TLB

2 Access I-cache using physical address

2. RD 1 Return instr. from I-cache, check tags & parity

2 Read reg. file; if a branch, generate target addr.

3. ALU 1 Start ALU op.; if a branch, check br. Condition

2 Finish ALU op.; if a load/store, translate virtual addr.

4. MEM 1 Access D-cache

2 Return data from D-cache, check tags & parity

5. WB 1 Write register file

2 ---

Intel i486 5-Stage “CISC”

Pipeline

Stage name Function performed



1. Instruction Fetch Fetch instruction from the 32-byte prefetch queue

(prefetch unit fills and flushes prefetch queue)

2. Instruction Decode-1 Translate instr. into control signals or microcode addr.

Initiate addr. generation and memory access

3. Instruction Decode-2 Access microcode memory

Outputs microinstruction to execution unit

4. Execute Execute ALU and memory accessing operations



5. Register Write-back Write back results to register

IBM’s Experience on Pipelined Processors

[Agerwala and Cocke 1987]

Attributes and Assumptions:

• Memory Bandwidth

– at least one word/cycle to fetch 1 instruction/cycle from I-

cache

– 40% of instructions are load/store, require access to D-

cache

• Code Characteristics (dynamic)

– loads - 25%

– stores - 15%

– ALU/Reg-Reg - 40%

– branches - 20% 1/3 unconditional (always taken);

» 1/3 conditional taken;

» 1/3 conditional not taken

More Statistics and Assumptions

• Cache Performance

– hit ratio of 100% is assumed in the experiments

– cache latency: I-cache = i; D-cache = d; default: i=d=1

cycle

• Load and Branch Scheduling

– loads:

• 25% cannot be scheduled

• 65% can be moved back 2 inst; 10% - 1 delay slot

– branches:

• unconditional - 100% schedulable (fill 1 delay slot)

• conditional - 50% schedulable (fill 1 delay slot)

25%L

15%S

CPI Calculations I 40%ALU

20%Br

• No cache bypass of reg. file, no scheduling of

loads or branches

– Load Penalty: 2 cycles

– Branch Penalty: 2 cycles

– Total CPI: 1 + 0.25*2+ 0.2*0.66*2= 1 + 0.5 + 0.27 =

1.77 CPI (assume br not taken, penalty only for 66%

branches)



• Bypass reg file for loads

– Load Penalty: 1 cycle

– Total CPI: 1 + 0.25 + 0.27 = 1.52 CPI

CPI Calculations II

• Bypass, scheduling of loads and branches

– Load Penalty:

75% can be moved back 1 => no penalty

remaining 25% => 1 cycle penalty

Load overhead=0.25*0.25*1=0.0625

– Branch Penalty:

1/3 Uncond. 100% schedulable => 1 cycle

1/3 Cond. Not Taken, if biased for NT => no penalty

1/3 Cond. Taken

50% schedulable => 1 cycle

50% unschedulable => 2 cycles

Branch overhead=0.2*[0.33*1+0.33*0.5*1 +0.33*0.5*2] = 0.167

– Total CPI: 1 + 0.063 + 0.167 = 1.23 CPI

CPI Calculations III

• Parallel target address generation to reduce penalty from 2->1cycle

– 90% of branches can be coded as PC relative (inst of reg indirect)

i.e. target address can be computed without register access

– A separate adder can compute (PC+offset) during reg read stage

– Branch Penalty:

PC-relative addressing Schedulable Branch penalty

YES (90%) YES (50%) 0 cycle

YES (90%) NO (50%) 1 cycle

NO (10%) YES (50%) 1 cycle

NO (10%) NO (50%) 2 cycles



– Conditional: Unconditional:

– Uncond = 0.2*0.33*0.1*1 = 0.0066 CPI

– Cond = 0.2*0.66*{[0.9*0.5*1]+[0.1*0.5*1]+[0.1*0.5*2]}=0.079

– Total CPI: 1 + 0.063 + 0.087 = 1.15 CPI = 0.87 IPC

Deeply Pipelined Processors

Deeply Pipelined Processors

Problem Set 2

• 2.4, 2.8, 2.18, 2.21


Related docs
Other docs by HC11120718410
?????? ?????? ?????
Views: 0  |  Downloads: 0
March 2009 - Parent Carers Voice
Views: 0  |  Downloads: 0
Flexible Spending Account Plan doc 20091
Views: 1  |  Downloads: 0
Sheet1
Views: 0  |  Downloads: 0
6. Low-Power Static RAM Architectures
Views: 1  |  Downloads: 0
IADC General Presentation
Views: 1  |  Downloads: 0
Computer Equipments
Views: 6  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!