# CSCE 212 Computer Architecture CSCE 513

Document Sample

```					CSCE 513 Computer Architecture

Lecture 8
Tomasulo’s Algorithm

Topics
   Dynamic Scheduling Review
   Tomasulo’s
 structure
 Examples
 Algorithm details
   Speculation

September 13, 2011
Overview
Last Time
   Stalls in Diagrams revisited
   Scoreboard Review
   Dealing with Control Hazards in the 5-stage:
 Static, dynamic branch prediction, branch history table

New
   Control Hazards: Lecture 7 slides 24-34
 Correlating and tournament branch predictors
   Data Hazards Review
   Tomasulo Overview, examples
   Tomasulo’s Algorithm details

References
   Chapter 2 sections 2.3(branch prediction), 2.4-2.5
Tomasulo’s
–2–      Test 1 Thursday September 29 – two weeks+        CSCE 513 Fall 2011
Chapter 2 – Instruction Level Parallelism
• Data Hazards review
• Assumptions on latencies of floating point operations
• Data Hazards
• Loop unrolling
• Control Hazards
• Static Branch Prediction
• Dynamic Branch Prediction

–3–                                         CSCE 513 Fall 2011
Review of Data Hazards
Assume instruction i comes before instruction j
• Instruction j is data dependant on i if
•   i produces a result that j uses
•   j depends on k and k depends on I

• Name dependence – two instructions use the same
register
• Antidependence when j writes a register that i reads
• Output dependence when both i and j write the same
register
• Hazards
•   RAW – j tries to read before i has written
•   WAW – both i and j write but j writes first
–4–   •   WAR - j writes over an operand for i before I reads it Fall 2011
CSCE 513
Chapter 2 – Latency Assumptions
.

Instruction       Instruction using Latency in   Stalls between
producing value   value             Cycles       cycles

FP ALU Op         FP ALU OP                4             3
FP ALU Op         Store Double             3             2
Load Double       FP ALU OP                1             1
Load Double       Store Double             1             0
Integer Op        Integer Op               1             0

–5–                                                 CSCE 513 Fall 2011
Loop Example Analysis

–6–                     CSCE 513 Fall 2011
Loop Statically Scheduled

–7–                         CSCE 513 Fall 2011
Loop Unrolled not scheduled

–8–                       CSCE 513 Fall 2011
Loop unrolled and Scheduled

–9–                       CSCE 513 Fall 2011
Branch Prediction Errors/Stalls
• Standard 5 cycle pipeline – Branch decision during EX

• Extra hardware to make decision in ID

• Support for Branch Conditions
•   Classical 5-stage   jnz   R1, loop
•   MIPS                jne   R1, R2, loop
•   IA32                blt   loop

– 10 –                                            CSCE 513 Fall 2011
Reducing Branch Costs with Prediction
Static branch prediction
• Observe branch statistics from program suite or
specific program
• flag into compiler or rewrite code
•   If (x < y) then …A    mispredict 60% then rewrite as
•   .           else …B
•   If (x >=y) then …B    mispredicts 40%
•   .           else …A

Figure 2.3

– 11 –                                                   CSCE 513 Fall 2011
Figure 2.3 Misprediction rates for SPEC92

– 12 –                            CSCE 513 Fall 2011
Static Branch Prediction
• Predict Branch Not Taken (BNT)
• Predict Branch Not Taken (BNT)
• Predict Branch Backwards Not Forward (BBNF)
• Predict Branch based on profiling the program

– 13 –                                       CSCE 513 Fall 2011
From Lecture 7: slides 27-32
Slide 27 - Perfect for Loops (misprediction)
Slide 28 - 2 Bit Branch predictor Fig 2.4
Slide 29 - Figure 2.5 2-bit predictor accuracy
Slide 30 - 2-bit versus infinite buffer
Slide 31 - Correlating Branch Predictors
Slide 32 – (m,n) predictors

– 14 –                                              CSCE 513 Fall 2011
Dynamic Branch Prediction
Dynamic?

Branch Prediction Buffers – branch history table
• Table indexed by low order bits of the address of the
branch
• Remembers where we branched last time (saves
actual targets)
• IDEA – predict we will go the same way as we did last
time

– 15 –                                         CSCE 513 Fall 2011
2 Bit Branch predictor Fig 2.4

N-bit predictors
– 16 –                        CSCE 513 Fall 2011
2 Bit Saturating Counter Branch predictor
.

– 17 –   http://en.wikipedia.org/wiki/File:Branch_prediction_2bit_saturating_counter.gif   CSCE 513 Fall 2011
Figure 2.5 2-bit predictor accuracy

– 18 –                        CSCE 513 Fall 2011
2-bit 4K versus infinite buffer

– 19 –                         CSCE 513 Fall 2011
Correlating Branch Predictors
If(a == 2)
a = 0;
If(b ==2)
b = 0;
If (a != b) {
…

– 20 –                      CSCE 513 Fall 2011
(m, n) predictors
m last branches are used to predict
One of 2m n-bit predictors

– 21 –                                   CSCE 513 Fall 2011
Tournament Branch Predictors

– 22 –                    CSCE 513 Fall 2011
Figure 2.8 Comparison Branch
Predictors

– 23 –                              CSCE 513 Fall 2011
H&P 2007 Elsevier, Inc
PopQuiz Review of Data Hazards
Loop:                   RAW
L.D        F0, 0(R1)
BNE     R1, R2, Loop    WAW
SUB.D      F6, F4, F2
MULT.D     F4, F6, F8

– 24 –                          CSCE 513 Fall 2011
Tomasulo’s Overview
IBM 360 family –
How do you design supercomputer with the same ISA
as a relatively cheap business machine?
This was before the invention of cache.

Key ideas
register renaming
out of order execution

– 25 –                                       CSCE 513 Fall 2011
Figure 2.9 Tomasulo

– 26 –                    CSCE 513 Fall 2011
Tomasulo’s
Multiple Reservation Stations for each Unit
• OP
• Qj, Qk
• Vj, Vk
• A
• Busy
Register File
• Qi

– 27 –                                           CSCE 513 Fall 2011
Data Flow
Data flow: actual flow of data values among
instructions that produce results and those that
consume them
   branches make flow dynamic, determine which
instruction is supplier of data
Example:
BEQZ        R4,L
DSUBU       R1,R5,R6
L: …
OR          R7,R1,R8
OR depends on DADDU or DSUBU?
Must preserve data flow on execution

– 28 –   4/29/2012            CS252 S06 Lec7 ILP           CSCE 513 Fall 2011
28
Register Renaming

DIV     F0, F2, F4
S.D     F6, 0(R1)
SUB.D   F8,F10,F14
MUL.D   F6,F10,F8

– 29 –                 CSCE 513 Fall 2011
Example page 98
1. L.D      F6, 32(R2)
2. L.D      F2, 44(R3)
3. MUL.D    F0, F2, F4
4. SUB.D    F8, F2, F6
5. DIV.D    F10, F0, F6
Cleverly chosen example (default input to simulator)
http://www.ecs.umass.edu/ece/koren/architecture/Toma
sulo/AppletTomasulo.html

– 30 –                                         CSCE 513 Fall 2011
Figure 2.10 – Example which Cycle?

– 31 –                     CSCE 513 Fall 2011
Figure
2.11

– 32 –   CSCE 513 Fall 2011
Figure 2.12.a Detailed Algorithm

– 33 –                       CSCE 513 Fall 2011
Figure 2.12.b Detailed Algorithm

– 34 –                       CSCE 513 Fall 2011
Figure 2.12.c Detailed Algorithm

– 35 –                       CSCE 513 Fall 2011
Tomasulo Loop Example
Loop: L.D         F0, 0(R1)
MUL.D       F4, F0, F2
S.D         F4, 0(R1)
BNE         R1, R2, Loop

Can’t be done on simulator! Can’t input DADDIU or BNE.

– 36 –                                       CSCE 513 Fall 2011
Figure 2.13 - Two active Iterations of loop

– 37 –                              CSCE 513 Fall 2011
Observations on Tomasulo’s Alg
1. Tomasulo designed for the IBM 360/91
   http://www.columbia.edu/acis/history/36091.html

2. Does not require compiler to do all of the work
   Changes to hardware do not require changes to compiler

3. Designed before caches, but OoOE really helps with
cache misses
4. Dynamic scheduling required for “speculation”

– 38 –                                                    CSCE 513 Fall 2011
– 39 –   CSCE 513 Fall 2011
Homework Set 3
1. (Semi-review) A processor has a clock frequency of
5GHz and is running a program that executes 5
billion instructions from start to finish. The
instruction mix of this program is 20% branches,
20% loads, 10% stores, and 50% ALU. The average
IPC is 1 for branches, 0.5 for loads, 1 for stores, and
2 for ALU instructions. What is the total execution
time for this program on this processor?

– 40 –                                           CSCE 513 Fall 2011
Homework Set 3: Problem 2
2. You are considering two possible enhancements for
the processor described in Problem 1.
One enhancement is a better memory organization,
which would improve the average IPC for load
instructions from 0.5 to 1.
The other enhancement is a new multiply-and-add
instruction that would reduce the number of ALU
instructions by 20% while still maintaining the
average IPC of 2 for the remaining ALU instructions.
Unfortunately, there is room on the processor chip
for only one of these two enhancements, so you
must choose the enhancement that provides better
overall performance. Which one would you choose?
– 41 –                                        CSCE 513 Fall 2011
HW 3: Tomasulo Problem 3
Show the first 8 steps of the execution of the code from
problem A.1 using the scheme from Tomasulo’s alg.

– 42 –                                          CSCE 513 Fall 2011
HW3: 4. Branch prediction problem
a. Show the state diagram for a 3-bit branch predictor.
b. Explain how the following instructions would be
handled with a branch prediction buffer with 64
buffer entries.
    … 0000 0100     BEQZ R2, skip
    … 0000 1000     BEQZ R3, next
    … 0000 1100     BNE R4, loop
Is there any conflict?

c. What distinguishes between a tournament predictor
and a correlating branch predictor?

– 43 –                                             CSCE 513 Fall 2011

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 19 posted: 4/29/2012 language: Latin pages: 43