Docstoc

CSCE 212 Computer Architecture CSCE 513

Document Sample
CSCE 212 Computer Architecture CSCE 513 Powered By Docstoc
					CSCE 513 Computer Architecture

                 Lecture 8
            Tomasulo’s Algorithm

            Topics
                    Dynamic Scheduling Review
                    Tomasulo’s
                       structure
                       Examples
                       Algorithm details
                    Speculation

            Readings: Chapter 2: 2.4-2.6
September 13, 2011
 Overview
Last Time
         Stalls in Diagrams revisited
         Scoreboard Review
         Dealing with Control Hazards in the 5-stage:
            Static, dynamic branch prediction, branch history table

New
         Control Hazards: Lecture 7 slides 24-34
            Correlating and tournament branch predictors
         Data Hazards Review
         Tomasulo Overview, examples
         Tomasulo’s Algorithm details

References
         Chapter 2 sections 2.3(branch prediction), 2.4-2.5
          Tomasulo’s
–2–      Test 1 Thursday September 29 – two weeks+        CSCE 513 Fall 2011
Chapter 2 – Instruction Level Parallelism
 • Data Hazards review
 • Assumptions on latencies of floating point operations
 • Data Hazards
 • Loop unrolling
 • Control Hazards
 • Static Branch Prediction
 • Dynamic Branch Prediction




 –3–                                         CSCE 513 Fall 2011
Review of Data Hazards
 Assume instruction i comes before instruction j
 • Instruction j is data dependant on i if
      •   i produces a result that j uses
      •   j depends on k and k depends on I

 • Name dependence – two instructions use the same
   register
 • Antidependence when j writes a register that i reads
 • Output dependence when both i and j write the same
   register
 • Hazards
      •   RAW – j tries to read before i has written
      •   WAW – both i and j write but j writes first
–4–   •   WAR - j writes over an operand for i before I reads it Fall 2011
                                                          CSCE 513
Chapter 2 – Latency Assumptions
      .




 Instruction       Instruction using Latency in   Stalls between
 producing value   value             Cycles       cycles

 FP ALU Op         FP ALU OP                4             3
 FP ALU Op         Store Double             3             2
 Load Double       FP ALU OP                1             1
 Load Double       Store Double             1             0
 Integer Op        Integer Op               1             0




–5–                                                 CSCE 513 Fall 2011
Loop Example Analysis




–6–                     CSCE 513 Fall 2011
Loop Statically Scheduled




–7–                         CSCE 513 Fall 2011
Loop Unrolled not scheduled




–8–                       CSCE 513 Fall 2011
Loop unrolled and Scheduled




–9–                       CSCE 513 Fall 2011
 Branch Prediction Errors/Stalls
• Standard 5 cycle pipeline – Branch decision during EX




• Extra hardware to make decision in ID




• Support for Branch Conditions
     •   Classical 5-stage   jnz   R1, loop
     •   MIPS                jne   R1, R2, loop
     •   IA32                blt   loop

– 10 –                                            CSCE 513 Fall 2011
Reducing Branch Costs with Prediction
   Static branch prediction
   • Observe branch statistics from program suite or
     specific program
   • flag into compiler or rewrite code
         •   If (x < y) then …A    mispredict 60% then rewrite as
         •   .           else …B
         •   If (x >=y) then …B    mispredicts 40%
         •   .           else …A

   Figure 2.3




– 11 –                                                   CSCE 513 Fall 2011
 Figure 2.3 Misprediction rates for SPEC92




– 12 –                            CSCE 513 Fall 2011
 Static Branch Prediction
   • Predict Branch Not Taken (BNT)
   • Predict Branch Not Taken (BNT)
   • Predict Branch Backwards Not Forward (BBNF)
   • Predict Branch based on profiling the program




– 13 –                                       CSCE 513 Fall 2011
 From Lecture 7: slides 27-32
   Slide 27 - Perfect for Loops (misprediction)
   Slide 28 - 2 Bit Branch predictor Fig 2.4
   Slide 29 - Figure 2.5 2-bit predictor accuracy
   Slide 30 - 2-bit versus infinite buffer
   Slide 31 - Correlating Branch Predictors
   Slide 32 – (m,n) predictors




– 14 –                                              CSCE 513 Fall 2011
 Dynamic Branch Prediction
   Dynamic?


   Branch Prediction Buffers – branch history table
   • Table indexed by low order bits of the address of the
     branch
   • Remembers where we branched last time (saves
     actual targets)
   • IDEA – predict we will go the same way as we did last
     time



– 15 –                                         CSCE 513 Fall 2011
 2 Bit Branch predictor Fig 2.4




         N-bit predictors
– 16 –                        CSCE 513 Fall 2011
2 Bit Saturating Counter Branch predictor
     .




  – 17 –   http://en.wikipedia.org/wiki/File:Branch_prediction_2bit_saturating_counter.gif   CSCE 513 Fall 2011
 Figure 2.5 2-bit predictor accuracy




– 18 –                        CSCE 513 Fall 2011
 2-bit 4K versus infinite buffer




– 19 –                         CSCE 513 Fall 2011
 Correlating Branch Predictors
   If(a == 2)
           a = 0;
   If(b ==2)
           b = 0;
   If (a != b) {
           …




– 20 –                      CSCE 513 Fall 2011
 (m, n) predictors
   m last branches are used to predict
   One of 2m n-bit predictors




– 21 –                                   CSCE 513 Fall 2011
 Tournament Branch Predictors




– 22 –                    CSCE 513 Fall 2011
 Figure 2.8 Comparison Branch
 Predictors




– 23 –                              CSCE 513 Fall 2011
           H&P 2007 Elsevier, Inc
 PopQuiz Review of Data Hazards
  Loop:                   RAW
  L.D        F0, 0(R1)
  ADD.D      F4, F0, F6
  BNE     R1, R2, Loop    WAW
  SUB.D      F6, F4, F2
  MULT.D     F4, F6, F8
  ADD.D      F8, F2, F6   WAR


– 24 –                          CSCE 513 Fall 2011
 Tomasulo’s Overview
   IBM 360 family –
   How do you design supercomputer with the same ISA
     as a relatively cheap business machine?
   This was before the invention of cache.


   Key ideas
         register renaming
         out of order execution




– 25 –                                       CSCE 513 Fall 2011
    Figure 2.9 Tomasulo




– 26 –                    CSCE 513 Fall 2011
 Tomasulo’s
   Multiple Reservation Stations for each Unit
   • OP
   • Qj, Qk
   • Vj, Vk
   • A
   • Busy
   Register File
   • Qi



– 27 –                                           CSCE 513 Fall 2011
 Data Flow
  Data flow: actual flow of data values among
    instructions that produce results and those that
    consume them
            branches make flow dynamic, determine which
             instruction is supplier of data
  Example:
         DADDU       R1,R2,R3
         BEQZ        R4,L
         DSUBU       R1,R5,R6
         L: …
         OR          R7,R1,R8
  OR depends on DADDU or DSUBU?
    Must preserve data flow on execution

– 28 –   4/29/2012            CS252 S06 Lec7 ILP           CSCE 513 Fall 2011
                                                            28
 Register Renaming

  DIV     F0, F2, F4
  ADD.D   F6, F0, F8
  S.D     F6, 0(R1)
  SUB.D   F8,F10,F14
  MUL.D   F6,F10,F8




– 29 –                 CSCE 513 Fall 2011
 Example page 98
   1. L.D      F6, 32(R2)
   2. L.D      F2, 44(R3)
   3. MUL.D    F0, F2, F4
   4. SUB.D    F8, F2, F6
   5. DIV.D    F10, F0, F6
   6. ADD.D    F6, F8, F2
   Cleverly chosen example (default input to simulator)
   http://www.ecs.umass.edu/ece/koren/architecture/Toma
      sulo/AppletTomasulo.html


– 30 –                                         CSCE 513 Fall 2011
 Figure 2.10 – Example which Cycle?




– 31 –                     CSCE 513 Fall 2011
Figure
2.11




– 32 –   CSCE 513 Fall 2011
 Figure 2.12.a Detailed Algorithm




– 33 –                       CSCE 513 Fall 2011
 Figure 2.12.b Detailed Algorithm




– 34 –                       CSCE 513 Fall 2011
 Figure 2.12.c Detailed Algorithm




– 35 –                       CSCE 513 Fall 2011
 Tomasulo Loop Example
   Loop: L.D         F0, 0(R1)
         MUL.D       F4, F0, F2
         S.D         F4, 0(R1)
         DADDIU      R1, R1, -8
         BNE         R1, R2, Loop




   Can’t be done on simulator! Can’t input DADDIU or BNE.



– 36 –                                       CSCE 513 Fall 2011
 Figure 2.13 - Two active Iterations of loop




– 37 –                              CSCE 513 Fall 2011
 Observations on Tomasulo’s Alg
   1. Tomasulo designed for the IBM 360/91
            http://www.columbia.edu/acis/history/36091.html

   2. Does not require compiler to do all of the work
            Changes to hardware do not require changes to compiler
             (adding another multiplier)

   3. Designed before caches, but OoOE really helps with
      cache misses
   4. Dynamic scheduling required for “speculation”




– 38 –                                                    CSCE 513 Fall 2011
– 39 –   CSCE 513 Fall 2011
 Homework Set 3
   1. (Semi-review) A processor has a clock frequency of
      5GHz and is running a program that executes 5
      billion instructions from start to finish. The
      instruction mix of this program is 20% branches,
      20% loads, 10% stores, and 50% ALU. The average
      IPC is 1 for branches, 0.5 for loads, 1 for stores, and
      2 for ALU instructions. What is the total execution
      time for this program on this processor?




– 40 –                                           CSCE 513 Fall 2011
 Homework Set 3: Problem 2
   2. You are considering two possible enhancements for
      the processor described in Problem 1.
   One enhancement is a better memory organization,
     which would improve the average IPC for load
     instructions from 0.5 to 1.
   The other enhancement is a new multiply-and-add
     instruction that would reduce the number of ALU
     instructions by 20% while still maintaining the
     average IPC of 2 for the remaining ALU instructions.
     Unfortunately, there is room on the processor chip
     for only one of these two enhancements, so you
     must choose the enhancement that provides better
     overall performance. Which one would you choose?
– 41 –                                        CSCE 513 Fall 2011
 HW 3: Tomasulo Problem 3
   Show the first 8 steps of the execution of the code from
     problem A.1 using the scheme from Tomasulo’s alg.




– 42 –                                          CSCE 513 Fall 2011
 HW3: 4. Branch prediction problem
   a. Show the state diagram for a 3-bit branch predictor.
   b. Explain how the following instructions would be
      handled with a branch prediction buffer with 64
      buffer entries.
             Addresses       Branch Instruction
             … 0000 0100     BEQZ R2, skip
             … 0000 1000     BEQZ R3, next
             … 0000 1100     BNE R4, loop
         Is there any conflict?

   c. What distinguishes between a tournament predictor
      and a correlating branch predictor?


– 43 –                                             CSCE 513 Fall 2011

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:19
posted:4/29/2012
language:Latin
pages:43