Computer Architecture by MikeJenny

VIEWS: 9 PAGES: 108

									Advanced Computer
   Architecture
       Chapter 4
   Advanced Pipelining

  Ioannis Papaefstathiou
        CS 590.25
       Easter 2003
    (thanks to Hennesy & Patterson)
            Chapter Overview
4.1 Instruction Level Parallelism: Concepts and Challenges
4.2 Overcoming Data Hazards with Dynamic Scheduling
4.3 Reducing Branch Penalties with Dynamic Hardware Prediction
4.4 Taking Advantage of More ILP with Multiple Issue
4.5 Compiler Support for Exploiting ILP
4.6 Hardware Support for Extracting more Parallelism
4.7 Studies of ILP




                        Chap. 4 - Pipelining II                  2
                        Chapter Overview
                 Technique                                Reduces           Section
Loop Unrolling                                Control Stalls                4.1

Basic Pipeline Scheduling                     RAW Stalls                    4.1

Dynamic Scheduling with Scoreboarding         RAW stalls                    4.2

Dynamic Scheduling with Register Renaming     WAR and WAW stalls            4.2

Dynamic Branch Prediction                     Control Stalls                4.3

Issue Multiple Instructions per Cycle         Ideal CPI                     4.4

Compiler Dependence Analysis                  Ideal CPI & data stalls       4.5


Software pipelining and trace scheduling      Ideal CPI & data stalls       4.5

Speculation                                   All data & control stalls     4.6

Dynamic memory disambiguation                 RAW stalls involving memory   4.2, 4.6




                                 Chap. 4 - Pipelining II                               3
       Instruction Level
          Parallelism
4.1 Instruction Level Parallelism:
    Concepts and Challenges                  ILP is the principle that there are many
                                                instructions in code that don‘t
4.2 Overcoming Data Hazards
    with Dynamic Scheduling
                                                depend on each other. That means
                                                it‘s possible to execute those
4.3 Reducing Branch Penalties                   instructions in parallel.
    with Dynamic Hardware
    Prediction
                                             This is easier said than done:
4.4 Taking Advantage of More
    ILP with Multiple Issue                  Issues include:
                                             • Building compilers to analyze the
4.5 Compiler Support for
    Exploiting ILP                               code,
                                             • Building hardware to be even
4.6 Hardware Support for
    Extracting more Parallelism                  smarter than that code.
4.7 Studies of ILP
                                             This section looks at some of the
                                                problems to be solved.



                                     Chap. 4 - Pipelining II                    4
Instruction Level                         Pipeline Scheduling and
                                              Loop Unrolling
   Parallelism
                       Terminology
Basic Block - That set of instructions between entry points and between
  branches. A basic block has only one entry and one exit. Typically
  this is about 6 instructions long.

Loop Level Parallelism - that parallelism that exists within a loop. Such
  parallelism can cross loop iterations.

Loop Unrolling - Either the compiler or the hardware is able to exploit
  the parallelism inherent in the loop.




                         Chap. 4 - Pipelining II                      5
 Instruction Level                    Pipeline Scheduling and
                                          Loop Unrolling
    Parallelism

Simple Loop and its Assembler Equivalent
 for (i=1; i<=1000; i++)                       This is a clean and
                                                simple example!
     x(i) = x(i) + s;

 Loop:   LD     F0,0(R1)     ;F0=vector element
         ADDD   F4,F0,F2     ;add scalar from F2
         SD     0(R1),F4     ;store result
         SUBI   R1,R1,8      ;decrement pointer 8bytes (DW)
         BNEZ   R1,Loop      ;branch R1!=zero
         NOP                 ;delayed branch slot




                     Chap. 4 - Pipelining II                     6
 Instruction Level                           Pipeline Scheduling and
                                                 Loop Unrolling
    Parallelism
                        FP Loop Hazards
    Loop:   LD          F0,0(R1)         ;F0=vector element
            ADDD        F4,F0,F2         ;add scalar in F2
            SD          0(R1),F4         ;store result
            SUBI        R1,R1,8          ;decrement pointer 8B (DW)
            BNEZ        R1,Loop          ;branch R1!=zero
            NOP                          ;delayed branch slot

         Instruction         Instruction              Latency in
         producing result    using result             clock cycles
         FP ALU op           Another FP ALU op        3
         FP ALU op           Store double             2
         Load double         FP ALU op                1
         Load double         Store double             0
         Integer op          Integer op               0

Where are the stalls?       Chap. 4 - Pipelining II                    7
   Instruction Level
                                           Pipeline Scheduling and
      Parallelism                              Loop Unrolling

                 FP Loop Showing Stalls
    1 Loop:    LD      F0,0(R1)   ;F0=vector element
    2          stall
    3          ADDD    F4,F0,F2   ;add scalar in F2
    4          stall
    5          stall
    6          SD      0(R1),F4   ;store result
    7          SUBI    R1,R1,8    ;decrement pointer 8Byte (DW)
    8          stall
    9          BNEZ    R1,Loop    ;branch R1!=zero
    10         stall              ;delayed branch slot
          Instruction        Instruction              Latency in
          producing result   using result             clock cycles
          FP ALU op          Another FP ALU op        3
          FP ALU op          Store double             2
          Load double        FP ALU op                1
          Load double        Store double             0
          Integer op         Integer op               0

10 clocks: Rewrite code
                        Chap. 4 - Pipelining II                      8
  to minimize stalls?
   Instruction Level                       Pipeline Scheduling and
                                               Loop Unrolling
      Parallelism
     Scheduled FP Loop Minimizing Stalls
    1 Loop:   LD      F0,0(R1)
    2         SUBI    R1,R1,8                       Stall is because SD
    3         ADDD    F4,F0,F2                        can‘t proceed.
    4         stall
    5         BNEZ    R1,Loop     ;delayed branch
    6         SD      8(R1),F4    ;altered when move past SUBI

    Swap BNEZ and SD by changing address of SD
      Instruction        Instruction           Latency in
      producing result   using result          clock cycles
      FP ALU op          Another FP ALU op     3
      FP ALU op          Store double          2
      Load double        FP ALU op             1

Now 6 clocks: Now unroll
                         Chap. 4 - Pipelining II
loop 4 times to make faster.                                          9
  Instruction Level        Pipeline Scheduling and
                               Loop Unrolling
     Parallelism
  Unroll Loop Four Times (straightforward way)
    1 Loop:   LD      F0,0(R1)
    2         stall                   15        ADDD      F12,F10,F2
    3         ADDD    F4,F0,F2        16        stall
    4         stall                   17        stall
    5         stall                   18        SD        -16(R1),F12
    6         SD      0(R1),F4        19        LD        F14,-24(R1)
    7         LD      F6,-8(R1)       20        stall
    8         stall                   21        ADDD      F16,F14,F2
    9         ADDD    F8,F6,F2        22        stall
   10         stall                   23        stall
   11         stall                   24        SD        -24(R1),F16
   12         SD      -8(R1),F8       25        SUBI      R1,R1,#32
   13         LD      F10,-16(R1)     26        BNEZ      R1,LOOP
   14         stall                   27        stall
                                      28        NOP
15 + 4 x (1+2) +1 = 28 clock cycles, or 7 per iteration
 Assumes R1 is multiple of 4
    Rewrite loop to minimize stalls.
                             Chap. 4 - Pipelining II                    10
Instruction Level                          Pipeline Scheduling and
                                               Loop Unrolling
   Parallelism
 Unrolled Loop That Minimizes Stalls
1 Loop:   LD       F0,0(R1)                What assumptions made when
2         LD       F6,-8(R1)                 moved code?
3         LD       F10,-16(R1)                  – OK to move store past SUBI
                                                  even though changes register
4         LD       F14,-24(R1)
                                                – OK to move loads before
5         ADDD     F4,F0,F2                       stores: get right data?
6         ADDD     F8,F6,F2                     – When is it safe for compiler to
7         ADDD     F12,F10,F2                     do such changes?
8         ADDD     F16,F14,F2
9         SD       0(R1),F4
10        SD       -8(R1),F8
11        SD       -16(R1),F12
12        SUBI     R1,R1,#32
13        BNEZ     R1,LOOP
14        SD       8(R1),F16         ; 8-32 = -24
                                                        No Stalls!!
 14 clock cycles, or 3.5 per iteration
                          Chap. 4 - Pipelining II                         11
Instruction Level                       Pipeline Scheduling and
                                            Loop Unrolling
   Parallelism
    Summary of Loop Unrolling Example
•    Determine that it was legal to move the SD after the SUBI and BNEZ,
     and find the amount to adjust the SD offset.
•    Determine that unrolling the loop would be useful by finding that the
     loop iterations were independent, except for the loop maintenance
     code.

•    Use different registers to avoid unnecessary constraints that would
     be forced by using the same registers for different computations.
•    Eliminate the extra tests and branches and adjust the loop
     maintenance code.

•    Determine that the loads and stores in the unrolled loop can be
     interchanged by observing that the loads and stores from different
     iterations are independent. This requires analyzing the memory
     addresses and finding that they do not refer to the same address.
•    Schedule the code, preserving any dependences needed to yield the
     same result as the original code.
                          Chap. 4 - Pipelining II                   12
 Instruction Level                                     Dependencies
    Parallelism
Compiler Perspectives on Code Movement
 Compiler concerned about dependencies in program. Not concerned if a
   HW hazard depends on a given pipeline.
 • Tries to schedule code to avoid hazards.
 • Looks for Data dependencies (RAW if a hazard for HW)
     – Instruction i produces a result used by instruction j, or
     – Instruction j is data dependent on instruction k, and instruction k is data
       dependent on instruction i.
 • If dependent, can‘t execute in parallel
 • Easy to determine for registers (fixed names)
 • Hard for memory:
     – Does 100(R4) = 20(R6)?
     – From different loop iterations, does 20(R6) = 20(R6)?



                             Chap. 4 - Pipelining II                        13
 Instruction Level                      Data Dependencies
    Parallelism

Compiler Perspectives on Code Movement

                                              Where are the data
                                               dependencies?
  1 Loop:   LD     F0,0(R1)
  2         ADDD   F4,F0,F2
  3         SUBI   R1,R1,8
  4         BNEZ   R1,Loop       ;delayed branch
  5         SD     8(R1),F4      ;altered when move past SUBI




                    Chap. 4 - Pipelining II                     14
    Instruction Level                             Name Dependencies
       Parallelism

Compiler Perspectives on Code Movement

•    Another kind of dependence called name dependence:
     two instructions use same name (register or memory location) but don‘t
     exchange data
•     Anti-dependence (WAR if a hazard for HW)
      – Instruction j writes a register or memory location that instruction i reads from
         and instruction i is executed first
•     Output dependence (WAW if a hazard for HW)
      – Instruction i and instruction j write the same register or memory location;
         ordering between instructions must be preserved.



                               Chap. 4 - Pipelining II                         15
 Instruction Level                  Name Dependencies
    Parallelism
Compiler Perspectives on Code Movement
  1 Loop:   LD     F0,0(R1)               Where are the name
  2         ADDD   F4,F0,F2
                                           dependencies?
  3         SD     0(R1),F4
  4         LD     F0,-8(R1)
  5         ADDD   F4,F0,F2         No data is passed in F0, but
  6         SD     -8(R1),F4         can‘t reuse F0 in cycle 4.
  7         LD     F0,-16(R1)
  8         ADDD   F4,F0,F2
  9         SD     -16(R1),F4
  10        LD     F0,-24(R1)
  11        ADDD   F4,F0,F2
  12        SD     -24(R1),F4
  13        SUBI   R1,R1,#32
  14        BNEZ   R1,LOOP
  15        NOP
  How can we remove these
  dependencies? Chap. 4 - Pipelining II                        16
 Instruction Level                           Name Dependencies
    Parallelism
Compiler Perspectives on Code Movement
 •   Again Name Dependencies are Hard for Memory Accesses
      – Does 100(R4) = 20(R6)?
      – From different loop iterations, does 20(R6) = 20(R6)?
 •   Our example required compiler to know that if R1 doesn‘t change then:

     0(R1) ≠ -8(R1) ≠ -16(R1) ≠ -24(R1)

     There were no dependencies between some loads and stores so they
     could be moved around each other




                          Chap. 4 - Pipelining II                  18
     Instruction Level                         Control Dependencies
        Parallelism
    Compiler Perspectives on Code Movement



•    Final kind of dependence called control dependence
•    Example
           if p1 {S1;};
           if p2 {S2;};
     S1 is control dependent on p1 and S2 is control dependent on p2 but not
     on p1.




                             Chap. 4 - Pipelining II                   19
    Instruction Level                          Control Dependencies
       Parallelism
Compiler Perspectives on Code Movement

•   Two (obvious) constraints on control dependences:
     – An instruction that is control dependent on a branch cannot be moved
       before the branch so that its execution is no longer controlled by the
       branch.

     – An instruction that is not control dependent on a branch cannot be
       moved to after the branch so that its execution is controlled by the
       branch.

•   Control dependencies relaxed to get parallelism; get same effect if
    preserve order of exceptions (address in register checked by branch
    before use) and data flow (value in register depends on branch)


                             Chap. 4 - Pipelining II                    20
 Instruction Level                    Control Dependencies
    Parallelism
Compiler Perspectives on Code Movement
   1 Loop:   LD     F0,0(R1)
   2         ADDD   F4,F0,F2
   3         SD     0(R1),F4
   4         SUBI   R1,R1,8             Where are the control
   5         BEQZ   R1,exit               dependencies?
   6         LD     F0,0(R1)
   7         ADDD   F4,F0,F2
   8         SD     0(R1),F4
   9         SUBI   R1,R1,8
   10        BEQZ   R1,exit
   11        LD     F0,0(R1)
   12        ADDD   F4,F0,F2
   13        SD     0(R1),F4
   14        SUBI   R1,R1,8
   15        BEQZ   R1,exit
  ....
                    Chap. 4 - Pipelining II                     21
  Instruction Level                            Loop Level Parallelism
     Parallelism
               When Safe to Unroll Loop?
• Example: Where are data dependencies?
  (A,B,C distinct & non-overlapping)
                                     for (i=1; i<=100; i=i+1) {
                                                A[i+1] = A[i] + C[i];   /* S1 */
                                                B[i+1] = B[i] + A[i+1]; /* S2 */
                                     }

    1. S2 uses the value, A[i+1], computed by S1 in the same iteration.
    2. S1 uses a value computed by S1 in an earlier iteration, since
       iteration i computes A[i+1] which is read in iteration i+1. The same
       is true of S2 for B[i] and B[i+1].
       This is a ―loop-carried dependence‖ between iterations

• Implies that iterations are dependent, and can‘t be executed in parallel

• Note the case for our prior example; each iteration was distinct

                             Chap. 4 - Pipelining II                      22
 Instruction Level                          Loop Level Parallelism
    Parallelism
            When Safe to Unroll Loop?
• Example: Where are data dependencies?
  (A,B,C,D distinct & non-overlapping)
                                      for (i=1; i<=100; i=i+1) {
                                                 A[i+1] = A[i] + B[i]; /* S1 */
                                                 B[i+1] = C[i] + D[i]; /* S2 */
                                      }


   1.   No dependence from S1 to S2. If there were, then there would be a
        cycle in the dependencies and the loop would not be parallel. Since
        this other dependence is absent, interchanging the two statements
        will not affect the execution of S2.
   2.   On the first iteration of the loop, statement S1 depends on the value
        of B[1] computed prior to initiating the loop.

                          Chap. 4 - Pipelining II                          23
 Instruction Level                           Loop Level Parallelism
    Parallelism

       Now Safe to Unroll Loop? (p. 240)
           for (i=1; i<=100; i=i+1) {                  No circular dependencies.
OLD:                 A[i+1] = A[i] + B[i]; /* S1 */
                     B[i+1] = C[i] + D[i];} /* S2 */
                                                       Loop caused dependence
                                                                on B.


           A[1] = A[1] + B[1];
           for (i=1; i<=99; i=i+1) {                     Have eliminated loop
NEW:
                B[i+1] = C[i] + D[i];                       dependence.
                A[i+1] = + A[i+1] + B[i+1];
           }
           B[101] = C[100] + D[100];

                          Chap. 4 - Pipelining II                      24
   Dynamic Scheduling
4.1 Instruction Level Parallelism:
    Concepts and Challenges                  Dynamic Scheduling is when the
                                                hardware rearranges the order of
4.2 Overcoming Data Hazards
    with Dynamic Scheduling
                                                instruction execution to reduce
                                                stalls.
4.3 Reducing Branch Penalties                Advantages:
    with Dynamic Hardware
    Prediction                               • Dependencies unknown at compile
                                                time can be handled by the hardware.
4.4 Taking Advantage of More
    ILP with Multiple Issue                  • Code compiled for one type of
                                                pipeline can be efficiently run on
4.5 Compiler Support for
    Exploiting ILP                              another.
                                             Disadvantages:
4.6 Hardware Support for
    Extracting more Parallelism              • Hardware much more complex.
4.7 Studies of ILP




                                     Chap. 4 - Pipelining II                  25
                                                    The idea:
Dynamic Scheduling
    HW Schemes: Instruction Parallelism
•   Why in HW at run time?
     – Works when can’t know real dependence at compile time
     – Compiler simpler
     – Code for one machine runs well on another
•   Key Idea: Allow instructions behind stall to proceed.
•   Key Idea: Instructions executing in parallel. There are multiple
    execution units, so use them.

         DIVD     F0,F2,F4
         ADDD     F10,F0,F8
         SUBD     F12,F8,F14
     – Enables out-of-order execution => out-of-order completion


                          Chap. 4 - Pipelining II                      26
                                                       The idea:
Dynamic Scheduling
    HW Schemes: Instruction Parallelism
•   Out-of-order execution divides ID stage:
     1. Issue—decode instructions, check for structural hazards
     2. Read operands—wait until no data hazards, then read operands
•   Scoreboards allow instruction to execute whenever 1 & 2 hold, not
    waiting for prior instructions.
•   A scoreboard is a ―data structure‖ that provides the information
    necessary for all pieces of the processor to work together.
•   We will use In order issue, out of order execution, out of order
    commit ( also called completion)
•   First used in CDC6600. Our example modified here for DLX.
•   CDC had 4 FP units, 5 memory reference units, 7 integer units.
•   DLX has 2 FP multiply, 1 FP adder, 1 FP divider, 1 integer.


                            Chap. 4 - Pipelining II                    27
                                              Using A Scoreboard
Dynamic Scheduling

               Scoreboard Implications
•   Out-of-order completion => WAR, WAW hazards?
•   Solutions for WAR
     – Queue both the operation and copies of its operands
     – Read registers only during Read Operands stage
•   For WAW, must detect hazard: stall until other completes
•   Need to have multiple instructions in execution phase => multiple
    execution units or pipelined execution units
•   Scoreboard keeps track of dependencies, state or operations
•   Scoreboard replaces ID, EX, WB with 4 stages




                          Chap. 4 - Pipelining II                       28
                                             Using A Scoreboard
Dynamic Scheduling

   Four Stages of Scoreboard Control
 1. Issue —decode instructions & check for structural hazards (ID1)
     If a functional unit for the instruction is free and no other active
        instruction has the same destination register (WAW), the
        scoreboard issues the instruction to the functional unit and
        updates its internal data structure.
     If a structural or WAW hazard exists, then the instruction issue
        stalls, and no further instructions will issue until these hazards
        are cleared.




                         Chap. 4 - Pipelining II                    29
                                              Using A Scoreboard
Dynamic Scheduling

  Four Stages of Scoreboard Control
 2.   Read operands —wait until no data hazards, then read
       operands (ID2)

      A source operand is available if no earlier issued active
          instruction is going to write it, or if the register containing
          the operand is being written by a currently active
          functional unit.
      When the source operands are available, the scoreboard tells
          the functional unit to proceed to read the operands from
          the registers and begin execution. The scoreboard
          resolves RAW hazards dynamically in this step, and
          instructions may be sent into execution out of order.



                          Chap. 4 - Pipelining II                      30
                                             Using A Scoreboard
Dynamic Scheduling
   Four Stages of Scoreboard Control
 3. Execution —operate on operands (EX)
       The functional unit begins execution upon receiving
       operands. When the result is ready, it notifies the
       scoreboard that it has completed execution.

 4. Write result —finish execution (WB)
       Once the scoreboard is aware that the functional unit has
       completed execution, the scoreboard checks for WAR
       hazards. If none, it writes results. If WAR, then it stalls the
       instruction.
       Example:
                  DIVD    F0,F2,F4
                  ADDD F10,F0,F8
                  SUBD F8,F8,F14
       Scoreboard would stall SUBD until ADDD reads operands

                         Chap. 4 - Pipelining II                    31
                                                Using A Scoreboard
   Dynamic Scheduling
            Three Parts of the Scoreboard

1. Instruction   status—which     of   4    steps     the   instruction   is   in

2. Functional unit status—Indicates the state of the functional unit (FU). 9
   fields for each functional unit
         Busy—Indicates whether the unit is busy or not
         Op—Operation to perform in the unit (e.g., + or –)
         Fi—Destination register
         Fj, Fk—Source-register numbers
         Qj, Qk—Functional units producing source registers Fj, Fk
         Rj,    Rk—Flags       indicating  when       Fj,   Fk   are   ready

3. Register result status—Indicates which functional unit will write each
   register, if one exists. Blank when no pending instructions will write that
   register

                            Chap. 4 - Pipelining II                       32
                                              Using A Scoreboard
Dynamic Scheduling
   Detailed Scoreboard Pipeline Control
 Instruction                                        Bookkeeping
                   Wait until
   status
                                       Busy(FU) yes; Op(FU) op;
                                        Fi(FU) `D‘; Fj(FU) `S1‘;
                Not busy (FU)         Fk(FU) `S2‘; Qj Result(‗S1‘);
    Issue
               and not result(D)      Qk Result(`S2‘); Rj not Qj;
                                       Rk not Qk; Result(‗D‘) FU;
    Read                                       Rj No; Rk No
                  Rj and Rk
  operands
 Execution      Functional unit
 complete           done

               f((Fj( f )≠Fi(FU)     f(if Qj(f)=FU then Rj(f) Yes);
               or Rj( f )=No) &       f(if Qk(f)=FU then Rj(f) Yes);
 Write result
              (Fk( f ) ≠Fi(FU) or    Result(Fi(FU)) 0; Busy(FU) No
                 Rk( f )=No))

                          Chap. 4 - Pipelining II                   33
                                            Using A Scoreboard
Dynamic Scheduling
               Scoreboard Example
  This is the sample code we‘ll be working with in the example:

  LD              F6, 34(R2)
  LD              F2, 45(R3)
  MULT            F0, F2, F4
  SUBD            F8, F6, F2
  DIVD            F10, F0, F6
  ADDD            F6, F8, F2

  What are the hazards in this code?
                                                  Latencies (clock cycles):
                                                  LD              1
                                                  MULT            10
                                                  SUBD            2
                                                  DIVD            40
                                                  ADDD            2

                        Chap. 4 - Pipelining II                        34
                                                            Using A Scoreboard
 Dynamic Scheduling
                           Scoreboard Example

Instruction status                                W
                                   Read Execution rite
Instruction      j     k   Issue          c       Result
                                   operands omplete
LD     F6      34+ R2
LD     F2      45+ R3
MULTD  F0      F2     F4
SUBD F8        F6     F2
DIVD F10 F0           F6
ADDD F6        F8     F2
Functional unit status                    dest    S1       S2   FU for j FU for k Fj?   Fk?
       Time Name           Busy    Op     Fi      Fj       Fk   Qj       Qk       Rj    Rk
               Integer     No
               Mult1       No
               Mult2       No
               Add         No
               Divide      No
Register result status
Clock                      F0      F2     F4      F6       F8   F10     F12       ...   F30
                     FU
                                     Chap. 4 - Pipelining II                             35
                                                             Using A Scoreboard
Dynamic Scheduling
               Scoreboard Example Cycle 1
Instruction status                Read            Write
                                          Execution
Instruction     j     k   Issue   operandscompleteResult
                                                                                Issue LD #1
LD     F6     34+ R2         1
LD     F2     45+ R3
MULTD  F0     F2     F4                                             Shows in which cycle
SUBD F8       F6     F2                                             the operation occurred.
DIVD F10 F0          F6
ADDD F6       F8     F2
Functional unit status                    dest    S1    S2    FU for j FU for k Fj?   Fk?
       Time Name          Busy    Op      Fi      Fj    Fk    Qj       Qk       Rj    Rk
              Integer     Yes     Load    F6            R2                            Yes
              Mult1       No
              Mult2       No
              Add         No
              Divide      No
Register result status
Clock                     F0      F2      F4      F6 F8 F10           F12       ...   F30
    1               FU                            Integer

                                    Chap. 4 - Pipelining II                                 36
                                                           Using A Scoreboard
Dynamic Scheduling
               Scoreboard Example Cycle 2
Instruction status              Read    Execution
                                                Write         LD #2 can‘t issue since
Instruction     j     k   Issue operandscompleteResult        integer unit is busy.
LD     F6     34+ R2         1     2                          MULT can‘t issue because
LD     F2     45+ R3
MULTD  F0     F2     F4
                                                              we require in-order issue.
SUBD F8       F6     F2
DIVD F10 F0          F6
ADDD F6       F8     F2
Functional unit status                  dest    S1    S2    FU for j FU for k Fj?   Fk?
       Time Name          Busy   Op     Fi      Fj    Fk    Qj       Qk       Rj    Rk
              Integer     Yes    Load   F6            R2                            Yes
              Mult1       No
              Mult2       No
              Add         No
              Divide      No
Register result status
Clock                     F0     F2     F4      F6 F8 F10           F12       ...   F30
    2               FU                          Integer

                                   Chap. 4 - Pipelining II                                37
                                                          Using A Scoreboard
Dynamic Scheduling
             Scoreboard Example Cycle 3
Instruction status              Read            Write
                                        Execution
Instruction     j     k   Issue operandscompleteResult
LD     F6     34+ R2         1     2      3
LD     F2     45+ R3
MULTD  F0     F2     F4
SUBD F8       F6     F2
DIVD F10 F0          F6
ADDD F6       F8     F2
Functional unit status                  dest    S1    S2    FU for j FU for k Fj?   Fk?
       Time Name          Busy   Op     Fi      Fj    Fk    Qj       Qk       Rj    Rk
              Integer     Yes    Load   F6            R2                            Yes
              Mult1       No
              Mult2       No
              Add         No
              Divide      No
Register result status
Clock                     F0     F2     F4      F6 F8 F10           F12       ...   F30
    3               FU                          Integer

                                  Chap. 4 - Pipelining II                                 38
                                                          Using A Scoreboard
Dynamic Scheduling
             Scoreboard Example Cycle 4
Instruction status              Read            Write
                                        Execution
Instruction     j     k   Issue operandscompleteResult
LD     F6     34+ R2         1     2      3       4
LD     F2     45+ R3
MULTD  F0     F2     F4
SUBD F8       F6     F2
DIVD F10 F0          F6
ADDD F6       F8     F2
Functional unit status                  dest    S1    S2    FU for j FU for k Fj?   Fk?
       Time Name          Busy   Op     Fi      Fj    Fk    Qj       Qk       Rj    Rk
              Integer     Yes    Load   F6            R2                            Yes
              Mult1       No
              Mult2       No
              Add         No
              Divide      No
Register result status
Clock                     F0     F2     F4      F6 F8 F10           F12       ...   F30
    4               FU                          Integer

                                  Chap. 4 - Pipelining II                                 39
                                                         Using A Scoreboard
Dynamic Scheduling
             Scoreboard Example Cycle 5
Instruction status              Read    Execution
                                                Write       Issue LD #2 since integer
Instruction     j     k   Issue operandscompleteResult      unit is now free.
LD     F6     34+ R2         1     2      3       4
LD     F2     45+ R3         5
MULTD  F0     F2     F4
SUBD F8       F6     F2
DIVD F10 F0          F6
ADDD F6       F8     F2
Functional unit status                     dest   S1   S2   FU for j FU for k Fj?   Fk?
       Time Name          Busy   Op        Fi     Fj   Fk   Qj       Qk       Rj    Rk
              Integer     Yes    Load      F2          R3                           Yes
              Mult1       No
              Mult2       No
              Add         No
              Divide      No
Register result status
Clock                     F0     F2        F4     F6 F8 F10         F12       ...   F30
    5               FU           Integer

                                  Chap. 4 - Pipelining II                                 40
                                                       Using A Scoreboard
Dynamic Scheduling
          Scoreboard Example Cycle 6
 Instruction status              Read    Execution
                                                 Write                Issue MULT.
 Instruction     j     k   Issue operandscompleteResult
 LD     F6     34+ R2         1     2      3       4
 LD     F2     45+ R3         5     6
 MULTD  F0     F2     F4      6
 SUBD F8       F6     F2
 DIVD F10 F0          F6
 ADDD F6       F8     F2
 Functional unit status                    dest   S1   S2   FU for j FU for k Fj?   Fk?
        Time Name          Busy    Op      Fi     Fj   Fk   Qj       Qk       Rj    Rk
               Integer     Yes     Load    F2          R3                           Yes
               Mult1       Yes     Mult    F0     F2   F4   Integer          No     Yes
               Mult2       No
               Add         No
               Divide      No
 Register result status
 Clock                     F0      F2      F4     F6 F8 F10           F12     ...   F30
     6               FU    Mult1 Integer

                                  Chap. 4 - Pipelining II                           41
                                                       Using A Scoreboard
Dynamic Scheduling
            Scoreboard Example Cycle 7
 Instruction status              Read    Execution
                                                 Write       MULT can‘t read its
 Instruction     j     k   Issue operandscompleteResult      operands (F2) because LD
 LD     F6     34+ R2         1     2      3       4         #2 hasn‘t finished.
 LD     F2     45+ R3         5     6      7
 MULTD  F0     F2     F4      6
 SUBD F8       F6     F2      7
 DIVD F10 F0          F6
 ADDD F6       F8     F2
 Functional unit status                    dest   S1   S2    FU for j FU for k Fj?     Fk?
        Time Name          Busy   Op       Fi     Fj   Fk    Qj       Qk       Rj      Rk
               Integer     Yes    Load     F2          R3                              Yes
               Mult1       Yes    Mult     F0     F2   F4    Integer             No    Yes
               Mult2       No
               Add         Yes    Sub      F8     F6   F2              Integer   Yes   No
               Divide      No
 Register result status
 Clock                     F0     F2       F4     F6 F8 F10            F12       ...   F30
     7               FU    Mult1 Integer               Add

                                  Chap. 4 - Pipelining II                               42
                                                         Using A Scoreboard
Dynamic Scheduling
            Scoreboard Example Cycle 8a

Instruction status              Read    Execution
                                                Write
                                                             DIVD issues.
Instruction     j     k   Issue operandscompleteResult       MULT and SUBD both
LD     F6     34+ R2         1     2      3       4          waiting for F2.
LD     F2     45+ R3         5     6      7
MULTD  F0     F2     F4      6
SUBD F8       F6     F2      7
DIVD F10 F0          F6      8
ADDD F6       F8     F2
Functional unit status                    dest   S1   S2    FU for j FU for k Fj?     Fk?
       Time Name          Busy   Op       Fi     Fj   Fk    Qj       Qk       Rj      Rk
              Integer     Yes    Load     F2          R3                              Yes
              Mult1       Yes    Mult     F0     F2   F4    Integer             No    Yes
              Mult2       No
              Add         Yes    Sub      F8     F6   F2              Integer   Yes   No
              Divide      Yes    Div      F10    F0   F6    Mult1               No    Yes
Register result status
Clock                     F0     F2       F4     F6 F8 F10            F12       ...   F30
    8               FU    Mult1 Integer               Add Divide
                                  Chap. 4 - Pipelining II                                   43
                                                         Using A Scoreboard
Dynamic Scheduling
            Scoreboard Example Cycle 8b

Instruction status              Read    Execution
                                                Write          LD #2 writes F2.
Instruction     j     k   Issue operandscompleteResult
LD     F6     34+ R2         1     2      3       4
LD     F2     45+ R3         5     6      7       8
MULTD  F0     F2     F4      6
SUBD F8       F6     F2      7
DIVD F10 F0          F6      8
ADDD F6       F8     F2
Functional unit status                   dest    S1   S2      FU for j FU for k Fj?   Fk?
       Time Name          Busy    Op     Fi      Fj   Fk      Qj       Qk       Rj    Rk
              Integer     No
              Mult1       Yes     Mult   F0      F2   F4                       Yes    Yes
              Mult2       No
              Add         Yes     Sub    F8      F6   F2                       Yes    Yes
              Divide      Yes     Div    F10     F0   F6      Mult1            No     Yes
Register result status
Clock                     F0      F2     F4      F6 F8 F10            F12       ...   F30
    8               FU    Mult1                       Add Divide
                                    Chap. 4 - Pipelining II                                 44
                                                          Using A Scoreboard
Dynamic Scheduling
            Scoreboard Example Cycle 9
 Instruction status              Read            Write
                                         Execution
 Instruction     j     k   Issue operandscompleteResult
 LD     F6     34+ R2         1     2      3       4         Now MULT and SUBD can
 LD     F2     45+ R3         5     6      7       8         both read F2.
 MULTD  F0     F2     F4      6     9                        How can both instructions
 SUBD F8       F6     F2      7     9                        do this at the same time??
 DIVD F10 F0          F6      8
 ADDD F6       F8     F2
 Functional unit status                   dest   S1   S2     FU for j FU for k Fj?   Fk?
        Time Name          Busy    Op     Fi     Fj   Fk     Qj       Qk       Rj    Rk
               Integer     No
           10 Mult1        Yes     Mult   F0     F2   F4                      Yes    Yes
               Mult2       No
             2 Add         Yes     Sub    F8     F6   F2                      Yes    Yes
               Divide      Yes     Div    F10    F0   F6     Mult1            No     Yes
 Register result status
 Clock                     F0      F2     F4     F6 F8 F10           F12       ...   F30
     9               FU    Mult1                      Add Divide



                                   Chap. 4 - Pipelining II                             45
                                                         Using A Scoreboard
Dynamic Scheduling
            Scoreboard Example Cycle 11
Instruction status              Read    Execution
                                                Write        ADDD can‘t start because
Instruction     j     k   Issue operandscompleteResult       add unit is busy.
LD     F6     34+ R2         1     2       3      4
LD     F2     45+ R3         5     6       7      8
MULTD  F0     F2     F4      6     9
SUBD F8       F6     F2      7     9      11
DIVD F10 F0          F6      8
ADDD F6       F8     F2
Functional unit status                   dest   S1   S2      FU for j FU for k Fj?   Fk?
       Time Name          Busy    Op     Fi     Fj   Fk      Qj       Qk       Rj    Rk
              Integer     No
            8 Mult1       Yes     Mult   F0     F2   F4                       Yes    Yes
              Mult2       No
            0 Add         Yes     Sub    F8     F6   F2                       Yes    Yes
              Divide      Yes     Div    F10    F0   F6      Mult1            No     Yes
Register result status
Clock                     F0      F2     F4     F6 F8 F10            F12       ...   F30
  11                FU    Mult1                      Add Divide

                                   Chap. 4 - Pipelining II                                 46
                                                         Using A Scoreboard
Dynamic Scheduling
           Scoreboard Example Cycle 12

Instruction status              Read    Execution
                                                Write
                                                             SUBD finishes.
Instruction     j     k   Issue operandscompleteResult       DIVD waiting for F0.
LD     F6     34+ R2         1     2       3       4
LD     F2     45+ R3         5     6       7       8
MULTD  F0     F2     F4      6     9
SUBD F8       F6     F2      7     9      11      12
DIVD F10 F0          F6      8
ADDD F6       F8     F2
Functional unit status                   dest   S1   S2      FU for j FU for k Fj?   Fk?
       Time Name          Busy    Op     Fi     Fj   Fk      Qj       Qk       Rj    Rk
              Integer     No
            7 Mult1       Yes     Mult   F0     F2   F4                       Yes    Yes
              Mult2       No
              Add         No
              Divide      Yes     Div    F10    F0   F6      Mult1            No     Yes
Register result status
Clock                     F0      F2     F4     F6 F8 F10             F12      ...   F30
  12                FU    Mult1                              Divide
                                   Chap. 4 - Pipelining II                                 47
                                                      Using A Scoreboard
Dynamic Scheduling
           Scoreboard Example Cycle 13

Instruction status              Read    Execution
                                                Write                ADDD issues.
Instruction     j     k   Issue operandscompleteResult
LD     F6     34+ R2         1     2        3      4
LD     F2     45+ R3         5     6        7      8
MULTD  F0     F2     F4      6     9
SUBD F8       F6     F2      7     9       11     12
DIVD F10 F0          F6      8
ADDD F6       F8     F2     13
Functional unit status                  dest    S1 S2       FU for j FU for k Fj?   Fk?
       Time Name          Busy Op       Fi      Fj    Fk    Qj       Qk       Rj    Rk
              Integer     No
            6 Mult1       Yes Mult      F0      F2    F4                     Yes    Yes
              Mult2       No
              Add         Yes Add       F6      F8    F2                     Yes    Yes
              Divide      Yes Div       F10     F0    F6    Mult1            No     Yes
Register result status
Clock                     F0      F2    F4      F6 F8 F10            F12      ...   F30
  13                FU    Mult1                 Add         Divide
                                  Chap. 4 - Pipelining II                                 48
                                                      Using A Scoreboard
Dynamic Scheduling
           Scoreboard Example Cycle 14
Instruction status              Read            Write
                                        Execution
Instruction     j     k   Issue operandscompleteResult
LD     F6     34+ R2         1     2        3      4
LD     F2     45+ R3         5     6        7      8
MULTD  F0     F2     F4      6     9
SUBD F8       F6     F2      7     9       11     12
DIVD F10 F0          F6      8
ADDD F6       F8     F2     13    14
Functional unit status                  dest    S1 S2       FU for j FU for k Fj?   Fk?
       Time Name          Busy Op       Fi      Fj    Fk    Qj       Qk       Rj    Rk
              Integer     No
            5 Mult1       Yes Mult      F0      F2    F4                     Yes    Yes
              Mult2       No
            2 Add         Yes Add       F6      F8    F2                     Yes    Yes
              Divide      Yes Div       F10     F0    F6    Mult1            No     Yes
Register result status
Clock                     F0      F2    F4      F6 F8 F10            F12      ...   F30
  14                FU    Mult1                 Add         Divide

                                  Chap. 4 - Pipelining II                                 49
                                                       Using A Scoreboard
Dynamic Scheduling
            Scoreboard Example Cycle 15

Instruction status              Read            Write
                                        Execution
Instruction     j     k   Issue operandscompleteResult
LD     F6     34+ R2         1     2        3      4
LD     F2     45+ R3         5     6        7      8
MULTD  F0     F2     F4      6     9
SUBD F8       F6     F2      7     9       11     12
DIVD F10 F0          F6      8
ADDD F6       F8     F2     13    14
Functional unit status                  dest    S1 S2        FU for j FU for k Fj?   Fk?
       Time Name          Busy Op       Fi      Fj    Fk     Qj       Qk       Rj    Rk
              Integer     No
            4 Mult1       Yes Mult      F0      F2    F4                      Yes    Yes
              Mult2       No
            1 Add         Yes Add       F6      F8    F2                      Yes    Yes
              Divide      Yes Div       F10     F0    F6     Mult1            No     Yes
Register result status
Clock                     F0      F2    F4      F6 F8 F10             F12      ...   F30
  15                FU    Mult1                 Add          Divide
                                   Chap. 4 - Pipelining II                                 50
                                                      Using A Scoreboard
Dynamic Scheduling
           Scoreboard Example Cycle 16

Instruction status              Read            Write
                                        Execution
Instruction     j     k   Issue operandscompleteResult
LD     F6     34+ R2         1     2        3      4
LD     F2     45+ R3         5     6        7      8
MULTD  F0     F2     F4      6     9
SUBD F8       F6     F2      7     9       11     12
DIVD F10 F0          F6      8
ADDD F6       F8     F2     13    14       16
Functional unit status                  dest    S1 S2       FU for j FU for k Fj?   Fk?
       Time Name          Busy Op       Fi      Fj    Fk    Qj       Qk       Rj    Rk
              Integer     No
            3 Mult1       Yes Mult      F0      F2    F4                     Yes    Yes
              Mult2       No
            0 Add         Yes Add       F6      F8    F2                     Yes    Yes
              Divide      Yes Div       F10     F0    F6    Mult1            No     Yes
Register result status
Clock                     F0      F2    F4      F6 F8 F10            F12      ...   F30
  16                FU    Mult1                 Add         Divide
                                  Chap. 4 - Pipelining II                             51
                                                       Using A Scoreboard
Dynamic Scheduling
            Scoreboard Example Cycle 17

Instruction status              Read    Execution
                                                Write         ADDD can‘t write because
Instruction     j     k   Issue operandscompleteResult        of DIVD. RAW!
LD     F6     34+ R2         1     2        3      4
LD     F2     45+ R3         5     6        7      8
MULTD  F0     F2     F4      6     9
SUBD F8       F6     F2      7     9       11     12
DIVD F10 F0          F6      8
ADDD F6       F8     F2     13    14       16
Functional unit status                  dest    S1 S2        FU for j FU for k Fj?   Fk?
       Time Name          Busy Op       Fi      Fj    Fk     Qj       Qk       Rj    Rk
              Integer     No
            2 Mult1       Yes Mult      F0      F2    F4                      Yes    Yes
              Mult2       No
              Add         Yes Add       F6      F8    F2                      Yes    Yes
              Divide      Yes Div       F10     F0    F6     Mult1            No     Yes
Register result status
Clock                     F0      F2    F4      F6 F8 F10             F12      ...   F30
  17                FU    Mult1                 Add          Divide
                                   Chap. 4 - Pipelining II                                 52
                                                      Using A Scoreboard
Dynamic Scheduling
           Scoreboard Example Cycle 18
Instruction status              Read    Execution
                                                Write                Nothing Happens!!
Instruction     j     k   Issue operandscompleteResult
LD     F6     34+ R2         1     2        3      4
LD     F2     45+ R3         5     6        7      8
MULTD  F0     F2     F4      6     9
SUBD F8       F6     F2      7     9       11     12
DIVD F10 F0          F6      8
ADDD F6       F8     F2     13    14       16
Functional unit status                  dest    S1 S2       FU for j FU for k Fj?   Fk?
       Time Name          Busy Op       Fi      Fj    Fk    Qj       Qk       Rj    Rk
              Integer     No
            1 Mult1       Yes Mult      F0      F2    F4                     Yes    Yes
              Mult2       No
              Add         Yes Add       F6      F8    F2                     Yes    Yes
              Divide      Yes Div       F10     F0    F6    Mult1            No     Yes
Register result status
Clock                     F0      F2    F4      F6 F8 F10             F12     ...   F30
  18                FU    Mult1                 Add         Divide

                                  Chap. 4 - Pipelining II                                 53
                                                      Using A Scoreboard
Dynamic Scheduling
           Scoreboard Example Cycle 19

Instruction status              Read    Execution
                                                Write      MULT completes execution.
Instruction     j     k   Issue operandscompleteResult
LD     F6     34+ R2         1     2        3      4
LD     F2     45+ R3         5     6        7      8
MULTD  F0     F2     F4      6     9       19
SUBD F8       F6     F2      7     9       11     12
DIVD F10 F0          F6      8
ADDD F6       F8     F2     13    14       16
Functional unit status                  dest    S1 S2       FU for j FU for k Fj?   Fk?
       Time Name          Busy Op       Fi      Fj    Fk    Qj       Qk       Rj    Rk
              Integer     No
            0 Mult1       Yes Mult      F0      F2    F4                     Yes    Yes
              Mult2       No
              Add         Yes Add       F6      F8    F2                     Yes    Yes
              Divide      Yes Div       F10     F0    F6    Mult1            No     Yes
Register result status
Clock                     F0      F2    F4      F6 F8 F10            F12      ...   F30
  19                FU    Mult1                 Add         Divide
                                  Chap. 4 - Pipelining II                             54
                                                      Using A Scoreboard
Dynamic Scheduling
           Scoreboard Example Cycle 20

Instruction status              Read    Execution
                                                Write                MULT writes.
Instruction     j     k   Issue operandscompleteResult
LD     F6     34+ R2         1     2        3      4
LD     F2     45+ R3         5     6        7      8
MULTD  F0     F2     F4      6     9       19     20
SUBD F8       F6     F2      7     9       11     12
DIVD F10 F0          F6      8
ADDD F6       F8     F2     13    14       16
Functional unit status                  dest    S1 S2      FU for j FU for k Fj?   Fk?
       Time Name          Busy Op       Fi      Fj    Fk   Qj       Qk       Rj    Rk
              Integer     No
              Mult1       No
              Mult2       No
              Add         Yes Add       F6      F8    F2                    Yes    Yes
              Divide      Yes Div       F10     F0    F6                    Yes    Yes
Register result status
Clock                     F0    F2      F4      F6 F8 F10           F12      ...   F30
  20                FU                          Add        Divide
                                 Chap. 4 - Pipelining II                                 55
                                                      Using A Scoreboard
Dynamic Scheduling
           Scoreboard Example Cycle 21
Instruction status              Read            Write
                                        Execution              DIVD loads operands
Instruction     j     k   Issue operandscompleteResult
LD     F6     34+ R2         1     2        3      4
LD     F2     45+ R3         5     6        7      8
MULTD  F0     F2     F4      6     9       19     20
SUBD F8       F6     F2      7     9       11     12
DIVD F10 F0          F6      8    21
ADDD F6       F8     F2     13    14       16
Functional unit status                  dest    S1 S2      FU for j FU for k Fj?   Fk?
       Time Name          Busy Op       Fi      Fj    Fk   Qj       Qk       Rj    Rk
              Integer     No
              Mult1       No
              Mult2       No
              Add         Yes Add       F6      F8    F2                    Yes    Yes
              Divide      Yes Div       F10     F0    F6                    Yes    Yes
Register result status
Clock                     F0    F2      F4      F6 F8 F10           F12      ...   F30
  21                FU                          Add        Divide

                                 Chap. 4 - Pipelining II                             56
                                                      Using A Scoreboard
Dynamic Scheduling
           Scoreboard Example Cycle 22

Instruction status              Read    Execution
                                                Write      Now ADDD can write since
Instruction     j     k   Issue operandscompleteResult     WAR removed.
LD     F6     34+ R2         1     2        3      4
LD     F2     45+ R3         5     6        7      8
MULTD  F0     F2     F4      6     9       19     20
SUBD F8       F6     F2      7     9       11     12
DIVD F10 F0          F6      8    21
ADDD F6       F8     F2     13    14       16     22
Functional unit status                  dest    S1 S2      FU for j FU for k Fj?   Fk?
       Time Name          Busy Op       Fi      Fj    Fk   Qj       Qk       Rj    Rk
              Integer     No
              Mult1       No
              Mult2       No
              Add         No
          40 Divide       Yes Div       F10     F0    F6                    Yes    Yes
Register result status
Clock                     F0    F2      F4      F6 F8 F10           F12      ...   F30
  22                FU                                     Divide
                                 Chap. 4 - Pipelining II                             57
                                                      Using A Scoreboard
Dynamic Scheduling
           Scoreboard Example Cycle 61
Instruction status              Read    Execution
                                                Write      DIVD completes execution
Instruction     j     k   Issue operandscompleteResult
LD     F6     34+ R2         1     2        3      4
LD     F2     45+ R3         5     6        7      8
MULTD  F0     F2     F4      6     9       19     20
SUBD F8       F6     F2      7     9       11     12
DIVD F10 F0          F6      8    21       61
ADDD F6       F8     F2     13    14       16     22
Functional unit status                  dest    S1 S2      FU for j FU for k Fj?   Fk?
       Time Name          Busy Op       Fi      Fj    Fk   Qj       Qk       Rj    Rk
              Integer     No
              Mult1       No
              Mult2       No
              Add         No
            0 Divide      Yes Div       F10     F0    F6                    Yes    Yes
Register result status
Clock                     F0    F2      F4      F6 F8 F10           F12      ...   F30
  61                FU                                     Divide

                                 Chap. 4 - Pipelining II                                 58
                                                       Using A Scoreboard
Dynamic Scheduling
           Scoreboard Example Cycle 62
 Instruction status              Read    Execution
                                                 Write                   DONE!!
 Instruction     j     k   Issue operandscompleteResult
 LD     F6     34+ R2         1     2        3      4
 LD     F2     45+ R3         5     6        7      8
 MULTD  F0     F2     F4      6     9       19     20
 SUBD F8       F6     F2      7     9       11     12
 DIVD F10 F0          F6      8    21       61     62
 ADDD F6       F8     F2     13    14       16     22
 Functional unit status                  dest    S1 S2      FU for j FU for k Fj?   Fk?
        Time Name          Busy Op       Fi      Fj    Fk   Qj       Qk       Rj    Rk
               Integer     No
               Mult1       No
               Mult2       No
               Add         No
             0 Divide      No
 Register result status
 Clock                     F0    F2      F4      F6 F8 F10          F12       ...   F30
   62                FU

                                 Chap. 4 - Pipelining II                              59
                                                   Using A Scoreboard
Dynamic Scheduling
            Another Dynamic Algorithm:
               Tomasulo Algorithm
 •   For IBM 360/91 about 3 years after CDC 6600 (1966)
 •   Goal: High Performance without special compilers
 •   Differences between IBM 360 & CDC 6600 ISA
      – IBM has only 2 register specifiers / instruction vs. 3 in CDC 6600
      – IBM has 4 FP registers vs. 8 in CDC 6600
 •   Why Study? lead to Alpha 21264, HP 8000, MIPS 10000, Pentium II,
     PowerPC 604, …




                              Chap. 4 - Pipelining II                        60
                                              Using A Scoreboard
Dynamic Scheduling
    Tomasulo Algorithm vs. Scoreboard
• Control & buffers distributed with Function Units (FU) vs.
  centralized in scoreboard;
   – FU buffers called ―reservation stations‖; have pending operands
• Registers in instructions replaced by values or pointers to
  reservation stations(RS); called register renaming ;
   – avoids WAR, WAW hazards
   – More reservation stations than registers, so can do optimizations
     compilers can’t
• Results to FU from RS, not through registers, over Common
  Data Bus that broadcasts results to all FUs
• Load and Stores treated as FUs with RSs as well
• Integer instructions can go past branches, allowing
  FP ops beyond basic block in FP queue

                          Chap. 4 - Pipelining II                        61
 Dynamic Scheduling                   Using A Scoreboard
             Tomasulo Organization
                 FP Op Queue FP
                             Registers
    Load
    Buffer



                                                      Store
Common                                                Buffer
  Data
  Bus
       FP Add                               FP Mul
       Res.                                 Res.
       Station                              Station

                  Chap. 4 - Pipelining II                  62
                                             Using A Scoreboard
Dynamic Scheduling
      Reservation Station Components
Op—Operation to perform in the unit (e.g., + or –)
Vj, Vk—Value of Source operands
     – Store buffers have V field, result to be stored
Qj, Qk—Reservation stations producing source registers (value to be
    written)
     – Note: No ready flags as in Scoreboard; Qj,Qk=0 => ready
     – Store buffers only have Qi for RS producing result
Busy—Indicates reservation station or FU is busy

Register result status—Indicates which functional unit will write each
  register, if one exists. Blank when no pending instructions that will
  write that register.



                         Chap. 4 - Pipelining II                    63
                                               Using A Scoreboard
Dynamic Scheduling

   Three Stages of Tomasulo Algorithm
1. Issue—get instruction from FP Op Queue
       If reservation station free (no structural hazard),
       control issues instruction & sends operands (renames registers).
2. Execution—operate on operands (EX)
       When both operands ready then execute;
        if not ready, watch Common Data Bus for result
3. Write result—finish execution (WB)
       Write on Common Data Bus to all awaiting units;
       mark reservation station available
• Normal data bus: data + destination (―go to‖ bus)
• Common data bus: data + source (―come from‖ bus)
    – 64 bits of data + 4 bits of Functional Unit source address
    – Write if matches expected Functional Unit (produces result)
    – Does the broadcast

                           Chap. 4 - Pipelining II                        64
                                                                       Using A Scoreboard
    Dynamic Scheduling
                          Tomasulo Example Cycle 0
Instruction status                    Execution   Write
Instruction       j       k   Issue   complete    Result                             Busy   Address
LD     F6      34+       R2                                               Load1      No
LD     F2      45+       R3                                               Load2      No
MULTD  F0      F2        F4                                               Load3      No
SUBD F8        F6        F2
DIVD F10 F0              F6
ADDD F6        F8        F2
Reservation Stations                  S1          S2        RS for j      RS for k
       Time Name         Busy Op      Vj          Vk        Qj            Qk
             0 Add1      No
             0 Add2      No
             0 Add3      No
             0 Mult1     No
             0 Mult2     No
Register result status
Clock                         F0      F2          F4        F6            F8         F10    F12 ...   F30
    0                    FU




                                              Chap. 4 - Pipelining II                                 65
                                               Using A Scoreboard
Dynamic Scheduling
                    Review: Tomasulo

•   Prevents Register as bottleneck
•   Avoids WAR, WAW hazards of Scoreboard
•   Allows loop unrolling in HW
•   Not limited to basic blocks (provided branch prediction)
•   Lasting Contributions
     – Dynamic scheduling
     – Register renaming
     – Load/store disambiguation
•   360/91 descendants are PowerPC 604, 620; MIPS R10000; HP-PA
    8000; Intel Pentium Pro




                           Chap. 4 - Pipelining II                  66
     Dynamic Hardware
        Prediction
4.1 Instruction Level Parallelism:
    Concepts and Challenges                  Dynamic Branch Prediction is the ability
                                                of the hardware to make an educated
4.2 Overcoming Data Hazards
    with Dynamic Scheduling
                                                guess about which way a branch will
                                                go - will the branch be taken or not.
4.3 Reducing Branch Penalties
    with Dynamic Hardware
    Prediction                               The hardware can look for clues based
                                                on the instructions, or it can use past
4.4 Taking Advantage of More
    ILP with Multiple Issue                     history - we will discuss both of
                                                these directions.
4.5 Compiler Support for
    Exploiting ILP
4.6 Hardware Support for
    Extracting more Parallelism
4.7 Studies of ILP




                                     Chap. 4 - Pipelining II                    67
Dynamic Hardware                              Basic Branch Prediction:
                                              Branch Prediction Buffers
   Prediction
           Dynamic Branch Prediction
• Performance = ƒ(accuracy, cost of misprediction)
• Branch History Lower bits of PC address index table of 1-bit values
    – Says whether or not branch taken last time
• Problem: in a loop, 1-bit BHT will cause two mis-predictions:
    – End of loop case, when it exits instead of looping as before
    – First time through loop on next time through code, when it predicts exit instead
      of looping

                                                                                   P
          Address                              0                                   r
                                                                                   e
                                                                                   d
  31                      1                                    Bits 13 - 2         i
                                                                                   c
                                                                                   t
                                             1023                                  i
                                                                                   o
                                                                                   n
                              Chap. 4 - Pipelining II                            68
Dynamic Hardware                        Basic Branch Prediction:
                                        Branch Prediction Buffers
   Prediction
         Dynamic Branch Prediction

• Solution: 2-bit scheme where change prediction only if get
  misprediction twice: (Figure 4.13, p. 264)


                        T
                                   NT
  Predict Taken                                          Predict Taken
                                           T
                    T                               NT
                                  NT
      Predict Not                                        Predict Not
         Taken                             T                Taken
                                                    NT

                        Chap. 4 - Pipelining II                   69
Dynamic Hardware                        Basic Branch Prediction:
                                        Branch Prediction Buffers
   Prediction

                     BHT Accuracy

• Mispredict because either:
    – Wrong guess for that branch
    – Got branch history of wrong branch when index the table
• 4096 entry table programs vary from 1% misprediction (nasa7,
  tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12%
• 4096 about as good as infinite table, but 4096 is a lot of HW




                        Chap. 4 - Pipelining II                   70
 Dynamic Hardware                             Basic Branch Prediction:
                                              Branch Prediction Buffers
    Prediction

                   Correlating Branches
Idea: taken/not taken of                          Branch address
recently executed branches is
related to behavior of next                 2-bits per branch predictors
branch (as well as the history
of that branch behavior)
 – Then behavior of recent
   branches selects between, say,                                        Prediction
   four predictions of next branch,
   updating just that prediction




                                           2-bit global branch history
                              Chap. 4 - Pipelining II                       71
                  Dynamic Hardware                                                                                                                Basic Branch Prediction:
                                                                                                                                                  Branch Prediction Buffers
                     Prediction
                                                    Accuracy of Different Schemes
                                                                                                                                                                                             (Figure 4.21,
                                                                            4096 Entries 2-bits per entry
                                                                                                                                                                                             p. 272)
                                                                            Unlimited Entries 2-bits per entry
                Frequency of Mispredictions




                                                   18%                      1024 Entries - 2 bits of history,
                                  18%
                                                                               2 bits per entry
                                  16%

                                  14%
Frequency of Mispredictions




                                  12%                                                                                                              11%

                                  10%

                                              8%
                                                                                                                               6%           6%                                     6%
                                              6%                                                                  5%                                                                         5%
                                                                                                                                                                    4%
                                              4%


                                              2%                1%                              1%
                                                   0%                            0%
                                              0%
                                                                                                         doducd




                                                                                                                                                 gcc
                                                        nasa7




                                                                                                                                                         espresso
                                                                                                                       spice




                                                                                                                                                                         eqntott
                                                                                                                                    fpppp
                                                                                      tomcatv




                                                                                                                                                                                        li
                                                                     matrix300




                                         4,096 entries: 2-bits per entry                                      Chap. 4 - Pipelining II entries (2,2)
                                                                                                     Unlimited entries: 2-bits/entry
                                                                                                                                   1,024                                                            72
    Dynamic Hardware                                  Basic Branch Prediction:
                                                       Branch Target Buffers
       Prediction
                       Branch Target Buffer
•   Branch Target Buffer (BTB): Use address of branch as index to get prediction AND
    branch address (if taken)
     – Note: must check for branch match now, since can’t use wrong branch address (Figure 4.22, p.
       273)


                                                      Predicted PC              Branch Prediction:
                                                                                Taken or not Taken




•   Return instruction addresses predicted with stack
                                   Chap. 4 - Pipelining II                                 73
 Dynamic Hardware                             Basic Branch Prediction:
                                               Branch Target Buffers
    Prediction
 Example               Instructions
                       in Buffer
                                          Prediction         Actual
                                                             Branch
                                                                                Penalty
                                                                                Cycles
                       Yes                Taken              Taken              0
                       Yes                Taken              Not taken          2
                       No                                    Taken              2

Example on page 274.
Determine the total branch penalty for a BTB using the above
  penalties. Assume also the following:
• Prediction accuracy of 80%
• Hit rate in the buffer of 90%
• 60% taken branch frequency.
Branch Penalty = Percent buffer hit rate X Percent incorrect predictions X 2
   + ( 1 - percent buffer hit rate) X Taken branches X 2
Branch Penalty = ( 90% X 10% X 2) + (10% X 60% X 2)
Branch Penalty = 0.18 + 0.12 = 0.30 clock cycles

                             Chap. 4 - Pipelining II                           74
         Multiple Issue
4.1 Instruction Level Parallelism:          Multiple Issue is the ability of the
    Concepts and Challenges                   processor to start more than one
4.2 Overcoming Data Hazards                   instruction in a given cycle.
    with Dynamic Scheduling
4.3 Reducing Branch Penalties               Flavor I:
    with Dynamic Hardware                   Superscalar processors issue varying
    Prediction
                                              number of instructions per clock - can
4.4 Taking Advantage of More                  be either statically scheduled (by the
    ILP with Multiple Issue
                                              compiler) or dynamically scheduled
4.5 Compiler Support for                      (by the hardware).
    Exploiting ILP
4.6 Hardware Support for                    Superscalar has a varying number of
    Extracting more Parallelism
                                              instructions/cycle (1 to 8), scheduled
4.7 Studies of ILP                            by compiler or by HW (Tomasulo).

                                            IBM PowerPC, Sun UltraSparc, DEC
                                              Alpha, HP 8000

                                     Chap. 4 - Pipelining II                    75
Multiple Issue

   Issuing Multiple Instructions/Cycle
Flavor II:
VLIW - Very Long Instruction Word - issues a fixed number of
   instructions formatted either as one very large instruction or as a
   fixed packet of smaller instructions.

fixed number of instructions (4-16) scheduled by the compiler; put
    operators into wide templates
     – Joint HP/Intel agreement in 1999/2000
     – Intel Architecture-64 (IA-64) 64-bit address
     – Style: ―Explicitly Parallel Instruction Computer (EPIC)‖




                     Chap. 4 - Pipelining II                    76
Multiple Issue

    Issuing Multiple Instructions/Cycle
Flavor II - continued:
•   3 Instructions in 128 bit ―groups‖; field determines if instructions
    dependent or independent
      – Smaller code size than old VLIW, larger than x86/RISC
      – Groups can be linked to show independence > 3 instr
•   64 integer registers + 64 floating point registers
      – Not separate files per functional unit as in old VLIW
•   Hardware checks dependencies
    (interlocks => binary compatibility over time)
•   Predicated execution (select 1 out of 64 1-bit flags)
    => 40% fewer mis-predictions?
•   IA-64 : name of instruction set architecture; EPIC is type
•   Merced is name of first implementation (1999/2000?)

                      Chap. 4 - Pipelining II                     77
                                                 A SuperScalar Version of DLX
         Multiple Issue
        Issuing Multiple Instructions/Cycle
                                                                In our DLX example,
– Fetch 64-bits/clock cycle; Int on left, FP on right
                                                                  we can handle 2
– Can only issue 2nd instruction if 1st instruction issues
                                                                  instructions/cycle:
– More ports for FP registers to do FP load & FP op in a pair
                                                                • Floating Point
                                                                • Anything Else
  Type                 Pipe Stages
  Int. instruction      IF      ID    EX MEM WB
  FP instruction        IF      ID    EX MEM WB
  Int. instruction              IF    ID      EX MEM WB
  FP instruction                IF    ID      EX MEM WB
  Int. instruction                    IF      ID    EX MEM WB
  FP instruction                      IF      ID    EX MEM WB
• 1 cycle load delay causes delay to 3 instructions in Superscalar
   – instruction in right half can‘t use it, nor instructions in next slot


                                 Chap. 4 - Pipelining II                        78
                                      A SuperScalar Version of DLX
   Multiple Issue
Unrolled Loop Minimizes Stalls for Scalar
     1 Loop:   LD     F0,0(R1)                  Latencies:
     2         LD     F6,-8(R1)                 LD to ADDD: 1 Cycle
     3         LD     F10,-16(R1)               ADDD to SD: 2 Cycles
     4         LD     F14,-24(R1)
     5         ADDD   F4,F0,F2
     6         ADDD   F8,F6,F2
     7         ADDD   F12,F10,F2
     8         ADDD   F16,F14,F2
     9         SD     0(R1),F4
     10        SD     -8(R1),F8
     11        SD     -16(R1),F12
     12        SUBI   R1,R1,#32
     13        BNEZ   R1,LOOP
     14        SD     8(R1),F16         ; 8-32 = -24

     14 clock cycles, or 3.5 per iteration
                      Chap. 4 - Pipelining II                   79
                                      A SuperScalar Version of DLX
   Multiple Issue
      Loop Unrolling in Superscalar
        Integer instruction      FP instruction     Clock cycle
Loop: LD F0,0(R1)                                             1
        LD F6,-8(R1)                                          2
        LD F10,-16(R1)           ADDD F4,F0,F2                3
        LD F14,-24(R1)           ADDD F8,F6,F2                4
        LD F18,-32(R1)           ADDD F12,F10,F2              5
        SD 0(R1),F4              ADDD F16,F14,F2              6
        SD -8(R1),F8             ADDD F20,F18,F2              7
        SD -16(R1),F12                                        8
        SD -24(R1),F16                                        9
        SUBI R1,R1,#40                                       10
        BNEZ R1,LOOP                                         11
        SD     8(R1),F20                                     12
• Unrolled 5 times to avoid delays (+1 due to SS)
• 12 clocks, or 2.4 clocks per iteration
                        Chap. 4 - Pipelining II                   80
                                             Multiple Instruction Issue &
      Multiple Issue                            Dynamic Scheduling

     Dynamic Scheduling in Superscalar

Code compiler for scalar version will run poorly on Superscalar
   May want code to vary depending on how Superscalar

Simple approach: separate Tomasulo Control for separate reservation
  stations for Integer FU/Reg and for FP FU/Reg




                           Chap. 4 - Pipelining II                   81
                                           Multiple Instruction Issue &
     Multiple Issue                           Dynamic Scheduling

     Dynamic Scheduling in Superscalar
• How to do instruction issue with two instructions and keep in-order
  instruction issue for Tomasulo?
    – Issue 2X Clock Rate, so that issue remains in order
    – Only FP loads might cause dependency between integer and FP
      issue:
        • Replace load reservation station with a load queue;
          operands must be read in the order they are fetched
        • Load checks addresses in Store Queue to avoid RAW violation
        • Store checks addresses in Load Queue to avoid WAR,WAW




                         Chap. 4 - Pipelining II                   82
                                     Multiple Instruction Issue &
 Multiple Issue                         Dynamic Scheduling

Performance of Dynamic Superscalar
Iteration Instructions           Issues Executes Writes result
no.                                   clock-cycle number
1           LD F0,0(R1)             1            2        4
1           ADDD F4,F0,F2           1            5        8
1           SD 0(R1),F4             2            9
1           SUBI R1,R1,#8           3            4        5
1           BNEZ R1,LOOP            4            5
2           LD F0,0(R1)             5            6        8
2           ADDD F4,F0,F2           5            9       12
2           SD 0(R1),F4             6           13
2           SUBI R1,R1,#8           7            8        9
2           BNEZ R1,LOOP            8            9
- 4 clocks per iteration
Branches, Decrements still take 1 clock cycle
                         Chap. 4 - Pipelining II               83
                                                                 VLIW
       Multiple Issue
                   Loop Unrolling in VLIW
Memory         Memory         FP                       FP          Int. op/    Clock
reference 1    reference 2    operation 1              op. 2       branch
LD F0,0(R1)    LD F6,-8(R1)                                                         1
LD F10,-16(R1) LD F14,-24(R1)                                                       2
LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2            ADDD F8,F6,F2                3
LD F26,-48(R1)                ADDD F12,F10,F2          ADDD F16,F14,F2              4
                              ADDD F20,F18,F2          ADDD F24,F22,F2              5
SD 0(R1),F4    SD -8(R1),F8 ADDD F28,F26,F2                                         6
SD -16(R1),F12 SD -24(R1),F16                                                       7
SD -32(R1),F20 SD -40(R1),F24                                      SUBI R1,R1,#48   8
SD -0(R1),F28                                                      BNEZ R1,LOOP     9

• Unrolled 7 times to avoid delays
• 7 results in 9 clocks, or 1.3 clocks per iteration
• Need more registers to effectively use VLIW


                                Chap. 4 - Pipelining II                        84
                                          Limitations With Multiple Issue
      Multiple Issue
          Limits to Multi-Issue Machines
• Inherent limitations of ILP
    – 1 branch in 5 instructions => how to keep a 5-way VLIW busy?
    – Latencies of units => many operations must be scheduled
    – Need about Pipeline Depth x No. Functional Units of independent
      operations to keep machines busy.

• Difficulties in building HW
   – Duplicate Functional Units to get parallel execution
   – Increase ports to Register File (VLIW example needs 6 read and 3
      write for Int. Reg. & 6 read and 4 write for Reg.)
   – Increase ports to memory
   – Decoding SS and impact on clock rate, pipeline depth


                          Chap. 4 - Pipelining II                  85
                                         Limitations With Multiple Issue
    Multiple Issue

        Limits to Multi-Issue Machines

• Limitations specific to either SS or VLIW implementation
   – Decode issue in SS
   – VLIW code size: unroll loops + wasted fields in VLIW
   – VLIW lock step => 1 hazard & all instructions stall
   – VLIW & binary compatibility




                         Chap. 4 - Pipelining II                  86
                                                        Limitations With Multiple Issue
       Multiple Issue
                  Multiple Issue Challenges
•   While Integer/FP split is simple for the HW, get CPI of 0.5 only for
    programs with:
     – Exactly 50% FP operations
     – No hazards
•   If more instructions issue at same time, greater difficulty of decode and
    issue
     – Even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2
       instructions can issue
•   VLIW: tradeoff instruction space for simple decoding
     – The long instruction word has room for many operations
     – By definition, all the operations the compiler puts in the long instruction word are
       independent => execute in parallel
     – E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch
          • 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide
     – Need compiling technique that schedules across several branches


                                    Chap. 4 - Pipelining II                           87
 Compiler Support For ILP
4.1 Instruction Level Parallelism:
    Concepts and Challenges                  How can compilers be smart?
                                             1. Produce good scheduling of code.
4.2 Overcoming Data Hazards
    with Dynamic Scheduling
                                             2. Determine which loops might contain
                                                 parallelism.
4.3 Reducing Branch Penalties                3. Eliminate name dependencies.
    with Dynamic Hardware
    Prediction
                                             Compilers must be REALLY smart to
4.4 Taking Advantage of More
    ILP with Multiple Issue                    figure out aliases -- pointers in C are
                                               a real problem.
4.5 Compiler Support for
    Exploiting ILP
                                             Techniques lead to:
4.6 Hardware Support for
    Extracting more Parallelism                  Symbolic Loop Unrolling
                                                 Critical Path Scheduling
4.7 Studies of ILP




                                     Chap. 4 - Pipelining II                    88
Compiler Support For ILP                          Symbolic Loop Unrolling



                    Software Pipelining
• Observation: if iterations from loops are independent, then can get ILP
  by taking instructions from different iterations
• Software pipelining: reorganizes loops so that each iteration is made
  from instructions chosen from different iterations of the original loop
  (Tomasulo in SW)                 Iteration
                                      0   Iteration
                                              1       Iteration
                                                          2     Iteration
                                                                    3     Iteration
                                                                              4


                         Software-
                         pipelined
                          iteration




                          Chap. 4 - Pipelining II                                     89
Compiler Support For ILP                       Symbolic Loop Unrolling

              SW Pipelining Example
 Before: Unrolled 3 times      After: Software Pipelined
 1 LD       F0,0(R1)                  LD      F0,0(R1)
 2 ADDD F4,F0,F2                      ADDD    F4,F0,F2
 3 SD       0(R1),F4                  LD      F0,-8(R1)
 4 LD       F6,-8(R1)           1     SD      0(R1),F4;   Stores M[i]
 5 ADDD F8,F6,F2                2     ADDD    F4,F0,F2;   Adds to M[i-1]
 6 SD       -8(R1),F8           3     LD      F0,-16(R1); loads M[i-2]
 7 LD       F10,-16(R1)         4     SUBI    R1,R1,#8
 8 ADDD F12,F10,F2              5     BNEZ    R1,LOOP
 9 SD       -16(R1),F12               SD      0(R1),F4
 10 SUBI R1,R1,#24                    ADDD    F4,F0,F2
 11 BNEZ R1,LOOP                      SD      -8(R1),F4

                   Read F4          Read F0
         SD   IF      ID     EX      Mem      WB Write F4
         ADDD         IF     ID      EX       Mem WB
         LD                  IF      ID       EX  Mem WB
                           Chap. 4 - Pipelining II Write F0         90
Compiler Support For ILP                Symbolic Loop Unrolling

           SW Pipelining Example
         Symbolic Loop Unrolling
            – Less code space
            – Overhead paid only once
              vs. each iteration in loop unrolling


                          Software Pipelining




                            Loop Unrolling




         100 iterations = 25 loops with 4 unrolled iterations each
                      Chap. 4 - Pipelining II                      91
Compiler Support For ILP                      Critical Path Scheduling


                     Trace Scheduling
•   Parallelism across IF branches vs. LOOP branches
•   Two steps:
     – Trace Selection
         • Find likely sequence of basic blocks (trace)
            of (statically predicted or profile predicted)
            long sequence of straight-line code
     – Trace Compaction
         • Squeeze trace into few VLIW instructions
         • Need bookkeeping code in case prediction is wrong
•   Compiler undoes bad guess
    (discards values in registers)
•   Subtle compiler bugs mean wrong answer
    vs. poorer performance; no hardware interlocks


                          Chap. 4 - Pipelining II                   92
    Hardware Support For
        Parallelism
4.1 Instruction Level Parallelism:
    Concepts and Challenges                  Software support of ILP is best when
                                                code is predictable at compile time.
4.2 Overcoming Data Hazards
    with Dynamic Scheduling
                                             But what if there‘s no predictability?
4.3 Reducing Branch Penalties                Here   we‘ll  talk  about     hardware
    with Dynamic Hardware
    Prediction                                  techniques. These include:
4.4 Taking Advantage of More
    ILP with Multiple Issue                  •   Conditional     or      Predicated
                                                 Instructions
4.5 Compiler Support for
    Exploiting ILP
                                             •   Hardware Speculation
4.6 Hardware Support for
    Extracting more Parallelism
4.7 Studies of ILP




                                     Chap. 4 - Pipelining II                  93
Hardware Support For                         Nullified Instructions
    Parallelism
Tell the Hardware To Ignore An Instruction
• Avoid branch prediction by turning branches into
  conditionally executed instructions:
  IF (x) then A = B op C else NOP
    – If false, then neither store result nor cause exception
    – Expanded ISA of Alpha, MIPs, PowerPC, SPARC,                x
      have conditional move. PA-RISC can annul any
      following instruction.
    – IA-64: 64 1-bit condition fields selected so
      conditional execution of any instruction                         A=
• Drawbacks to conditional instructions:                              B op C
    – Still takes a clock, even if ―annulled‖
    – Stalls if condition evaluated late
    – Complex conditions reduce effectiveness; condition
      becomes known late in pipeline.
This can be a major win because there is no time lost by
  taking a branch!!
                             Chap. 4 - Pipelining II                  94
Hardware Support For                          Nullified Instructions
    Parallelism
Tell the Hardware To Ignore An Instruction
Suppose we have the code:                     Nullified Method:
   if ( VarA == 0 )                               LD           R1, VarA
         VarS = VarT;          Compare            LD           R2, VarT
                              and Nullify         CMPNNZ       R1, #0
Previous Method:              Next Instr.         SD           VarS, R2
    LD         R1, VarA       If Not Zero
                                              Label:
    BNEZ       R1, Label
    LD         R2, VarT                        Nullified Method:
    SD         VarS, R2                            LD           R1, VarA
Label:                          Compare            LD           R2, VarT
                                and Move           CMOVZ        VarS,R2, R1
                                 IF Zero


                           Chap. 4 - Pipelining II                     95
Hardware Support For                            Compiler Speculation
    Parallelism
                 Increasing Parallelism
The theory here is to move an instruction across a branch so as to
  increase the size of a basic block and thus to increase parallelism.

Primary difficulty is in avoiding exceptions. For example
   if ( a ^= 0 ) c = b/a; may have divide by zero error in some cases.

Methods for increasing speculation include:

1. Use a set of status bits (poison bits) associated with the registers.
   Are a signal that the instruction results are invalid until some later
   time.
2. Result of instruction isn‘t written until it‘s certain the instruction is
   no longer speculative.


                           Chap. 4 - Pipelining II                       96
Hardware Support For                            Compiler Speculation
    Parallelism
                                   Original Code:
    Increasing                         LW    R1, 0(R3)     Load A
    Parallelism                        BNEZ R1, L1         Test A
                                       LW    R1, 0(R2)     If Clause
Example on Page 305.                   J     L2            Skip Else
Code for                           L1: ADDI R1, R1, #4     Else Clause
if ( A == 0 )                      L2: SW    0(R3), R1     Store A
  A = B;
else                               Speculated Code:
  A = A + 4;                           LW   R1, 0(R3)      Load A
Assume A is at 0(R3) and              LW    R14, 0(R2)     Spec Load B
   B is at 0(R4)                       BEQZ R1, L3         Other if Branch
Note here that only ONE                ADDI R14, R1, #4    Else Clause
 side needs to take a              L3: SW   0(R3), R14     Non-Spec Store
       branch!!

                           Chap. 4 - Pipelining II                   97
Hardware Support For                              Compiler Speculation
    Parallelism

    Poison Bits
                                     Speculated Code:
In the example on the last               LW   R1, 0(R3)      Load A
page, if the LW* produces
                                        LW*   R14, 0(R2)     Spec Load B
an exception, a poison bit
is set on that register. The             BEQZ R1, L3         Other if Branch
if a later instruction tries to          ADDI R14, R1, #4    Else Clause
use the register, an                 L3: SW   0(R3), R14     Non-Spec Store
exception is THEN raised.




                             Chap. 4 - Pipelining II                   98
     Hardware Support For                            Hardware Speculation
         Parallelism
                    HW support for More ILP
• Need HW buffer for results of
  uncommitted instructions: reorder buffer                         Reorder
   – Reorder buffer can be operand                                  Buffer
     source                                         FP
                                                    Op
   – Once operand commits, result is
                                                  Queue
     found in register                                                 FP Regs
   – 3 fields: instr. type, destination, value
   – Use reorder buffer number instead
     of reservation station                    Res Stations      Res Stations
   – Discard instructions on mis-
     predicted branches or on exceptions         FP Adder          FP Adder


                                                     Figure 4.34, page 311

                                Chap. 4 - Pipelining II                       99
Hardware Support For                         Hardware Speculation
    Parallelism
            HW support for More ILP
How is this used in practice?

Rather than predicting the direction of a branch, execute the
  instructions on both side!!

We early on know the target of a branch, long before we know it if will
  be taken or not.

So begin fetching/executing at that new Target PC.
But also continue fetching/executing as if the branch NOT taken.




                        Chap. 4 - Pipelining II                    100
           Studies of ILP
4.1 Instruction Level Parallelism:           •   Conflicting studies of amount of
    Concepts and Challenges
                                                 improvement available
4.2 Overcoming Data Hazards                       – Benchmarks (vectorized FP
    with Dynamic Scheduling
                                                     Fortran vs. integer C programs)
4.3 Reducing Branch Penalties
    with Dynamic Hardware                         – Hardware sophistication
    Prediction                                    – Compiler sophistication
4.4 Taking Advantage of More                 •   How much ILP is available using
    ILP with Multiple Issue                      existing mechanisms with increasing
4.5 Compiler Support for                         HW budgets?
    Exploiting ILP
                                             •   Do we need to invent new HW/SW
4.6 Hardware Support for                         mechanisms to keep on processor
    Extracting more Parallelism                  performance curve?
4.7 Studies of ILP




                                     Chap. 4 - Pipelining II                  101
      Studies of ILP
                       Limits to ILP
Initial HW Model here; MIPS compilers.
Assumptions for ideal/perfect machine to start:
    1. Register renaming–infinite virtual registers and all WAW & WAR
    hazards are avoided
    2. Branch prediction–perfect; no mispredictions
    3. Jump prediction–all jumps perfectly predicted => machine with
    perfect speculation & an unbounded buffer of instructions available
    4. Memory-address alias analysis–addresses are known & a store can
    be moved before a load provided addresses not equal
1 cycle latency for all instructions; unlimited number of instructions
    issued per clock cycle




                        Chap. 4 - Pipelining II                  102
                                         Studies of ILP                        Upper Limit to ILP: Ideal
 This is the amount of parallelism when
 there are no branch mis-predictions and
                                                                                      Machine
 we‘re limited only by data dependencies.                                            (Figure 4.38, page 319)

                                       160                                                           150.1
                                                                                     FP: 75 - 150
                                       140
                                                                                           118.7
                                       120      Integer: 18 - 60
        Instruction Issues per cycle




                                       100

                                                                                 75.2
                                        80
  IPC




                                                      62.6
                                        60   54.8


                                        40
                                                                    17.9
                                        20

                                         0
                                             gcc    espresso         li          fpppp    doducd    tomcatv
Instructions that could                                                   Programs
theoretically be issued
       per cycle.                                            Chap. 4 - Pipelining II                          103
      Studies of ILP
                                         Impact of Realistic Branch
                                                 Prediction


What parallelism do we get when we don‘t allow perfect branch
  prediction, as in the last picture, but assume some realistic model?
Possibilities include:

1. Perfect - all branches are perfectly predicted (the last slide)

2. Selective History Predictor - a complicated but do-able mechanism for
   selection.

3. Standard 2-bit history predictor with 512 2-bit entries.

4. Static prediction based on past history of the program.

5. None - Parallelism is limited to basic block.

                           Chap. 4 - Pipelining II                   104
Studies of ILP                                    Bonus!!

        Selective History Predictor
                8096 x 2 bits
                                    1
                                    0
                                                 Taken/Not Taken


                                           11
                                              Choose Non-correlator
                                           10
Branch Addr                                01 Choose Correlator
                                           00
         2
      Global
      History            00
                         01        8K x 2 bit
                         10        Selector
                         11
                                  11 Taken
                                  10
              2048 x 4 x 2 bits
                                  01 Not Taken
                                  00
                       Chap. 4 - Pipelining II                     105
                                                                                                                   Impact of Realistic
                                            Studies of ILP
                                                                                                                   Branch Prediction
                                                 Limiting the type of                                                   Figure 4.42, Page 325
                                                 branch prediction.                                               61
                                                                                                                                         58
                                                                                                                                                                 60
                                     60


                                                                                                                                                                        FP: 15 - 45
                                     50                                                                                48
                                                                                                                            46 45                                     46 45 45

                                                                   41
                                     40
      Instruction issues per cycle




                                            35

                                                                             Integer: 6 - 12
                                                                                                                                    29
                                     30
IPC




                                                                                                                                                                                 19
                                     20                                                   16
                                                                                                                                              15
                                                                                                                                                   13 14
                                                                        12
                                                                                               10
                                                 9
                                     10                    6                  7   6                 6    7
                                                     6
                                                                                                                                                           4
                                                               2                      2                       2

                                      0

                                                     gcc                espresso                    li                  fpppp                  doducd                 tomcatv

                                                                                                             Program


                                      Perfect                      Selective predictor          Standard 2-bit                  Static                         None

                                                                        Chap. 4 - Pipelining II                                                                        106
                                     Perfect               Selective Hist     BHT (512)       Profile                                                           No prediction
                                           Studies of ILP                                                                More Realistic HW:
                                                                                                                          Register Impact
                                                     Effect of limiting the                                                             Figure 4.44, Page 328
                                                     number of renaming
                                                                                                                    59
                                     60              registers.                                                                                                                            FP: 11 - 45
                                                                                                                                                                                 54

                                                                                                                         49
                                     50
                                                                                                                                                                                      45
                                                                                                                                                                                           44


                                     40
      Instruction issues per cycle




                                                                                                                              35


                                                                 Integer: 5 - 15                                                                     29
IPC




                                     30                                                                                                                                                         28


                                                                                                                                   20
                                     20
                                                                                                                                                          16
                                                                  15 15                                                                                        15
                                                                          13
                                                                                              12 12 12 11                                                           11
                                          11 10 10                             10
                                                     9
                                     10                                                                                                                                                              7
                                                         5                                                  6   5                       5                                5                               5
                                                             4                      5   4                                                                                    5
                                                                                                                                            4


                                      0

                                                 gcc                   espresso                      li                       fpppp                            doducd                  tomcatv

                                                                                                            Program


                                                                    Infinite            Chap. 4128Pipelining II
                                                                                        256     -        64                                     32                   None                            107
                                                     Infinite                       256             128             64                      32                  None
                                          Studies of ILP                                                 More Realistic HW:
                                          What happens when there                                          Alias Impact
                                          may be conflicts with                                                     Figure 4.46, Page 330
                                          memory aliasing?
                                                                                                                             FP: 4 - 45
                                     50
                                                                                                               49   49
                                                                                                                             (Fortran,                 45   45
                                     45
                                                                                                                             no heap)
                                     40

                                     35
                                                           Integer: 4 - 9
      Instruction issues per cycle
IPC




                                     30

                                     25

                                     20
                                                                                                                                  16   16
                                                                   15
                                     15
                                                                                       12
                                            10
                                     10                                                     9
                                                 7                       7
                                                                              5    5                                                        6
                                                       4                                             4                   4                                        5
                                                            3                                             3                   3                    4                  4
                                      5

                                      0

                                                     gcc                espresso                li                  fpppp              doducd               tomcatv

                                                                                                         Program
                                      Perfect                    Global/Stack perf;             Inspec.                                     None
                                                       Perfect          Chap. 4 - PipeliningInspection
                                                                 heap conflicts
                                                                           Global/stack Perfect  II                                         None             108
                                                                                                Assem.
                       Summary

4.1 Instruction Level Parallelism: Concepts and Challenges
4.2 Overcoming Data Hazards with Dynamic Scheduling
4.3 Reducing Branch Penalties with Dynamic Hardware Prediction
4.4 Taking Advantage of More ILP with Multiple Issue
4.5 Compiler Support for Exploiting ILP
4.6 Hardware Support for Extracting more Parallelism
4.7 Studies of ILP




                        Chap. 4 - Pipelining II                  109

								
To top