ILP

Document Sample
ILP Powered By Docstoc
					CSC504: ILP

 Lecture 3
Outline

Instruction Level Parallelism (ILP)
Recap: Data Dependencies
Extended MIPS Pipeline and Hazards
Dynamic scheduling with a scoreboard




                                       2
ILP: Concepts and Challenges

ILP (Instruction Level Parallelism) –
overlap execution of unrelated instructions
Techniques that increase amount of parallelism
exploited among instructions
  reduce impact of data and control hazards
  increase processor ability to exploit parallelism
Pipeline CPI = Ideal pipeline CPI +
Structural stalls + RAW stalls +
WAR stalls + WAW stalls + Control stalls
  Reducing each of the terms of the right-hand side
  minimize CPI and thus increase instruction throughput

                                                          3
Two approaches to exploit parallelism

Dynamic techniques
  largely depend on hardware
  to locate the parallelism
Static techniques
  rely on software




                                        4
Definition: Data Dependencies

Data dependence: instruction j is data dependent on
instruction i if either of the following holds
  Instruction i produces a result used by instruction j, or
  Instruction j is data dependent on instruction k, and
  instruction k is data dependent on instruction i
If dependent, cannot execute in parallel
Try to schedule to avoid hazards
Easy to determine for registers (fixed names)
Hard for memory (“memory disambiguation”):
  Does 100(R4) = 20(R6)?
     Effective address can also change from one instruction
     execution to other.

                                                              5
   Examples of Data Dependencies

Loop:   LD.D     F0, 0(R1)      ;   F0 = array element
        ADD.D    F4, F0, F2     ;   add scalar in F2
        SD.D     0(R1), F4      ;   store result and
        DADUI    R1,R1,#-8      ;   decrement pointer
        BNE      R1, R2, Loop   ;   branch if R1!=R2




                                                         6
Definition: Name Dependencies
Two instructions use same name
(register or memory location) but don’t exchange data
   Antidependence (WAR if a hazard for HW)
   Instruction j writes a register or memory location that instruction i
   reads from and instruction i is executed first
   Output dependence (WAW if a hazard for HW)
   Instruction i and instruction j write the same register or memory
   location; ordering between instructions must be preserved. If
   dependent, can’t execute in parallel
Renaming to remove data dependencies
Again Name Dependencies are Hard for Memory Accesses
   Does 100(R4) = 20(R6)?



                                                                           7
 Where are the name dependencies?
1 Loop:L.D      F0,0(R1)
2      ADD.D    F4,F0,F2
3      S.D      0(R1),F4     ;drop DSUBUI & BNEZ
4      L.D      F0,-8(R1)
5      ADD.D    F4,F0,F2
6      S.D      -8(R1),F4    ;drop DSUBUI & BNEZ
7      L.D      F0,-16(R1)
8      ADD.D    F4,F0,F2
9      S.D      -16(R1),F4   ;drop DSUBUI & BNEZ
10     L.D      F0,-24(R1)
11     ADD.D    F4,F0,F2
12     S.D      -24(R1),F4
13     SUBUI    R1,R1,#32    ;alter to 4*8
14     BNEZ     R1,LOOP
15     NOP

               How can remove them?

                                                   8
Where are the name dependencies?
1 Loop:L.D      F0,0(R1)
2      ADD.D    F4,F0,F2
3      S.D      0(R1),F4      ;drop DSUBUI & BNEZ
4      L.D      F6,-8(R1)
5      ADD.D    F8,F6,F2
6      S.D      -8(R1),F8     ;drop DSUBUI & BNEZ
7      L.D      F10,-16(R1)
8      ADD.D    F12,F10,F2
9      S.D      -16(R1),F12   ;drop DSUBUI & BNEZ
10     L.D      F14,-24(R1)
11     ADD.D    F16,F14,F2
12     S.D      -24(R1),F16
13     DSUBUI   R1,R1,#32     ;alter to 4*8
14     BNEZ     R1,LOOP
15     NOP

        The Orginal“register renaming”

                                                    9
Definition: Control Dependencies
Example: if p1 {S1;}; if p2 {S2;};
S1 is control dependent on p1 and
S2 is control dependent on p2 but not on p1
Two constraints on control dependences:
   An instruction that is control dep. on a branch cannot be moved
   before the branch, so that its execution is no longer controlled by
   the branch
   An instruction that is not control dep. on a branch cannot be
   moved to after the branch so that its execution is
   controlled by the branch
                                          DADDU R5, R6, R7
                                          ADD R1, R2, R3
                                          BEQZ R4, L
                                          SUB R1, R5, R6
                                  L:      OR R7, R1, R8
                                                                         10
Dynamically Scheduled Pipelines
  Overcoming Data Hazards
  with Dynamic Scheduling

 Why in HW at run time?
    Works when can’t know real dependence
    at compile time
    Simpler compiler
    Code for one machine runs well on another
 Example
                     SUB.D cannot execute because the
DIV.D F0,F2,F4       dependence of ADD.D on DIV.D causes
ADD.D F10,F0,F8      the pipeline to stall; yet SUBD is not data
SUB.D F12,F8,F12     dependent on anything!



 Key idea: Allow instructions behind stall to proceed
                                                                   12
Overcoming Data Hazards
with Dynamic Scheduling (cont’d)

Enables out-of-order execution =>
out-of-order completion
Out-of-order execution divides ID stage:
  1. Issue—decode instructions,
  check for structural hazards
  2. Read operands—wait until no data hazards,
  then read operands
Scoreboarding –
technique for allowing instructions
to execute out of order when there are sufficient
resources and no data dependencies (CDC 6600,
1963)
                                                    13
Scoreboarding Implications
Out-of-order completion =>
WAR, WAW hazards?
      DIV.D   F0,F2,F4              DIV.D   F0,F2,F4
      ADD.D   F10,F0,F8             ADD.D   F10,F0,F8
      SUB.D   F8,F8,F12             SUB.D   F10,F8,F12
Solutions for WAR
   Queue both the operation and copies of its operands
   Read registers only during Read Operands stage
For WAW, must detect hazard:
stall until other completes
Need to have multiple instructions in execution phase =>
multiple execution units or pipelined execution units
Scoreboard keeps track of dependencies,
state or operations
Scoreboard replaces ID, EX, WB with 4 stages
                                                           14
 Four Stages of Scoreboard Control
ID1: Issue — decode instructions &
check for structural hazards
ID2: Read operands — wait until no data hazards,
then read operands
EX: Execute — operate on operands;
when the result is ready, it notifies the scoreboard that it has
completed execution
WB: Write results — finish execution;
the scoreboard checks for WAR hazards.
If none, it writes results. If WAR, then it stalls the instruction
 DIV.D   F0,F2,F4         Scoreboarding stalls the the SUBD in its
 ADD.D   F10,F0,F8        write result stage until ADDD reads its
 SUB.D   F8,F8,F12        operands


                                                                     15
 Four Stages of Scoreboard Control
1.    Issue—decode instructions & check for structural hazards (ID1)
     If a functional unit for the instruction is free and no other active
     instruction has the same destination register (WAW), the scoreboard
     issues the instruction to the functional unit and updates its internal data
     structure. If a structural or WAW hazard exists, then the instruction issue
     stalls, and no further instructions will issue until these hazards are
     cleared.
2. Read operands—wait until no data hazards, then read
operands (ID2)
     A source operand is available if no earlier issued active instruction is
     going to write it, or if the register containing the operand is being written
     by a currently active functional unit. When the source operands are
     available, the scoreboard tells the functional unit to proceed to read the
     operands from the registers and begin execution. The scoreboard
     resolves RAW hazards dynamically in this step, and instructions may be
     sent into execution out of order.

                                                                                     16
Four Stages of Scoreboard Control
3.    Execution—operate on operands (EX)
     The functional unit begins execution upon receiving operands.
     When the result is ready, it notifies the scoreboard that it has
     completed execution.
4.    Write result—finish execution (WB)
     Once the scoreboard is aware that the functional unit has
     completed execution, the scoreboard checks for WAR hazards.
     If none, it writes results. If WAR, then it stalls the instruction.
     Example:
                  DIV.D              F0,F2,F4
                  ADD.D              F10,F0,F8
                  SUB.D              F8,F8,F14


     CDC 6600 scoreboard would stall SUBD until ADD.D
     reads operands
                                                                           17
Three Parts of the Scoreboard
1. Instruction status—which of 4 steps the instruction is in
(Capacity = window size)
2. Functional unit status—Indicates the state of the
functional unit (FU). 9 fields for each functional unit
   Busy—Indicates whether the unit is busy or not
   Op—Operation to perform in the unit (e.g., + or –)
   Fi—Destination register
   Fj, Fk—Source-register numbers
   Qj, Qk—Functional units producing source registers Fj, Fk
   Rj, Rk—Flags indicating when Fj, Fk are ready
3. Register result status—Indicates which functional unit will
write each register, if one exists. Blank when no pending
instructions will write that register

                                                                 18
MIPS with a Scoreboard
    Registers


                               FP Mult
                               FP Mult

                               FP Div


                               FP Div


                                FP Div
           Add1
           Add2
           Add3


Control/                                Control/
 Status                                  Status
                  Scoreboard


                                                   19
 Scoreboard Example
Instruction status                 Read ExecutioWrite
Instruction        j     k   Issue operandcompleteResult
L.D       F6     34+ R2
L.D       F2     45+ R3
MUL.D F0         F2 F4
SUB.D F8         F6 F2
DIV.D     F10 F0 F6
ADD.D F6         F8 F2
Functional unit status                    dest    S1       S2   FU for jFU for kFj?   Fk?
          Time Name          Busy Op      Fi      Fj       Fk   Qj      Qk      Rj    Rk
                 Integer     No
                 Mult1       No
                 Mult2       No
                 Add         No
                 Divide      No
Register result status
Clock                        F0   F2      F4      F6       F8   F10    F12     ...    F30
                      FU




                                                                                        20
     Scoreboard Example: Cycle 1
Instruction status               Read Executio  Write
Instruction      j     k   Issue operandcomplet Result
L.D      F6    34+ R2        1
L.D      F2    45+ R3
M UL.D F0      F2 F4
SUB.D F8       F6 F2
DIV.D    F10 F0 F6
ADD.D F6       F8 F2
Functional unit status                  dest   S1        S2   FU for jFU for kFj?   Fk?
         Time Name         Busy Op      Fi     Fj        Fk   Qj      Qk      Rj    Rk
               Integer     Yes Load     F6               R2                         Yes
               M ult1      No
               M ult2      No
               Add         No
               Divide      No
Register result status
Clock                      F0   F2      F4     F6      F8     F10    F12     ...    F30
       1             FU                        Integer



 Issue 1st L.D!

                                                                                          21
     Scoreboard Example: Cycle 2
Instruction status               Read Executio  Write
Instruction      j     k   Issue operandcomplet Result
L.D      F6    34+ R2        1     2
L.D      F2    45+ R3
M UL.D F0      F2 F4
SUB.D F8       F6 F2
DIV.D    F10 F0 F6
ADD.D F6       F8 F2
Functional unit status                  dest   S1        S2   FU for jFU for kFj?   Fk?
         Time Name         Busy Op      Fi     Fj        Fk   Qj      Qk      Rj    Rk
               Integer     Yes Load     F6               R2                         Yes
               M ult1      No
               M ult2      No
               Add         No
               Divide      No
Register result status
Clock                      F0   F2      F4     F6      F8     F10    F12     ...    F30
       2             FU                        Integer



 Issue 2nd L.D?            Structural hazard!
                           No further instructions will issue!
                                                                                          22
     Scoreboard Example: Cycle 3
Instruction status               Read Executio  Write
Instruction      j     k   Issue operandcomplet Result
L.D      F6    34+ R2        1     2      3
L.D      F2    45+ R3
M UL.D F0      F2 F4
SUB.D F8       F6 F2
DIV.D    F10 F0 F6
ADD.D F6       F8 F2
Functional unit status                  dest   S1        S2   FU for jFU for kFj?   Fk?
         Time Name         Busy Op      Fi     Fj        Fk   Qj      Qk      Rj    Rk
               Integer     Yes Load     F6               R2                         Yes
               M ult1      No
               M ult2      No
               Add         No
               Divide      No
Register result status
Clock                      F0   F2      F4     F6      F8     F10    F12     ...    F30
       3             FU                        Integer



 Issue MUL.D?

                                                                                          23
     Scoreboard Example: Cycle 4
Instruction status               Read Executio  Write
Instruction      j     k   Issue operandcomplet Result
L.D      F6    34+ R2        1     2      3       4
L.D      F2    45+ R3
M UL.D F0      F2 F4
SUB.D F8       F6 F2
DIV.D    F10 F0 F6
ADD.D F6       F8 F2
Functional unit status                  dest   S1        S2   FU for jFU for kFj?   Fk?
         Time Name         Busy Op      Fi     Fj        Fk   Qj      Qk      Rj    Rk
               Integer     Yes Load     F6               R2                         Yes
               M ult1      No
               M ult2      No
               Add         No
               Divide      No
Register result status
Clock                      F0   F2      F4     F6      F8     F10    F12     ...    F30
       4             FU                        Integer


  Check for WAR hazards!
  If none, write result!
                                                                                          24
     Scoreboard Example: Cycle 5
Instruction status               Read Executio  Write
Instruction      j     k   Issue operandcomplet Result
L.D      F6    34+ R2        1     2      3       4
L.D      F2    45+ R3        5
M UL.D F0      F2 F4
SUB.D F8       F6 F2
DIV.D    F10 F0 F6
ADD.D F6       F8 F2
Functional unit status                  dest   S1        S2   FU for jFU for kFj?   Fk?
         Time Name         Busy Op      Fi     Fj        Fk   Qj      Qk      Rj    Rk
               Integer     Yes Load     F2               R3                         Yes
               M ult1      No
               M ult2      No
               Add         No
               Divide      No
Register result status
Clock                      F0   F2      F4     F6        F8   F10    F12     ...    F30
       5             FU         Integer


  Issue 2nd L.D!

                                                                                          25
     Scoreboard Example: Cycle 6
Instruction status               Read Executio  Write
Instruction      j     k   Issue operandcomplet Result
L.D      F6    34+ R2        1     2      3       4
L.D      F2    45+ R3        5     6
M UL.D F0      F2 F4         6
SUB.D F8       F6 F2
DIV.D    F10 F0 F6
ADD.D F6       F8 F2
Functional unit status                   dest   S1       S2   FU for jFU for kFj?   Fk?
         Time Name         Busy Op       Fi     Fj       Fk   Qj      Qk      Rj    Rk
               Integer     Yes Load      F2              R3                         Yes
               M ult1      Yes M ult     F0     F2       F4   Integer         No    Yes
               M ult2      No
               Add         No
               Divide      No
Register result status
Clock                      F0     F2      F4    F6       F8   F10       F12   ...   F30
       6             FU    M ult1 Integer


  Issue MUL.D!

                                                                                          26
     Scoreboard Example: Cycle 7
Instruction status               Read ExecutioWrite
Instruction      j     k   Issue operand
                                       completResult
L.D      F6    34+ R2        1     2     3      4
L.D      F2    45+ R3        5     6     7
MUL.D F0       F2 F4         6
SUB.D F8       F6 F2         7
DIV.D    F10 F0 F6
ADD.D F6       F8 F2
Functional unit status                   dest   S1     S2   FU for jFU for kFj?     Fk?
         Time Name         Busy   Op     Fi     Fj     Fk   Qj      Qk      Rj      Rk
               Integer     Yes    Load   F2            R3                           Yes
               Mult1       Yes    Mult   F0     F2     F4   Integer          No     Yes
               Mult2       No
               Add         Yes    Sub    F8     F6     F2             Integer Yes   No
               Divide      No
Register result status
Clock                      F0    F2      F4     F6     F8  F10        F12    ...    F30
       7             FU    Mult1 Integer               Add


  Issue SUB.D!

                                                                                          27
     Scoreboard Example: Cycle 8
Instruction status               Read ExecutioWrite
Instruction      j     k   Issue operand
                                       completResult
L.D      F6    34+ R2        1     2     3      4
L.D      F2    45+ R3        5     6     7      8
MUL.D F0       F2 F4         6
SUB.D F8       F6 F2         7
DIV.D    F10 F0 F6           8
ADD.D F6       F8 F2
Functional unit status                   dest   S1     S2   FU for jFU for kFj?   Fk?
         Time Name         Busy   Op     Fi     Fj     Fk   Qj      Qk      Rj    Rk
               Integer     Yes    Load   F2            R3                         Yes
               Mult1       Yes    Mult   F0     F2     F4   Integer        No     Yes
               Mult2       No
               Add         Yes    Sub    F8     F6     F2         Integer Yes     No
               Divide      Yes    Div    F10    F0     F6   Mult1         No      Yes
Register result status
Clock                      F0    F2      F4     F6     F8  F10    F12      ...    F30
       8             FU    Mult1 Integer               Add Divide


  Issue DIV.D!

                                                                                        28
     Scoreboard Example: Cycle 9
Instruction status               Read ExecutioWrite
Instruction      j     k   Issue operand
                                       completResult
L.D      F6    34+ R2        1     2     3      4
L.D      F2    45+ R3        5     6     7      8
MUL.D F0       F2 F4         6     9
SUB.D F8       F6 F2         7     9
DIV.D    F10 F0 F6           8
ADD.D F6       F8 F2
Functional unit status                   dest   S1     S2   FU for jFU for kFj?   Fk?
         Time Name         Busy   Op     Fi     Fj     Fk   Qj      Qk      Rj    Rk
               Integer     No
            10 Mult1       Yes    Mult   F0     F2     F4   Integer        Yes    Yes
               Mult2       No
             2 Add         Yes    Sub    F8     F6     F2         Integer Yes     Yes
               Divide      Yes    Div    F10    F0     F6   Mult1         No      Yes
Register result status
Clock                      F0    F2      F4     F6     F8  F10    F12      ...    F30
       9             FU    Mult1                       Add Divide

  Read operands for MUL.D and SUB.D!
  Assume we can feed Mult1 and Add units in the same clock cycle.
  Issue ADD.D? Structural Hazard (unit is busy)!
                                                                                        29
     Scoreboard Example: Cycle 11
Instruction status               Read ExecutioWrite
Instruction      j     k   Issue operand
                                       completResult
L.D      F6    34+ R2        1     2      3     4
L.D      F2    45+ R3        5     6      7     8
MUL.D F0       F2 F4         6     9
SUB.D F8       F6 F2         7     9     11
DIV.D    F10 F0 F6           8
ADD.D F6       F8 F2
Functional unit status                   dest   S1     S2   FU for jFU for kFj?   Fk?
         Time Name         Busy   Op     Fi     Fj     Fk   Qj      Qk      Rj    Rk
               Integer     No
             8 Mult1       Yes    Mult   F0     F2     F4   Integer        Yes    Yes
               Mult2       No
             0 Add         Yes    Sub    F8     F6     F2         Integer Yes     Yes
               Divide      Yes    Div    F10    F0     F6   Mult1         No      Yes
Register result status
Clock                      F0    F2      F4     F6     F8  F10    F12      ...    F30
      11             FU    Mult1                       Add Divide


  Last cycle of SUB.D execution.

                                                                                        30
     Scoreboard Example: Cycle 12
Instruction status               Read ExecutioWrite
Instruction      j     k   Issue operand
                                       completResult
L.D      F6    34+ R2        1     2      3     4
L.D      F2    45+ R3        5     6      7     8
MUL.D F0       F2 F4         6     9
SUB.D F8       F6 F2         7     9     11    12
DIV.D    F10 F0 F6           8
ADD.D F6       F8 F2
Functional unit status                   dest   S1     S2   FU for jFU for kFj?   Fk?
         Time Name         Busy   Op     Fi     Fj     Fk   Qj      Qk      Rj    Rk
               Integer     No
             7 Mult1       Yes    Mult   F0     F2     F4   Integer        Yes    Yes
               Mult2       No
               Add         Yes    Sub    F8     F6     F2         Integer Yes     Yes
               Divide      Yes    Div    F10    F0     F6   Mult1         No      Yes
Register result status
Clock                      F0    F2      F4     F6     F8  F10    F12      ...    F30
      12             FU    Mult1                       Add Divide


  Check WAR on F8. Write F8.

                                                                                        31
     Scoreboard Example: Cycle 13
Instruction status               Read Executio  Write
Instruction      j     k   Issue operandcomplet Result
L.D      F6    34+ R2        1     2        3       4
L.D      F2    45+ R3        5     6        7       8
M UL.D F0      F2 F4         6     9
SUB.D F8       F6 F2         7     9       11      12
DIV.D    F10 F0 F6           8
ADD.D F6       F8 F2         13
Functional unit status                  dest    S1       S2   FU for jFU for kFj?   Fk?
         Time Name         Busy Op      Fi      Fj       Fk   Qj      Qk      Rj    Rk
               Integer     No
             6 M ult1      Yes M ult    F0      F2       F4   Integer        Yes    Yes
               M ult2      No
               Add         Yes Add      F6      F8       F2                  Yes    Yes
               Divide      Yes Div      F10     F0       F6   M ult1         No     Yes
Register result status
Clock                      F0     F2    F4     F6        F8   F10    F12     ...    F30
      13             FU    M ult1              Add            Divide


  Issue ADD.D!

                                                                                          32
     Scoreboard Example: Cycle 14
Instruction status               Read ExecutioWrite
Instruction      j     k   Issue operandcomplet Result
L.D      F6    34+ R2        1     2        3       4
L.D      F2    45+ R3        5     6        7       8
MUL.D F0       F2 F4         6     9
SUB.D F8       F6 F2         7     9       11      12
DIV.D    F10 F0 F6           8
ADD.D F6       F8 F2         13    14
Functional unit status                  dest    S1       S2   FU for jFU for kFj?   Fk?
         Time Name         Busy Op      Fi      Fj       Fk   Qj      Qk      Rj    Rk
               Integer     No
             5 Mult1       Yes Mult F0          F2       F4   Integer        Yes    Yes
               Mult2       No
             2 Add         Yes Add      F6      F8       F2                  Yes    Yes
               Divide      Yes Div      F10     F0       F6   Mult1          No     Yes
Register result status
Clock                      F0    F2     F4     F6        F8   F10    F12     ...    F30
      14             FU    Mult1               Add            Divide
  Read operands for ADD.D!

                                                                                          33
     Scoreboard Example: Cycle 15
Instruction status               Read Executio  Write
Instruction      j     k   Issue operandcomplet Result
L.D      F6    34+ R2        1     2        3       4
L.D      F2    45+ R3        5     6        7       8
M UL.D F0      F2 F4         6     9
SUB.D F8       F6 F2         7     9       11      12
DIV.D    F10 F0 F6           8
ADD.D F6       F8 F2         13    14
Functional unit status                  dest    S1       S2   FU for jFU for kFj?   Fk?
         Time Name         Busy Op      Fi      Fj       Fk   Qj      Qk      Rj    Rk
               Integer     No
             4 M ult1      Yes M ult    F0      F2       F4   Integer        Yes    Yes
               M ult2      No
             1 Add         Yes Add      F6      F8       F2                  Yes    Yes
               Divide      Yes Div      F10     F0       F6   M ult1         No     Yes
Register result status
Clock                      F0     F2    F4     F6        F8   F10    F12     ...    F30
      14             FU    M ult1              Add            Divide




                                                                                          34
     Scoreboard Example: Cycle 16
Instruction status               Read Executio  Write
Instruction      j     k   Issue operandcomplet Result
L.D      F6    34+ R2        1     2        3       4
L.D      F2    45+ R3        5     6        7       8
M UL.D F0      F2 F4         6     9
SUB.D F8       F6 F2         7     9       11      12
DIV.D    F10 F0 F6           8
ADD.D F6       F8 F2         13    14      16
Functional unit status                  dest    S1       S2   FU for jFU for kFj?   Fk?
         Time Name         Busy Op      Fi      Fj       Fk   Qj      Qk      Rj    Rk
               Integer     No
             3 M ult1      Yes M ult    F0      F2       F4   Integer        Yes    Yes
               M ult2      No
             0 Add         Yes Add      F6      F8       F2                  Yes    Yes
               Divide      Yes Div      F10     F0       F6   M ult1         No     Yes
Register result status
Clock                      F0     F2    F4     F6        F8   F10    F12     ...    F30
      16             FU    M ult1              Add            Divide




                                                                                          35
     Scoreboard Example: Cycle 17
Instruction status               Read Executio  Write
Instruction      j     k   Issue operandcomplet Result
L.D      F6    34+ R2        1     2        3       4
L.D      F2    45+ R3        5     6        7       8
M UL.D F0      F2 F4         6     9
SUB.D F8       F6 F2         7     9       11      12
DIV.D    F10 F0 F6           8
ADD.D F6       F8 F2         13    14      16
Functional unit status                  dest    S1       S2   FU for jFU for kFj?   Fk?
         Time Name         Busy Op      Fi      Fj       Fk   Qj      Qk      Rj    Rk
               Integer     No
             2 M ult1      Yes M ult    F0      F2       F4   Integer        Yes    Yes
               M ult2      No
               Add         Yes Add      F6      F8       F2                  Yes    Yes
               Divide      Yes Div      F10     F0       F6   M ult1         No     Yes
Register result status
Clock                      F0     F2    F4     F6        F8   F10    F12     ...    F30
      17             FU    M ult1              Add            Divide


  Why cannot write F6?

                                                                                          36
     Scoreboard Example: Cycle 19
Instruction status               Read Executio  Write
Instruction      j     k   Issue operandcomplet Result
L.D      F6    34+ R2        1     2        3       4
L.D      F2    45+ R3        5     6        7       8
M UL.D F0      F2 F4         6     9       19
SUB.D F8       F6 F2         7     9       11      12
DIV.D    F10 F0 F6           8
ADD.D F6       F8 F2         13    14      16
Functional unit status                  dest    S1       S2   FU for jFU for kFj?   Fk?
         Time Name         Busy Op      Fi      Fj       Fk   Qj      Qk      Rj    Rk
               Integer     No
             0 M ult1      Yes M ult    F0      F2       F4   Integer        Yes    Yes
               M ult2      No
               Add         Yes Add      F6      F8       F2                  Yes    Yes
               Divide      Yes Div      F10     F0       F6   M ult1         No     Yes
Register result status
Clock                      F0     F2    F4     F6        F8   F10    F12     ...    F30
      17             FU    M ult1              Add            Divide




                                                                                          37
     Scoreboard Example: Cycle 20
Instruction status               Read Executio  Write
Instruction      j     k   Issue operandcomplet Result
L.D      F6    34+ R2        1     2        3       4
L.D      F2    45+ R3        5     6        7       8
M UL.D F0      F2 F4         6     9       19      20
SUB.D F8       F6 F2         7     9       11      12
DIV.D    F10 F0 F6           8
ADD.D F6       F8 F2         13    14      16
Functional unit status                  dest    S1       S2   FU for jFU for kFj?   Fk?
         Time Name         Busy Op      Fi      Fj       Fk   Qj      Qk      Rj    Rk
               Integer     No
               M ult1      Yes M ult    F0      F2       F4   Integer        Yes    Yes
               M ult2      No
               Add         Yes Add      F6      F8       F2                  Yes    Yes
               Divide      Yes Div      F10     F0       F6   M ult1         No     Yes
Register result status
Clock                      F0     F2    F4     F6        F8   F10    F12     ...    F30
      20             FU    M ult1              Add            Divide




                                                                                          38
     Scoreboard Example: Cycle 21
Instruction status               Read Executio  Write
Instruction      j     k   Issue operandcomplet Result
L.D      F6    34+ R2        1     2        3       4
L.D      F2    45+ R3        5     6        7       8
M UL.D F0      F2 F4         6     9       19      20
SUB.D F8       F6 F2         7     9       11      12
DIV.D    F10 F0 F6           8     21
ADD.D F6       F8 F2         13    14      16
Functional unit status                  dest    S1     S2   FU for jFU for kFj?   Fk?
         Time Name         Busy Op      Fi      Fj     Fk   Qj      Qk      Rj    Rk
               Integer     No
               M ult1      No
               M ult2      No
               Add         Yes Add      F6      F8     F2                  Yes    Yes
               Divide      Yes Div      F10     F0     F6   M ult1         Yes    Yes
Register result status
Clock                      F0   F2      F4     F6     F8    F10    F12     ...    F30
      21             FU                        Add          Divide




                                                                                        39
     Scoreboard Example: Cycle 22
Instruction status               Read Executio  Write
Instruction      j     k   Issue operandcomplet Result
L.D      F6    34+ R2        1     2        3       4
L.D      F2    45+ R3        5     6        7       8
M UL.D F0      F2 F4         6     9       19      20
SUB.D F8       F6 F2         7     9       11      12
DIV.D    F10 F0 F6           8     21
ADD.D F6       F8 F2         13    14      16      22
Functional unit status                  dest    S1     S2   FU for jFU for kFj?   Fk?
         Time Name         Busy Op      Fi      Fj     Fk   Qj      Qk      Rj    Rk
               Integer     No
               M ult1      No
               M ult2      No
               Add         Yes Add      F6      F8     F2                  Yes    Yes
            40 Divide      Yes Div      F10     F0     F6   M ult1         Yes    Yes
Register result status
Clock                      F0   F2      F4     F6     F8    F10    F12     ...    F30
      22             FU                        Add          Divide


  Write F6?

                                                                                        40
     Scoreboard Example: Cycle 61
Instruction status               Read Executio  Write
Instruction      j     k   Issue operandcomplet Result
L.D      F6    34+ R2        1     2        3       4
L.D      F2    45+ R3        5     6        7       8
M UL.D F0      F2 F4         6     9       19      20
SUB.D F8       F6 F2         7     9       11      12
DIV.D    F10 F0 F6           8     21      61
ADD.D F6       F8 F2         13    14      16      22
Functional unit status                  dest    S1     S2   FU for jFU for kFj?   Fk?
         Time Name         Busy Op      Fi      Fj     Fk   Qj      Qk      Rj    Rk
               Integer     No
               M ult1      No
               M ult2      No
               Add         No
             0 Divide      Yes Div      F10     F0     F6   M ult1         Yes    Yes
Register result status
Clock                      F0   F2      F4     F6     F8    F10    F12     ...    F30
      61             FU                                     Divide




                                                                                        41
     Scoreboard Example: Cycle 62
Instruction status               Read Executio  Write
Instruction      j     k   Issue operandcomplet Result
L.D      F6    34+ R2        1     2        3       4
L.D      F2    45+ R3        5     6        7       8
M UL.D F0      F2 F4         6     9       19      20
SUB.D F8       F6 F2         7     9       11      12
DIV.D    F10 F0 F6           8     21      61      62
ADD.D F6       F8 F2         13    14      16      22
Functional unit status                  dest    S1     S2   FU for jFU for kFj?   Fk?
         Time Name         Busy Op      Fi      Fj     Fk   Qj      Qk      Rj    Rk
               Integer     No
               M ult1      No
               M ult2      No
               Add         No
               Divide      Yes Div      F10     F0     F6   M ult1         Yes    Yes
Register result status
Clock                      F0   F2      F4     F6     F8    F10    F12     ...    F30
      62             FU                                     Divide




                                                                                        42
Scoreboard Results

For the CDC 6600
  70% improvement for Fortran
  150% improvement for hand coded assembly
  language
  cost was similar to one of the functional units
     surprisingly low
     bulk of cost was in the extra busses
Still this was in ancient time
  no caches & no main semiconductor memory
  no software pipelining
  compilers?
So, why is it coming back
  performance via ILP
                                                    43
Scoreboard Limitations

Amount of parallelism among instructions
  can we find independent instructions to execute
Number of scoreboard entries
  how far ahead the pipeline can look for independent
  instructions (we assume a window does not extend
  beyond a branch)
Number and types of functional units
  avoid structural hazards
Presence of antidependences and output
dependences
  WAR and WAW stalls become more important
                                                        44
Things to Remember

Pipeline CPI = Ideal pipeline CPI + Structural
stalls + RAW stalls + WAR stalls + WAW
stalls
+ Control stalls
Data dependencies
Dynamic scheduling to minimise stalls
Dynamic scheduling with a scoreboard



                                                 45
Scoreboard Limitations

Amount of parallelism among instructions
  can we find independent instructions to execute
Number of scoreboard entries
  how far ahead the pipeline can look for independent
  instructions (we assume a window does not extend
  beyond a branch)
Number and types of functional units
  avoid structural hazards
Presence of antidependences and output
dependences
  WAR and WAW stalls become more important
                                                        46
Tomasulo’s Algorithm
Used in IBM 360/91 FPU (before caches)
Goal: high FP performance without special compilers
Conditions:
   Small number of floating point registers (4 in 360) prevented
   interesting compiler scheduling of operations
   Long memory accesses and long FP delays
   This led Tomasulo to try to figure out how to get more effective
   registers — renaming in hardware!
Why Study 1966 Computer?
The descendants of this have flourished!
   Alpha 21264, HP 8000, MIPS 10000, Pentium III, PowerPC 604,
   …


                                                                      47
Tomasulo’s Algorithm (cont’d)
Control & buffers distributed with Function Units (FU)
   FU buffers called “reservation stations” =>
   buffer the operands of instructions waiting to issue;
Registers in instructions replaced by values or pointers to
reservation stations (RS) => register renaming
   avoids WAR, WAW hazards
   More reservation stations than registers,
   so can do optimizations compilers can’t
Results to FU from RS, not through registers, over Common
Data Bus that broadcasts results to all FUs
Load and Stores treated as FUs with RSs as well
Integer instructions can go past branches,
allowing FP ops beyond basic block in FP queue

                                                              48
         Tomasulo-based FPU for MIPS
                            From Instruction Unit
  From Mem        FP Op                   FP Registers
                  Queue
        Load Buffers
Load1
Load2
Load3
Load4                                                 Store
Load5                                                 Buffers
Load6
                                                            Store1
                                                            Store2
                                                            Store3
        Add1
        Add2                    Mult1
        Add3                    Mult2
                            Reservation                              To Mem
                              Stations
               FP adders
                FP adders                 FP multipliers
                                           FP multipliers



                     Common Data Bus (CDB)
                                                                              49
Reservation Station Components
Op: Operation to perform in the unit (e.g., + or –)
Vj, Vk: Value of Source operands
   Store buffers has V field, result to be stored
Qj, Qk: Reservation stations producing source registers (value
to be written)
   Note: Qj/Qk=0 => source operand is already available in Vj /Vk
   Store buffers only have Qi for RS producing result
Busy: Indicates reservation station or FU is busy

Register result status—Indicates which functional unit will
write each register, if one exists. Blank when no pending
instructions that will write that register.

                                                                    50
 Three Stages of Tomasulo Algorithm
1. Issue—get instruction from FP Op Queue
   If reservation station free (no structural hazard),
   control issues instr & sends operands (renames registers)
2. Execute—operate on operands (EX)
   When both operands ready then execute;
   if not ready, watch Common Data Bus for result
3. Write result—finish execution (WB)
   Write it on Common Data Bus to all awaiting units;
   mark reservation station available
Normal data bus: data + destination (“go to” bus)
Common data bus: data + source (“come from” bus)
   64 bits of data + 4 bits of Functional Unit source address
   Write if matches expected Functional Unit (produces result)
   Does the broadcast
Example speed: 2 clocks for Fl .pt. +,-; 10 for * ; 40 clks for /
                                                                    51
Instruction stream
           Tomasulo Example
 Instruction status:  Exec Write
    Instruction        j   k    Issue Comp Result                Busy Address
    LD          F6   34+   R2                            Load1    No
    LD          F2   45+   R3                            Load2    No
    MULTD F0          F2   F4                            Load3    No
    SUBD        F8    F6   F2

                                                                  3 Load/Buffers
    DIVD       F10    F0   F6
    ADDD        F6    F8   F2

 Reservation Stations:                 S1   S2      RS    RS
           Time Name Busy       Op     Vj   Vk      Qj    Qk
                Add1  No
  FU count      Add2  No
                                                                       3 FP Adder R.S.
    down        Add3  No
                                                                        2 FP Mult R.S.
                Mult1 No
                Mult2 No

 Register result status:
   Clock                        F0    F2    F4      F6    F8     F10     F12    ...   F30
       0                   FU


   Clock cycle
     counter

                                                                                         52
          Tomasulo Example Cycle 1
Instruction status:  Exec Write
   Instruction        j   k    Issue Comp Result                   Busy Address
   LD          F6   34+   R2    1                          Load1    Yes   34+R2
   LD          F2   45+   R3                               Load2    No
   MULTD F0          F2   F4                               Load3    No
   SUBD        F8    F6   F2
   DIVD       F10    F0   F6
   ADDD        F6    F8   F2

Reservation Stations:                 S1   S2       RS      RS
            Time Name Busy     Op     Vj   Vk       Qj      Qk
                 Add1  No
                 Add2  No
                 Add3  No
                 Mult1 No
                 Mult2 No

Register result status:
  Clock                        F0    F2    F4      F6       F8     F10    F12     ...   F30
      1                   FU                       Load1




                                                                                          53
          Tomasulo Example Cycle 2
Instruction status:  Exec Write
   Instruction        j   k    Issue Comp Result                   Busy Address
   LD          F6   34+   R2    1                          Load1    Yes   34+R2
   LD          F2   45+   R3    2                          Load2    Yes   45+R3
   MULTD F0          F2   F4                               Load3    No
   SUBD        F8    F6   F2
   DIVD       F10    F0   F6
   ADDD        F6    F8   F2

Reservation Stations:                 S1     S2     RS      RS
            Time Name Busy     Op     Vj     Vk     Qj      Qk
                 Add1  No
                 Add2  No
                 Add3  No
                 Mult1 No
                 Mult2 No

Register result status:
  Clock                        F0    F2      F4    F6       F8     F10    F12     ...   F30
      2                   FU         Load2         Load1


          Note: Can have multiple loads outstanding

                                                                                          54
          Tomasulo Example Cycle 3
Instruction status:  Exec Write
   Instruction        j   k    Issue Comp Result                   Busy Address
   LD          F6   34+   R2    1      3                   Load1    Yes   34+R2
   LD          F2   45+   R3    2                          Load2    Yes   45+R3
   MULTD F0          F2   F4    3                          Load3    No
   SUBD        F8    F6   F2
   DIVD       F10    F0   F6
   ADDD        F6    F8   F2

Reservation Stations:                 S1     S2     RS      RS
            Time Name Busy Op         Vj     Vk     Qj      Qk
                 Add1  No
                 Add2  No
                 Add3  No
                 Mult1 Yes MULTD             R(F4) Load2
                 Mult2 No

Register result status:
  Clock                        F0    F2      F4    F6       F8     F10    F12     ...   F30
      3                   FU   Mult1 Load2         Load1

     • Note: registers names are removed (“renamed”) in Reservation
       Stations; MULT issued
     • Load1 completing; what is waiting for Load1?
                                                                                          55
           Tomasulo Example Cycle 4
Instruction status:   Exec Write
   Instruction        j   k    Issue Comp Result                 Busy Address
   LD          F6   34+   R2    1      3     4           Load1    No
   LD          F2   45+   R3    2      4                 Load2    Yes   45+R3
   MULTD F0          F2   F4    3                        Load3    No
   SUBD        F8    F6   F2    4
   DIVD       F10    F0   F6
   ADDD        F6    F8   F2

Reservation Stations:                 S1     S2     RS    RS
            Time Name Busy Op         Vj     Vk     Qj    Qk
                 Add1 Yes SUBD M(A1)             Load2
                 Add2  No
                 Add3  No
                 Mult1 Yes MULTD     R(F4) Load2
                 Mult2 No

Register result status:
  Clock                        F0    F2      F4    F6     F8     F10    F12     ...   F30
      4                   FU   Mult1 Load2         M(A1) Add1



          • Load2 completing; what is waiting for Load2?
                                                                                        56
          Tomasulo Example Cycle 5
Instruction status:  Exec Write
   Instruction        j   k    Issue Comp Result                 Busy Address
   LD          F6   34+   R2    1     3      4           Load1    No
   LD          F2   45+   R3    2     4      5           Load2    No
   MULTD F0          F2   F4    3                        Load3    No
   SUBD        F8    F6   F2    4
   DIVD       F10    F0   F6    5
   ADDD        F6    F8   F2

Reservation Stations:                 S1     S2     RS    RS
            Time Name Busy Op         Vj     Vk     Qj    Qk
                2 Add1 Yes SUBD M(A1) M(A2)
                  Add2  No
                  Add3  No
               10 Mult1 Yes MULTD M(A2) R(F4)
                  Mult2 Yes DIVD        M(A1) Mult1

Register result status:
  Clock                        F0    F2      F4    F6     F8     F10   F12      ...   F30
      5                   FU   Mult1 M(A2)         M(A1) Add1 Mult2



          • Timer starts down for Add1, Mult1
                                                                                        57
          Tomasulo Example Cycle 6
Instruction status:  Exec Write
   Instruction        j   k    Issue Comp Result                  Busy Address
   LD          F6   34+   R2    1     3      4            Load1    No
   LD          F2   45+   R3    2     4      5            Load2    No
   MULTD F0          F2   F4    3                         Load3    No
   SUBD        F8    F6   F2    4
   DIVD       F10    F0   F6    5
   ADDD        F6    F8   F2    6

Reservation Stations:                 S1     S2    RS      RS
            Time Name Busy Op         Vj     Vk    Qj      Qk
                1 Add1 Yes SUBD M(A1) M(A2)
                  Add2 Yes ADDD         M(A2) Add1
                  Add3  No
                9 Mult1 Yes MULTD M(A2) R(F4)
                  Mult2 Yes DIVD        M(A1) Mult1

Register result status:
  Clock                        F0    F2      F4    F6      F8     F10   F12      ...   F30
      6                   FU   Mult1 M(A2)         Add2   Add1 Mult2



       • Issue ADDD here despite name dependency on F6?

                                                                                         58
          Tomasulo Example Cycle 7
Instruction status:  Exec Write
   Instruction        j   k    Issue Comp Result                  Busy Address
   LD          F6   34+   R2    1     3      4            Load1    No
   LD          F2   45+   R3    2     4      5            Load2    No
   MULTD F0          F2   F4    3                         Load3    No
   SUBD        F8    F6   F2    4     7
   DIVD       F10    F0   F6    5
   ADDD        F6    F8   F2    6

Reservation Stations:                 S1     S2    RS      RS
            Time Name Busy Op         Vj     Vk    Qj      Qk
                0 Add1 Yes SUBD M(A1) M(A2)
                  Add2 Yes ADDD         M(A2) Add1
                  Add3  No
                8 Mult1 Yes MULTD M(A2) R(F4)
                  Mult2 Yes DIVD        M(A1) Mult1

Register result status:
  Clock                        F0    F2      F4    F6      F8     F10   F12      ...   F30
      7                   FU   Mult1 M(A2)         Add2   Add1 Mult2



          • Add1 (SUBD) completing; what is waiting for it?
                                                                                         59
          Tomasulo Example Cycle 8
Instruction status:  Exec Write
   Instruction        j   k    Issue Comp Result                Busy Address
   LD          F6   34+   R2    1     3      4          Load1    No
   LD          F2   45+   R3    2     4      5          Load2    No
   MULTD F0          F2   F4    3                       Load3    No
   SUBD        F8    F6   F2    4     7      8
   DIVD       F10    F0   F6    5
   ADDD        F6    F8   F2    6

Reservation Stations:                 S1     S2    RS    RS
            Time Name Busy Op         Vj     Vk    Qj    Qk
                  Add1  No
                2 Add2 Yes ADDD (M-M) M(A2)
                  Add3  No
                7 Mult1 Yes MULTD M(A2) R(F4)
                  Mult2 Yes DIVD        M(A1) Mult1

Register result status:
  Clock                        F0    F2      F4    F6    F8     F10   F12      ...   F30
      8                   FU   Mult1 M(A2)         Add2 (M-M) Mult2




                                                                                       60
          Tomasulo Example Cycle 9
Instruction status:  Exec Write
   Instruction        j   k    Issue Comp Result                Busy Address
   LD          F6   34+   R2    1     3      4          Load1    No
   LD          F2   45+   R3    2     4      5          Load2    No
   MULTD F0          F2   F4    3                       Load3    No
   SUBD        F8    F6   F2    4     7      8
   DIVD       F10    F0   F6    5
   ADDD        F6    F8   F2    6

Reservation Stations:                 S1     S2    RS    RS
            Time Name Busy Op         Vj     Vk    Qj    Qk
                  Add1  No
                1 Add2 Yes ADDD (M-M) M(A2)
                  Add3  No
                6 Mult1 Yes MULTD M(A2) R(F4)
                  Mult2 Yes DIVD        M(A1) Mult1

Register result status:
  Clock                        F0    F2      F4    F6    F8     F10   F12      ...   F30
      9                   FU   Mult1 M(A2)         Add2 (M-M) Mult2




                                                                                       61
           Tomasulo Example Cycle 10
Instruction status:   Exec Write
   Instruction        j   k    Issue Comp Result                Busy Address
   LD          F6   34+   R2    1     3      4          Load1    No
   LD          F2   45+   R3    2     4      5          Load2    No
   MULTD F0          F2   F4    3                       Load3    No
   SUBD        F8    F6   F2    4     7      8
   DIVD       F10    F0   F6    5
   ADDD        F6    F8   F2    6     10

Reservation Stations:                 S1     S2    RS    RS
            Time Name Busy Op         Vj     Vk    Qj    Qk
                  Add1  No
                0 Add2 Yes ADDD (M-M) M(A2)
                  Add3  No
                5 Mult1 Yes MULTD M(A2) R(F4)
                  Mult2 Yes DIVD        M(A1) Mult1

Register result status:
  Clock                        F0    F2      F4    F6    F8     F10   F12      ...   F30
      10                  FU   Mult1 M(A2)         Add2 (M-M) Mult2



           • Add2 (ADDD) completing; what is waiting for it?
                                                                                       62
           Tomasulo Example Cycle 11
Instruction status:   Exec Write
   Instruction        j   k    Issue Comp Result                 Busy Address
   LD          F6   34+   R2    1     3      4           Load1    No
   LD          F2   45+   R3    2     4      5           Load2    No
   MULTD F0          F2   F4    3                        Load3    No
   SUBD        F8    F6   F2    4     7      8
   DIVD       F10    F0   F6    5
   ADDD        F6    F8   F2    6     10     11

Reservation Stations:                 S1     S2     RS    RS
             Time Name Busy Op        Vj     Vk     Qj    Qk
                   Add1  No
                   Add2  No
                   Add3  No
                 4 Mult1 Yes MULTD M(A2) R(F4)
                   Mult2 Yes DIVD        M(A1) Mult1

Register result status:
  Clock                        F0    F2      F4     F6    F8     F10   F12      ...   F30
      11                  FU   Mult1 M(A2)        (M-M+M(M-M) Mult2


           • Write result of ADDD here?
           • All quick instructions complete in this cycle!
                                                                                        63
           Tomasulo Example Cycle 12
Instruction status:   Exec Write
   Instruction        j   k    Issue Comp Result                 Busy Address
   LD          F6   34+   R2    1     3      4           Load1    No
   LD          F2   45+   R3    2     4      5           Load2    No
   MULTD F0          F2   F4    3                        Load3    No
   SUBD        F8    F6   F2    4     7      8
   DIVD       F10    F0   F6    5
   ADDD        F6    F8   F2    6     10     11

Reservation Stations:                 S1     S2     RS    RS
            Time Name Busy Op         Vj     Vk     Qj    Qk
                  Add1  No
                  Add2  No
                  Add3  No
                3 Mult1 Yes MULTD M(A2) R(F4)
                  Mult2 Yes DIVD        M(A1) Mult1

Register result status:
  Clock                        F0    F2      F4     F6    F8     F10   F12      ...   F30
      12                  FU   Mult1 M(A2)        (M-M+M(M-M) Mult2




                                                                                        64
           Tomasulo Example Cycle 13
Instruction status:   Exec Write
   Instruction        j   k    Issue Comp Result                 Busy Address
   LD          F6   34+   R2    1     3      4           Load1    No
   LD          F2   45+   R3    2     4      5           Load2    No
   MULTD F0          F2   F4    3                        Load3    No
   SUBD        F8    F6   F2    4     7      8
   DIVD       F10    F0   F6    5
   ADDD        F6    F8   F2    6     10     11

Reservation Stations:                 S1     S2     RS    RS
            Time Name Busy Op         Vj     Vk     Qj    Qk
                  Add1  No
                  Add2  No
                  Add3  No
                2 Mult1 Yes MULTD M(A2) R(F4)
                  Mult2 Yes DIVD        M(A1) Mult1

Register result status:
  Clock                        F0    F2      F4     F6    F8     F10   F12      ...   F30
      13                  FU   Mult1 M(A2)        (M-M+M(M-M) Mult2




                                                                                        65
           Tomasulo Example Cycle 14
Instruction status:   Exec Write
   Instruction        j   k    Issue Comp Result                 Busy Address
   LD          F6   34+   R2    1     3      4           Load1    No
   LD          F2   45+   R3    2     4      5           Load2    No
   MULTD F0          F2   F4    3                        Load3    No
   SUBD        F8    F6   F2    4     7      8
   DIVD       F10    F0   F6    5
   ADDD        F6    F8   F2    6     10     11

Reservation Stations:                 S1     S2     RS    RS
            Time Name Busy Op         Vj     Vk     Qj    Qk
                  Add1  No
                  Add2  No
                  Add3  No
                1 Mult1 Yes MULTD M(A2) R(F4)
                  Mult2 Yes DIVD        M(A1) Mult1

Register result status:
  Clock                        F0    F2      F4     F6    F8     F10   F12      ...   F30
      14                  FU   Mult1 M(A2)        (M-M+M(M-M) Mult2




                                                                                        66
           Tomasulo Example Cycle 15
Instruction status:   Exec Write
   Instruction        j   k    Issue Comp Result                 Busy Address
   LD          F6   34+   R2    1     3      4           Load1    No
   LD          F2   45+   R3    2      4     5           Load2    No
   MULTD F0          F2   F4    3     15                 Load3    No
   SUBD        F8    F6   F2    4      7     8
   DIVD       F10    F0   F6    5
   ADDD        F6    F8   F2    6     10     11

Reservation Stations:                 S1     S2     RS    RS
            Time Name Busy Op         Vj     Vk     Qj    Qk
                  Add1  No
                  Add2  No
                  Add3  No
                0 Mult1 Yes MULTD M(A2) R(F4)
                  Mult2 Yes DIVD        M(A1) Mult1

Register result status:
  Clock                        F0    F2      F4     F6    F8     F10   F12      ...   F30
      15                  FU   Mult1 M(A2)        (M-M+M(M-M) Mult2



           • Mult1 (MULTD) completing; what is waiting for it?
                                                                                        67
           Tomasulo Example Cycle 16
Instruction status:   Exec Write
   Instruction        j   k    Issue Comp Result                Busy Address
   LD          F6   34+   R2    1     3     4           Load1    No
   LD          F2   45+   R3    2      4     5          Load2    No
   MULTD F0          F2   F4    3     15    16          Load3    No
   SUBD        F8    F6   F2    4      7     8
   DIVD       F10    F0   F6    5
   ADDD        F6    F8   F2    6     10    11

Reservation Stations:                 S1    S2     RS    RS
            Time Name Busy Op         Vj    Vk     Qj    Qk
                  Add1  No
                  Add2  No
                  Add3  No
                  Mult1 No
               40 Mult2 Yes DIVD M*F4 M(A1)

Register result status:
  Clock                        F0    F2     F4     F6    F8     F10   F12      ...   F30
      16                  FU   M*F4 M(A2)        (M-M+M(M-M) Mult2



           • Just waiting for Mult2 (DIVD) to complete
                                                                                       68
           Tomasulo Example Cycle 55
Instruction status:   Exec Write
   Instruction        j   k    Issue Comp Result                Busy Address
   LD          F6   34+   R2    1     3     4           Load1    No
   LD          F2   45+   R3    2      4     5          Load2    No
   MULTD F0          F2   F4    3     15    16          Load3    No
   SUBD        F8    F6   F2    4      7     8
   DIVD       F10    F0   F6    5
   ADDD        F6    F8   F2    6     10    11

Reservation Stations:                 S1    S2     RS    RS
            Time Name Busy Op         Vj    Vk     Qj    Qk
                  Add1  No
                  Add2  No
                  Add3  No
                  Mult1 No
                1 Mult2 Yes DIVD M*F4 M(A1)

Register result status:
  Clock                        F0    F2     F4     F6    F8     F10   F12      ...   F30
      55                  FU   M*F4 M(A2)        (M-M+M(M-M) Mult2




                                                                                       69
           Tomasulo Example Cycle 56
Instruction status:   Exec Write
   Instruction        j   k    Issue Comp Result                Busy Address
   LD          F6   34+   R2    1     3     4           Load1    No
   LD          F2   45+   R3    2      4     5          Load2    No
   MULTD F0          F2   F4    3     15    16          Load3    No
   SUBD        F8    F6   F2    4      7     8
   DIVD       F10    F0   F6    5     56
   ADDD        F6    F8   F2    6     10    11

Reservation Stations:                 S1    S2     RS    RS
            Time Name Busy Op         Vj    Vk     Qj    Qk
                  Add1  No
                  Add2  No
                  Add3  No
                  Mult1 No
                0 Mult2 Yes DIVD M*F4 M(A1)

Register result status:
  Clock                        F0    F2     F4     F6    F8     F10   F12      ...   F30
      56                  FU   M*F4 M(A2)        (M-M+M(M-M) Mult2



 • Mult2 (DIVD) is completing; what is waiting for it?
                                                                                       70
           Tomasulo Example Cycle 57
Instruction status:   Exec Write
   Instruction        j   k    Issue Comp Result                Busy Address
   LD          F6   34+   R2    1     3     4           Load1    No
   LD          F2   45+   R3    2      4     5          Load2    No
   MULTD F0          F2   F4    3     15    16          Load3    No
   SUBD        F8    F6   F2    4      7     8
   DIVD       F10    F0   F6    5     56    57
   ADDD        F6    F8   F2    6     10    11

Reservation Stations:                 S1    S2     RS    RS
            Time Name Busy Op         Vj    Vk     Qj    Qk
                 Add1  No
                 Add2  No
                 Add3  No
                 Mult1 No
                 Mult2 Yes DIVD M*F4 M(A1)

Register result status:
  Clock                        F0    F2     F4     F6    F8     F10   F12      ...   F30
      56                  FU   M*F4 M(A2)        (M-M+M(M-M) Result


 • Once again: In-order issue, out-of-order execution
   and out-of-order completion.
                                                                                       71
Tomasulo Drawbacks

Complexity
  delays of 360/91, MIPS 10000, Alpha 21264,
  IBM PPC 620 in CA:AQA 2/e, but not in silicon!
Many associative stores (CDB) at high speed
Performance limited by Common Data Bus
  Each CDB must go to multiple functional units
  ⇒ high capacitance, high wiring density
  Number of functional units that can complete per
  cycle limited to one!
     Multiple CDBs ⇒ more FU logic for parallel assoc stores
Non-precise interrupts!
  We will address this later

                                                               72
Tomasulo Loop Example
  Loop: LD      F0     0(R1)
        MULTD   F4     F0    F2
        SD      F4     0     R1
        SUBI    R1     R1    #8
        BNEZ    R1     Loop


This time assume Multiply takes 4 clocks
Assume 1st load takes 8 clocks
(L1 cache miss), 2nd load takes 1 clock (hit)
To be clear, will show clocks for SUBI, BNEZ
  Reality: integer instructions ahead of Fl. Pt.
  Instructions
Show 2 iterations
                                                   73
           Loop Example
 Instruction status:                       Exec Write
    ITER Instruction       j     k   Issue CompResult             Busy Addr      Fu
      1    LD      F0       0   R1                       Load1    No
      1    MULTD   F4      F0   F2                       Load2    No
      1    SD      F4       0   R1                       Load3    No
Iter-
      2    LD      F0       0   R1                       Store1   No
ation 2    MULTD   F4      F0   F2                       Store2   No
Count 2    SD      F4       0   R1                       Store3   No
 Reservation Stations:                S1   S2   RS                Added Store Buffers
    Time   Name Busy       Op   Vj    Vk   Qj   Qk       Code:
           Add1  No                                     LD         F0       0    R1
           Add2  No                                     MULTD      F4      F0    F2
           Add3  No                                     SD         F4       0    R1
           Mult1 No                                     SUBI       R1      R1    #8
           Mult2 No                                     BNEZ       R1     Loop
 Register result status                                                 Instruction Loop
  Clock     R1            F0    F2   F4    F6   F8       F10 F12          ...    F30
      0      80    Fu

                       Value of Register used for address, iteration control
                                                                                           74
          Loop Example Cycle 1
Instruction status:     Exec Write
   ITER Instruction      j    k    Issue CompResult             Busy Addr    Fu
     1    LD       F0    0    R1     1                 Load1    Yes    80
                                                       Load2    No
                                                       Load3    No
                                                       Store1   No
                                                       Store2   No
                                                       Store3   No
Reservation Stations:               S1   S2   RS
   Time    Name Busy     Op   Vj    Vk   Qj   Qk       Code:
           Add1  No                                   LD         F0    0     R1
           Add2  No                                   MULTD      F4    F0    F2
           Add3  No                                   SD         F4    0     R1
           Mult1 No                                   SUBI       R1    R1    #8
           Mult2 No                                   BNEZ       R1   Loop
Register result status
 Clock      R1           F0   F2   F4    F6   F8       F10 F12        ...    F30
     1      80    Fu Load1


                                                                                   75
          Loop Example Cycle 2
Instruction status:     Exec Write
   ITER Instruction      j    k    Issue CompResult             Busy Addr    Fu
     1    LD       F0     0   R1     1                 Load1    Yes    80
     1    MULTD    F4    F0   F2     2                 Load2    No
                                                       Load3    No
                                                       Store1   No
                                                       Store2   No
                                                       Store3   No
Reservation Stations:               S1     S2    RS
   Time    Name Busy Op       Vj    Vk     Qj    Qk    Code:
           Add1  No                                   LD         F0    0     R1
           Add2  No                                   MULTD      F4    F0    F2
           Add3  No                                   SD         F4    0     R1
           Mult1 Yes Multd         R(F2) Load1        SUBI       R1    R1    #8
           Mult2 No                                   BNEZ       R1   Loop
Register result status
 Clock      R1           F0   F2   F4      F6    F8    F10 F12        ...    F30
     2      80    Fu Load1         Mult1



                                                                                   76
          Loop Example Cycle 3
Instruction status:     Exec Write
   ITER Instruction      j    k    Issue CompResult             Busy Addr    Fu
     1    LD       F0     0   R1     1                 Load1    Yes    80
     1    MULTD    F4    F0   F2     2                 Load2    No
     1    SD       F4     0   R1     3                 Load3    No
                                                       Store1   Yes    80    Mult1
                                                       Store2   No
                                                       Store3   No
Reservation Stations:               S1     S2    RS
   Time    Name Busy Op       Vj    Vk     Qj    Qk    Code:
           Add1  No                                   LD         F0    0      R1
           Add2  No                                   MULTD      F4    F0     F2
           Add3  No                                   SD         F4    0      R1
           Mult1 Yes Multd         R(F2) Load1        SUBI       R1    R1     #8
           Mult2 No                                   BNEZ       R1   Loop
Register result status
 Clock      R1           F0   F2   F4      F6    F8    F10 F12        ...    F30
     3      80    Fu Load1         Mult1

      Implicit renaming sets up data flow
        graph                                                                        77
          Loop Example Cycle 4
Instruction status:     Exec Write
   ITER Instruction      j    k    Issue CompResult             Busy Addr    Fu
     1    LD       F0     0   R1     1                 Load1    Yes    80
     1    MULTD    F4    F0   F2     2                 Load2    No
     1    SD       F4     0   R1     3                 Load3    No
                                                       Store1   Yes    80    Mult1
                                                       Store2   No
                                                       Store3   No
Reservation Stations:               S1     S2    RS
   Time    Name Busy Op       Vj    Vk     Qj    Qk    Code:
           Add1  No                                   LD         F0    0      R1
           Add2  No                                   MULTD      F4    F0     F2
           Add3  No                                   SD         F4    0      R1
           Mult1 Yes Multd         R(F2) Load1        SUBI       R1    R1     #8
           Mult2 No                                   BNEZ       R1   Loop
Register result status
 Clock      R1           F0   F2   F4      F6    F8    F10 F12        ...    F30
     4      80    Fu Load1         Mult1



                                                                                     78
          Loop Example Cycle 5
Instruction status:     Exec Write
   ITER Instruction      j    k    Issue CompResult             Busy Addr    Fu
     1    LD       F0     0   R1     1                 Load1    Yes    80
     1    MULTD    F4    F0   F2     2                 Load2    No
     1    SD       F4     0   R1     3                 Load3    No
                                                       Store1   Yes    80    Mult1
                                                       Store2   No
                                                       Store3   No
Reservation Stations:               S1     S2    RS
   Time    Name Busy Op       Vj    Vk     Qj    Qk    Code:
           Add1  No                                   LD         F0    0      R1
           Add2  No                                   MULTD      F4    F0     F2
           Add3  No                                   SD         F4    0      R1
           Mult1 Yes Multd         R(F2) Load1        SUBI       R1    R1     #8
           Mult2 No                                   BNEZ       R1   Loop
Register result status
 Clock      R1           F0   F2   F4      F6    F8    F10 F12        ...    F30
     5      72    Fu Load1         Mult1



                                                                                     79
          Loop Example Cycle 6
Instruction status:     Exec Write
   ITER Instruction      j    k    Issue CompResult             Busy Addr    Fu
     1    LD       F0     0   R1     1                 Load1    Yes    80
     1    MULTD    F4    F0   F2     2                 Load2    Yes    72
     1    SD       F4     0   R1     3                 Load3    No
     2    LD       F0     0   R1     6                 Store1   Yes    80    Mult1
                                                       Store2   No
                                                       Store3   No
Reservation Stations:               S1     S2    RS
   Time    Name Busy Op       Vj    Vk     Qj    Qk    Code:
           Add1  No                                   LD         F0    0      R1
           Add2  No                                   MULTD      F4    F0     F2
           Add3  No                                   SD         F4    0      R1
           Mult1 Yes Multd         R(F2) Load1        SUBI       R1    R1     #8
           Mult2 No                                   BNEZ       R1   Loop
Register result status
 Clock      R1           F0   F2   F4      F6    F8    F10 F12        ...    F30
     6      72    Fu Load2         Mult1



                                                                                     80
          Loop Example Cycle 7
Instruction status:     Exec Write
   ITER Instruction      j    k    Issue CompResult             Busy Addr    Fu
     1    LD       F0     0   R1     1                 Load1    Yes    80
     1    MULTD    F4    F0   F2     2                 Load2    Yes    72
     1    SD       F4     0   R1     3                 Load3    No
     2    LD       F0     0   R1     6                 Store1   Yes    80    Mult1
     2    MULTD    F4    F0   F2     7                 Store2   No
                                                       Store3   No
Reservation Stations:               S1     S2    RS
   Time    Name Busy Op       Vj    Vk     Qj    Qk    Code:
           Add1  No                                   LD         F0    0      R1
           Add2  No                                   MULTD      F4    F0     F2
           Add3  No                                   SD         F4    0      R1
           Mult1 Yes Multd         R(F2) Load1        SUBI       R1    R1     #8
           Mult2 Yes Multd         R(F2) Load2        BNEZ       R1   Loop
Register result status
 Clock      R1           F0   F2   F4      F6    F8    F10 F12        ...    F30
     7      72    Fu Load2         Mult2



                                                                                     81
          Loop Example Cycle 8
Instruction status:     Exec Write
   ITER Instruction      j    k    Issue CompResult             Busy Addr    Fu
     1    LD       F0     0   R1     1                 Load1    Yes    80
     1    MULTD    F4    F0   F2     2                 Load2    Yes    72
     1    SD       F4     0   R1     3                 Load3    No
     2    LD       F0     0   R1     6                 Store1   Yes    80    Mult1
     2    MULTD    F4    F0   F2     7                 Store2   Yes    72    Mult2
     2    SD       F4     0   R1     8                 Store3   No
Reservation Stations:               S1     S2    RS
   Time    Name Busy Op       Vj    Vk     Qj    Qk    Code:
           Add1  No                                   LD         F0    0      R1
           Add2  No                                   MULTD      F4    F0     F2
           Add3  No                                   SD         F4    0      R1
           Mult1 Yes Multd         R(F2) Load1        SUBI       R1    R1     #8
           Mult2 Yes Multd         R(F2) Load2        BNEZ       R1   Loop
Register result status
 Clock      R1           F0   F2   F4      F6    F8    F10 F12        ...    F30
     8      72    Fu Load2         Mult2



                                                                                     82
          Loop Example Cycle 9
Instruction status:     Exec Write
   ITER Instruction      j    k    Issue CompResult             Busy Addr    Fu
     1    LD       F0     0   R1     1     9           Load1    Yes    80
     1    MULTD    F4    F0   F2     2                 Load2    Yes    72
     1    SD       F4     0   R1     3                 Load3    No
     2    LD       F0     0   R1     6                 Store1   Yes    80    Mult1
     2    MULTD    F4    F0   F2     7                 Store2   Yes    72    Mult2
     2    SD       F4     0   R1     8                 Store3   No
Reservation Stations:               S1     S2    RS
   Time    Name Busy Op       Vj    Vk     Qj    Qk    Code:
           Add1  No                                   LD         F0    0      R1
           Add2  No                                   MULTD      F4    F0     F2
           Add3  No                                   SD         F4    0      R1
           Mult1 Yes Multd         R(F2) Load1        SUBI       R1    R1     #8
           Mult2 Yes Multd         R(F2) Load2        BNEZ       R1   Loop
Register result status
 Clock      R1           F0   F2   F4      F6    F8    F10 F12        ...    F30
     9      72    Fu Load2         Mult2



                                                                                     83
          Loop Example Cycle 10
Instruction status:     Exec Write
   ITER Instruction      j    k    Issue CompResult             Busy Addr    Fu
     1    LD       F0     0   R1     1     9    10     Load1    No
     1    MULTD    F4    F0   F2     2                 Load2    Yes    72
     1    SD       F4     0   R1     3                 Load3    No
     2    LD       F0     0   R1     6     10          Store1   Yes    80    Mult1
     2    MULTD    F4    F0   F2     7                 Store2   Yes    72    Mult2
     2    SD       F4     0   R1     8                 Store3   No
Reservation Stations:               S1     S2   RS
   Time    Name Busy Op     Vj      Vk     Qj   Qk     Code:
           Add1  No                                   LD         F0    0      R1
           Add2  No                                   MULTD      F4    F0     F2
           Add3  No                                   SD         F4    0      R1
     4     Mult1 Yes Multd M[80] R(F2)                SUBI       R1    R1     #8
           Mult2 Yes Multd       R(F2) Load2          BNEZ       R1   Loop
Register result status
 Clock      R1           F0   F2   F4      F6   F8     F10 F12        ...    F30
     10     64    Fu Load2         Mult2



                                                                                     84
          Loop Example Cycle 11
Instruction status:     Exec Write
   ITER Instruction      j    k    Issue CompResult             Busy Addr    Fu
     1    LD       F0     0   R1     1     9    10     Load1    No
     1    MULTD    F4    F0   F2     2                 Load2    No
     1    SD       F4     0   R1     3                 Load3    Yes    64
     2    LD       F0     0   R1     6     10   11     Store1   Yes    80    Mult1
     2    MULTD    F4    F0   F2     7                 Store2   Yes    72    Mult2
     2    SD       F4     0   R1     8                 Store3   No
Reservation Stations:               S1     S2   RS
   Time    Name Busy Op     Vj      Vk     Qj   Qk     Code:
           Add1  No                                   LD         F0    0      R1
           Add2  No                                   MULTD      F4    F0     F2
           Add3  No                                   SD         F4    0      R1
     3     Mult1 Yes Multd M[80] R(F2)                SUBI       R1    R1     #8
     4     Mult2 Yes Multd M[72] R(F2)                BNEZ       R1   Loop
Register result status
 Clock      R1           F0   F2   F4      F6   F8     F10 F12        ...    F30
     11     64    Fu Load3         Mult2



                                                                                     85
          Loop Example Cycle 12
Instruction status:     Exec Write
   ITER Instruction      j    k    Issue CompResult             Busy Addr    Fu
     1    LD       F0     0   R1     1     9    10     Load1    No
     1    MULTD    F4    F0   F2     2                 Load2    No
     1    SD       F4     0   R1     3                 Load3    Yes    64
     2    LD       F0     0   R1     6     10   11     Store1   Yes    80    Mult1
     2    MULTD    F4    F0   F2     7                 Store2   Yes    72    Mult2
     2    SD       F4     0   R1     8                 Store3   No
Reservation Stations:               S1     S2   RS
   Time    Name Busy Op     Vj      Vk     Qj   Qk     Code:
           Add1  No                                   LD         F0    0      R1
           Add2  No                                   MULTD      F4    F0     F2
           Add3  No                                   SD         F4    0      R1
     2     Mult1 Yes Multd M[80] R(F2)                SUBI       R1    R1     #8
     3     Mult2 Yes Multd M[72] R(F2)                BNEZ       R1   Loop
Register result status
 Clock      R1           F0   F2   F4      F6   F8     F10 F12        ...    F30
     12     64    Fu Load3         Mult2



                                                                                     86
          Loop Example Cycle 13
Instruction status:     Exec Write
   ITER Instruction      j    k    Issue CompResult             Busy Addr    Fu
     1    LD       F0     0   R1     1     9    10     Load1    No
     1    MULTD    F4    F0   F2     2                 Load2    No
     1    SD       F4     0   R1     3                 Load3    Yes    64
     2    LD       F0     0   R1     6     10   11     Store1   Yes    80    Mult1
     2    MULTD    F4    F0   F2     7                 Store2   Yes    72    Mult2
     2    SD       F4     0   R1     8                 Store3   No
Reservation Stations:               S1     S2   RS
   Time    Name Busy Op     Vj      Vk     Qj   Qk     Code:
           Add1  No                                   LD         F0    0      R1
           Add2  No                                   MULTD      F4    F0     F2
           Add3  No                                   SD         F4    0      R1
     1     Mult1 Yes Multd M[80] R(F2)                SUBI       R1    R1     #8
     2     Mult2 Yes Multd M[72] R(F2)                BNEZ       R1   Loop
Register result status
 Clock      R1           F0   F2   F4      F6   F8     F10 F12        ...    F30
     13     64    Fu Load3         Mult2



                                                                                     87
          Loop Example Cycle 14
Instruction status:     Exec Write
   ITER Instruction      j    k    Issue CompResult             Busy Addr    Fu
     1    LD       F0     0   R1     1      9   10     Load1    No
     1    MULTD    F4    F0   F2     2     14          Load2    No
     1    SD       F4     0   R1     3                 Load3    Yes    64
     2    LD       F0     0   R1     6     10   11     Store1   Yes    80    Mult1
     2    MULTD    F4    F0   F2     7                 Store2   Yes    72    Mult2
     2    SD       F4     0   R1     8                 Store3   No
Reservation Stations:               S1     S2   RS
   Time    Name Busy Op     Vj      Vk     Qj   Qk     Code:
           Add1  No                                   LD         F0    0      R1
           Add2  No                                   MULTD      F4    F0     F2
           Add3  No                                   SD         F4    0      R1
     0     Mult1 Yes Multd M[80] R(F2)                SUBI       R1    R1     #8
     1     Mult2 Yes Multd M[72] R(F2)                BNEZ       R1   Loop
Register result status
 Clock      R1           F0   F2   F4      F6   F8     F10 F12        ...    F30
     14     64    Fu Load3         Mult2



                                                                                     88
          Loop Example Cycle 15
Instruction status:     Exec Write
   ITER Instruction      j    k    Issue CompResult             Busy Addr     Fu
     1    LD       F0     0   R1     1      9   10     Load1    No
     1    MULTD    F4    F0   F2     2     14   15     Load2    No
     1    SD       F4     0   R1     3                 Load3    Yes    64
     2    LD       F0     0   R1     6     10   11     Store1   Yes    80    [80]*R2
     2    MULTD    F4    F0   F2     7     15          Store2   Yes    72     Mult2
     2    SD       F4     0   R1     8                 Store3   No
Reservation Stations:               S1     S2   RS
   Time    Name Busy Op     Vj      Vk     Qj   Qk     Code:
           Add1  No                                   LD         F0    0       R1
           Add2  No                                   MULTD      F4    F0      F2
           Add3  No                                   SD         F4    0       R1
           Mult1 No                                   SUBI       R1    R1      #8
     0     Mult2 Yes Multd M[72] R(F2)                BNEZ       R1   Loop
Register result status
 Clock      R1           F0   F2   F4      F6   F8     F10 F12        ...     F30
     15     64    Fu Load3         Mult2



                                                                                       89
          Loop Example Cycle 16
Instruction status:     Exec Write
   ITER Instruction      j    k    Issue CompResult             Busy Addr     Fu
     1    LD       F0     0   R1     1      9    10    Load1    No
     1    MULTD    F4    F0   F2     2     14    15    Load2    No
     1    SD       F4     0   R1     3                 Load3    Yes    64
     2    LD       F0     0   R1     6     10    11    Store1   Yes    80    [80]*R2
     2    MULTD    F4    F0   F2     7     15    16    Store2   Yes    72    [72]*R2
     2    SD       F4     0   R1     8                 Store3   No
Reservation Stations:               S1     S2    RS
   Time    Name Busy Op       Vj    Vk     Qj    Qk    Code:
           Add1  No                                   LD         F0    0       R1
           Add2  No                                   MULTD      F4    F0      F2
           Add3  No                                   SD         F4    0       R1
     4     Mult1 Yes Multd         R(F2) Load3        SUBI       R1    R1      #8
           Mult2 No                                   BNEZ       R1   Loop
Register result status
 Clock      R1           F0   F2   F4      F6    F8    F10 F12        ...     F30
     16     64    Fu Load3         Mult1



                                                                                       90
          Loop Example Cycle 17
Instruction status:     Exec Write
   ITER Instruction      j    k    Issue CompResult             Busy Addr     Fu
     1    LD       F0     0   R1     1      9    10    Load1    No
     1    MULTD    F4    F0   F2     2     14    15    Load2    No
     1    SD       F4     0   R1     3                 Load3    Yes    64
     2    LD       F0     0   R1     6     10    11    Store1   Yes    80    [80]*R2
     2    MULTD    F4    F0   F2     7     15    16    Store2   Yes    72    [72]*R2
     2    SD       F4     0   R1     8                 Store3   Yes    64     Mult1
Reservation Stations:               S1     S2    RS
   Time    Name Busy Op       Vj    Vk     Qj    Qk    Code:
           Add1  No                                   LD         F0    0       R1
           Add2  No                                   MULTD      F4    F0      F2
           Add3  No                                   SD         F4    0       R1
           Mult1 Yes Multd         R(F2) Load3        SUBI       R1    R1      #8
           Mult2 No                                   BNEZ       R1   Loop
Register result status
 Clock      R1           F0   F2   F4      F6    F8    F10 F12        ...     F30
     17     64    Fu Load3         Mult1



                                                                                       91
          Loop Example Cycle 18
Instruction status:     Exec Write
   ITER Instruction      j    k    Issue CompResult             Busy Addr     Fu
     1    LD       F0     0   R1     1      9    10    Load1    No
     1    MULTD    F4    F0   F2     2     14    15    Load2    No
     1    SD       F4     0   R1     3     18          Load3    Yes    64
     2    LD       F0     0   R1     6     10    11    Store1   Yes    80    [80]*R2
     2    MULTD    F4    F0   F2     7     15    16    Store2   Yes    72    [72]*R2
     2    SD       F4     0   R1     8                 Store3   Yes    64     Mult1
Reservation Stations:               S1     S2    RS
   Time    Name Busy Op       Vj    Vk     Qj    Qk    Code:
           Add1  No                                   LD         F0    0       R1
           Add2  No                                   MULTD      F4    F0      F2
           Add3  No                                   SD         F4    0       R1
           Mult1 Yes Multd         R(F2) Load3        SUBI       R1    R1      #8
           Mult2 No                                   BNEZ       R1   Loop
Register result status
 Clock      R1           F0   F2   F4      F6    F8    F10 F12        ...     F30
     18     64    Fu Load3         Mult1



                                                                                       92
          Loop Example Cycle 19
Instruction status:     Exec Write
   ITER Instruction      j    k    Issue CompResult             Busy Addr     Fu
     1    LD       F0     0   R1     1      9    10    Load1    No
     1    MULTD    F4    F0   F2     2     14    15    Load2    No
     1    SD       F4     0   R1     3     18    19    Load3    Yes    64
     2    LD       F0     0   R1     6     10    11    Store1   No
     2    MULTD    F4    F0   F2     7     15    16    Store2   Yes    72    [72]*R2
     2    SD       F4     0   R1     8     19          Store3   Yes    64     Mult1
Reservation Stations:               S1     S2    RS
   Time    Name Busy Op       Vj    Vk     Qj    Qk    Code:
           Add1  No                                   LD         F0    0       R1
           Add2  No                                   MULTD      F4    F0      F2
           Add3  No                                   SD         F4    0       R1
           Mult1 Yes Multd         R(F2) Load3        SUBI       R1    R1      #8
           Mult2 No                                   BNEZ       R1   Loop
Register result status
 Clock      R1           F0   F2   F4      F6    F8    F10 F12        ...     F30
     19     56    Fu Load3         Mult1



                                                                                       93
          Loop Example Cycle 20
Instruction status:     Exec Write
   ITER Instruction      j    k    Issue CompResult             Busy Addr    Fu
     1    LD       F0     0   R1     1      9    10    Load1    Yes    56
     1    MULTD    F4    F0   F2     2     14    15    Load2    No
     1    SD       F4     0   R1     3     18    19    Load3    Yes    64
     2    LD       F0     0   R1     6     10    11    Store1   No
     2    MULTD    F4    F0   F2     7     15    16    Store2   No
     2    SD       F4     0   R1     8     19    20    Store3   Yes    64    Mult1
Reservation Stations:               S1     S2    RS
   Time    Name Busy Op       Vj    Vk     Qj    Qk    Code:
           Add1  No                                   LD         F0    0      R1
           Add2  No                                   MULTD      F4    F0     F2
           Add3  No                                   SD         F4    0      R1
           Mult1 Yes Multd         R(F2) Load3        SUBI       R1    R1     #8
           Mult2 No                                   BNEZ       R1   Loop
Register result status
 Clock      R1           F0   F2   F4      F6    F8    F10 F12        ...    F30
     20     56    Fu Load1         Mult1

• Once again: In-order issue, out-of-order execution
  and out-of-order completion.
                                                                                     94
Why can Tomasulo
overlap iterations of loops?

Register renaming
  Multiple iterations use different physical destinations
  for registers (dynamic loop unrolling)
Reservation stations
  Permit instruction issue to advance past integer
  control flow operations
  Also buffer old values of registers - totally avoiding the
  WAR stall that we saw in the scoreboard
Other perspective: Tomasulo building data flow
dependency graph on the fly


                                                               95
Tomasulo’s scheme offers 2 major
advantages

(1) the distribution of the hazard detection logic
  distributed reservation stations and the CDB
  If multiple instructions waiting on single result, & each
  instruction has other operand, then instructions can
  be released simultaneously by broadcast on CDB
  If a centralized register file were used, the units would
  have to read their results from the registers when
  register buses are available.
(2) the elimination of stalls for WAW and WAR
hazards


                                                              96
Multiple Issue

Allow multiple instructions to issue in a single
clock cycle (CPI < 1)
Two flavors
  Superscalar
     Issue varying number of instruction per clock
     Can be statically (compiler tech.) or dynamically
     (Tomasulo) scheduled
  VLIW (Very Long Instruction Word)
     Issue a fixed number of instructions formatted as a
     single long instruction or as a fixed instruction
     packet
                                                           97
         Multiple Issue with Dynamic
         Scheduling
                               From Instruction Unit
  From Mem        FP Op                      FP Registers
                  Queue
        Load Buffers
Load1
Load2
Load3
Load4                                                    Store
Load5                                                    Buffers
Load6
                                                               Store1
                                                               Store2
                                                               Store3
        Add1
        Add2                       Mult1
        Add3                       Mult2
                               Reservation                              To Mem
                                 Stations
                  FP adders
                   FP adders                 FP multipliers
                                              FP multipliers




               Issue: 2 instructions per clock cycle
                                                                                 98
Multiple Issue with Dynamic
Scheduling: An Example
   Loop:   L.D          F0, 0(R1)
           ADD.D        F4,F0,F2
           S.D          0(R1), F4
           DADDIU       R1,R1,-#8
           BNE          R1,R2,Loop

Assumptions:
2-issue processor: can issue any pair of instructions
if reservation stations are available
Resources: ALU (int + effective address),
a separate pipelined FP for each operation type,
branch prediction hardware, 1 CDB
2 cc for loads, 3 cc for FP Add
Branches single issue, branch prediction is perfect     99
        Execution in
        Dual-issue Tomasulo Pipeline
                                 Exe.       Mem.     Write
Iter. Inst.              Issue                              Com.
                                 (begins)   Access   at CDB
 1    LD.D F0,0(R1)         1         2        3       4    first issue
 1    ADD.D F4,F0,F2        1         5                8    Wait for LD.D
 1    S.D 0(R1), F4         2         3        9            Wait for ADD.D
 1    DADDIU R1,R1,-#8      2         4                5    Wait for ALU
 1    BNE R1,R2,Loop        3         6                     Wait for DAIDU
 2    LD.D F0,0(R1)         4         7        8       9    Wait for BNE
 2    ADD.D F4,F0,F2        4        10                13   Wait for LD.D
 2    S.D 0(R1), F4         5         8        14           Wait for ADD.D
 2    DADDIU R1,R1,-#8      5         9                10   Wait for ALU
 2    BNE R1,R2,Loop        6        11                     Wait for DAIDU
 3    LD.D F0,0(R1)         7        12        13      14   Wait for BNE
 3    ADD.D F4,F0,F2        7        15                18   Wait for LD.D
 3    S.D 0(R1), F4         8        13        19           Wait for ADD.D
 3    DADDIU R1,R1,-#8      8        14                15   Wait for ALU
 3    BNE R1,R2,Loop        9        16                     Wait for DAIDU
                                                                            100
Multiple Issue with Dynamic
Scheduling: Resource Usage
   Clock    Int ALU    FP ALU    Data Cache     CDB

    2        1/L.D
    3        1/S.D                 1/L.D
    4      1/DADDIU                            1/L.D
    5                  1/ADD.D                1/DADDIU
    6
    7        2/L.D
    8        2/S.D                 2/L.D      1/ADD.D
    9      2/ DADDIU               1/S.D       2/L.D
    10                 2/ADD.D                2/DADDIU
    11
    12       3/L.D
    13       3/S.D                 3/L.D      2/ADD.D
    14     3/ DADDIU               2/S.D       3/L.D
    15                 3/ADD.D                3/DADDIU
    16
    17
    18                                        3/ADD.D
    19                             3/S.D                 101
Multiple Issue with Dynamic
Scheduling

DADDIU waits for ALU used by S.D
  Add one ALU dedicated to
  effective address calculation
  Use 2 CDBs
Draw table for the dual-issue version of
Tomasulo’s pipeline




                                           102
        Multiple Issue with Dynamic
        Scheduling
                                 Exe.       Mem.     Write
Iter. Inst.              Issue                              Com.
                                 (begins)   Access   at CDB
 1    LD.D F0,0(R1)         1         2        3       4    first issue
 1    ADD.D F4,F0,F2        1         5                8    Wait for LD.D
 1    S.D 0(R1), F4         2         3        9            Wait for ADD.D
 1    DADDIU R1,R1,-#8      2         3                4    Executes earlier
 1    BNE R1,R2,Loop        3         5                     Wait for DAIDU
 2    LD.D F0,0(R1)         4         6        7       8    Wait for BNE
 2    ADD.D F4,F0,F2        4         9                12   Wait for LD.D
 2    S.D 0(R1), F4         5         7        13           Wait for ADD.D
 2    DADDIU R1,R1,-#8      5         6                7    Executes earlier
 2    BNE R1,R2,Loop        6         8
 3    LD.D F0,0(R1)         7         9        10      11   Wait for BNE
 3    ADD.D F4,F0,F2        7        12                15
 3    S.D 0(R1), F4         8        10        16
 3    DADDIU R1,R1,-#8      8         9                10
 3    BNE R1,R2,Loop        9        11
                                                                               103
      Multiple Issue with Dynamic
      Scheduling: Resource Usage
Clock    Int ALU    Adr. Adder   FP ALU    Data Cache    CDB#1      CDB#2
 2                    1/L.D
 3       1/DADDIU     1/S.D                  1/L.D
 4                                                       1/L.D     1/DADDIU
 5                               1/ADD.D
 6      2/ DADDIU     2/L.D

 7                    2/S.D                  2/L.D      2/DADDIU
 8                                                      1/ADD.D     2/L.D
 9      3/ DADDIU     3/L.D      2/ADD.D     1/S.D
 10                   3/S.D                  3/L.D      3/DADDIU
 11                                                      3/L.D
 12                              3/ADD.D                2/ADD.D
 13                                          2/S.D
 14
 15                                                     3/ADD.D
 16                                          3/S.D



                                                                              104
What about Precise Interrupts?

Tomasulo had:
In-order issue, out-of-order execution, and
out-of-order completion
Need to “fix” the out-of-order completion
aspect so that we can find precise breakpoint
in instruction stream




                                                105
Hardware-based Speculation

With wide issue processors control
dependences become a burden, even with
sophisticated branch predictors
Speculation: speculate on the outcome of
branches and execute the program as if our
guesses were correct => need a mechanism
to handle situations when the speculations
were incorrect



                                             106
Relationship between
precise interrupts and speculation

Speculation is a form of guessing
Important for branch prediction:
  Need to “take our best shot” at predicting
  branch direction
If we speculate and are wrong, need to back
up and restart execution to point at which we
predicted incorrectly:
  This is exactly same as precise exceptions!
Technique for both precise
interrupts/exceptions and speculation:
in-order completion or commit
                                                107
   HW support for precise interrupts
Need HW buffer for results of uncommitted instructions:
reorder buffer (ROB)
  4 fields: instr. type, destination, value, ready
  Use reorder buffer number instead
  of reservation station
  when execution completes
                                                          Reorder
  Supplies operands between                                Buffer
  execution complete & commit                 FP
  (Reorder buffer can be operand              Op
  source => more registers like RS)         Queue         FP Regs
  Instructions commit
  Once instruction commits,
  result is put into register
                                         Res Stations   Res Stations
  As a result, easy to undo
  speculated instructions                  FP Adder      FP Adder
  on mispredicted branches
  or exceptions
                                                                       108
Four Steps of Speculative Tomasulo
Algorithm
1. Issue—get instruction from FP Op Queue
   If reservation station and reorder buffer slot free, issue instr &
   send operands & reorder buffer no. for destination (this stage
   sometimes called “dispatch”)
2. Execution—operate on operands (EX)
   When both operands ready then execute; if not ready, watch
   CDB for result; when both in reservation station, execute; checks
   RAW (sometimes called “issue”)
3. Write result—finish execution (WB)
   Write on Common Data Bus to all awaiting FUs
   & reorder buffer; mark reservation station available.
4. Commit—update register with reorder result
   When instr. at head of reorder buffer & result present, update
   register with result (or store to memory) and remove instr from
   reorder buffer. Mispredicted branch flushes reorder buffer
   (sometimes called “graduation”)
                                                                        109
           What are the hardware complexities
           with reorder buffer (ROB)?
           How do you find the latest version of a register?
                    (As specified by Smith paper) need associative comparison network
                    Could use future file or just use the register result status buffer to track
                    which specific reorder buffer has received the value
           Need as many ports on ROB as register file




                                                                            Compar network
                                           Program Counter
                                                                                               Reorder
                     Exceptions?




                                                                                                Buffer
                                                                 FP
Dest Reg




                                                                 Op
           Result




                                                                Queue
                                   Valid




                                                                                               FP Regs



  Reorder Table                                              Res Stations                    Res Stations
                                                               FP Adder                       FP Adder


                                                                                                            110
Summary
Reservations stations: implicit register renaming to larger set
of registers + buffering source operands
   Prevents registers as bottleneck
   Avoids WAR, WAW hazards of Scoreboard
   Allows loop unrolling in HW
Not limited to basic blocks
(integer units gets ahead, beyond branches)
Today, helps cache misses as well
   Don’t stall for L1 Data cache miss (insufficient ILP for L2 miss?)
Lasting Contributions
   Dynamic scheduling
   Register renaming
   Load/store disambiguation
360/91 descendants are Pentium III; PowerPC 604; MIPS
R10000; HP-PA 8000; Alpha 21264
                                                                        111

				
DOCUMENT INFO
Shared By:
Categories:
Tags: lecture
Stats:
views:30
posted:8/6/2011
language:English
pages:111
Description: this lecture in architecture is about ILP.