ILP and Dynamic Execution Branch Prediction Multiple Issue by tvm12882


									    ILP and Dynamic Execution: Branch Prediction
                 & Multiple Issue

                      Original by
               Prof. David A. Patterson

                             Review Tomasulo
          • Reservations stations: implicit register renaming to
            larger set of registers + buffering source operands
             – Prevents registers as bottleneck
             – Avoids WAR, WAW hazards of Scoreboard
             – Allows loop unrolling in HW
          • Not limited to basic blocks
            (integer units gets ahead, beyond branches)
          • Lasting Contributions
             – Dynamic scheduling
             – Register renaming
          • 360/91 descendants are Pentium III; PowerPC 604;
            MIPS R10000; HP-PA 8000; Alpha 21264

              Tomasulo Algorithm and Branch

          • 360/91 predicted branches, but did not
            speculate: pipeline stopped until the branch
            was resolved
             – No speculation; only instructions that can complete
          • Speculation with Reorder Buffer allows
            execution past branch, and then discard if
            branch fails
             – just need to hold instructions in buffer until branch can

             Case for Branch Prediction when
           Issue N instructions per clock cycle

          1. Branches will arrive up to n times faster in
             an n-issue processor
          2. Amdahl’s Law => relative impact of the
             control stalls will be larger with the lower
             potential CPI in an n-issue processor

               7 Branch Prediction Schemes

          1. 1-bit Branch-Prediction Buffer
          2. 2-bit Branch-Prediction Buffer
          3. Correlating Branch Prediction (two-level)
          4. Tournament Branch Predictor
          5. Branch Target Buffer
          6. Integrated Instruction Fetch Units
          7. Return Address Predictors

                     Dynamic Branch Prediction

          • Performance = ƒ(accuracy, cost of misprediction)
          • Branch History Table (BHT): Lower bits of PC
            address index table of 1-bit values
             – Says whether or not branch taken last time
             – No address check (saves HW, but may not be right branch)
          • Problem: in a loop, 1-bit BHT will cause
            2 mispredictions (avg is 9 iterations before exit):
             – End of loop case, when it exits instead of looping as before
             – First time through loop on next time through code, when it
               predicts exit instead of looping
             – Only 80% accuracy even if loop 90% of the time

                 Dynamic Branch Prediction
                              (Jim Smith, 1981)

     • Solution: 2-bit scheme where change prediction only
       if get misprediction twice: (Figure 3.7, p. 249)

          Predict Taken                           Predict Taken
                                    T        NT
           Predict Not                            Predict Not
                                        T            Taken

     • Red: stop, not taken             NT
     • Green: go, taken
     • Adds hysteresis to decision making process
          Prediction Accuracy of a 4096-entry 2-bit
            Prediction Buffer vs. an Infinite Buffer

                          BHT Accuracy

          • Mispredict because either:
            – Wrong guess for that branch
            – Got branch history of wrong branch when index the
          • 4096 entry table programs vary from 1%
            misprediction (nasa7, tomcatv) to 18%
            (eqntott), with spice at 9% and gcc at 12%
          • For SPEC92,
            4096 about as good as infinite table

              Improve Prediction Strategy By
                   Correlating Branches
          • Consider the worst case for the 2-bit predictor
             if (aa==2) then aa=0;            if the first 2 fail then the 3rd
             if (bb==2) then bb=0;            will always be taken
              if (aa != bb) then whatever
              – single level predictors can never get this case
          • Correlating or 2-level predictors
              – Correlation = what happened on the last branch
                  » Note that the last correlator branch may not always be
                    the same
              – Predictor = which way to go
                  » 4 possibilities: which way the last one went chooses the
                       •   (Last-taken, last-not-taken) X (predict-taken, predict-not-taken)

                                       From 柯皓仁, 交通大學
                     Correlating Branches

          • Hypothesis: recently executed branches are
            correlated; that is, behavior of recently
            executed branches affects prediction of
            current branch
          • Idea: record m most recently executed
            branches as taken or not taken, and use
            that pattern to select the proper branch
            history table
          • In general, (m,n) predictor means record
            last m branches to select between 2m
            history tables each with n-bit counters
            – Old 2-bit BHT is then a (0,2) predictor

                              From 柯皓仁, 交通大學
               Example of Correlating Branch
   if (d==0)       BNEZ R1, L1            ;branch b1 (d!=0)
      d = 1;            DAAIU R1, R0, #1  ;d==0, so d=1
   if (d==1)       L1:  DAAIU R3, R1, #-1
      …                 BNEZ R3, L2       ;branch b2 (d!=1)

4/16/06                  From 柯皓仁, 交通大學
                      A Problem for 1-bit Predictor
                           without Correlators
            initial      d!=0?         b1         value of d   d!=1?           b2
          value of d                              before b2
               0         NO        not taken          1         NO          not taken
              1          YES          taken           1         NO          not taken
              2          YES          taken           2        YES           taken

    d=?        b1         b1 action     New b1           b2      b2 action      New b2
            prediction                 prediction     prediction               prediction
      2           NT           T              T           NT           T             T

      0           T          NT             NT            T            NT            NT

      2           NT           T              T           NT           T             T

      0           T          NT             NT            T            NT            NT
4/16/06                               From 柯皓仁, 交通大學
               Example of Correlating Branch
                  Predictors (1,1) (Cont.)
            Prediction        Prediction if last        Prediction if last
               bits           branch not taken            branch taken
             NT/NT                   NT                        NT
              NT/T                    NT                        T
              T/NT                    T                        NT
               T/T                    T                         T

    d=?      b1          b1 action    New b1          b2      b2 action       New b2
          prediction                 prediction    prediction                prediction
      2    NT/NT            T          T/NT         NT/NT           T          NT/T
      0     T/NT           NT          T/NT          NT/T           NT         NT/T
      2     T/NT            T          T/NT          NT/T           T          NT/T
      0     T/NT           NT          T/NT          NT/T           NT         NT/T
4/16/06                              From 柯皓仁, 交通大學
              In general: (m,n) BHT (prediction

          •   p bits of buffer index = 2p entries of BHT
          •   Use last m branches = global branch history
          •   Use n bit predictor
          •   Total bits for the (m, n) BHT prediction
                      Total _ memory _ bits  2 m * n * 2 p

4/16/06                        From 柯皓仁, 交通大學
             (2,2) Predictor Implementation
             4 banks = each with 32 2-bit predictor entries



4/16/06                 From 柯皓仁, 交通大學
          Accuracy of Different Schemes

                     Tournament Predictors

          • Adaptively combine local and global
             – Multiple predictors
                » One based on global information: Results of
                  recently executed m branches
                » One based on local information: Results of past
                  executions of the current branch instruction
             – Selector to choose which predictors to use
          • Advantage
             – Ability to select the right predictor for the right

4/16/06                         From 柯皓仁, 交通大學
          Misprediction Rate Comparison

                Branch Target Buffer (BTB)

          • To reduce the branch penalty to 0
             – Need to know what the address is by the end of IF
             – But the instruction is not even decoded yet
             – So use the instruction address rather than wait for decode
                » If prediction works then penalty goes to 0!
          • BTB Idea -- Cache to store taken branches (no
            need to store untaken)
             – Match tag is instruction address  compare with current PC
             – Data field is the predicted PC
          • May want to add predictor field
             – To avoid the mispredict twice on every loop phenomenon
             – Adds complexity since we now have to track untaken branches
               as well

                                    Need Address
                              at Same Time as Prediction
 • Branch Target Buffer (BTB): Address of branch index to get
   prediction AND branch address (if taken)
      – Note: must check for branch match now, since can’t use wrong branch address
        (Figure 3.19, p. 262)

                                 Branch PC          Predicted PC
          PC of instruction

                                    =?                               Extra
                                             Yes: instruction is prediction state
                                             branch and use            bits
              No: branch not                 predicted PC as
  predicted, proceed normally                next PC
            (Next PC = PC+4)                                                        21
           Integrated Instruction Fetch Units

          • Integrated branch prediction
             – The branch predictor becomes part of the instruction
               fetch unit and is constantly predicting branches
          • Instruction prefetch
             – To deliver multiple instructions per clock, the
               instruction fetch unit will likely need to fetch ahead
          • Instruction memory access and buffering

                   Return Address Predictor

          • Indirect jump – jumps whose destination address
            varies at run time
             – indirect procedure call, select or case, procedure return
             – SPEC89 benchmarks: 85% of indirect jumps are procedure
          • Accuracy of BTB for procedure returns are low
             – if procedure is called from many places, and the calls from
               one place are not clustered in time
          • Use a small buffer of return addresses operating as
            a stack
             – Cache the most recent return addresses
             – Push a return address at a call, and pop one off at a return
             – If the cache is sufficient large (max call depth)  prefect

4/16/06                           From 柯皓仁, 交通大學
          Dynamic Branch Prediction Summary

          • Branch History Table: 2 bits for loop
          • Correlation: Recently executed branches
            correlated with next branch.
            – Either different branches
            – Or different executions of same branches
          • Tournament Predictor: more resources to
            competitive solutions and pick between them
          • Branch Target Buffer: include branch
            address & prediction
          • Return address stack for prediction of
            indirect jump

                     Getting CPI < 1:
            Issuing Multiple Instructions/Cycle
          • Vector Processing
             – Explicit coding of independent loops as operations on large
               vectors of numbers
             – Multimedia instructions being added to many processors
          • Superscalar
             – varying number instructions/cycle (1 to 8), scheduled by
               compiler or by HW (Tomasulo)
             – IBM PowerPC, Sun UltraSparc, DEC Alpha, Pentium III/4
          • (Very) Long Instruction Words (V)LIW:
             – fixed number of instructions (4-16) scheduled by the
               compiler; put operations into wide templates
             – Intel Architecture-64 (IA-64) 64-bit address
                  » Renamed: “Explicitly Parallel Instruction Computer (EPIC)”
          • Anticipated success of multiple instructions lead to
            Instructions Per Clock cycle (IPC) vs. CPI

                        Getting CPI < 1: Issuing
                       Multiple Instructions/Cycle
  • Superscalar MIPS: 2 instructions, 1 FP & 1 anything
          – Fetch 64-bits/clock cycle; Int on left, FP on right

     Type                            Pipe Stages
     Int. instruction   IF            ID     EX    MEM WB
     FP ADD instruction               IF     ID     EX  EX  EX WB
     Int. instruction                 IF     ID     EX MEM WB
     FP ADD instruction                      IF     ID  EX  EX EX   WB
     Int. instruction                        IF     ID  EX MEM WB
     FP ADD instruction                             IF  ID  EX EX   EX    WB

  • While Integer/FP split is simple for the HW, get CPI of 0.5 only
    for programs with:
          – Exactly 50% FP operations AND No hazards

                       Multiple Issue Issues

          • Issue packet: group of instructions from fetch
            unit that could potentially issue in 1 clock
             – If instruction causes structural hazard or a data hazard
               either due to earlier instruction in execution or to earlier
               instruction in issue packet, then instruction does not issue
             – 0 to N instruction issues per clock cycle, for N-issue
          • Performing issue checks in 1 cycle could limit
            clock cycle time
             – Issue stage usually split and pipelined
             – 1st stage decides how many instructions from within this
               packet can issue, 2nd stage examines hazards among selected
               instructions and those already been issued
             – higher branch penalties => prediction accuracy important

            Studies of The Limitations of ILP

          • Perfect hardware model
             – Rename as much as you need
                 » Infinite registers
                 » No WAW or WAR
             – Branch prediction is perfect
                 » Never happen in reality
             – Jump prediction (even computed such as return) is perfect
                 » Similarly unreal
             – Perfect memory disambiguation
                 » Almost perfect is not too hard in practice
             – Issue an unlimited number of instructions at once & no
               restriction on types of instructions issued
             – One-cycle latency

          What A Perfect Processor Must Do?

          • Look arbitrary far ahead to find a set of
            instructions to issue, predicting all branches
          • Rename all register uses to avoid WAW and WAR
          • Determine whether there are any dependences among
            the instructions in the issue packet; if so, rename
          • Determine if any memory dependences exist among
            the issuing instructions and hand them appropriately
          • Provide enough replicated function units to allow all
            the ready instructions to issue

          How many instructions would issue on
           the perfect machine every cycle?

                               Window Size

          • The set of instructions that is examined for
            simultaneous execution is called the window
          • The window size will be determined by the cost of
            determining whether n issuing instructions have any
            register dependences among them
             – In theory, This cost will be O(n2), refer to p.243, why?
          • Each instruction in the window must be kept in
          • Window size is limited by the required storage, the
            comparisons, and a limited issue rate

          Effects of Reducing the Window

              Effects of Realistic Branch and
                     Jump Prediction
          • Perfect
          • Tournament-based
             – Uses a correlating 2 bit and non-correlating 2 bit plus a
               selector to choose between the two
             – Prediction buffer has 8K (13 address bits from the branch)
             – 3 entries per slot - non-correlating, correlating, select
          • Standard 2 bit
             – 512 (9 address bits) entries
             – Plus 16 entry buffer to predict RETURNS
          • Static
             – Based on profile - predict either T or NT but it stays fixed
          • None

          The Effect of Branch-Prediction

          Effects of Finite Registers

            Models for Memory Alias Analysis

          • Perfect
          • Global/Stack Perfect
             – Perfect prediction for global and stack references and
               assume all heap references conflict
             – The best compiler-based analysis schemes can do
          • Inspection
             – Pointers to different allocation areas (such as the
               global area and the stack area) are assumed no conflict
             – Addresses using same register with different offsets
          • None
             – All memory references are assumed to conflict

          Effects of Memory Aliasing


To top