ILP and Dynamic Execution Branch Prediction Multiple Issue by tvm12882

VIEWS: 63 PAGES: 37

									    ILP and Dynamic Execution: Branch Prediction
                 & Multiple Issue




                      Original by
               Prof. David A. Patterson




                                             1
4/16/06
                             Review Tomasulo
          • Reservations stations: implicit register renaming to
            larger set of registers + buffering source operands
             – Prevents registers as bottleneck
             – Avoids WAR, WAW hazards of Scoreboard
             – Allows loop unrolling in HW
          • Not limited to basic blocks
            (integer units gets ahead, beyond branches)
          • Lasting Contributions
             – Dynamic scheduling
             – Register renaming
          • 360/91 descendants are Pentium III; PowerPC 604;
            MIPS R10000; HP-PA 8000; Alpha 21264


                                                              2
4/16/06
              Tomasulo Algorithm and Branch
                        Prediction

          • 360/91 predicted branches, but did not
            speculate: pipeline stopped until the branch
            was resolved
             – No speculation; only instructions that can complete
          • Speculation with Reorder Buffer allows
            execution past branch, and then discard if
            branch fails
             – just need to hold instructions in buffer until branch can
               commit




                                                                           3
4/16/06
             Case for Branch Prediction when
           Issue N instructions per clock cycle

          1. Branches will arrive up to n times faster in
             an n-issue processor
          2. Amdahl’s Law => relative impact of the
             control stalls will be larger with the lower
             potential CPI in an n-issue processor




                                                            4
4/16/06
               7 Branch Prediction Schemes

          1. 1-bit Branch-Prediction Buffer
          2. 2-bit Branch-Prediction Buffer
          3. Correlating Branch Prediction (two-level)
             Buffer
          4. Tournament Branch Predictor
          5. Branch Target Buffer
          6. Integrated Instruction Fetch Units
          7. Return Address Predictors



                                                         5
4/16/06
                     Dynamic Branch Prediction

          • Performance = ƒ(accuracy, cost of misprediction)
          • Branch History Table (BHT): Lower bits of PC
            address index table of 1-bit values
             – Says whether or not branch taken last time
             – No address check (saves HW, but may not be right branch)
          • Problem: in a loop, 1-bit BHT will cause
            2 mispredictions (avg is 9 iterations before exit):
             – End of loop case, when it exits instead of looping as before
             – First time through loop on next time through code, when it
               predicts exit instead of looping
             – Only 80% accuracy even if loop 90% of the time




                                                                              6
4/16/06
                 Dynamic Branch Prediction
                              (Jim Smith, 1981)

     • Solution: 2-bit scheme where change prediction only
       if get misprediction twice: (Figure 3.7, p. 249)

                          T
                                     NT
          Predict Taken                           Predict Taken
                                        T
                                    T        NT
                                        NT
           Predict Not                            Predict Not
                                        T            Taken
              Taken

     • Red: stop, not taken             NT
     • Green: go, taken
     • Adds hysteresis to decision making process
                                                                  7
4/16/06
          Prediction Accuracy of a 4096-entry 2-bit
            Prediction Buffer vs. an Infinite Buffer




                                                  8
4/16/06
                          BHT Accuracy

          • Mispredict because either:
            – Wrong guess for that branch
            – Got branch history of wrong branch when index the
              table
          • 4096 entry table programs vary from 1%
            misprediction (nasa7, tomcatv) to 18%
            (eqntott), with spice at 9% and gcc at 12%
          • For SPEC92,
            4096 about as good as infinite table




                                                                  9
4/16/06
              Improve Prediction Strategy By
                   Correlating Branches
          • Consider the worst case for the 2-bit predictor
             if (aa==2) then aa=0;            if the first 2 fail then the 3rd
             if (bb==2) then bb=0;            will always be taken
              if (aa != bb) then whatever
              – single level predictors can never get this case
          • Correlating or 2-level predictors
              – Correlation = what happened on the last branch
                  » Note that the last correlator branch may not always be
                    the same
              – Predictor = which way to go
                  » 4 possibilities: which way the last one went chooses the
                    prediction
                       •   (Last-taken, last-not-taken) X (predict-taken, predict-not-taken)




                                                                                               10
4/16/06
                                       From 柯皓仁, 交通大學
                     Correlating Branches

          • Hypothesis: recently executed branches are
            correlated; that is, behavior of recently
            executed branches affects prediction of
            current branch
          • Idea: record m most recently executed
            branches as taken or not taken, and use
            that pattern to select the proper branch
            history table
          • In general, (m,n) predictor means record
            last m branches to select between 2m
            history tables each with n-bit counters
            – Old 2-bit BHT is then a (0,2) predictor


                                                         11
4/16/06
                              From 柯皓仁, 交通大學
               Example of Correlating Branch
                        Predictors
   if (d==0)       BNEZ R1, L1            ;branch b1 (d!=0)
      d = 1;            DAAIU R1, R0, #1  ;d==0, so d=1
   if (d==1)       L1:  DAAIU R3, R1, #-1
      …                 BNEZ R3, L2       ;branch b2 (d!=1)
                   …
                   L2:




                                                          12
4/16/06                  From 柯皓仁, 交通大學
                      A Problem for 1-bit Predictor
                           without Correlators
            initial      d!=0?         b1         value of d   d!=1?           b2
          value of d                              before b2
               0         NO        not taken          1         NO          not taken
              1          YES          taken           1         NO          not taken
              2          YES          taken           2        YES           taken

    d=?        b1         b1 action     New b1           b2      b2 action      New b2
            prediction                 prediction     prediction               prediction
      2           NT           T              T           NT           T             T

      0           T          NT             NT            T            NT            NT

      2           NT           T              T           NT           T             T

      0           T          NT             NT            T            NT            NT
                                                                                         13
4/16/06                               From 柯皓仁, 交通大學
               Example of Correlating Branch
                  Predictors (1,1) (Cont.)
            Prediction        Prediction if last        Prediction if last
               bits           branch not taken            branch taken
             NT/NT                   NT                        NT
              NT/T                    NT                        T
              T/NT                    T                        NT
               T/T                    T                         T


    d=?      b1          b1 action    New b1          b2      b2 action       New b2
          prediction                 prediction    prediction                prediction
      2    NT/NT            T          T/NT         NT/NT           T          NT/T
      0     T/NT           NT          T/NT          NT/T           NT         NT/T
      2     T/NT            T          T/NT          NT/T           T          NT/T
      0     T/NT           NT          T/NT          NT/T           NT         NT/T
                                                                                   14
4/16/06                              From 柯皓仁, 交通大學
              In general: (m,n) BHT (prediction
                           buffer)

          •   p bits of buffer index = 2p entries of BHT
          •   Use last m branches = global branch history
          •   Use n bit predictor
          •   Total bits for the (m, n) BHT prediction
              buffer:
                      Total _ memory _ bits  2 m * n * 2 p




                                                              15
4/16/06                        From 柯皓仁, 交通大學
             (2,2) Predictor Implementation
             4 banks = each with 32 2-bit predictor entries

    p=5
    m=2
    n=2

          5:32




                                                              16
4/16/06                 From 柯皓仁, 交通大學
          Accuracy of Different Schemes




                                          17
4/16/06
                     Tournament Predictors

          • Adaptively combine local and global
            predictors
             – Multiple predictors
                » One based on global information: Results of
                  recently executed m branches
                » One based on local information: Results of past
                  executions of the current branch instruction
             – Selector to choose which predictors to use
          • Advantage
             – Ability to select the right predictor for the right
               branch




                                                                     18
4/16/06                         From 柯皓仁, 交通大學
          Misprediction Rate Comparison




                                          19
4/16/06
                Branch Target Buffer (BTB)

          • To reduce the branch penalty to 0
             – Need to know what the address is by the end of IF
             – But the instruction is not even decoded yet
             – So use the instruction address rather than wait for decode
                » If prediction works then penalty goes to 0!
          • BTB Idea -- Cache to store taken branches (no
            need to store untaken)
             – Match tag is instruction address  compare with current PC
             – Data field is the predicted PC
          • May want to add predictor field
             – To avoid the mispredict twice on every loop phenomenon
             – Adds complexity since we now have to track untaken branches
               as well


                                                                             20
4/16/06
                                    Need Address
                              at Same Time as Prediction
 • Branch Target Buffer (BTB): Address of branch index to get
   prediction AND branch address (if taken)
      – Note: must check for branch match now, since can’t use wrong branch address
        (Figure 3.19, p. 262)

                                 Branch PC          Predicted PC
          PC of instruction
               FETCH




                                    =?                               Extra
                                             Yes: instruction is prediction state
                                             branch and use            bits
              No: branch not                 predicted PC as
  predicted, proceed normally                next PC
            (Next PC = PC+4)                                                        21
4/16/06
           Integrated Instruction Fetch Units

          • Integrated branch prediction
             – The branch predictor becomes part of the instruction
               fetch unit and is constantly predicting branches
          • Instruction prefetch
             – To deliver multiple instructions per clock, the
               instruction fetch unit will likely need to fetch ahead
          • Instruction memory access and buffering




                                                                        22
4/16/06
                   Return Address Predictor

          • Indirect jump – jumps whose destination address
            varies at run time
             – indirect procedure call, select or case, procedure return
             – SPEC89 benchmarks: 85% of indirect jumps are procedure
               returns
          • Accuracy of BTB for procedure returns are low
             – if procedure is called from many places, and the calls from
               one place are not clustered in time
          • Use a small buffer of return addresses operating as
            a stack
             – Cache the most recent return addresses
             – Push a return address at a call, and pop one off at a return
             – If the cache is sufficient large (max call depth)  prefect



                                                                              23
4/16/06                           From 柯皓仁, 交通大學
          Dynamic Branch Prediction Summary

          • Branch History Table: 2 bits for loop
            accuracy
          • Correlation: Recently executed branches
            correlated with next branch.
            – Either different branches
            – Or different executions of same branches
          • Tournament Predictor: more resources to
            competitive solutions and pick between them
          • Branch Target Buffer: include branch
            address & prediction
          • Return address stack for prediction of
            indirect jump

                                                          24
4/16/06
                     Getting CPI < 1:
            Issuing Multiple Instructions/Cycle
          • Vector Processing
             – Explicit coding of independent loops as operations on large
               vectors of numbers
             – Multimedia instructions being added to many processors
          • Superscalar
             – varying number instructions/cycle (1 to 8), scheduled by
               compiler or by HW (Tomasulo)
             – IBM PowerPC, Sun UltraSparc, DEC Alpha, Pentium III/4
          • (Very) Long Instruction Words (V)LIW:
             – fixed number of instructions (4-16) scheduled by the
               compiler; put operations into wide templates
             – Intel Architecture-64 (IA-64) 64-bit address
                  » Renamed: “Explicitly Parallel Instruction Computer (EPIC)”
          • Anticipated success of multiple instructions lead to
            Instructions Per Clock cycle (IPC) vs. CPI

                                                                                 25
4/16/06
                        Getting CPI < 1: Issuing
                       Multiple Instructions/Cycle
  • Superscalar MIPS: 2 instructions, 1 FP & 1 anything
          – Fetch 64-bits/clock cycle; Int on left, FP on right


     Type                            Pipe Stages
     Int. instruction   IF            ID     EX    MEM WB
     FP ADD instruction               IF     ID     EX  EX  EX WB
     Int. instruction                 IF     ID     EX MEM WB
     FP ADD instruction                      IF     ID  EX  EX EX   WB
     Int. instruction                        IF     ID  EX MEM WB
     FP ADD instruction                             IF  ID  EX EX   EX    WB

  • While Integer/FP split is simple for the HW, get CPI of 0.5 only
    for programs with:
          – Exactly 50% FP operations AND No hazards



                                                                         26
4/16/06
                       Multiple Issue Issues

          • Issue packet: group of instructions from fetch
            unit that could potentially issue in 1 clock
             – If instruction causes structural hazard or a data hazard
               either due to earlier instruction in execution or to earlier
               instruction in issue packet, then instruction does not issue
             – 0 to N instruction issues per clock cycle, for N-issue
          • Performing issue checks in 1 cycle could limit
            clock cycle time
             – Issue stage usually split and pipelined
             – 1st stage decides how many instructions from within this
               packet can issue, 2nd stage examines hazards among selected
               instructions and those already been issued
             – higher branch penalties => prediction accuracy important


                                                                              27
4/16/06
            Studies of The Limitations of ILP

          • Perfect hardware model
             – Rename as much as you need
                 » Infinite registers
                 » No WAW or WAR
             – Branch prediction is perfect
                 » Never happen in reality
             – Jump prediction (even computed such as return) is perfect
                 » Similarly unreal
             – Perfect memory disambiguation
                 » Almost perfect is not too hard in practice
             – Issue an unlimited number of instructions at once & no
               restriction on types of instructions issued
             – One-cycle latency



                                                                           28
4/16/06
          What A Perfect Processor Must Do?

          • Look arbitrary far ahead to find a set of
            instructions to issue, predicting all branches
            perfectly
          • Rename all register uses to avoid WAW and WAR
            hazards
          • Determine whether there are any dependences among
            the instructions in the issue packet; if so, rename
            accordingly
          • Determine if any memory dependences exist among
            the issuing instructions and hand them appropriately
          • Provide enough replicated function units to allow all
            the ready instructions to issue


                                                                    29
4/16/06
          How many instructions would issue on
           the perfect machine every cycle?




                                                 30
4/16/06
                               Window Size

          • The set of instructions that is examined for
            simultaneous execution is called the window
          • The window size will be determined by the cost of
            determining whether n issuing instructions have any
            register dependences among them
             – In theory, This cost will be O(n2), refer to p.243, why?
          • Each instruction in the window must be kept in
            processor
          • Window size is limited by the required storage, the
            comparisons, and a limited issue rate




                                                                          31
4/16/06
          Effects of Reducing the Window
                       Size




                                           32
4/16/06
              Effects of Realistic Branch and
                     Jump Prediction
          • Perfect
          • Tournament-based
             – Uses a correlating 2 bit and non-correlating 2 bit plus a
               selector to choose between the two
             – Prediction buffer has 8K (13 address bits from the branch)
             – 3 entries per slot - non-correlating, correlating, select
          • Standard 2 bit
             – 512 (9 address bits) entries
             – Plus 16 entry buffer to predict RETURNS
          • Static
             – Based on profile - predict either T or NT but it stays fixed
          • None


                                                                              33
4/16/06
          The Effect of Branch-Prediction
                     Schemes




                                            34
4/16/06
          Effects of Finite Registers




                                        35
4/16/06
            Models for Memory Alias Analysis

          • Perfect
          • Global/Stack Perfect
             – Perfect prediction for global and stack references and
               assume all heap references conflict
             – The best compiler-based analysis schemes can do
          • Inspection
             – Pointers to different allocation areas (such as the
               global area and the stack area) are assumed no conflict
             – Addresses using same register with different offsets
          • None
             – All memory references are assumed to conflict


                                                                         36
4/16/06
          Effects of Memory Aliasing




                                       37
4/16/06

								
To top