EECS 252 Graduate Computer Architecture Lec 01 - Introduction

Document Sample
EECS 252 Graduate Computer Architecture Lec 01 - Introduction Powered By Docstoc
					CS 152 Computer Architecture and
         Engineering

  Lecture 14 - Branch Prediction

                Krste Asanovic
   Electrical Engineering and Computer Sciences
         University of California at Berkeley

      http://www.eecs.berkeley.edu/~krste
      http://inst.eecs.berkeley.edu/~cs152
Last time in Lecture 13
• Register renaming removes WAR, WAW hazards by
  giving a new internal destination register for every
  new result
• Pipeline is structured with in-order
  fetch/decode/rename, followed by out-of-order
  execution/complete, followed by in-order commit
• At commit time, can detect exceptions and roll back
  buffer to provide precise interrupts




3/20/2008             CS152-Spring’08                    2
Recap: Overall Pipeline Structure
         In-order                    Out-of-order             In-order

 Fetch         Decode               Reorder Buffer           Commit

                             Kill                Kill
                    Kill
                                       Execute
         Inject handler PC                              Exception?


  • Instructions fetched and decoded into instruction
   reorder buffer in-order
  • Execution is out-of-order (  out-of-order completion)
  • Commit (write-back to architectural state, i.e., regfile &
   memory) is in-order
   Temporary storage needed to hold results before commit
   (shadow registers and store buffers)

  3/20/2008                    CS152-Spring’08                        3
  Control Flow Penalty
                                    Next fetch     PC
                                    started
                                                           Fetch
                                                 I-cache
Modern processors may have
> 10 pipeline stages between
                                                 Fetch
next PC calculation and branch
                                                 Buffer
resolution !                                               Decode
                                                 Issue
                                                 Buffer
How much work is lost if
pipeline doesn’t follow                                    Execute
correct instruction flow?                        Func.
                                                 Units
   ~ Loop length x pipeline width
                                                 Result
                                     Branch      Buffer    Commit
                                     executed
                                                 Arch.
                                                 State
  3/20/2008                   CS152-Spring’08                      4
MIPS Branches and Jumps
Each instruction fetch depends on one or two pieces
of information from the preceding instruction:
           1) Is the preceding instruction a taken branch?
           2) If so, what is the target address?

Instruction             Taken known?              Target known?
                       After Inst. Decode          After Inst. Decode
J
                        After Inst. Decode         After Reg. Fetch
JR
BEQZ/BNEZ              After Reg. Fetch*          After Inst. Decode

                  *Assuming   zero detect on register read
    3/20/2008                  CS152-Spring’08                        5
Branch Penalties in Modern Pipelines
    UltraSPARC-III instruction fetch pipeline stages
  (in-order issue, 4-way superscalar, 750MHz, 2000)

               A   PC Generation/Mux
               P   Instruction Fetch Stage 1
   Branch      F   Instruction Fetch Stage 2
   Target      B   Branch Address Calc/Begin Decode
   Address     I   Complete Decode
   Known
               J   Steer Instructions to Functional units
 Branch
               R   Register File Read
 Direction &
 Jump          E   Integer Execute
 Register          Remainder of execute pipeline
 Target            (+ another 6 stages)
 Known

   3/20/2008             CS152-Spring’08                    6
Reducing Control Flow Penalty
   Software solutions
         • Eliminate branches - loop unrolling
             Increases the run length
         • Reduce resolution time - instruction scheduling
             Compute the branch condition as early
             as possible (of limited value)

   Hardware solutions
         • Find something else to do - delay slots
             Replaces pipeline bubbles with useful work
             (requires software cooperation)
         • Speculate - branch prediction
             Speculative execution of instructions beyond
             the branch



  3/20/2008                 CS152-Spring’08                  7
Branch Prediction
 Motivation:
      Branch penalties limit performance of deeply pipelined
      processors

      Modern branch predictors have high accuracy
      (>95%) and can reduce branch penalties significantly

 Required hardware support:
      Prediction structures:
         • Branch history tables, branch target buffers, etc.

      Mispredict recovery mechanisms:
         • Keep result computation separate from commit
         • Kill instructions following branch in pipeline
         • Restore state to state following branch


 3/20/2008                CS152-Spring’08                       8
Static Branch Prediction
 Overall probability a branch is taken is ~60-70% but:


                                                  JZ
              backward            forward
                90%                 50%
                          JZ




 ISA can attach preferred direction semantics to branches,
 e.g., Motorola MC88110
     bne0 (preferred taken) beq0 (not taken)

 ISA can allow arbitrary choice of statically predicted direction,
 e.g., HP PA-RISC, Intel IA-64
      typically reported as ~80% accurate


  3/20/2008               CS152-Spring’08                     9
 Dynamic Branch Prediction
 learning based on past behavior


Temporal correlation
  The way a branch resolves may be a good
  predictor of the way it will resolve at the next
  execution


Spatial correlation
  Several branches may resolve in a highly
  correlated manner (a preferred path of
  execution)




 3/20/2008               CS152-Spring’08             10
Branch Prediction Bits
• Assume 2 BP bits per instruction
• Change the prediction after two consecutive mistakes!


                              ¬take
                              wrong     ¬ taken
                      taken
                                taken
            taken   take                ¬take     ¬ taken
                    right               right
                              taken
                    ¬ taken             ¬ taken
                               take
                              wrong



BP state:
       (predict take/¬take) x (last prediction right/wrong)

3/20/2008                     CS152-Spring’08               11
  Branch History Table
                Fetch PC                              00


                                                      k         2k-entry
                I-Cache                        BHT Index        BHT,
                                                                2 bits/entry

Instruction
      Opcode               offset


                                     +

    Branch?                    Target PC              Taken/¬Taken?

   4K-entry BHT, 2 bits/entry, ~80-90% correct predictions

    3/20/2008                       CS152-Spring’08                   12
Exploiting Spatial Correlation
Yeh and Patt, 1992

                 if (x[i] <   7) then
                       y +=   1;
                 if (x[i] <   5) then
                       c -=   4;

 If first condition false, second condition also false


 History register, H, records the direction of the last
 N branches executed by the processor




  3/20/2008             CS152-Spring’08              13
Two-Level Branch Predictor
 Pentium Pro uses the result from the last two branches
 to select one of the four sets of BHT bits (~95% correct)
                               00
     Fetch PC              k



      2-bit global branch
      history shift register

  Shift in
  Taken/¬Taken
  results of each
  branch

                                                 Taken/¬Taken?
  3/20/2008                    CS152-Spring’08               14
Limitations of BHTs
   Only predicts branch direction. Therefore, cannot redirect
   fetch stream until after branch target is determined.

Correctly                  A     PC Generation/Mux
predicted                  P     Instruction Fetch Stage 1
taken branch               F     Instruction Fetch Stage 2
penalty                    B     Branch Address Calc/Begin Decode
                           I     Complete Decode
   Jump Register           J     Steer Instructions to Functional units
   penalty
                           R     Register File Read
                           E     Integer Execute
                                 Remainder of execute pipeline
                                 (+ another 6 stages)

               UltraSPARC-III fetch pipeline
   3/20/2008                   CS152-Spring’08                   15
Branch Target Buffer
                                        predicted     BPb
                                          target
                                                            Branch
                                                            Target
IMEM
                                                            Buffer
                                                            (2k entries)
                               k

                      PC

                                             target    BP

  BP bits are stored with the predicted target address.

  IF stage: If (BP=taken) then nPC=target else nPC=PC+4
  later:     check prediction, if wrong then kill the instruction
              and update BTB & BPb else update BPb
  3/20/2008                CS152-Spring’08                          16
Address Collisions

                                                    132 Jump 100
 Assume a
 128-entry
 BTB                                               1028 Add .....
                               target       BPb
                               236          take
                                                     Instruction
 What will be fetched after the instruction at 1028? Memory
              BTB prediction       = 236
              Correct target       = 1032

               kill PC=236 and fetch PC=1032

                        Is this a common occurrence?
                        Can we avoid these bubbles?

  3/20/2008                    CS152-Spring’08                      17
BTB is only for Control Instructions


 BTB contains useful information for branch and
 jump instructions only
       Do not update it for other instructions

 For all other instructions the next PC is PC+4 !

 How to achieve this effect without decoding the
 instruction?




  3/20/2008            CS152-Spring’08              18
Branch Target Buffer (BTB)
I-Cache                      2k-entry direct-mapped BTB
               PC            (can also be associative)
                               Entry PC      Valid   predicted
                                                      target PC


                       k




                                    =

                                match        valid   target
   •   Keep both the branch PC and target PC in the BTB
   •   PC+4 is fetched if match fails
   •   Only taken branches and jumps held in BTB
   •   Next PC determined before branch fetched and decoded
  3/20/2008                CS152-Spring’08                        19
Consulting BTB Before Decoding

                                               132 Jump 100

              entry PC   target       BPb
               132       236          take    1028 Add .....




• The match for PC=1028 fails and 1028+4 is fetched
        eliminates false predictions after ALU instructions

• BTB contains entries only for control transfer instructions
        more room to store branch targets


  3/20/2008                 CS152-Spring’08                     20
CS152 Administrivia
• Lab 4, branch predictor competition, due April 3
     – PRIZE (TBD) for winners in both unlimited and realistic categories
• Quiz 4, Tuesday April 8




3/20/2008                     CS152-Spring’08                           21
Combining BTB and BHT
 • BTB entries are considerably more expensive than BHT, but can
   redirect fetches at earlier stage in pipeline and can accelerate
   indirect branches (JR)
 • BHT can hold many more entries and is more accurate

                          A    PC Generation/Mux
                   BTB    P    Instruction Fetch Stage 1
                          F    Instruction Fetch Stage 2
BHT in later       BHT    B    Branch Address Calc/Begin Decode
pipeline stage
                          I    Complete Decode
corrects when
BTB misses a              J    Steer Instructions to Functional units
predicted                 R    Register File Read
taken branch
                          E    Integer Execute

  BTB/BHT only updated after branch resolves in E stage

  3/20/2008                   CS152-Spring’08                     22
Uses of Jump Register (JR)
• Switch statements (jump to address of matching case)

       BTB works well if same case used repeatedly

• Dynamic function call (jump to run-time function address)

       BTB works well if same function usually called, (e.g., in
       C++ programming, when objects have same type in
       virtual function call)

• Subroutine returns (jump to return address)
       BTB works well if usually return to the same place
         Often one function called from many distinct call sites!

 How well does BTB work for each of these cases?

  3/20/2008                  CS152-Spring’08                   23
Subroutine Return Stack
Small structure to accelerate JR for subroutine returns,
 typically much more accurate than BTBs.
                  fa() { fb(); }
                  fb() { fc(); }
                  fc() { fd(); }
                                            Pop return address
Push call address when
                                            when subroutine
function call executed
                                            return decoded


                           &fd()            k entries
                           &fc()            (typically k=8-16)
                           &fb()
   3/20/2008              CS152-Spring’08                        24
Mispredict Recovery

In-order execution machines:
   – Assume no instruction issued after branch can write-back before
     branch resolves
   – Kill all instructions in pipeline behind mispredicted branch


Out-of-order execution?

   – Multiple instructions following branch in program
     order can complete before branch resolves




  3/20/2008                  CS152-Spring’08                      25
In-Order Commit for Precise Exceptions
         In-order                    Out-of-order             In-order

 Fetch         Decode               Reorder Buffer           Commit

                             Kill                Kill
                    Kill
                                       Execute
         Inject handler PC                              Exception?


  • Instructions fetched and decoded into instruction
   reorder buffer in-order
  • Execution is out-of-order (  out-of-order completion)
  • Commit (write-back to architectural state, i.e., regfile &
   memory, is in-order
Temporary storage needed in ROB to hold results before commit

  3/20/2008                    CS152-Spring’08                       26
Branch Misprediction in Pipeline
                   Inject correct PC

                        Branch          Kill         Branch
                       Prediction                   Resolution

                                       Kill        Kill

PC         Fetch        Decode                Reorder Buffer          Commit

                                                           Complete

                                                 Execute


        • Can have multiple unresolved branches in ROB
        • Can resolve branches out-of-order by killing all the
          instructions in ROB that follow a mispredicted branch

     3/20/2008                   CS152-Spring’08                        27
   Recovering ROB/Renaming Table
                                 t v
   Rename                  t t t vvv    Rename                      Registe
      Table r1                          Snapshots                     r File
                    r2


      Ptr2
                    Ins# use exec       op   p1    src1   p2    src2     pd dest    data    t1
next to commit                                                                              t2
  rollback                                                                                  .
next available                                                                              .
     Ptr1                                                                                   tn
next available

   Reorder
     buffer              Load                                                      Commit
                                   FU             FU                      Store
                                                               FU
                          Unit                                             Unit
                                                                               < t, result >

             Take snapshot of register rename table at each predicted
             branch, recover earlier snapshot if branch mispredicted
        3/20/2008                            CS152-Spring’08                               28
Speculating Both Directions
 An alternative to branch prediction is to execute
 both directions of a branch speculatively
        • resource requirement is proportional to the
          number of concurrent speculative executions

        • only half the resources engage in useful work
          when both directions of a branch are executed
          speculatively


        • branch prediction takes less resources
          than speculative execution of both paths
 With accurate branch prediction, it is more cost
 effective to dedicate all resources to the predicted
 direction
  3/20/2008                CS152-Spring’08                29
Acknowledgements
• These slides contain material developed and
  copyright by:
     –   Arvind (MIT)
     –   Krste Asanovic (MIT/UCB)
     –   Joel Emer (Intel/MIT)
     –   James Hoe (CMU)
     –   John Kubiatowicz (UCB)
     –   David Patterson (UCB)


• MIT material derived from course 6.823
• UCB material derived from course CS252




3/20/2008                     CS152-Spring’08   30

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:0
posted:5/12/2013
language:Unknown
pages:30