Your Federal Quarterly Tax Payments are due April 15th Get Help Now >>

8 Processing of control transfer instructions by iij26547


									8 Processing of control transfer instructions
•   8.1 Introduction
•   8.2 Basic approaches to branch handling
•   8.3 Delayed branching
•   8.4 Branch processing
•   8.5 Multiway branching
•   8.6 Guarded execution

                                   Computer Science
8.1 Intro to Branch
• Branches modify, conditionally or unconditionally,
  the value of the PC.
• To transfer control
• To alter the sequence of instructions
Major types of branches
Branch: To transfer control
Branch: e.g.
8.1.2 How to check the results of operations for specified
conditions {branch} (e.g. equals 0, negative, and so on)
Alternatives for checking the operation results
Result State vs. Direct Check, e.g.
Result state approach: Disadvantage
• The generation of the result state is not
   It requires an irregular structure and occupies
    additional chip area
• The result state is a sequential concept.
   It cannot be applied without modification in
    architectures which have multiple execution units.
Retaining sequential consistency for
condition checking (in VLIW or Superscalar)
• Use multiple sets of condition codes or flags
   It relies on programmer or compiler to use
    different sets condition codes or flags for
    different outcome generated by different EUs.
• Use Direct Check approach.
Branch Statistics
• 20% of general-purpose code are branch
   on average, each fifth instruction is a branch

• 5-10% of scientific code are branch

• The Majority of branches are conditional (80%)

• 75-80% of all branches are taken
Branch statistics: Taken or not Taken
Frequency of taken and not-taken branches
8.1.4 The branch problem:
The delay caused in pipelining
More branch problems
• Conditional branch could cause an even longer
   evaluation of the specified condition needs an extra
   waiting for unresolved condition (the result is not yet
       e.g. wait for the result of FDIV may take 10-50 cycles
• Pipelines became more stages than 4
   each branch would result in a yet larger number of
    wasted cycles (called bubbles)
8.1.5 Performance measures of branch processing
   Pt : branch penalties for taken
   Pnt : branch penalties for not-taken
   ft : frequencies of taken
   fnt : frequencies for not-taken
   P : effective penalty of branch processing
• P = ft * Pt + fnt * Pnt
   e.g. 80386:: P = 0.75 * 8 + 0.25 * 2 = 6.5 cycles
   e.g. i486:: P = 0.75 * 2 + 0.25 * 0 = 1.5 cycles

• Branch prediction correctly or mispredicted
• P = fc * Pc + fm * Pm
   e.g. Pentium:: P = 0.9 * 0 + 0.1 * 3.5 = 0.35 cycles
Interpretation of the concept of branch penalty
Zero-cycle branching {in no time}
8.2 Basic approaches to branch handling
Review of the basic approaches to branch handling
Speculative   vs.   Multiway branching
-Delayed Branching: Occurrence of an unused
instruction slot (unconditional branch)
Basic scheme of delayed branching
Delayed branching: Performance Gain
• Ratio of the delay slots that can be filled with useful
  instructions:: ff
   60-70% of the delay slot can be fill with useful
       fill only with: instruction that can be put in the delay slot but does
        not violate data dependency
       fill only with: instruction that can be executed in single pipeline
• Frequency of branches:: fb
   20-30% for general-propose program
   5-10% for scientific program
• 100 instructions have 100* fb delay slots,
• 100*fb * ff can be utilized.
• Performance Gain = (100*fb * ff)/100 = fb * ff
Delayed branching: for conditional branches
{Can be cancel or not}
Where to find the instruction to fill delay slot
Possible annulment options provided by architectures (use
special instructions) with delayed branching {Scalar only}//
8.4 Branch Processing: design space
Branch detection schemes
{early detection, better handling}
Branch detection in parallel with
decoding/issuing of other instructions (in I-Buffer)
Early detection of branches by
Looking ahead
Early detection of branches by inspection of
instruction that inputs to I-buffer
Early branch detection: {for scalar Processor}
Integrated instruction fetch and branch detection
• Detect branch instruction during fetching
• Guess taken or not taken
• Fetch next sequential instruction or target instruction
Handling of unresolved conditional branches
-Blocking branch processing
• Simply stalled (stopped and waited) until the
  specified condition can be resolved
-Basic kinds of branch predictions
-The fixed prediction approach
Always not taken vs. Always Taken
Always not taken: Penalty figures
Penalty figures for
the always taken prediction approach
-Static branch prediction
Static prediction: opcode based
e.g. implemented in the MC88110
-Dynamic branch prediction: branch taken in
the last n occurrences is likely to be taken next
Dynamic branch prediction: e.g.
1-bit dynamic prediction: state transition diagram
2-bit dynamic prediction: state transition diagram
3-bit prediction
Implicit dynamic technique
• Schemes for accessing the branch target path also
  used for branch prediction
• Branch Target Access Cache (BTAC)
   holds the most recently used branch addresses
• Branch Target Instruction Cache (BTIC)
   holds the most recently used target instructions
• BTAC or BTIC holds entries only
  for the taken branches
• The existence of an entry means that
   the corresponding branch was taken at its last
   so its next occurrence is also guessed as taken
=Implementation alternatives of history bits
Example of the implementation of the BHT
=Combining implicit and 2bit prediction
Combining implicit and 2bit prediction..
=The effect of branch accuracy on branch penalty
Simulation results of prediction accuracy on the SPEC
=Extent of speculative processing
Extent of speculative processing: e.g.
=Recovery from a misprediction: Basic Tasks
Necessary activities to allow of to
shorten recovery from a misprediction
Frequently employed schemes for
shortening recovery from a misprediction
shortening recovery from a misprediction: needs
Using two instruction buffers in the supersparc
to shorten recovery from a misprediction: e.g.
Using three instruction buffers in the Nx586
to shorten recovery from a misprediction: e.g.
8.4.5 Branch penalty for taken guesses depends on
branch target accessing schemes
-Compute/fetch scheme
for accessing branch targets {IFAR vs. PC}
-BTAC scheme for accessing branch targets {associative
search for BA, if found get BTA} {0-cycle branch: BA=BA-4}
-BTIC scheme: store next BTA
-BTIC scheme: calculate next BTA
-Successor index in the I-cache scheme
to access the branch target path {index: next I, or target I}
Successor index in the I-cache scheme: e.g.
The microachitecture of the UltraSparc
Predecode unit: detects branches, BTA, make predictions
(based on compiler’s hint bit), set up I-cache Next address
=Branch target accessing trends //
8.5 Multiway branching: {two IFA’s or PC’s}
Threefold multiway branching: only one correct path!
8.6 Guarded Execution
• a means to eliminate branches
• by conditional operate instructions
   IF the condition associated with the instruction is met,
   THEN perform the specified operation
   ELSE do not perform the operation
• e.g. original
          beg r1, label // if (r1) = 0 branch to label
          move r2, r3 // move (r2) into r3
   label: …
• e.g. guarded
           cmovne r1, r2, r3 // if (r1) != 0, move (r2) into r3
           …
• Convert control dependencies into data dependencies
Eliminated branches by full and restricted guarding
{full: all instruction guarded, restricted: ALU inst guarded}
Guarded Execution: Disadvantages
• guarding transforms instructions from both the taken
  and the not-taken paths into guard instruction
   increase number of instructions
   by 33% for full guarding
   by 8% for restricted guarding
   {more instructions more time and space}
• guarding requires additional hardware resources
  if an increase in processing time is to be avoided

To top