compile

Document Sample
compile Powered By Docstoc
					    Compiling for the
    Intel® Itanium™
      Architecture
         Compiler Tricks
        Steve Skedzielewski
         Intel Corporation




®
R
    Agenda
                 Principles
     Architecture

     Compiler Bag of Tricks
      – Speculation
      – Predication
      – Branching
      – Loop Generation



®
R
     Traditional Architectures:
     Limited Parallelism
Original Source                  Sequential Machine
     Code          Compiler            Code         Hardware

                  parallelized                        parallelized
                     code                                code

                                                         multiple
                                                      functional units




           Execution Units Available-             .      .    .      .
               Used Inefficiently                 .
                                                  .      .
                                                         .    .
                                                              .      .
                                                                     .
     Today’s Processors are often 60% Idle
 ®
 R
        Itanium™ Architecture:
        Explicit Parallelism
         Original Source                         Parallel Machine
              Code                                     Code
                            Compile


   Compiler                     Hardware       multiple functional units


  Itanium™
Compiler Views
     Wider             More efficient use of        .
                                                    .   .
                                                        .   .
                                                            .   .
                                                                .
    Scope              execution resources          .   .   .   .


                 Increases Parallel Execution
   R®
    Itanium™ Architecture
    Principles
       Explicit parallelism:
         – Instruction level parallelism (ILP) in machine code
         – Compiler schedules across a wide scope
       Enhanced ILP :
         – Predication, Speculation, Software pipelining, ...
       Compatibility:
         – Across all Itanium™ processor family members
         – IA-32 in hardware and PA-RISC through instruction mapping
       Massive resources:
         – Many registers
         – Many functional units

®
R
      Speculation Review
Traditional Architectures      Itanium™ Architecture
                                    ld.s
           instr 1
                                    instr 1
           instr 2
           ...                      instr 2
           br        Barrier        br


           Load                    chk.s
           use                     use
                               Advances a load,
                               even above a branch
         Memory latency is a major performance
          bottleneck in today’s systems
  ®
  R
          – CPU to memory gap increasing
    Speculating Uses
              Itanium™ Architecture
                     ld.s
                     instr 1
                     instr 2
                     br

                      chk.s
                      use

     Usesof speculative data can also be
     executed speculatively
      – distinguishes speculation from simple prefetch

®
R
          Enables Further Parallelism
    Introducing the NaT
    (“Not a Thing”)
            Itanium™ Architecture
                   ld.s    ;Exception Detection
                   instr 1
                   instr 2 Propagate
                   br       Exception

                    chk.s    ;Exception Delivery
                    use
       NaT is the GR’s 65th bit that indicates:
         – whether or not an exception has occurred
         – when a branch to recovery code is required

®
R
       NaT set during ld.s, tested by Chk.s
    Propagation
       All computations propagate NaTs, which reduces
        the number of checks
             ld8.s r3 = (r9)
             ld8.s r4 = (r10)
             shladd r6 = r3, 3, r4
             ld8.s r5 = (r6)
             p1,p2 = cmp(...)        Needs only one chk
                                     on result
                   chk.s r5
                   sub r7 = r5,r2

       Cmp propagates “false” when writing predicates


®
R
    Exception Deferral: More
    Than Skin Deep
       Costly exceptions can be
        deferred                      ld.s
       OS can control deferral of:   instr 1
         – Page faults
                                      instr 2
         – Protection violations
                                      uses
                                      br
         – …
                                                     Recovery code
       NaTs enable deferral with
        recovery                      chk.s            ld
                                      (Home Block)     uses
                                                       br home


            Enables aggressive code motion at
                      compile time
®
R
    Store Barrier
        Traditional Architectures
                instr 1
                instr 2
                ...
                Store(*)    Barrier
                Load (*)
                use


      Traditional architectures limited by
               the store barrier
®
R
     Introducing Data
     Speculation
 Compiler    can issue a load prior to a
     preceding, possibly-conflicting store
Traditional Architectures       Itanium™ Architecture
       instr 1                        ld8.a
       instr 2                        instr 1
        ...                           instr 2
       st8         Barrier            st8

        ld8                            ld.c
        use                            use


 ®
 R
              Unique to Itanium™ Architecture
    Data Speculation
     Uses   can be speculated

                             ld8.a
             ld8.a           instr 1
             instr 1         use
             instr 2         instr 2
             st8             st8           Recovery code

             ld.c            chk.a          ld8
             use                            uses
                                            br home
        Synergy with control speculation
            increases performance
®
R
    Architectural Support for
    Data Speculation
     Instructions
      –ld.a - advanced loads
      –ld.c - check loads
      –chk.a - advanced load checks
     SpeculativeAdvanced loads - ld.sa - is
      an advanced load with deferral
     ALAT - HW structure containing
      outstanding advanced loads
®
R
      Advanced Load Address
      Table - ALAT
         ld.a inserts entries.
         Conflicting stores remove entries
           – Also: ld.c.clr, chk.a.clr,
         Presence of entry indicates success
           – chk.a branches when no entry is found
                 ld.a reg# =...
chk.a reg#             ?     reg #        Address    st
                             reg #        Address
                             reg #        Address
  ®
  R

                             ...
    Speculation Benefits
     Reduces    impact of memory
     latency
     Improves   code with many cache
     accesses
      –Large databases
      –Operating systems
     Gives   scheduling flexibility

®
R
    Agenda
                 Principles
     Architecture

     Compiler Bag of Tricks
      – Speculation
      – Predication
      – Branching
      – Loop Generation



®
R
       Predication
Traditional Architectures               Itanium™ Architecture
            cmp
                                         cmp

                       then        p1           p2
                                   p1           p2
                                   p1           p2
                       else



          Converts branches to conditional execution
            – Executes multiple paths simultaneously
          Exposes parallelism and reduces critical path
            – Better utilizes wider machines
   ®
   R
            – Reduces mispredicted branches
    Complex Transformations
            • Mark from SPEC CPU95 130.li
            • Low ILP in each block




              Highly mispredicted branch




®
R
           Not your simple if-then-else
    Complex Transformations
                                        Set p1 = true



                         p1              p2

                         p1              p2

                         p1              p2

                         p1
                         set p1 or p2 based upon next path


          • One loop back branch     • Utilizes machine width
             - always taken

      Global control flow reduction
®
R
       Upward Code Movement
     cmp.unc.eq p1,p2 = r1,r2             cmp.unc.eq p1,p2 = r1,r2
      :                                    :
(p1) br --> label                          ld.s r4 = [r3]
      :                                    add r5 = r4,1
     ld r4 = [r3]                          :
     add r5 = r4,1                   (p1) br --> label
                                           chk.s r4, rec

                                Speculate both the load and the use



          Depending upon deferral mode, the
             add could cause cache miss
   ®
   R
       Upward Code Movement
     cmp.unc.eq p1,p2 = r1,r2        cmp.unc.eq p1,p2 = r1,r2
      :                               :
(p1) br --> label               (p2) ld r4 = [r3]
      :                         (p2) add r5 = r4,1
     ld r4 = [r3]                     :
     add r5 = r4,1              (p1) br --> label

                                Predicate with fall-thru predicate
                                Motion bounded by compare



                    Predication can avoid
   ®
   R
                   speculative side effects
Downward Code Movement
                                    Predication enables
    A              B
                                    downward code movement
                                    from A to C without
          C                         compensation code in B

        Main Trace                       Use predication to
                                         merge sparse code in
              Compensation Block
                                         compensation block with
                                         code in merge block
               A


                       C   Merge Block
®
R
      Code Motion Tradeoffs
 Downward
Code Motion           Slots available in hot path
                      Predicate region formation
                        occurs before scheduling
              A
                      Predication can pull instructions
                        from lower weight path
         B        C
                      Scheduler can move instructions
                        from above and below
              D
                      Solutions
                      • Heuristic formation
  Upward
Code Motion           • Preschedule information
                      • Reverse if-conversion
  ®
  R
    Introducing Parallel
    Compares
       Three new types of compares:
         – AND: both target predicates set FALSE if compare is false
         – OR: both target predicates set TRUE if compare is true
         – ANDOR: if true, sets one TRUE, sets other FALSE

        A                             A      B     C

            B
                                             D
                C
                                Reduces Critical Path
                    D
®
R
    Method of Use
    Or Predicate
                                      0    cmp.unc.ne p1 = r0,r0
    • Initially clear predicate
    • All true compares will set           cmp.or.eq p1 = 40,r7
                                      1    cmp.or.eq p1 = 9,r7
    • All false compares do nothing



    And Predicate
    • Initially set predicate         0   cmp.unc.eq   p1 = r0,r0
    • All true compares do nothing        cmp.and.ge   p1 = 48,r6
                                      1
    • All false compares will clear       cmp.and.lt   p1 = 58,r6




®
R
     Parallel Compare Example
                    c1        if (c1 && c2 && c3 && c4)
                                  then then_code
                                  else else_code
               c2
                              Itanium™ Architecture Code
          c3                  0        cmp.unc.eq     p1,p2 = r0,0
                                       cmp.and.orcm    p1,p2 = c1
                                       cmp.and.orcm    p1,p2 = c2
     c4                       1        cmp.and.orcm    p1,p2 = c3
                                       cmp.and.orcm    p1,p2 = c4
                                  (p1) then_code
                              2
                                  (p2) else_code
then                 else

                            Significant control
 ®
 R

                             height reduction
    Predication Benefits
       Reduces branches and mispredict penalties
       Parallel compares further reduce critical paths
       Greatly improves code with hard to predict
        branches
       Works in tandem with speculation
       Traditional architectures’ “bolt-on” approach can’t
        efficiently approximate predication
         – Cmove: 39% more instructions, 23% slower performance*
         – All instructions need predication




®
R




                                                * Source: S. Mahlke, 1995
    Agenda
                 Principles
     Architecture

     Compiler Bag of Tricks
      – Speculation
      – Predication
      – Branching
      – Loop Generation



®
R
      Branch Instruction
                                              128-bit bundle
127                               41-bits                                    0

 QP Branch                    Instruction 1           Instruction 0   Template
                 IP-Offset
                  21-bits
         Two basic branch formats
           – Relative: IP := IP + Offset21
           – Indirect: IP := BR[I]
               – 8 branch registers for efficient branch execution
               – Call/Return linking through branch registers
         Loop branches with 64-bit loopcount register (LC)
           – Enables perfect branch prediction of counted loops
           – Traditional architectures always mispredict last iteration
               – Important for low trip count loops

  ®
  R
    Branch Predicates
Unconditional branch         (p0) br target;


                                 cmp p1 = cond
Conditional branches
                            (p1) br target;

   Compare and branch can be in same cycle
   Compiler-directed static prediction
    augments dynamic prediction
    – Reduced false mispredicts due to aliasing
    – Frees space in H/W predictor
    – Can give hint for dynamic predictor
®
R
     8 Queens Example
    if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true))
         Unconditional Compares
         R1=&b[j]                          8 queens control flow
     1   R3=&a[i+j]
         R5=&c[i-j+7]                               P2
         ld R2=[R1]
                                          P1
     2   ld.s R4=[R3]
         ld.s R6=[R5]
     4   P1,P2 <-cmp.unc(R2==true)                   P4
                                           P3
     5   (p1) chk.s R4
         (p1) P3,P4 <-cmp.unc(R4==true)                    P6
                                               P5
                                                                Else
     6   (p3) chk.s R6                              Then
         (p3) P5,P6 <-cmp.unc(R5==true)
         (P5) br then
     7
®
R
         else
       Eight Queens Example
if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true))
                  R1=&b[j]
              1   R3=&a[i+j]
                  R5=&c[i-j+7]
                  p1 <- true
                  ld R2=[R1]
              2   ld R4=[R3]
                  ld R6=[R5]
                  p1,p2 <- cmp.and(R2==true)
              4   p1,p2 <- cmp.and(R4==true)
                  p1,p2 <- cmp.and(R6==true)
                  (p1) br then
                  else



  R®
           Major reduction in control flow
      Multi-way Branch
  w/o Speculation               Hoisting Loads
                                   ld8.s r6 = (ra)
              ld8 r6 = (ra)                                ld8.s r6 = (ra)
                                   ld8.s r7 = (rb)
              (p1) br exit1                                ld8.s r7 = (rb)
                                   ld8.s r8 = (rc)
                                                           ld8.s r8 = (rc)
      P1
                                    chk r6, rec0
                     P2             (p1) br exit1
              ld8 r7 = (rb)                               chk r6, rec0
              (p3) br exit2                               (p2) chk r7, rec1
                                    Chk r7, rec1          (p4) chk r8, rec2
      P3                            (p3) br exit2
                     P4                                   }{
                                                          (p1) br exit1
              ld8 r8 = (rc)                               (p3) br exit2
              (p5) br exit3        Chk r8, rec2           (p5) br exit3
      P5                           (p5) br exit3          }
                     P6

3 branch cycles                                       1 branch cycle
         Multi-way branches: more than 1 branch in a single cycle
            Allows n-way branching

  ®
  R
             Supports Aggressive Speculation
    Multi-way Branch
    w/o Predication          Predication
         cmp p1, p2 = c1        cmp p1, p2 = c1
         cmp p3, p4 = c2        cmp p3, p4 = c2
         cmp p5, p6 = c3        cmp p5, p6 = c3
          :                      :
          :                      :
         st [r10] =             st [r10] =
     (p1) br exit1          (p2) st [r11] =
         st [r11] =         (p4) st [r12] =
     (p3) br exit2          (p1) br exit1
         st [r12] =         (p3) br exit2
     (p5) br exit3          (p5) br exit3




®
R
         Predication and Multi-way increase ILP
    Agenda
                 Principles
     Architecture

     Compiler Bag of Tricks
      – Speculation
      – Predication
      – Branching
      – Loop Generation



®
R
    Loop Example
    Convert string to uppercase
    for (i=0, i< len, i++) {
        if (IS_LOWERCASE(line[i]))
            newline[i] = CNVT_TO_UPPERCASE(line[i]);
        else
            newline[i] = line[i];
    }
    After macro expansion
    for (i=0, i< len, i++) {
          if (line[i] >= ‘a’ && line[i] <= ‘z’)
              newline[i] = line[i]-32;
          else
              newline[i] = line[i];
    }
®
R

                Typical integer-type loop
        Loop Assembly Code
           Traditional Arch           Itanium™ Architecture
        loop:                        loop:
1          ld c = [ra], 1                 ld        c = [ra], 1
2          bgt c, 96 bottom               cmp       p1 = true        1
3          blt c, 123 bottom              cmp.and p1 = (c > 96)
          sub c = c,32                    cmp.and p1 = (c < 123)     2
4
        bottom:                      (p1) sub       c = c,32
                                                                     3
          st [rb] = c, 1                   st [rb] = c, 1
5         blt ra, end loop                 br.cloop loop             4

        40 cycles for 8 iterations      32 cycles for 8 iterations

            Fewer branches and no mispredictions.
                        Still low ILP.
    ®
    R
    Unroll for ILP
      ld c = [ra],1              ld c           Unroll twice
    loop:                                       • 8 iterations in
                         loop:   ld d   bgt c
      ld d = [ra],1
      bgt c,115,b1                                  33 cycles
                                        blt c
      blt c,96, b1                              • 1.2x perf. inprov.
      sub c=c,36                 sub            • Code size: 2x
    b1:
      st [rb] = c,1       b1:    st c   beq     • Won’t gain by
      beq rb,end, exit           ld c   bgt d        unrolling more
      ld c = [ra],1
      bgt d,115,b2                      blt d
      blt d,96, b2
      sub d=d,36                 sub
    b2:
      st [rb] = d,1       b2:    st d   blt
      blt rb,end, loop
®
R
    Software Pipelining
           Overlapping execution of different loop iterations




                             vs.




                        Whole loop computation in one cycle
       More iterations in same amount of time
®
R
        Software Pipelining
                                              Input
Cycle
                                        ld
    1   ld

    2   ld cmps
                                       cmps
    3   ld cmps   ?sub

    4   ld cmps   ?sub   st   Kernel
                                       ?sub



                                        st

                                              Output
        Data transferred from one
®
R
        functional unit to the next
    Introducing Rotating
    Registers
       GR 32-127, FR32-127 can rotate
       Separate Rotating Register Base for each: GRs, FRs
       Loop branches decrement all register rotating bases (RRB)
       Instructions contain a “virtual” register number
         – RRB + virtual register number = physical register number.
       References
         – “Overlapped Loop Support in the Cydra 5” - Dehnert et. al, 1989
         – “Code Generation Schemas for Modulo-Scheduled Loops” -
           Rau et. al, MICRO-25, 1992


                  Allows painless transfer of
                     data between stages
®
R
          Pipelined Loop                      Physical
                                                                 RRB = 0

                                                                 Virtual
                                             register file   +   register
          Kernel code
     loop:                             ld      r34 = xx          r34 = xx
             ld r34 = [ra], 1
s1
             cmp        p1 = true
                                      cmp<
             cmp.and p1 = (r35>96)
s2           cmp.and p1 = (r35<123)            r35 = xx          r35 = xx
                                      cmp>
s3 (p1) sub r36 = r36, 32
s4      st [rb] = r37, 1
             br.ctop loop             sub      r36 = xx          r36 = xx


                                       st      r37 = xx          r37 = xx

      ®
      R
 Fill the pipe ...                                               RRB = 0

                                              Physical           Virtual
                                             register file   +   register
                 G o   _ G r   e   y    h

                                       ld      r34 = G           r34 = G

Execute prologue stage
                                   cmp<
      Kernel code                              r35 = xx          r35 = xx
loop:
                                   cmp>
     ld r34 = [ra], 1
     cmp         p1 = true
     cmp.and p1 = (r35>96)             sub     r36 = xx          r36 = xx
     cmp.and p1 = (r35<123)
(p1) sub r36 = r36, 32
                                       st      r37 = xx          r37 = xx
     st [rb] = r37, 1
     br.ctop loop
  ®
  R
 Fill the pipe ...                                 RRB = 0

                                Physical           Virtual
                               register file   +   register

                         ld      r34 = G           r34 = G

Perform a loop branch
• Decrement lc          cmp<
• Rotate registers by            r35 = xx          r35 = xx
   decrementing RRB     cmp>


                        sub      r36 = xx          r36 = xx


                         st      r37 = xx          r37 = xx

  ®
  R
 Fill the pipe ...                                       RRB = -1

                                      Physical           Virtual
                                     register file   +   register

                               ld       r33 = o          r34 = o

Execute prologue stage
                              cmp<
      Kernel code                      r34 = G           r35 = G
loop:
                              cmp>
     ld r34 = [ra], 1
     cmp         p1 = true
     cmp.and p1 = (r35>96)    sub      r35 = xx          r36 = xx
     cmp.and p1 = (r35<123)
(p1) sub r36 = r36, 32
                               st      r36 = xx          r37 = xx
     st [rb] = r37, 1
     br.ctop loop
  ®
  R
 Fill the pipe ...                                     RRB = -2

                                   Physical            Virtual
                  G o _ G r e y h register file    +   register

                                  ld    r32 = _        r34 = _

Execute prologue stage
                                 cmp<
        Kernel code                     r33 = o        r35 = o
loop:
                                 cmp>
     ld r34 = [ra], 1
     cmp         p16 = true
     cmp.and p16 = (r35>96)      sub    r34 = G        r36 = G
     cmp.and p16 = (r35<123)
(p17) sub r36 = r36, 32
     st [rb] = r37, 1             st    r35 = xx       r37 = xx
     br.ctop loop
  ®
  R
 Execute the Kernel                                   RRB = -3

                                   Physical           Virtual
                  G o _ G r e y h register file   +   register

                                  ld    r37 = G       r34 = G
Execute kernel
Whole iteration per cycle
                                 cmp<
        Kernel code                     r32 = _       r35 = _
loop:
                                 cmp>
     ld r34 = [ra], 1
     cmp         p16 = true
     cmp.and p16 = (r35>96)      sub    r33 = o       r36 = o
     cmp.and p16 = (r35<123)
(p17) sub r36 = r36, 32
                                  st    r34 = G       r37 = G
     st [rb] = r37, 1
     br.ctop loop
  ®
                      G
  R
 Execute the Kernel                                                       RRB = -4

                                                       Physical           Virtual
                                                      register file   +   register
                  G o       _   G   r   e   y    h

                                                ld       r36 = r          r34 = r
Execute kernel
Whole iteration per cycle
                                            cmp<
        Kernel code                                     r37 = G           r35 = G
loop:
                                            cmp>
     ld r34 = [ra], 1
     cmp         p16 = true
     cmp.and p16 = (r35>96)                     sub      r32 = _          r36 = _
     cmp.and p16 = (r35<123)
(p17) sub r36 = r36, 32
                                                st       r33 = O          r37 = O
     st [rb] = r37, 1
     br.ctop loop
  ®
                      G O
  R
 Execute the Kernel                                       RRB = -5

                                       Physical           Virtual
                                      register file   +   register
                  G o _ G r e y h

                                ld       r35 = e          r34 = e
Execute kernel
Whole iteration per cycle
                               cmp<
        Kernel code                      r36 = r          r35 = r
loop:
                               cmp>
     ld r34 = [ra], 1
     cmp         p16 = true
     cmp.and p16 = (r35>96)    sub      r37 = G           r36 = G
     cmp.and p16 = (r35<123)
(p17) sub r36 = r36, 32
                                st       r32 = _          r37 = _
     st [rb] = r37, 1
     br.ctop loop
  ®
                      G O
  R
    Pipelining Overhead
                   Prologue and Epilogue are bad
        Prologue
                   • Code size expansion
                   • Overhead not good for low trip count
                   loops - cache performance
        Kernel




        Epilogue



®
R
      Can we avoid prologue and epilogue?
      Prologue Code
        Cycle
          1     ld

          2     ld cmps

          3     ld cmps   ?sub

          4     ld cmps   ?sub   st   Kernel




®
R
    Incrementally turn on functional units
      Avoid Pro and Epilogues
                                                                   Physical
                                                                  register file
         Epilogue     Kernel (loop count)

                                                            ld      r34 = xx

Have enable bit on each functional unit
                                                           cmp<
Enablers are initialized to off
                                                                    r35 = xx
Feed through a sequence of bits of
                                                           cmp>
  length dependent upon loop count and
  pipe depth
                                                           sub      r36 = xx


                                                            st      r37 = xx
  ®
  R

                                            Unit Enabler
    Revisiting Rotating
    Predicate Registers
       PR16-63 can rotate, with separate Rotating Register Base
       Loop branches decrement all register rotating base (RRB)
       Instructions contain a “virtual” predicate register number
        – RRB + virtual register number = physical register number.
       Some predicates control pipeline stages, Stage Predicates
       Qualifying Predicates can still be in the loop

                      Complete Loop Code
                      loop:
                      (p16)   ld         r34 = [ra], 1
               s1     (p16)   cmp.unc p20 = true
                      (p17)   cmp.and p21 = (r35>96)
               s2     (p17)   cmp.and p21 = (r35<123)
               s3     (p22)   sub        r36 = r36, 32
               s4     (p19)   st         [rb] = r37, 1
®
R
                              br.ctop loop
    How does this work
                                                  RRB = 0
                                                  Physical
        Epilogue            Kernel               register file

    Complete Loop Code                     ld      r34 = G
loop:
(p16)   ld         r34 = [ra], 1
(p16)   cmp         p20 = true            cmp<
(p17)   cmp.and     p21 = (r35>96)                 r35 = xx
(p17)   cmp.and     p21 = (r35<123)       cmp>
(p22)   sub        r36 = r36, 32
(p19)   st         [rb] = r37, 1
        br.ctop    loop                   sub      r36 = xx


                   Qualifying Predicate    st      r37 = xx
®
R



               Stage Predicates
     Auto Predicate Generation
Initalize
• lc to trip count
                                 lc    RRB        ec
• ec to epilogue count
• p16 to true
Loop branches
• Rotate predicates by                Predicate
                                      Generator
decrementing RRB
• When lc > 0
     - Decr. lc, set p16=true
                                        p16
• When lc = 0
     - Decr. ec, set p16=false
• Fall through when ec=0
R®
    Fill the pipe again ...
                                                           RRB = 0
                                                           Physical
          Epilogue           Kernel                       register file

    Complete Loop Code                              ld      r34 = G
loop:
(p16)   ld        r34 = [ra], 1
(p16)   cmp.unc   p20 = true                       cmp<
(p17)   cmp.and    p21 = (r35>96)                           r35 = xx
(p17)   cmp.and    p21 = (r35<123)                 cmp>
(p22)   sub       r36 = r36, 32
(p19)   st        [rb] = r37, 1
        br.ctop   loop                             sub      r36 = xx


                                                    st      r37 = xx
®
R



                                Stage Predicates
    Fill the pipe again ...
                                                           RRB = -1
                                                           Physical
             Epilogue          Kernel                     register file

    Complete Loop Code                              ld       r33 = o
loop:
(p16)   ld        r34 = [ra], 1
(p16)   cmp.unc   p20 = true                       cmp<
(p17)   cmp.and    p21 = (r35>96)                           r34 = G
(p17)   cmp.and    p21 = (r35<123)                 cmp>
(p22)   sub       r36 = r36, 32
(p19)   st        [rb] = r37, 1
        br.ctop   loop                             sub      r35 = xx


                                                    st      r36 = xx
®
R



                                Stage Predicates
    Fill the pipe again ...
                                                           RRB = -2
                                                           Physical
                  Epilogue     Kernel                     register file

    Complete Loop Code                              ld       r32 = _
loop:
(p16)   ld        r34 = [ra], 1
(p16)   cmp.unc   p20 = true                       cmp<
(p17)   cmp.and    p21 = (r35>96)                            r33 = o
(p17)   cmp.and    p21 = (r35<123)                 cmp>
(p22)   sub       r36 = r36, 32
(p19)   st        [rb] = r37, 1
        br.ctop   loop                             sub      r34 = G


                                                    st      r35 = xx
®
R



                                Stage Predicates
    Chunking thru kernel
                                                      RRB = -3
                                                      Physical
                     Epilogue        Kernel          register file

    Complete Loop Code                         ld      r37 = G
loop:
(p16)   ld        r34 = [ra], 1
(p16)   cmp.unc   p20 = true                  cmp<
(p17)   cmp.and    p21 = (r35>96)                       r32 = _
(p17)   cmp.and    p21 = (r35<123)            cmp>
(p22)   sub       r36 = r36, 32
(p19)   st        [rb] = r37, 1
        br.ctop   loop                        sub       r33 = o


                                               st       r34 = G
®
R
    Chunking thru kernel
                                                      RRB = -4
                                                      Physical
                         Epilogue    Kernel
                                                     register file

    Complete Loop Code                         ld       r36 = r
loop:
(p16)   ld        r34 = [ra], 1
(p16)   cmp.unc   p20 = true                  cmp<
(p17)   cmp.and    p21 = (r35>96)                      r37 = G
(p17)   cmp.and    p21 = (r35<123)            cmp>
(p22)   sub       r36 = r36, 32
(p19)   st        [rb] = r37, 1
        br.ctop   loop                        sub       r32 = _


                                               st       r33 = O
®
             G
R
    Chunking thru kernel
                                               RRB = -5
                                               Physical
                            Epilogue          register file

    Complete Loop Code                  ld       r35 = e
loop:
(p16)   ld        r34 = [ra], 1
(p16)   cmp.unc   p20 = true           cmp<
(p17)   cmp.and    p21 = (r35>96)                r36 = r
(p17)   cmp.and    p21 = (r35<123)     cmp>
(p22)   sub       r36 = r36, 32
(p19)   st        [rb] = r37, 1
        br.ctop   loop                 sub      r37 = G


                                        st       r32 = _
®
             G O
R
    Chunking thru kernel
                                                  RRB = -6
                                                  Physical
                               Epilogue
                                                 register file

    Complete Loop Code                     ld       r34 = y
loop:
(p16)   ld        r34 = [ra], 1
(p16)   cmp.unc   p20 = true              cmp<
(p17)   cmp.and    p21 = (r35>96)                   r35 = e
(p17)   cmp.and    p21 = (r35<123)        cmp>
(p22)   sub       r36 = r36, 32
(p19)   st        [rb] = r37, 1
        br.ctop   loop                    sub       r36 = r


                                           st       r37 = G
®
             G O
R
    Chunking thru kernel
                                                        RRB = -7
                                                        Physical
                                     Epilogue          register file

    Complete Loop Code                           ld       r33 = h
loop:
(p16)   ld        r34 = [ra], 1
(p16)   cmp.unc   p20 = true                    cmp<
(p17)   cmp.and    p21 = (r35>96)                         r34 = y
(p17)   cmp.and    p21 = (r35<123)              cmp>
(p22)   sub       r36 = r36, 32
(p19)   st        [rb] = r37, 1
        br.ctop   loop                          sub       r35 = e


                                                 st       r36 = r
®
             G O         G
R
    Draining the pipe
                                             RRB = -8
                                             Physical
                                            register file

    Complete Loop Code                ld      r32 = xx
loop:
(p16)   ld        r34 = [ra], 1
(p16)   cmp.unc   p20 = true         cmp<
(p17)   cmp.and    p21 = (r35>96)              r33 = h
(p17)   cmp.and    p21 = (r35<123)   cmp>
(p22)   sub       r36 = r36, 32
(p19)   st        [rb] = r37, 1
        br.ctop   loop               sub       r34 = Y


                                      st       r35 = E
®
             G O         G R
R
    Draining the pipe
                                             RRB = -9
                                             Physical
                                            register file

    Complete Loop Code                ld      r33 = xx
loop:
(p16)   ld        r34 = [ra], 1
(p16)   cmp.unc   p20 = true         cmp<
(p17)   cmp.and    p21 = (r35>96)             r34 = xx
(p17)   cmp.and    p21 = (r35<123)   cmp>
(p22)   sub       r36 = r36, 32
(p19)   st        [rb] = r37, 1
        br.ctop   loop               sub       r35 = H


                                      st       r36 = Y
®
             G O         G R E
R
    Draining the pipe
                                                 RRB = -10
          Fall through the loop                  Physical
          Don’t rotate                          register file

    Complete Loop Code                    ld      r32 = xx
loop:
(p16)   ld        r34 = [ra], 1
(p16)   cmp.unc   p20 = true             cmp<
(p17)   cmp.and    p21 = (r35>96)                 r33 = xx
(p17)   cmp.and    p21 = (r35<123)       cmp>
(p22)   sub       r36 = r36, 32
(p19)   st        [rb] = r37, 1
        br.ctop   loop                   sub      r34 = xx


                                          st       r35 = H
®
             G O         G R E       Y
R
       Example Summary
• 8 iterations in 12 cycles                           RRB = -10
• 2.6x speedup of initial code
                                                      Physical
• 2.75x over unrolled traditional                    register file
• No code expansion
• No mispredicts (4x, 1 10 cycle miss)         ld      r32 = xx
• Minimal register usage
   loop:                                      cmp<
   (p16)   ld        r34 = [ra], 1                     r33 = xx
   (p16)   cmp.unc   p20 = true
                                              cmp>
   (p17)   cmp.and    p21 = (r35>96)
   (p17)   cmp.and    p21 = (r35<123)
   (p22)   sub        r36 = r36, 32           sub      r34 = xx
   (p19)   st         [rb] = r37, 1
           br.ctop   loop
                                               st       r35 = H
   ®
   R

                G O         G R E       Y H
    Software Pipelining
   Itanium™ architecture features support SWP
     –   Full Predication
     –   Special branch handling features
     –   Register rotation: removes loop copy overhead
     –   Predicate rotation/generation: removes prologue & epilogue

   Traditional architectures use loop unrolling
     – High overhead: extra code for loop body, prologue, and
       epilogue

         Especially Useful for Integer Code with
           Small Number of Loop Iterations
®
R
    Compiler Bag of Tricks
       Predication
        – Removes branches and mispredictions
        – Enables aggressive code motion
        – Parallel compares increase parallelism
       Speculation
        – Hides memory latency
        – Enables aggressive code motion
        – Control speculation over branches
        – Data speculation over stores
        – Compiler-controlled recovery code
®
R
    Compiler Bag of Tricks
       Rich branch architecture
        – Multi-way branches increase ILP
        – Loop branches
        – Static direction hints assist prediction
       S/W pipelining support with minimal
        overhead encourages broad usage
        – Performance for small integer loops with
          unknown trip counts as well as monster FP
          loops

®
R
    BACKUP




®
R
    8 Queens Example
    if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true))
             Parallel Compares          8 queens control flow
        R1=&b[j]
    1   R3=&a[i+j]                             P2
        R5=&c[i-j+7]                 P1
        p1 <- true
        ld R2=[R1]
    2   ld R4=[R3]                                     P4
        ld R6=[R5]                     P3
        p1,p2 <- cmp.and(R2==true)          P1= true     P1=False
    4   p1,p2 <- cmp.and(R4==true)                      P6
        p1,p2 <- cmp.and(R6==true)        P5
                                                            Else
    5   (p1) br then                           Then
        else


®
R
              Reduced from 7 cycles to 5
    Five Predicate Compare
    Types
       (qp) p1,p2 <- cmp.relation
         – if(qp) {p1 = relation; p2 = !relation};
       (qp) p1,p2 <- cmp.relation.unc
         – p1 = qp&relation; p2 = qp&!relation;
       (qp) p1,p2 <- cmp.relation.and
         – if(qp & (relation==FALSE)) { p1=0; p2=0; }
       (qp) p1,p2 <- cmp.relation.or
         – if(qp & (relation==TRUE)) { p1=1; p2=1; }
       (qp) p1,p2 <- cmp.relation.or.andcm
         – if(qp & (relation==TRUE)) { p1=1; p2=0; }

®
R
    Control Speculation
    Summary
       All loads have a speculative form that sets
        the NaT bit when deferring exceptions
       Computational instructions propagate NaTs
       OS controls deferral of faults but supported
        directly in HW - “no-fault speculation”
         – Minimizes overhead of data that is not used
       Chk more effective than non-faulting load



®
R
     More complex example
Killtime loop in m88ksim
for (i=0, i<32, i++)
  comptime[i] -= MIN(comptime[i], time)


     Initial Loop               Pipelined Loop
     loop:                      Loop:
           ld r5=[r10],4         (p16)   ld     r36 = [r10],4
           cmp p1,p2 = r5,r32    (p18)   cmp p21,p23 = r38,r32
     (p1) br side                (p22)   sub r37 = r0,0
           sub r5=r5,r32         (p24)   sub r38 = r38,r32
            st [addr]=r5,4       (p20)   st     [r11] = r40,4
           br   cloop                    br.ctop loop
     side:
           add t=0,r0
           st4 [addr]=t,4
 ®
           br   cloop
R
    Software Pipelining Benefits
       Loop pipelining maximizes performance;
        minimizes overhead
        – High applicability
        – Minimum code size - fewer cache misses
        – Reduced register usage
        – Greater performance improvements in higher
          latency conditions
       Reduced overhead allows S/W pipelining of
        small loops with unknown trip counts
        – Good for integer scalar codes
®
R
    Memory Address Modes
     Register   Indirect is only address mode
      –Memory address comes from a General
       Register
      –no add in critical memory access path
     Post-Incrementprovided for efficient
     address arithmetic
      –can add 9-bit signed immediate value, or a
       value from a general register
      –uses idle ALU resources
      –avoid extra add instructions

®
R
         Benefits vector Floating Point Code
    Memory Address Modes
     Load    Instructions
      –(qp) ld{1,2,4,8} r1 = [r3]        no post-inc
      –(qp) ld{1,2,4,8} r1 = [r3] , imm9
      –(qp) ld{1,2,4,8} r1 = [r3] , r2
     Store   Instructions
      –(qp) st{1,2,4,8} [r3] = r2        no post-inc
      –(qp) st{1,2,4,8} [r3] = r2, imm9


®
R

				
DOCUMENT INFO
Categories:
Tags:
Stats:
views:4
posted:7/30/2012
language:
pages:79