Docstoc

Single dimension Software Pipelining for Multi dimensional Loops

Document Sample
Single dimension Software Pipelining for Multi dimensional Loops Powered By Docstoc
					Single-dimension Software Pipelining
     for Multi-dimensional Loops
             Hongbo Rong
            Zhizhong Tang
            Alban Douillet
        Ramaswamy Govindarajan
             Guang R. Gao

        Presented by: Hongbo Rong
             IFIP Tele-seminar June 1, 2004
                  Introduction
   Loops and software pipelining are
    important
   Innermost loops are not enough
    [Burger&Goodman04]
       Billion-transistor architectures tend to have
        much more parallelism
   Previous methods for scheduling multi-
    dimensional loops are meeting new
    challenges
                                                        2
          Motivating Example
      int U[N1+1][N2+1], V[N1+1][N2+1];
L1:   for (i1=0; i1<N1; i1++) {
L2:    for (i2=0; i2<N2; i2++) {
        a: U[i1+1][i2]=V[i1][i2]+ U[i1][i2];
        b: V[i1][i2+1]=U[i1+1][i2];
      }
      }
                                         <1,0>
                                               a
 A strong cycle in the
    inner loop: No             <0,0>               <0,1>
      parallelism
                                               b
                                                           3
     Loop Interchange Followed by Modulo
         Scheduling of the Inner Loop


        <0,1>
          a
<0,0>           <1,0>
          b

• Why not select a better loop to software pipeline?
  • Which and how?
                                                       4
         Starting from A Naïve Approach
              0         1        2         3            4            5             i1
0            a(0,0)                                                              <1,0>
1            b(0,0)   a(1,0)
2
              ---     b(1,0)   a(2,0)
                                                 Resource
                                                                                  a
3
             a(0,1)    ---     b(2,0)   a(3,0)   conflicts
4
             b(0,1)   a(1,1)    ---     b(3,0)   a(4,0)
5
                                                                         <0,0>           <0,1>
              ---     b(1,1)   a(2,1)    ---     b(4,0)     a(5,0)
6
             a(0,2)    ---     b(2,1)   a(3,1)    ---       b(5,0)
7                                                                                 b
             b(0,2)   a(1,2)    ---     b(3,1)   a(4,1)      ---
8
                                                                         2 function units
              ---     b(1,2)   a(2,2)    ---     b(4,1)     a(5,1)
9
                       ---     b(2,2)   a(3,2)    ---       b(5,1)
10
11
                                ---     b(3,2)   a(4,2)      ---         a: 1 cycle
12
                                         ---     b(4,2)
                                                  ---
                                                            a(5,2)
                                                            b(5,2)
                                                                         b: 2 cycles
     Cycle                                                   ---         N2=3
                                                                                                 5
              Looking from Another Angle
              0         1        2         3            4            5               i1
0            a(0,0)
1            b(0,0)   a(1,0)
                                      Initiation interval T=1
2             ---     b(1,0)   a(2,0)            Kernel, with S=3 stages
3            a(0,1)    ---     b(2,0)   a(3,0)                             Resource conflicts
4            b(0,1)   a(1,1)    ---     b(3,0)   a(4,0)
5             ---     b(1,1)   a(2,1)    ---     b(4,0)     a(5,0)
6            a(0,2)    ---     b(2,1)   a(3,1)    ---       b(5,0)       Slice 1

7            b(0,2)   a(1,2)    ---     b(3,1)   a(4,1)      ---
8             ---     b(1,2)   a(2,2)    ---     b(4,1)     a(5,1)
9                      ---     b(2,2)   a(3,2)    ---       b(5,1)       Slice2
10                              ---     b(3,2)   a(4,2)      ---
11                                       ---     b(4,2)     a(5,2)
12                                                ---       b(5,2)       Slice 3

                                                             ---
     Cycle                                                                                      6
                  0         1        2         3            4            5           i1
     0           a(0,0)
     1           b(0,0)   a(1,0)
                                     Initiation interval T=1
     2            ---     b(1,0)   a(2,0)     Kernel, with S=3 stages
     3           a(0,1)    ---     b(2,0)   a(3,0)
     4           b(0,1)   a(1,1)    ---     b(3,0)   a(4,0)
     5            ---     b(1,1)   a(2,1)    ---     b(4,0)     a(5,0)
     6           a(0,2)    ---     b(2,1)   a(3,1)    ---       b(5,0)       Delay = (N2-1)*S*T
     7           b(0,2)   a(1,2)    ---     b(3,1)   a(4,1)      ---
     8            ---     b(1,2)   a(2,2)    ---     b(4,1)     a(5,1)
     9                     ---     b(2,2)   a(3,2)    ---       b(5,1)
    10                              ---     b(3,2)   a(4,2)      ---
    11                                       ---     b(4,2)     a(5,2)
    12                                                ---       b(5,2)
                                                                 ---
         Cycle


         SSP
(Single-dimension
 Software Pipelining)
                                                                                                  7
                  0         1        2         3            4            5                 i1
     0           a(0,0)
     1           b(0,0)   a(1,0)
                                     Initiation interval T=1
     2            ---     b(1,0)   a(2,0)     Kernel, with S=3 stages
     3           a(0,1)    ---     b(2,0)
     4           b(0,1)   a(1,1)    ---
     5            ---     b(1,1)   a(2,1)
     6           a(0,2)    ---     b(2,1)                                          Delay = (N2-1)*S*T
     7           b(0,2)   a(1,2)    ---
     8            ---     b(1,2)   a(2,2)
     9                     ---     b(2,2)   a(3,0)
    10                              ---     b(3,0)   a(4,0)
    11                                       ---     b(4,0)     a(5,0)
    12                                      a(3,1)    ---       b(5,0)
                                                                                An iteration point per
                                                                                 cyle
                                            b(3,1)   a(4,1)      ---
         Cycle                               ---     b(4,1)     a(5,1)
                                                                                Filling & draining
                                                                                 naturally overlapped
                                            a(3,2)    ---       b(5,1)
         SSP                                b(3,2)   a(4,2)      ---
                                                                                Dependences are still
                                                                                 respected!
(Single-dimension                            ---     b(4,2)     a(5,2)
                                                                                Resources fully used
 Software Pipelining)
                                                      ---       b(5,2)
                                                                 ---            Data reuse exploited!
                                                                                                        8
                            Loop Rewriting

      int U[N1+1][N2+1], V[N1+1][N2+1];
L1': for (i1=0; i1<N1; i1+=3) {
       b(i1-1, N2-1) a(i1, 0)
               b(i1, 0) a(i1+1, 0)
                            b(i1+1, 0) a(i1+2, 0)
L2': for (i2=1; i2<N2; i2++) {
          a(i1, i2)              b(i1+2, i2-1)
          b(i1, i2) a(i1+1, i2)
                     b(i1+1, i2) a(i1+2, i2)
       }
    }

   b(i1-1, N2-1)



                                                    9
            Outline
 Motivation
 Problem Formulation &
  Perspective
 Properties

 Extensions

 Current and Future work

 Code Generation and experiments
                                10
          Problem Formulation
Given a loop nest L composed of n loops L1, …,
  Ln, identify the most profitable loop level Lx
  with 1<= x<=n, and software pipeline it.
 Which loop to software pipeline?

 How to software pipeline the selected loop?

     How to handle the n-D dependences?
     How to enforce resource constraints?
     How can we guarantee that repeating patterns
      will definitely appear?
                                                     11
Single-dimension Software Pipelining

   A resource-constrained scheduling
    method for loop nests
   Can schedule at an arbitrary level
   Simplify n-D dependences to 1-D
   3 steps
       Loop Selection
       Dependence Simplification and 1-D Schedule
        Construction
       Final schedule computation                   12
                  Perspective
   Which loop to software pipeline?
       Most profitable one in terms of
        parallelism, data reuse, or others
   How to software pipeline the selected
    loop?
     Allocate iteration points to slices
     Software pipeline each slice
                                       Enforce resource
     Partition slices into groups
                                       constraints in two
                                       steps
     Delay groups until resources available           13
        Perspective (Cont.)
   How to handle dependences?
     If a dependence is respected
      before pushing-down the
      groups, it will be respected
      afterwards
     Simplify dependences from n-D

      to 1-D
                                      14
               0         1        2         3            4            5                i1
  0           a(0,0)                     Dependences within a slice
  1           b(0,0)   a(1,0)
  2            ---     b(1,0)   a(2,0)
                                                  Dependences between slices
  3           a(0,1)    ---     b(2,0)   a(3,0)
  4           b(0,1)   a(1,1)    ---     b(3,0)   a(4,0)
  5            ---     b(1,1)   a(2,1)    ---     b(4,0)     a(5,0)
  6           a(0,2)    ---     b(2,1)   a(3,1)    ---       b(5,0)          Still respected
  7           b(0,2)   a(1,2)    ---     b(3,1)   a(4,1)      ---            after pushing
  8            ---     b(1,2)   a(2,2)    ---     b(4,1)     a(5,1)
                                                                             down
  9                     ---     b(2,2)   a(3,2)    ---       b(5,1)
 10                              ---     b(3,2)   a(4,2)      ---
 11                                       ---     b(4,2)     a(5,2)
 12                                                ---       b(5,2)
                                                              ---
                                                                                  <1,0>
      Cycle
                                                                                   a
How to handle                                                             <0,0>             <0,1>
dependences?                                                                       b
                                                                                               15
        Simplify n-D Dependences
  Only the first
  distance useful
        (i1, 0, …, 0,0)    (i1+1, 0, …, 0,0)
                     ……                               <1,0>
                                                          a
                          Ignorable
                                                <0, 0 >       <0,1>
        (i1, 0, …, 0,1)
                   ……       (i1+1, 0, …, 0,1)             b

Cycle
                                                                 16
          Step 1: Loop Selection
   Scan each loop.
   Evaluate parallelism
       Recurrence Minimum II (RecMII) from
        the cycles in 1-D DDG
   Evaluate data reuse
       average memory accesses of an S*S tile
        from the future final schedule (optimized
        iteration space).

                                                17
Example: Evaluate Parallelism
Inner loop:        Outer loop:
  RecMII=3          RecMII=1
                          <1>
        a                 a
  <0>       < 1>    <0>
        b                 b


                                 18
  0 1 ……S-1 S S+1 ……2S-1 …. N1-1 i1

        ……                 Symbolic parameters
                            S: total stages
        ……
    ……                  
                            l: cache line size
                            Evaluate data
        ……
                            reuse[WolfLam91]
    ……                          Localize
        ……                  

                                space=span{(0,1),(1,0)}
                   ……          Calculate equivalent classes
                                for temporal and spatial
                   ……
                ……          
                                reuse space
                                avarage accesses=2/l
                   ……
                ……               Evaluate Data
                   ……
Cycle
                                     Reuse               19
     Step 2: Dependence Simplification and
           1-D Schedule Construction

                                             <1,0>                    <1>
   Dependence Simplification                  a                        a
   1-D schedule construction       <0,0>            <0,1>      <0>
                                               b                        b
                  a                         Modulo
                                T           property
                  b      a
                  -      b           a                 Resource constraints

    Sequential           -           b
    constraints                      -   Dependence constraints
                                                                            20
               0         1        2         3            4            5                i1
 0            a(0,0)
 1            b(0,0)   a(1,0)                                             Module schedule
 2             ---     b(1,0)   a(2,0)
                                                                          time=5
 3            a(0,1)    ---     b(2,0)   a(3,0)
 4            b(0,1)   a(1,1)    ---     b(3,0)   a(4,0)
 5             ---     b(1,1)   a(2,1)    ---     b(4,0)     a(5,0)
 6            a(0,2)    ---     b(2,1)   a(3,1)    ---       b(5,0)
 7            b(0,2)   a(1,2)    ---     b(3,1)   a(4,1)      ---          Distance=
                                                                           i2 * S * T  2 * 3 * 1  6
 8             ---     b(1,2)   a(2,2)    ---     b(4,1)     a(5,1)
 9                      ---     b(2,2)   a(3,2)    ---       b(5,1)
 10                              ---     b(3,2)   a(4,2)      ---
 11                                       ---     b(4,2)     a(5,2)
 12                                                ---       b(5,2)
                                                              ---          Delay = (N2-1)*S*T
      Cycle

                                                                                    =(3-1)*3*1=6
 Final Schedule
  Computation                                                                  Final schedule
                                                                              time=5+6+6=17
Example: a(5,2)
                                                                                                    21
    Step 3: Final Schedule Computation
For any operation o, iteration point I=(i1, i2,…,in),

f(o,I) = σ(o, i1)                                Modulo schedule time

          +

      
   x, 2  x  n
                   (ix *         N y ) * S *T
                           y , x  y  n 1
                                                 Distance between o(i1,0, …,
                                                   0) and o(i1, i2, …, in)

          +
   i1 / S  * (            N x  1) * S * T
                      x , 2  x  n
                                                 Delay from pushing down



                                                                               22
            Outline
 Motivation
 Problem Formulation &
  Perspective
 Properties

 Extensions

 Current and Future work

 Code Generation and experiments
                                23
Correctness of the Final Schedule

   Respects the original n-D
    dependences
       Although we use 1-D
        dependences in scheduling
   No resource competition
   Repeating patterns definitely
    appear
                                    24
    Efficiency of the Final Schedule
   Schedule length <= the innermost-
    centric approach
       One iteration point per T cycles
       Draining and filling of pipelines naturally
        overlapped
   Execution time: even better
       Data reuse exploited from outermost
        and innermost dimensions

                                                  25
    Relation with Modulo Scheduling

   The classical MS for single
    loops is subsumed as a
    special case of SSP
     No sequential constraints
     f(o,I) = Modulo schedule time (σ(o, i1))




                                                 26
            Outline
 Motivation
 Problem Formulation &
  Perspective
 Properties

 Extensions

 Current and Future work

 Code Generation and experiments
                                27
    SSP for Imperfect Loop Nest
 Loop selection
 Dependence simplification
  and 1-D schedule
  construction
       Sequential constraints
   Final schedule
                                  28
               0         1        2         3          4            5              i1
  0           a(0,0)              Initiation interval T=1
  1           b(0,0)   a(1,0)
  2           c(0,0)   b(1,0)   a(2,0)
                                                       Push from here
  3           d(0,0)   c(1,0)   b(2,0)   a(3,0)     Kernel, with S=3 stages
  4           c(0,1)   d(1,0)   c(2,0)   b(3,0)   a(4,0)
  5           d(0,1)   c(1,1)   d(2,0)   c(3,0)   b(4,0)   a(5,0)
  6           c(0,2)   d(1,1)   c(2,1)   d(3,0)   c(4,0)   b(5,0)
  7           d(0,2)   c(1,2)   d(2,1)   c(3,1)   d(4,0)   c(5,0)
  8                    d(1,2)   c(2,2)   d(3,1)   c(4,1)   d(5,0)
  9                                                                     Push from here
                                d(2,2)   c(3,2)   b(4,0)
                                                  d(4,1)   c(5,1)
                                                           a(5,0)
 10                                      d(3,2)   c(4,2)
                                                  c(4,0)   b(5,0)
                                                           d(5,1)                        a
 11                                               d(4,0)
                                                  d(4,2)   c(5,0)
                                                           c(5,2)
 12                                               c(4,1)   d(5,2)
                                                           d(5,0)                        b
                                                  d(4,1)   c(5,1)
      Cycle
                                                  c(4,2)   d(5,1)                        c
                                                                                         d
                                                  d(4,2)   c(5,2)
                                                           d(5,2)


SSP for Imperfect
Loop Nest (Cont.)                                                                            29
            Outline
 Motivation
 Problem Formulation &
  Perspective
 Properties

 Extensions

 Current and Future work

 Code Generation and experiments
                                30
Compiler Platform Under Construction
 C/C++/Fortran
                                   Loop Selection
                    Front                                  Bundling
  gfec/gfecc/f90     End
                                     Selected Loop
Very High WHIRL                                         Bundled kernel
                                     Dependence
                    Pre-Loop        Simplification     Register Allocation
  High WHIRL
                    Selection
                                      1-D DDG           Register-allocated
 Middle WHIRL                                              kernel
                   Consistency
                                    1-D Schedule
                   Maintenance                          Code generation
  Low WHIRL        Middle           Construction
                    End
                                 Intermediate kernel     Assembly code
Very Low WHIRL
                    Back
                    End
                                                                             31
        Current and Future Work
   Register allocation
   Implementation and evaluation
   Interaction and comparison with
    pre-transforming the loop nest
     Unroll-and-jam
     Tiling

     Loop interchange

     Loop skewing and Peeling

     …….
                                      32
 An (Incomplete) Taxonomy of Software Pipelining

Software Pipelining
       For 1-dimensional loops      Modulo scheduling and others
       For n-dimensional loops
                                    Outer Loop Pipelining[MuthukumarDoshi01]
             Resource-
             constrained             Hierarchical reduction[Lam88]
                  Innermost-loop     Pipelining-dovetailing[WangGao96]
                      centric
                                     Linear scheduling with
                                     constants[DarteEtal00,94]
                                    Affine-by-statement
           Parallelism -oriented    scheduling[DarteEtal00,94]
                                   Statement-level rational affine
                                   scheduling[Ramanujam94]
              SSP                    r-periodic scheduling[GaoEtAl93]
                                     Juggling problem[DarteEtAl02]
                                                                           33
            Outline
 Motivation
 Problem Formulation &
  Perspective
 Properties

 Extensions

 Current and Future work

 Code Generation and experiments
                                34
                Code Generation
   Loop nest in CGIR           Problem Statement
                               Code generation issues
          SSP
                                •Register assignment
                             Given       an      register
                                •Predicated execution
  Intermediate Kernel
                                                  kernel
                             allocated drain control
                                •Loop and
                                •Generating SSP and
                             generated by prologand a
    Register allocation
                                epilog
                             target         architecture,
Register- allocated kernel      •Generating SSP final
                             generate the outermost loop
                                pattern
                             schedule, while reducing
     Code Generation            •Generating innermost loop
                             code size and loop
                                pattern
        Final code           control overheads.
                                •Code-size optimizations

                                                        35
    Code Generation: Challenges
   Multiple repeating patterns
       Code emission algorithms
   Register Assignment
       Lack of multiple rotating register files
            Mix of rotating registers and static register renaming
             techniques
   Loop and drain control
       Predicated execution
       Loop counters
       Branch instructions
   Code size increase
       Code compression techniques
                                                                      36
           Experiments: Setup
   Stand-alone module at assembly level.
   Software-pipelining using Huff's modulo-scheduling.
   SSP kernel generation & register allocation by
    hand.
   Scheduling algorithms: MS, xMS, SSP, CS-SSP
   Other optimizations: unroll-and-jam, loop tiling
   Benchmarks: MM, HD, LU, SOR
   Itanium workstation 733MHz, 16KB/96KB/2MB/2GB



                                                      37
Experiments: Relative Speedup




 Speedup between 1.1 and 4.24, average 2.1.
 Better performance : better parallelism and/or better data reuse.
 Code-size optimized version performs as well as original version.
 Code duplication and code size do not degrade performance.
                                                                      38
 Experiments: Bundle Density




 Bundle density measures average number of non-NOP in a bundle.
 Average: MS-xMS: 1.90, SSP: 1.91, CS-SSP: 2.1
 CS-SSP produces a denser code.
 CS-SSP makes better use of available resources.
                                                                   39
Experiments: Relative Code Size




 SSP code is between 3.6 and 9.0 times bigger than MS/xMS .
 CS-SSP code is between 2 and 6.85 times bigger than MS/xMS.
 Because of multiple patterns and code duplication in innermost loop.
 However entire code (~4KB) easily fits in the L1 instruction cache.
                                                                     40
          Acknowledgement
   Prof.Bogong Su, Dr.Hongbo Yang
   Anonymous reviewers
   Chan, Sun C.
   NSF, DOE agencies




                                     41
                 Appendix
   The following slides are for the detailed
    performance analysis of SSP.




                                                42
Exploiting Parallelism from the Whole
            Iteration Space




                            (Matrix size is N*N)
    Represents a class of important application
    Strong dependence cycle in the innermost loop
    The middle loop has negative dependence but can be removed.   43
Exploiting Data Reuse from the
    Whole Iteration Space




                                 44
   Advantage of Code Generation



Speedup




               N

                                  45
Exploiting Parallelism from the
Whole Iteration Space (Cont.)




    Both have dependence cycles in the innermost loop
                                                        46
Exploiting Data Reuse from the
    Whole Iteration Space




                                 47
Exploiting Data Reuse from the
Whole Iteration Space (Cont.)




                                 48
Exploiting Data Reuse from the
Whole Iteration Space (Cont.)




              (Matrix size is jn*jn)
                                       49
              Advantage of Code Generation




    Speedup




                               N
   SSP considers all operations in constructing 1-D scheule, thus
    effectively offsets the overhead of operations out of the
    innermost loop                                                50
   Performance Analysis from L2 Cache
                misses


           4.5

            4

           3.5

Cache       3                                                                 MS
misses     2.5
                                                                              xMS

relative                                                                      SSP

to MS       2                                                                 CS-SSP


           1.5

            1

           0.5

            0
                 ijk   jik   ikj   jki   kij   kji   HD   LU   SOR   jki+RT     jik+T


                                                                                        51
   Performance Analysis from L3 Cache
                misses


           1.2



            1


Cache
           0.8
misses                                                                        MS

relative   0.6
                                                                              xMS

to MS
                                                                              SSP
                                                                              CS-SSP

           0.4



           0.2



            0
                 ijk   jik   ikj   jki   kij   kji   HD   LU   SOR   jki+RT     jik+T



                                                                                        52
Comparison with Linear Schedule
   Linear schedule
       Traditionally apply to multi-processing,
        systolic arrays, etc. , not for uniprocessor
       Parallelism oriented. Do not consider
            Fine-grain resource constraints
            Register usage
            Data reuse
       Code generation
            Communicate values through memory, or
             message passing, etc.
                                                       53
  Optimized Iteration Space of A Linear
                Schedule
        0   1   2   3   4   5   6   7   8   9
                                                i1




Cycle                                                 54
                                                     54