Your Federal Quarterly Tax Payments are due April 15th

# Single dimension Software Pipelining for Multi dimensional Loops by sanmelody

VIEWS: 5 PAGES: 54

• pg 1
```									Single-dimension Software Pipelining
for Multi-dimensional Loops
Hongbo Rong
Zhizhong Tang
Alban Douillet
Ramaswamy Govindarajan
Guang R. Gao

Presented by: Hongbo Rong
IFIP Tele-seminar June 1, 2004
Introduction
   Loops and software pipelining are
important
   Innermost loops are not enough
[Burger&Goodman04]
   Billion-transistor architectures tend to have
much more parallelism
   Previous methods for scheduling multi-
dimensional loops are meeting new
challenges
2
Motivating Example
int U[N1+1][N2+1], V[N1+1][N2+1];
L1:   for (i1=0; i1<N1; i1++) {
L2:    for (i2=0; i2<N2; i2++) {
a: U[i1+1][i2]=V[i1][i2]+ U[i1][i2];
b: V[i1][i2+1]=U[i1+1][i2];
}
}
<1,0>
a
A strong cycle in the
inner loop: No             <0,0>               <0,1>
parallelism
b
3
Loop Interchange Followed by Modulo
Scheduling of the Inner Loop

<0,1>
a
<0,0>           <1,0>
b

• Why not select a better loop to software pipeline?
• Which and how?
4
Starting from A Naïve Approach
0         1        2         3            4            5             i1
0            a(0,0)                                                              <1,0>
1            b(0,0)   a(1,0)
2
---     b(1,0)   a(2,0)
Resource
a
3
a(0,1)    ---     b(2,0)   a(3,0)   conflicts
4
b(0,1)   a(1,1)    ---     b(3,0)   a(4,0)
5
<0,0>           <0,1>
---     b(1,1)   a(2,1)    ---     b(4,0)     a(5,0)
6
a(0,2)    ---     b(2,1)   a(3,1)    ---       b(5,0)
7                                                                                 b
b(0,2)   a(1,2)    ---     b(3,1)   a(4,1)      ---
8
2 function units
---     b(1,2)   a(2,2)    ---     b(4,1)     a(5,1)
9
---     b(2,2)   a(3,2)    ---       b(5,1)
10
11
---     b(3,2)   a(4,2)      ---         a: 1 cycle
12
---     b(4,2)
---
a(5,2)
b(5,2)
b: 2 cycles
Cycle                                                   ---         N2=3
5
Looking from Another Angle
0         1        2         3            4            5               i1
0            a(0,0)
1            b(0,0)   a(1,0)
Initiation interval T=1
2             ---     b(1,0)   a(2,0)            Kernel, with S=3 stages
3            a(0,1)    ---     b(2,0)   a(3,0)                             Resource conflicts
4            b(0,1)   a(1,1)    ---     b(3,0)   a(4,0)
5             ---     b(1,1)   a(2,1)    ---     b(4,0)     a(5,0)
6            a(0,2)    ---     b(2,1)   a(3,1)    ---       b(5,0)       Slice 1

7            b(0,2)   a(1,2)    ---     b(3,1)   a(4,1)      ---
8             ---     b(1,2)   a(2,2)    ---     b(4,1)     a(5,1)
9                      ---     b(2,2)   a(3,2)    ---       b(5,1)       Slice2
10                              ---     b(3,2)   a(4,2)      ---
11                                       ---     b(4,2)     a(5,2)
12                                                ---       b(5,2)       Slice 3

---
Cycle                                                                                      6
0         1        2         3            4            5           i1
0           a(0,0)
1           b(0,0)   a(1,0)
Initiation interval T=1
2            ---     b(1,0)   a(2,0)     Kernel, with S=3 stages
3           a(0,1)    ---     b(2,0)   a(3,0)
4           b(0,1)   a(1,1)    ---     b(3,0)   a(4,0)
5            ---     b(1,1)   a(2,1)    ---     b(4,0)     a(5,0)
6           a(0,2)    ---     b(2,1)   a(3,1)    ---       b(5,0)       Delay = (N2-1)*S*T
7           b(0,2)   a(1,2)    ---     b(3,1)   a(4,1)      ---
8            ---     b(1,2)   a(2,2)    ---     b(4,1)     a(5,1)
9                     ---     b(2,2)   a(3,2)    ---       b(5,1)
10                              ---     b(3,2)   a(4,2)      ---
11                                       ---     b(4,2)     a(5,2)
12                                                ---       b(5,2)
---
Cycle

SSP
(Single-dimension
Software Pipelining)
7
0         1        2         3            4            5                 i1
0           a(0,0)
1           b(0,0)   a(1,0)
Initiation interval T=1
2            ---     b(1,0)   a(2,0)     Kernel, with S=3 stages
3           a(0,1)    ---     b(2,0)
4           b(0,1)   a(1,1)    ---
5            ---     b(1,1)   a(2,1)
6           a(0,2)    ---     b(2,1)                                          Delay = (N2-1)*S*T
7           b(0,2)   a(1,2)    ---
8            ---     b(1,2)   a(2,2)
9                     ---     b(2,2)   a(3,0)
10                              ---     b(3,0)   a(4,0)
11                                       ---     b(4,0)     a(5,0)
12                                      a(3,1)    ---       b(5,0)
   An iteration point per
cyle
b(3,1)   a(4,1)      ---
Cycle                               ---     b(4,1)     a(5,1)
   Filling & draining
naturally overlapped
a(3,2)    ---       b(5,1)
SSP                                b(3,2)   a(4,2)      ---
   Dependences are still
respected!
(Single-dimension                            ---     b(4,2)     a(5,2)
   Resources fully used
Software Pipelining)
---       b(5,2)
---            Data reuse exploited!
8
Loop Rewriting

int U[N1+1][N2+1], V[N1+1][N2+1];
L1': for (i1=0; i1<N1; i1+=3) {
b(i1-1, N2-1) a(i1, 0)
b(i1, 0) a(i1+1, 0)
b(i1+1, 0) a(i1+2, 0)
L2': for (i2=1; i2<N2; i2++) {
a(i1, i2)              b(i1+2, i2-1)
b(i1, i2) a(i1+1, i2)
b(i1+1, i2) a(i1+2, i2)
}
}

b(i1-1, N2-1)

9
Outline
 Motivation
 Problem Formulation &
Perspective
 Properties

 Extensions

 Current and Future work

 Code Generation and experiments
10
Problem Formulation
Given a loop nest L composed of n loops L1, …,
Ln, identify the most profitable loop level Lx
with 1<= x<=n, and software pipeline it.
 Which loop to software pipeline?

 How to software pipeline the selected loop?

   How to handle the n-D dependences?
   How to enforce resource constraints?
   How can we guarantee that repeating patterns
will definitely appear?
11
Single-dimension Software Pipelining

   A resource-constrained scheduling
method for loop nests
   Can schedule at an arbitrary level
   Simplify n-D dependences to 1-D
   3 steps
   Loop Selection
   Dependence Simplification and 1-D Schedule
Construction
   Final schedule computation                   12
Perspective
   Which loop to software pipeline?
   Most profitable one in terms of
parallelism, data reuse, or others
   How to software pipeline the selected
loop?
 Allocate iteration points to slices
 Software pipeline each slice
Enforce resource
 Partition slices into groups
constraints in two
steps
 Delay groups until resources available           13
Perspective (Cont.)
   How to handle dependences?
 If a dependence is respected
before pushing-down the
groups, it will be respected
afterwards
 Simplify dependences from n-D

to 1-D
14
0         1        2         3            4            5                i1
0           a(0,0)                     Dependences within a slice
1           b(0,0)   a(1,0)
2            ---     b(1,0)   a(2,0)
Dependences between slices
3           a(0,1)    ---     b(2,0)   a(3,0)
4           b(0,1)   a(1,1)    ---     b(3,0)   a(4,0)
5            ---     b(1,1)   a(2,1)    ---     b(4,0)     a(5,0)
6           a(0,2)    ---     b(2,1)   a(3,1)    ---       b(5,0)          Still respected
7           b(0,2)   a(1,2)    ---     b(3,1)   a(4,1)      ---            after pushing
8            ---     b(1,2)   a(2,2)    ---     b(4,1)     a(5,1)
down
9                     ---     b(2,2)   a(3,2)    ---       b(5,1)
10                              ---     b(3,2)   a(4,2)      ---
11                                       ---     b(4,2)     a(5,2)
12                                                ---       b(5,2)
---
<1,0>
Cycle
a
How to handle                                                             <0,0>             <0,1>
dependences?                                                                       b
15
Simplify n-D Dependences
Only the first
distance useful
(i1, 0, …, 0,0)    (i1+1, 0, …, 0,0)
……                               <1,0>
a
Ignorable
<0, 0 >       <0,1>
(i1, 0, …, 0,1)
……       (i1+1, 0, …, 0,1)             b

Cycle
16
Step 1: Loop Selection
   Scan each loop.
   Evaluate parallelism
   Recurrence Minimum II (RecMII) from
the cycles in 1-D DDG
   Evaluate data reuse
   average memory accesses of an S*S tile
from the future final schedule (optimized
iteration space).

17
Example: Evaluate Parallelism
Inner loop:        Outer loop:
RecMII=3          RecMII=1
<1>
a                 a
<0>       < 1>    <0>
b                 b

18
0 1 ……S-1 S S+1 ……2S-1 …. N1-1 i1

……                 Symbolic parameters
S: total stages
……
……                  
l: cache line size
Evaluate data
……
reuse[WolfLam91]
……                          Localize
……                  

space=span{(0,1),(1,0)}
……          Calculate equivalent classes
for temporal and spatial
……
……          
reuse space
avarage accesses=2/l
……
……               Evaluate Data
……
Cycle
Reuse               19
Step 2: Dependence Simplification and
1-D Schedule Construction

<1,0>                    <1>
   Dependence Simplification                  a                        a
   1-D schedule construction       <0,0>            <0,1>      <0>
b                        b
a                         Modulo
T           property
b      a
-      b           a                 Resource constraints

Sequential           -           b
constraints                      -   Dependence constraints
20
0         1        2         3            4            5                i1
0            a(0,0)
1            b(0,0)   a(1,0)                                             Module schedule
2             ---     b(1,0)   a(2,0)
time=5
3            a(0,1)    ---     b(2,0)   a(3,0)
4            b(0,1)   a(1,1)    ---     b(3,0)   a(4,0)
5             ---     b(1,1)   a(2,1)    ---     b(4,0)     a(5,0)
6            a(0,2)    ---     b(2,1)   a(3,1)    ---       b(5,0)
7            b(0,2)   a(1,2)    ---     b(3,1)   a(4,1)      ---          Distance=
i2 * S * T  2 * 3 * 1  6
8             ---     b(1,2)   a(2,2)    ---     b(4,1)     a(5,1)
9                      ---     b(2,2)   a(3,2)    ---       b(5,1)
10                              ---     b(3,2)   a(4,2)      ---
11                                       ---     b(4,2)     a(5,2)
12                                                ---       b(5,2)
---          Delay = (N2-1)*S*T
Cycle

=(3-1)*3*1=6
Final Schedule
Computation                                                                  Final schedule
time=5+6+6=17
Example: a(5,2)
21
Step 3: Final Schedule Computation
For any operation o, iteration point I=(i1, i2,…,in),

f(o,I) = σ(o, i1)                                Modulo schedule time

+


x, 2  x  n
(ix *         N y ) * S *T
y , x  y  n 1
Distance between o(i1,0, …,
0) and o(i1, i2, …, in)

+
i1 / S  * (            N x  1) * S * T
x , 2  x  n
Delay from pushing down

22
Outline
 Motivation
 Problem Formulation &
Perspective
 Properties

 Extensions

 Current and Future work

 Code Generation and experiments
23
Correctness of the Final Schedule

   Respects the original n-D
dependences
   Although we use 1-D
dependences in scheduling
   No resource competition
   Repeating patterns definitely
appear
24
Efficiency of the Final Schedule
   Schedule length <= the innermost-
centric approach
   One iteration point per T cycles
   Draining and filling of pipelines naturally
overlapped
   Execution time: even better
   Data reuse exploited from outermost
and innermost dimensions

25
Relation with Modulo Scheduling

   The classical MS for single
loops is subsumed as a
special case of SSP
 No sequential constraints
 f(o,I) = Modulo schedule time (σ(o, i1))

26
Outline
 Motivation
 Problem Formulation &
Perspective
 Properties

 Extensions

 Current and Future work

 Code Generation and experiments
27
SSP for Imperfect Loop Nest
 Loop selection
 Dependence simplification
and 1-D schedule
construction
   Sequential constraints
   Final schedule
28
0         1        2         3          4            5              i1
0           a(0,0)              Initiation interval T=1
1           b(0,0)   a(1,0)
2           c(0,0)   b(1,0)   a(2,0)
Push from here
3           d(0,0)   c(1,0)   b(2,0)   a(3,0)     Kernel, with S=3 stages
4           c(0,1)   d(1,0)   c(2,0)   b(3,0)   a(4,0)
5           d(0,1)   c(1,1)   d(2,0)   c(3,0)   b(4,0)   a(5,0)
6           c(0,2)   d(1,1)   c(2,1)   d(3,0)   c(4,0)   b(5,0)
7           d(0,2)   c(1,2)   d(2,1)   c(3,1)   d(4,0)   c(5,0)
8                    d(1,2)   c(2,2)   d(3,1)   c(4,1)   d(5,0)
9                                                                     Push from here
d(2,2)   c(3,2)   b(4,0)
d(4,1)   c(5,1)
a(5,0)
10                                      d(3,2)   c(4,2)
c(4,0)   b(5,0)
d(5,1)                        a
11                                               d(4,0)
d(4,2)   c(5,0)
c(5,2)
12                                               c(4,1)   d(5,2)
d(5,0)                        b
d(4,1)   c(5,1)
Cycle
c(4,2)   d(5,1)                        c
d
d(4,2)   c(5,2)
d(5,2)

SSP for Imperfect
Loop Nest (Cont.)                                                                            29
Outline
 Motivation
 Problem Formulation &
Perspective
 Properties

 Extensions

 Current and Future work

 Code Generation and experiments
30
Compiler Platform Under Construction
C/C++/Fortran
Loop Selection
Front                                  Bundling
gfec/gfecc/f90     End
Selected Loop
Very High WHIRL                                         Bundled kernel
Dependence
Pre-Loop        Simplification     Register Allocation
High WHIRL
Selection
1-D DDG           Register-allocated
Middle WHIRL                                              kernel
Consistency
1-D Schedule
Maintenance                          Code generation
Low WHIRL        Middle           Construction
End
Intermediate kernel     Assembly code
Very Low WHIRL
Back
End
31
Current and Future Work
   Register allocation
   Implementation and evaluation
   Interaction and comparison with
pre-transforming the loop nest
 Unroll-and-jam
 Tiling

 Loop interchange

 Loop skewing and Peeling

 …….
32
An (Incomplete) Taxonomy of Software Pipelining

Software Pipelining
For 1-dimensional loops      Modulo scheduling and others
For n-dimensional loops
Outer Loop Pipelining[MuthukumarDoshi01]
Resource-
constrained             Hierarchical reduction[Lam88]
Innermost-loop     Pipelining-dovetailing[WangGao96]
centric
Linear scheduling with
constants[DarteEtal00,94]
Affine-by-statement
Parallelism -oriented    scheduling[DarteEtal00,94]
Statement-level rational affine
scheduling[Ramanujam94]
SSP                    r-periodic scheduling[GaoEtAl93]
Juggling problem[DarteEtAl02]
33
Outline
 Motivation
 Problem Formulation &
Perspective
 Properties

 Extensions

 Current and Future work

 Code Generation and experiments
34
Code Generation
Loop nest in CGIR           Problem Statement
Code generation issues
SSP
•Register assignment
Given       an      register
•Predicated execution
Intermediate Kernel
kernel
allocated drain control
•Loop and
•Generating SSP and
generated by prologand a
Register allocation
epilog
target         architecture,
Register- allocated kernel      •Generating SSP final
generate the outermost loop
pattern
schedule, while reducing
Code Generation            •Generating innermost loop
code size and loop
pattern
•Code-size optimizations

35
Code Generation: Challenges
   Multiple repeating patterns
   Code emission algorithms
   Register Assignment
   Lack of multiple rotating register files
   Mix of rotating registers and static register renaming
techniques
   Loop and drain control
   Predicated execution
   Loop counters
   Branch instructions
   Code size increase
   Code compression techniques
36
Experiments: Setup
   Stand-alone module at assembly level.
   Software-pipelining using Huff's modulo-scheduling.
   SSP kernel generation & register allocation by
hand.
   Scheduling algorithms: MS, xMS, SSP, CS-SSP
   Other optimizations: unroll-and-jam, loop tiling
   Benchmarks: MM, HD, LU, SOR
   Itanium workstation 733MHz, 16KB/96KB/2MB/2GB

37
Experiments: Relative Speedup

 Speedup between 1.1 and 4.24, average 2.1.
 Better performance : better parallelism and/or better data reuse.
 Code-size optimized version performs as well as original version.
 Code duplication and code size do not degrade performance.
38
Experiments: Bundle Density

 Bundle density measures average number of non-NOP in a bundle.
 Average: MS-xMS: 1.90, SSP: 1.91, CS-SSP: 2.1
 CS-SSP produces a denser code.
 CS-SSP makes better use of available resources.
39
Experiments: Relative Code Size

 SSP code is between 3.6 and 9.0 times bigger than MS/xMS .
 CS-SSP code is between 2 and 6.85 times bigger than MS/xMS.
 Because of multiple patterns and code duplication in innermost loop.
 However entire code (~4KB) easily fits in the L1 instruction cache.
40
Acknowledgement
   Prof.Bogong Su, Dr.Hongbo Yang
   Anonymous reviewers
   Chan, Sun C.
   NSF, DOE agencies

41
Appendix
   The following slides are for the detailed
performance analysis of SSP.

42
Exploiting Parallelism from the Whole
Iteration Space

(Matrix size is N*N)
   Represents a class of important application
   Strong dependence cycle in the innermost loop
   The middle loop has negative dependence but can be removed.   43
Exploiting Data Reuse from the
Whole Iteration Space

44

Speedup

N

45
Exploiting Parallelism from the
Whole Iteration Space (Cont.)

Both have dependence cycles in the innermost loop
46
Exploiting Data Reuse from the
Whole Iteration Space

47
Exploiting Data Reuse from the
Whole Iteration Space (Cont.)

48
Exploiting Data Reuse from the
Whole Iteration Space (Cont.)

(Matrix size is jn*jn)
49

Speedup

N
   SSP considers all operations in constructing 1-D scheule, thus
effectively offsets the overhead of operations out of the
innermost loop                                                50
Performance Analysis from L2 Cache
misses

4.5

4

3.5

Cache       3                                                                 MS
misses     2.5
xMS

relative                                                                      SSP

to MS       2                                                                 CS-SSP

1.5

1

0.5

0
ijk   jik   ikj   jki   kij   kji   HD   LU   SOR   jki+RT     jik+T

51
Performance Analysis from L3 Cache
misses

1.2

1

Cache
0.8
misses                                                                        MS

relative   0.6
xMS

to MS
SSP
CS-SSP

0.4

0.2

0
ijk   jik   ikj   jki   kij   kji   HD   LU   SOR   jki+RT     jik+T

52
Comparison with Linear Schedule
   Linear schedule
systolic arrays, etc. , not for uniprocessor
   Parallelism oriented. Do not consider
   Fine-grain resource constraints
   Register usage
   Data reuse
   Code generation
   Communicate values through memory, or
message passing, etc.
53
Optimized Iteration Space of A Linear
Schedule
0   1   2   3   4   5   6   7   8   9
i1

Cycle                                                 54
54

```
To top