; 1 Fine-Grained and Coarse- Grained Parallelism Overview Review for
Documents
User Generated
Resources
Learning Center
Your Federal Quarterly Tax Payments are due April 15th

# 1 Fine-Grained and Coarse- Grained Parallelism Overview Review for

VIEWS: 26 PAGES: 6

• pg 1
```									                                                                                                          Overview
E   Theme: the use of dependence to automatically
Fine-Grained and Coarse-                                                                       parallelize sequential code
Grained Parallelism                                                                     E   Outline
¶   Fine-grained parallelism: vectorization
n   loop interchange
n   scalar expansion
Chapter 5 & 6 of Allen &                                                                                             scalar and array renaming
CSC 2/455
n

Kennedy: Optimizing                                                                                              n   node splitting
Compilers for Modern                                     Fall 2004                                         ¶   Coarse-grained parallelism: multiple asynchronous
Architectures                                                                                                  processor
n   loop interchange
Lecturer: Yutao Zhong
URCS                            CS2/455 2004                            2

Review for codegen (Ch2)                                                                              Review of codegen (Ch2)
procedure codegen(R, k, D);
// R is the region for which we must generate code.
// k is the minimum nesting level of possible parallel loops.                                         E   Basic idea: find all the possible parallelism by
// D is the dependence graph among statements in R.
find the set {S 1, S2, ... , Sm} of maximal strongly-connected                                            loop distribution and statement reordering
regions in the dependence graph D restricted to R
construct R pi from R by reducing each S i to a single node and
compute Dpi, the dependence graph naturally induced on Rpi by D
E   May not work as desired
let {pi1, pi2, ... , pim} be the m nodes of Rpi numbered in an order                                     ¶   not effectively
consistent with D pi (use topological sort to do the numbering);
for i = 1 to m do begin                                                                                        n   loop interchange to increase parallelism
if pii is cyclic then begin
¶   cyclic dependences exist
generate a level-k DO statement;
let Di be the dependence graph consisting of all dependence edges in D                          n   scalar expansion
that are at level k +1 or greater and are internal to pii;
codegen (pii, k+1, Di);
n   scalar & array renaming
generate the level-k ENDDO statement;                                                           n   node splitting
end
else
generate a vector statement for pii in r(pii)-k+1 dimensions, where r (pii) is
the number of loops containing pii;
end
end

URCS                                     CS2/455 2004                                            3       URCS                            CS2/455 2004                            4

Loop interchange                                                                                      Loop interchange: safety
DO I = 1, N
DO J = 1, M                                                                                          E   Loop interchange is a reordering transformation
S        A(I,J+1) = A(I,J) + B                                        DV(I,J) = (=,<)
ENDDO                                                                                                E   Not all loop interchanges are legal (Theorem 2.3)
ENDDO                                                                                                  DO J = 1, M
E    After interchanging I-loop and J-loop                                                                   DO I = 1, N
DV(J,I) = (<,>)
DO J = 1, M                                                                                                 A(I,J+1) = A(I+1,J) + B
DO I = 1, N                                                                                            ENDDO
S         A(I,J+1) = A(I,J) + B                                        DV(J,I) = (<,=)                    ENDDO
ENDDO
ENDDO                                                                                                 E   Theorem 5.1 : let D be a direction vector for a
dependence in a perfect loop nests, the direction vector
E    Vectorization                                                                                            of the same dependence after any loop permutation is
DO J = 1, M
determined by applying the same permutation to the
S      A(1:N,J+1) = A(1:N,J) + B
ENDDO                                                                                                    elements of D

URCS                                     CS2/455 2004                                            5       URCS                            CS2/455 2004                            6

1
Loop interchange: safety                                              Loop interchange: safety
DO I = 1, N
DO J = 1, M                                                   E Theorem       5.2: A permutation of the loops in a
DO K = 1, L                                                    perfect nest is legal if and only if the direction
A(I+1,J+1,K) = A(I,J,K) + A(I,J+1,K+1)
ENDDO
matrix, after the same permutation is applied to
< < =
ENDDO                                                            its columns, has no ">" direction as the leftmost
< = >
ENDDO
non-"=" direction in any row.
I J K
E   The direction matrix for a nest of loops is a
< < =
matrix in which each row is a direction vector
< = >
for some dependence between statements
contained in the nest and every such direction                E   Interchange preventing dependence and
vector is represented by a row.                                   interchange sensitive dependence

URCS                     CS2/455 2004                     7           URCS                     CS2/455 2004             8

Loop interchange for vectorization                                     Loop shifting: example (I)
E   Motivation                                                        DO I = 1, N
¶   If we deal with loops containing cyclic                          DO J = 1, N
dependences early on in the loop nest, we can                        DO K = 1, N
potentially vectorize more loops
S           A(I,J) = A(I,J) + B(I,K)*C(K,J)
¶   Inward-shifting loops that carry no dependences                       ENDDO
E   Theorem 5.3: In a perfect loop nest, loops carry no                   ENDDO
dependence are legal to be shifted inward and will not              ENDDO
carry any dependences in their new position.

E   Applying codegen: no vectorization possible
E   DV(I,J,K) = (=,=.<)

URCS                     CS2/455 2004                     9           URCS                     CS2/455 2004            10

Loop shifting: example (II)                                           Loop interchange: profitability
E After shifting K-loop outward:
DO K= 1, N
E The    architecture of the target machine
DO I = 1, N
DO J = 1, N
is usually the principal factor:
S         A(I,J) = A(I,J) + B(I,K)*C(K,J)                                 ¶ SIMD
ENDDO
ENDDO                                                                 ¶ Vector    register machine
ENDDO
E Applying codegen:                                                       ¶ MIMD
DO K = 1, N
FORALL J=1,N
n    will see more in Chapter 6
A(1:N,J) = A(1:N,J) + B(1:N,K)*C(K,J)
END FORALL
ENDDO

URCS                     CS2/455 2004                    11           URCS                     CS2/455 2004            12

2
Codegen revised                                                               Scalar expansion
if pii is cyclic then                              scalar expansion                              DO I = 1, N
if k is the deepest loop in pii                scalar renaming                          S1        T = A(I)
then try_recurrence_breaking(pii, D, k)     array renaming                           S2        A(I) = B(I)
else begin                                     node splitting
S3        B(I) = T
select_loop_and_interchange(pii, D, k);                                                   ENDDO
generate a level-k DO statement;
let Di be the dependence graph consisting of
all dependence edges in D that are at level                 E        Scalar expansion:              E    Vectorization:
k+1 or greater and are internal to pii;                              DO I = 1, N                    S1        T\$(1:N) = A(1:N)
codegen (pii, k+1, Di);                                        S1            T\$(I) = A(I)              S2        A(1:N) = B(1:N)
generate the level-k ENDDO statement                          S2            A(I) = B(I)               S3        B(1:N) = T\$(1:N)
end                                                               S3            B(I) = T\$(I)                        T = T\$(N)
end
ENDDO
T = T\$(N)

URCS                        CS2/455 2004                          13           URCS                       CS2/455 2004                       14

Scalar expansion: safety & profitability                                        Scalar expansion: profitability
E   Always safe to be applied                                              E   Dependences due to reuse of memory location
E   Not always profitable                                                      vs. reuse of values
DO I = 1, N                                                                    ¶   Dependences due to reuse of values must be
T = T + A(I) + A(I-1)                                                          preserved
A(I) = T                          d1                                       ¶   Dependences due to reuse of memory location can
ENDDO                                             d•
be deleted by expansion
¶   after scalar expansion:               S1                S2        E   Naïve approach
T\$(0) = T                                          d•-1
DO I = 1, N                                                             E   Better approach: detecting deletable edges first,
S1      T\$(I) = T\$(I-1) + A(I) + A(I-1)               d1                       then applying scalar expansion only if the new
S2      A(I) = T\$(I)                                                           graph has vectorizable statements
ENDDO
¶   covering definition
T = T\$(N)

URCS                        CS2/455 2004                          15           URCS                       CS2/455 2004                       16

Scalar expansion: drawbacks                                                   Scalar renaming: example
E   Increased memory assumption                                            DO I = 1, 100                         E   After renaming T:
S1     T = A(I)         +   B(I)      DO I = 1, 100
E   Solutions:                                                             S2     C(I) = T         +   T         S1     T1 = A(I) +       B(I)
¶   Expand in a single loop                                           S3     T = D(I)         -   B(I)
S2     C(I) = T1 +       T1
¶   Strip mine loop before expansion                                                                        S3     T2 = D(I) -       B(I)
S4     A(I+1) =         T   * T
S4     A(I+1) = T2       * T2
¶   Forward substitution:                                                ENDDO                                 ENDDO
DO I = 1, N
T = A(I) + A(I+1)
A(I) = T + B(I)                                                 E    Vectorization:
ENDDO                                                               S3           T2\$(1:100) = D(1:100)        -   B(1:100)
S4           A(2:101) = T2\$(1:100)        *   T2\$(1:100)
DO I = 1, N                                                         S1           T1\$(1:100) = A(1:100)        +   B(1:100)
A(I) = A(I) + A(I+1) + B(I)
S2           C(1:100) = T1\$(1:100)        +   T1\$(1:100)
ENDDO
T = T2\$(100)
URCS                        CS2/455 2004                          17           URCS                       CS2/455 2004                       18

3
Scalar renaming                                                     Array renaming
E        Original:                        E   After renaming A:
E   Renaming algorithm partitions all definitions                                                             DO I = 1, N
DO I = 1,         N
and uses of a scalar S into equivalent                         S1     A(I)        = A(I-1) + X           S1     A\$(I) = A(I-1) + X
classes, each of which can occupy a different                  S2     Y(I)        = A(I) + Z             S2     Y(I) = A\$(I) + Z
memory location (Fig 5.12)                                     S3     A(I)        = B(I) + C             S3     A(I) = B(I) + C
ENDDO                                     ENDDO
¶    def-use graph
¶    reachable analysis                                           E    Vectorization:
E   Renaming works by removing critical loop-                          S3       A(1:N) = B(1:N) + C
S1       A\$(1:N) = A(0:N-1) + X
independent anti- or output- dependence to
S2       Y(1:N) = A\$(1:N) + Z
break a cycle

URCS                    CS2/455 2004                     19            URCS                        CS2/455 2004                         20

Node splitting                                                      Node splitting: example
DO I = 1, N                                              E   Original:                             E   After node splitting:
S1      A(I) = X(I+1) + X(I)
DO I = 1, N                             DO I = 1, N
S2      X(I+1) = B(I) + 10
S1         A(I) = X(I+1) + X(I)            S1’     X\$(I) = X(I+1)
ENDDO
S2         X(I+1) = B(I) + 10              S1      A(I) = X\$(I) + X(I)
E renaming does not work because of the two                                 ENDDO                             S2      X(I+1) = B(I) + 10
ENDDO
dependences share one single access: X(I+1)
E renaming tries to give both name spaces the
E    Vectorization:
S1’         X\$(1:N) = X(2:N+1)
original array name                                                S2          X(2:N+1) = B(1:N) + 10
E solution: creating a copy of a node from which                   S1          A(1:N) = X\$(1:N) + X(1:N)
the critical anti-dependence emanates

URCS                    CS2/455 2004                     21            URCS                        CS2/455 2004                         22

Node splitting: profitability                                      Fine-grained parallelism: summary
ENot always profitable            EAfter    splitting as before:             if pii is cyclic then                              scalar expansion
if k is the deepest loop in pii                scalar renaming
DO I = 1, N
EFor example                                                                        then try_recurrence_breaking(pii, D, k)     array renaming
S1’  X\$(I) = X(I+1)
DO I = 1, N                                                             else begin                                     node splitting
S1    A(I) = X\$(I) + X(I)
S1         A(I) = X(I+1) + X(I)                                                     select_loop_and_interchange(pii, D, k);
S2    X(I+1) = A(I) + 10                          generate a level-k DO statement;
S2         X(I+1) = A(I) + 10
ENDDO                                            let Di be the dependence graph consisting of
ENDDO
ERecurrence      not broken                          all dependence edges in D that are at level
k+1 or greater and are internal to pii;
EWhy?                                             codegen (pii, k+1, Di);
generate the level-k ENDDO statement
The target dependence must be critical!                                 end
end

URCS                    CS2/455 2004                     23            URCS                        CS2/455 2004                         24

4
Coarse-grained parallelism                                                 Basics
E    Target machine: symmetric multiprocessor                                 E   Sequential loop vs. parallel loop
¶   multiple processors with a shared memory                               ¶   a sequential loop carries a dependence
¶   parallelism employed by creating and executing a                       ¶   iterations of a parallel loop can be correctly run in
process on each processor                                                  any order
¶   expensive overhead: processes initiation and
synchronization                                                   E   Statement PARALLEL DO
E    Parallelism concern for high performance:                                     ¶   represents a parallel loop that can be distributed
¶   find and package parallelism with a granularity large                      on different processors
enough to compensate the overhead                                 E   Synchronization
¶   delicate trade-off between overhead minimization and                   ¶   barrier: forces all processes to reach a certain
load balancing                                                             point before execution continues

URCS                          CS2/455 2004                       25       URCS                         CS2/455 2004                       26

Loop interchange: revisited                                               Loop interchange: example
E Previously:           fine-grained                                      E After      interchanging I-loop and J-loop:
¶   move loops inward to vectorize more                           PARALLEL DO J = 1, N
DO I = 1, N
E Now:          coarse-grained                                                                                            DV(J,I) = ( =, <)
A(I+1, J) = A(I, J) + B(I, J)
¶   move dependence-free loops outward to                           ENDDO
generate large enough parallel unit                           END PARALLEL DO                                 1 barrier needed

DO I = 1, N
DO J = 1, M                                    DV(I,J) = (<, =)
A(I+1, J) = A(I, J) + B(I, J)
ENDDO
N barriers needed
ENDDO

URCS                          CS2/455 2004                       27       URCS                         CS2/455 2004                       28

Loop interchange: profitability                                           Loop interchange: profitability
E   Not always possible to move a parallel loop                           E Theorem     6.3: In a perfect nest of loops,
outward and have it remain free of dependence                             a particular loop can be parallelized at
E   Example:                                                                  the outermost level if and only if the
DO J = 1, N
DO I = 1, N
column of the direction matrix for that
DV(J,I) = ( <, <)            nest contains only “=“ entries.
A(I+1, J+1) = A(I, J) + B(I, J)
ENDDO
ENDDO
¶   The best we can do: parallelize the inner loop

URCS                          CS2/455 2004                       29       URCS                         CS2/455 2004                       30

5
Loop interchange: algorithm                              Loop selection and interchange
E   Move any loop with only “=“ entries into              DO I=1,N
DO J=1,M
outermost position and parallelize it, remove
DO K=1,L
the column from the matrix                           S1       A(I+1,J,K) = A(I,J,K) +X1
E   Move any loop with the most “<“ entries into         S2       B(I,J,K+1) = B(I,J,K) + X2
next outermost position and sequentialize it,        S3       C(I+1,J+1,K+1) = C(I,J,K) + X3
eliminate the column and any rows                           ENDDO                        I J        K
ENDDO
representing carried dependences                     ENDDO                           S1 < =         =
E   Repeat the algorithm starting with step 1                                           S2 =          = <
S3 <          < <

URCS                     CS2/455 2004           31     URCS                        CS2/455 2004             32

Loop selection and interchange                           Summary
DO I=1,N
E Transformations            to break recurrence
PARALLEL DO J=1,M
DO K=1,L                                             ¶ scalar expansion
S1       A(I+1,J,K) = A(I,J,K) +X1                          ¶ scalar/array renaming
S2       B(I,J,K+1) = B(I,J,K) + X2
S3       C(I+1,J+1,K+1) = C(I,J,K) + X3                     ¶ node splitting
ENDDO                                             E Transformations            to increase parallelism
END PARALLEL DO
ENDDO                                                       ¶   loop interchange
n   fine-grained: vectorization
E general      case: NP-complete
n   coarse-grained: multiprocessor

URCS                     CS2/455 2004           33     URCS                        CS2/455 2004             34

6

```
To top