; 1 Fine-Grained and Coarse- Grained Parallelism Overview Review for
Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out
Your Federal Quarterly Tax Payments are due April 15th Get Help Now >>

1 Fine-Grained and Coarse- Grained Parallelism Overview Review for

VIEWS: 26 PAGES: 6

  • pg 1
									                                                                                                          Overview
                                                                                                          E   Theme: the use of dependence to automatically
               Fine-Grained and Coarse-                                                                       parallelize sequential code
                  Grained Parallelism                                                                     E   Outline
                                                                                                               ¶   Fine-grained parallelism: vectorization
                                                                                                                     n   loop interchange
                                                                                                                     n   scalar expansion
    Chapter 5 & 6 of Allen &                                                                                             scalar and array renaming
                                                            CSC 2/455
                                                                                                                     n


    Kennedy: Optimizing                                                                                              n   node splitting
    Compilers for Modern                                     Fall 2004                                         ¶   Coarse-grained parallelism: multiple asynchronous
    Architectures                                                                                                  processor
                                                                                                                     n   loop interchange
                                               Lecturer: Yutao Zhong
                                                                                                              URCS                            CS2/455 2004                            2




    Review for codegen (Ch2)                                                                              Review of codegen (Ch2)
procedure codegen(R, k, D);
    // R is the region for which we must generate code.
    // k is the minimum nesting level of possible parallel loops.                                         E   Basic idea: find all the possible parallelism by
    // D is the dependence graph among statements in R.
    find the set {S 1, S2, ... , Sm} of maximal strongly-connected                                            loop distribution and statement reordering
             regions in the dependence graph D restricted to R
      construct R pi from R by reducing each S i to a single node and
            compute Dpi, the dependence graph naturally induced on Rpi by D
                                                                                                          E   May not work as desired
      let {pi1, pi2, ... , pim} be the m nodes of Rpi numbered in an order                                     ¶   not effectively
               consistent with D pi (use topological sort to do the numbering);
      for i = 1 to m do begin                                                                                        n   loop interchange to increase parallelism
             if pii is cyclic then begin
                                                                                                               ¶   cyclic dependences exist
                     generate a level-k DO statement;
                     let Di be the dependence graph consisting of all dependence edges in D                          n   scalar expansion
                            that are at level k +1 or greater and are internal to pii;
                     codegen (pii, k+1, Di);
                                                                                                                     n   scalar & array renaming
                     generate the level-k ENDDO statement;                                                           n   node splitting
             end
             else
                     generate a vector statement for pii in r(pii)-k+1 dimensions, where r (pii) is
                 the number of loops containing pii;
             end
      end

     URCS                                     CS2/455 2004                                            3       URCS                            CS2/455 2004                            4




    Loop interchange                                                                                      Loop interchange: safety
  DO I = 1, N
     DO J = 1, M                                                                                          E   Loop interchange is a reordering transformation
S        A(I,J+1) = A(I,J) + B                                        DV(I,J) = (=,<)
     ENDDO                                                                                                E   Not all loop interchanges are legal (Theorem 2.3)
   ENDDO                                                                                                  DO J = 1, M
E    After interchanging I-loop and J-loop                                                                   DO I = 1, N
                                                                                                                                                                    DV(J,I) = (<,>)
    DO J = 1, M                                                                                                 A(I,J+1) = A(I+1,J) + B
      DO I = 1, N                                                                                            ENDDO
S         A(I,J+1) = A(I,J) + B                                        DV(J,I) = (<,=)                    ENDDO
      ENDDO
    ENDDO                                                                                                 E   Theorem 5.1 : let D be a direction vector for a
                                                                                                              dependence in a perfect loop nests, the direction vector
E    Vectorization                                                                                            of the same dependence after any loop permutation is
     DO J = 1, M
                                                                                                              determined by applying the same permutation to the
S      A(1:N,J+1) = A(1:N,J) + B
     ENDDO                                                                                                    elements of D

     URCS                                     CS2/455 2004                                            5       URCS                            CS2/455 2004                            6




                                                                                                                                                                                          1
 Loop interchange: safety                                              Loop interchange: safety
DO I = 1, N
    DO J = 1, M                                                   E Theorem       5.2: A permutation of the loops in a
       DO K = 1, L                                                    perfect nest is legal if and only if the direction
          A(I+1,J+1,K) = A(I,J,K) + A(I,J+1,K+1)
       ENDDO
                                                                      matrix, after the same permutation is applied to
                           < < =
     ENDDO                                                            its columns, has no ">" direction as the leftmost
                           < = >
 ENDDO
                                                                      non-"=" direction in any row.
                                                                                           I J K
E   The direction matrix for a nest of loops is a
                                                                                           < < =
    matrix in which each row is a direction vector
                                                                                           < = >
    for some dependence between statements
    contained in the nest and every such direction                E   Interchange preventing dependence and
    vector is represented by a row.                                   interchange sensitive dependence

    URCS                     CS2/455 2004                     7           URCS                     CS2/455 2004             8




Loop interchange for vectorization                                     Loop shifting: example (I)
E   Motivation                                                        DO I = 1, N
     ¶   If we deal with loops containing cyclic                          DO J = 1, N
         dependences early on in the loop nest, we can                        DO K = 1, N
         potentially vectorize more loops
                                                                      S           A(I,J) = A(I,J) + B(I,K)*C(K,J)
     ¶   Inward-shifting loops that carry no dependences                       ENDDO
E   Theorem 5.3: In a perfect loop nest, loops carry no                   ENDDO
    dependence are legal to be shifted inward and will not              ENDDO
    carry any dependences in their new position.

                                                                      E   Applying codegen: no vectorization possible
                                                                      E   DV(I,J,K) = (=,=.<)


    URCS                     CS2/455 2004                     9           URCS                     CS2/455 2004            10




 Loop shifting: example (II)                                           Loop interchange: profitability
E After shifting K-loop outward:
DO K= 1, N
                                                                  E The    architecture of the target machine
    DO I = 1, N
       DO J = 1, N
                                                                      is usually the principal factor:
S         A(I,J) = A(I,J) + B(I,K)*C(K,J)                                 ¶ SIMD
       ENDDO
    ENDDO                                                                 ¶ Vector    register machine
 ENDDO
E Applying codegen:                                                       ¶ MIMD
DO K = 1, N
   FORALL J=1,N
                                                                            n    will see more in Chapter 6
       A(1:N,J) = A(1:N,J) + B(1:N,K)*C(K,J)
   END FORALL
 ENDDO

    URCS                     CS2/455 2004                    11           URCS                     CS2/455 2004            12




                                                                                                                                2
  Codegen revised                                                               Scalar expansion
      if pii is cyclic then                              scalar expansion                              DO I = 1, N
          if k is the deepest loop in pii                scalar renaming                          S1        T = A(I)
             then try_recurrence_breaking(pii, D, k)     array renaming                           S2        A(I) = B(I)
          else begin                                     node splitting
                                                                                                  S3        B(I) = T
             select_loop_and_interchange(pii, D, k);                                                   ENDDO
              generate a level-k DO statement;
             let Di be the dependence graph consisting of
                all dependence edges in D that are at level                 E        Scalar expansion:              E    Vectorization:
                k+1 or greater and are internal to pii;                              DO I = 1, N                    S1        T$(1:N) = A(1:N)
             codegen (pii, k+1, Di);                                        S1            T$(I) = A(I)              S2        A(1:N) = B(1:N)
              generate the level-k ENDDO statement                          S2            A(I) = B(I)               S3        B(1:N) = T$(1:N)
          end                                                               S3            B(I) = T$(I)                        T = T$(N)
       end
                                                                                     ENDDO
                                                                                     T = T$(N)

     URCS                        CS2/455 2004                          13           URCS                       CS2/455 2004                       14




Scalar expansion: safety & profitability                                        Scalar expansion: profitability
 E   Always safe to be applied                                              E   Dependences due to reuse of memory location
 E   Not always profitable                                                      vs. reuse of values
     DO I = 1, N                                                                    ¶   Dependences due to reuse of values must be
         T = T + A(I) + A(I-1)                                                          preserved
         A(I) = T                          d1                                       ¶   Dependences due to reuse of memory location can
     ENDDO                                             d•
                                                                                        be deleted by expansion
      ¶   after scalar expansion:               S1                S2        E   Naïve approach
    T$(0) = T                                          d•-1
    DO I = 1, N                                                             E   Better approach: detecting deletable edges first,
 S1      T$(I) = T$(I-1) + A(I) + A(I-1)               d1                       then applying scalar expansion only if the new
 S2      A(I) = T$(I)                                                           graph has vectorizable statements
    ENDDO
                                                                                    ¶   covering definition
    T = T$(N)

     URCS                        CS2/455 2004                          15           URCS                       CS2/455 2004                       16




  Scalar expansion: drawbacks                                                   Scalar renaming: example
 E   Increased memory assumption                                            DO I = 1, 100                         E   After renaming T:
                                                                            S1     T = A(I)         +   B(I)      DO I = 1, 100
 E   Solutions:                                                             S2     C(I) = T         +   T         S1     T1 = A(I) +       B(I)
      ¶   Expand in a single loop                                           S3     T = D(I)         -   B(I)
                                                                                                                  S2     C(I) = T1 +       T1
      ¶   Strip mine loop before expansion                                                                        S3     T2 = D(I) -       B(I)
                                                                            S4     A(I+1) =         T   * T
                                                                                                                  S4     A(I+1) = T2       * T2
      ¶   Forward substitution:                                                ENDDO                                 ENDDO
            DO I = 1, N
                T = A(I) + A(I+1)
                A(I) = T + B(I)                                                 E    Vectorization:
            ENDDO                                                               S3           T2$(1:100) = D(1:100)        -   B(1:100)
                                                                                S4           A(2:101) = T2$(1:100)        *   T2$(1:100)
            DO I = 1, N                                                         S1           T1$(1:100) = A(1:100)        +   B(1:100)
               A(I) = A(I) + A(I+1) + B(I)
                                                                                S2           C(1:100) = T1$(1:100)        +   T1$(1:100)
            ENDDO
                                                                                             T = T2$(100)
     URCS                        CS2/455 2004                          17           URCS                       CS2/455 2004                       18




                                                                                                                                                       3
    Scalar renaming                                                     Array renaming
                                                                   E        Original:                        E   After renaming A:
E   Renaming algorithm partitions all definitions                                                             DO I = 1, N
                                                                    DO I = 1,         N
    and uses of a scalar S into equivalent                         S1     A(I)        = A(I-1) + X           S1     A$(I) = A(I-1) + X
    classes, each of which can occupy a different                  S2     Y(I)        = A(I) + Z             S2     Y(I) = A$(I) + Z
    memory location (Fig 5.12)                                     S3     A(I)        = B(I) + C             S3     A(I) = B(I) + C
                                                                      ENDDO                                     ENDDO
     ¶    def-use graph
     ¶    reachable analysis                                           E    Vectorization:
E   Renaming works by removing critical loop-                          S3       A(1:N) = B(1:N) + C
                                                                       S1       A$(1:N) = A(0:N-1) + X
    independent anti- or output- dependence to
                                                                       S2       Y(1:N) = A$(1:N) + Z
    break a cycle

     URCS                    CS2/455 2004                     19            URCS                        CS2/455 2004                         20




    Node splitting                                                      Node splitting: example
          DO I = 1, N                                              E   Original:                             E   After node splitting:
    S1      A(I) = X(I+1) + X(I)
                                                                            DO I = 1, N                             DO I = 1, N
    S2      X(I+1) = B(I) + 10
                                                                   S1         A(I) = X(I+1) + X(I)            S1’     X$(I) = X(I+1)
          ENDDO
                                                                   S2         X(I+1) = B(I) + 10              S1      A(I) = X$(I) + X(I)
E renaming does not work because of the two                                 ENDDO                             S2      X(I+1) = B(I) + 10
                                                                                                                    ENDDO
dependences share one single access: X(I+1)
E renaming tries to give both name spaces the
                                                                   E    Vectorization:
                                                                   S1’         X$(1:N) = X(2:N+1)
original array name                                                S2          X(2:N+1) = B(1:N) + 10
E solution: creating a copy of a node from which                   S1          A(1:N) = X$(1:N) + X(1:N)
the critical anti-dependence emanates

     URCS                    CS2/455 2004                     21            URCS                        CS2/455 2004                         22




    Node splitting: profitability                                      Fine-grained parallelism: summary
ENot always profitable            EAfter    splitting as before:             if pii is cyclic then                              scalar expansion
                                                                                 if k is the deepest loop in pii                scalar renaming
                                   DO I = 1, N
EFor example                                                                        then try_recurrence_breaking(pii, D, k)     array renaming
                                   S1’  X$(I) = X(I+1)
         DO I = 1, N                                                             else begin                                     node splitting
                                   S1    A(I) = X$(I) + X(I)
S1         A(I) = X(I+1) + X(I)                                                     select_loop_and_interchange(pii, D, k);
                                   S2    X(I+1) = A(I) + 10                          generate a level-k DO statement;
S2         X(I+1) = A(I) + 10
                                   ENDDO                                            let Di be the dependence graph consisting of
         ENDDO
                                  ERecurrence      not broken                          all dependence edges in D that are at level
                                                                                       k+1 or greater and are internal to pii;
                                  EWhy?                                             codegen (pii, k+1, Di);
                                                                                     generate the level-k ENDDO statement
         The target dependence must be critical!                                 end
                                                                              end


     URCS                    CS2/455 2004                     23            URCS                        CS2/455 2004                         24




                                                                                                                                                   4
    Coarse-grained parallelism                                                 Basics
E    Target machine: symmetric multiprocessor                                 E   Sequential loop vs. parallel loop
        ¶   multiple processors with a shared memory                               ¶   a sequential loop carries a dependence
        ¶   parallelism employed by creating and executing a                       ¶   iterations of a parallel loop can be correctly run in
            process on each processor                                                  any order
        ¶   expensive overhead: processes initiation and
            synchronization                                                   E   Statement PARALLEL DO
E    Parallelism concern for high performance:                                     ¶   represents a parallel loop that can be distributed
        ¶   find and package parallelism with a granularity large                      on different processors
            enough to compensate the overhead                                 E   Synchronization
        ¶   delicate trade-off between overhead minimization and                   ¶   barrier: forces all processes to reach a certain
            load balancing                                                             point before execution continues

        URCS                          CS2/455 2004                       25       URCS                         CS2/455 2004                       26




     Loop interchange: revisited                                               Loop interchange: example
    E Previously:           fine-grained                                      E After      interchanging I-loop and J-loop:
            ¶   move loops inward to vectorize more                           PARALLEL DO J = 1, N
                                                                                DO I = 1, N
    E Now:          coarse-grained                                                                                            DV(J,I) = ( =, <)
                                                                                     A(I+1, J) = A(I, J) + B(I, J)
            ¶   move dependence-free loops outward to                           ENDDO
                generate large enough parallel unit                           END PARALLEL DO                                 1 barrier needed

    DO I = 1, N
      DO J = 1, M                                    DV(I,J) = (<, =)
            A(I+1, J) = A(I, J) + B(I, J)
      ENDDO
                                                 N barriers needed
    ENDDO


        URCS                          CS2/455 2004                       27       URCS                         CS2/455 2004                       28




     Loop interchange: profitability                                           Loop interchange: profitability
    E   Not always possible to move a parallel loop                           E Theorem     6.3: In a perfect nest of loops,
        outward and have it remain free of dependence                             a particular loop can be parallelized at
    E   Example:                                                                  the outermost level if and only if the
    DO J = 1, N
      DO I = 1, N
                                                                                  column of the direction matrix for that
                                                     DV(J,I) = ( <, <)            nest contains only “=“ entries.
           A(I+1, J+1) = A(I, J) + B(I, J)
      ENDDO
    ENDDO
            ¶   The best we can do: parallelize the inner loop



        URCS                          CS2/455 2004                       29       URCS                         CS2/455 2004                       30




                                                                                                                                                       5
 Loop interchange: algorithm                              Loop selection and interchange
E   Move any loop with only “=“ entries into              DO I=1,N
                                                            DO J=1,M
    outermost position and parallelize it, remove
                                                                DO K=1,L
    the column from the matrix                           S1       A(I+1,J,K) = A(I,J,K) +X1
E   Move any loop with the most “<“ entries into         S2       B(I,J,K+1) = B(I,J,K) + X2
    next outermost position and sequentialize it,        S3       C(I+1,J+1,K+1) = C(I,J,K) + X3
    eliminate the column and any rows                           ENDDO                        I J        K
                                                            ENDDO
    representing carried dependences                     ENDDO                           S1 < =         =
E   Repeat the algorithm starting with step 1                                           S2 =          = <
                                                                                        S3 <          < <

    URCS                     CS2/455 2004           31     URCS                        CS2/455 2004             32




 Loop selection and interchange                           Summary
 DO I=1,N
                                                         E Transformations            to break recurrence
   PARALLEL DO J=1,M
       DO K=1,L                                             ¶ scalar expansion
S1       A(I+1,J,K) = A(I,J,K) +X1                          ¶ scalar/array renaming
S2       B(I,J,K+1) = B(I,J,K) + X2
S3       C(I+1,J+1,K+1) = C(I,J,K) + X3                     ¶ node splitting
       ENDDO                                             E Transformations            to increase parallelism
   END PARALLEL DO
ENDDO                                                       ¶   loop interchange
                                                                  n   fine-grained: vectorization
E general      case: NP-complete
                                                                  n   coarse-grained: multiprocessor


    URCS                     CS2/455 2004           33     URCS                        CS2/455 2004             34




                                                                                                                     6

								
To top