VIEWS: 26 PAGES: 6 POSTED ON: 5/25/2011
Overview E Theme: the use of dependence to automatically Fine-Grained and Coarse- parallelize sequential code Grained Parallelism E Outline ¶ Fine-grained parallelism: vectorization n loop interchange n scalar expansion Chapter 5 & 6 of Allen & scalar and array renaming CSC 2/455 n Kennedy: Optimizing n node splitting Compilers for Modern Fall 2004 ¶ Coarse-grained parallelism: multiple asynchronous Architectures processor n loop interchange Lecturer: Yutao Zhong URCS CS2/455 2004 2 Review for codegen (Ch2) Review of codegen (Ch2) procedure codegen(R, k, D); // R is the region for which we must generate code. // k is the minimum nesting level of possible parallel loops. E Basic idea: find all the possible parallelism by // D is the dependence graph among statements in R. find the set {S 1, S2, ... , Sm} of maximal strongly-connected loop distribution and statement reordering regions in the dependence graph D restricted to R construct R pi from R by reducing each S i to a single node and compute Dpi, the dependence graph naturally induced on Rpi by D E May not work as desired let {pi1, pi2, ... , pim} be the m nodes of Rpi numbered in an order ¶ not effectively consistent with D pi (use topological sort to do the numbering); for i = 1 to m do begin n loop interchange to increase parallelism if pii is cyclic then begin ¶ cyclic dependences exist generate a level-k DO statement; let Di be the dependence graph consisting of all dependence edges in D n scalar expansion that are at level k +1 or greater and are internal to pii; codegen (pii, k+1, Di); n scalar & array renaming generate the level-k ENDDO statement; n node splitting end else generate a vector statement for pii in r(pii)-k+1 dimensions, where r (pii) is the number of loops containing pii; end end URCS CS2/455 2004 3 URCS CS2/455 2004 4 Loop interchange Loop interchange: safety DO I = 1, N DO J = 1, M E Loop interchange is a reordering transformation S A(I,J+1) = A(I,J) + B DV(I,J) = (=,<) ENDDO E Not all loop interchanges are legal (Theorem 2.3) ENDDO DO J = 1, M E After interchanging I-loop and J-loop DO I = 1, N DV(J,I) = (<,>) DO J = 1, M A(I,J+1) = A(I+1,J) + B DO I = 1, N ENDDO S A(I,J+1) = A(I,J) + B DV(J,I) = (<,=) ENDDO ENDDO ENDDO E Theorem 5.1 : let D be a direction vector for a dependence in a perfect loop nests, the direction vector E Vectorization of the same dependence after any loop permutation is DO J = 1, M determined by applying the same permutation to the S A(1:N,J+1) = A(1:N,J) + B ENDDO elements of D URCS CS2/455 2004 5 URCS CS2/455 2004 6 1 Loop interchange: safety Loop interchange: safety DO I = 1, N DO J = 1, M E Theorem 5.2: A permutation of the loops in a DO K = 1, L perfect nest is legal if and only if the direction A(I+1,J+1,K) = A(I,J,K) + A(I,J+1,K+1) ENDDO matrix, after the same permutation is applied to < < = ENDDO its columns, has no ">" direction as the leftmost < = > ENDDO non-"=" direction in any row. I J K E The direction matrix for a nest of loops is a < < = matrix in which each row is a direction vector < = > for some dependence between statements contained in the nest and every such direction E Interchange preventing dependence and vector is represented by a row. interchange sensitive dependence URCS CS2/455 2004 7 URCS CS2/455 2004 8 Loop interchange for vectorization Loop shifting: example (I) E Motivation DO I = 1, N ¶ If we deal with loops containing cyclic DO J = 1, N dependences early on in the loop nest, we can DO K = 1, N potentially vectorize more loops S A(I,J) = A(I,J) + B(I,K)*C(K,J) ¶ Inward-shifting loops that carry no dependences ENDDO E Theorem 5.3: In a perfect loop nest, loops carry no ENDDO dependence are legal to be shifted inward and will not ENDDO carry any dependences in their new position. E Applying codegen: no vectorization possible E DV(I,J,K) = (=,=.<) URCS CS2/455 2004 9 URCS CS2/455 2004 10 Loop shifting: example (II) Loop interchange: profitability E After shifting K-loop outward: DO K= 1, N E The architecture of the target machine DO I = 1, N DO J = 1, N is usually the principal factor: S A(I,J) = A(I,J) + B(I,K)*C(K,J) ¶ SIMD ENDDO ENDDO ¶ Vector register machine ENDDO E Applying codegen: ¶ MIMD DO K = 1, N FORALL J=1,N n will see more in Chapter 6 A(1:N,J) = A(1:N,J) + B(1:N,K)*C(K,J) END FORALL ENDDO URCS CS2/455 2004 11 URCS CS2/455 2004 12 2 Codegen revised Scalar expansion if pii is cyclic then scalar expansion DO I = 1, N if k is the deepest loop in pii scalar renaming S1 T = A(I) then try_recurrence_breaking(pii, D, k) array renaming S2 A(I) = B(I) else begin node splitting S3 B(I) = T select_loop_and_interchange(pii, D, k); ENDDO generate a level-k DO statement; let Di be the dependence graph consisting of all dependence edges in D that are at level E Scalar expansion: E Vectorization: k+1 or greater and are internal to pii; DO I = 1, N S1 T$(1:N) = A(1:N) codegen (pii, k+1, Di); S1 T$(I) = A(I) S2 A(1:N) = B(1:N) generate the level-k ENDDO statement S2 A(I) = B(I) S3 B(1:N) = T$(1:N) end S3 B(I) = T$(I) T = T$(N) end ENDDO T = T$(N) URCS CS2/455 2004 13 URCS CS2/455 2004 14 Scalar expansion: safety & profitability Scalar expansion: profitability E Always safe to be applied E Dependences due to reuse of memory location E Not always profitable vs. reuse of values DO I = 1, N ¶ Dependences due to reuse of values must be T = T + A(I) + A(I-1) preserved A(I) = T d1 ¶ Dependences due to reuse of memory location can ENDDO d• be deleted by expansion ¶ after scalar expansion: S1 S2 E Naïve approach T$(0) = T d•-1 DO I = 1, N E Better approach: detecting deletable edges first, S1 T$(I) = T$(I-1) + A(I) + A(I-1) d1 then applying scalar expansion only if the new S2 A(I) = T$(I) graph has vectorizable statements ENDDO ¶ covering definition T = T$(N) URCS CS2/455 2004 15 URCS CS2/455 2004 16 Scalar expansion: drawbacks Scalar renaming: example E Increased memory assumption DO I = 1, 100 E After renaming T: S1 T = A(I) + B(I) DO I = 1, 100 E Solutions: S2 C(I) = T + T S1 T1 = A(I) + B(I) ¶ Expand in a single loop S3 T = D(I) - B(I) S2 C(I) = T1 + T1 ¶ Strip mine loop before expansion S3 T2 = D(I) - B(I) S4 A(I+1) = T * T S4 A(I+1) = T2 * T2 ¶ Forward substitution: ENDDO ENDDO DO I = 1, N T = A(I) + A(I+1) A(I) = T + B(I) E Vectorization: ENDDO S3 T2$(1:100) = D(1:100) - B(1:100) S4 A(2:101) = T2$(1:100) * T2$(1:100) DO I = 1, N S1 T1$(1:100) = A(1:100) + B(1:100) A(I) = A(I) + A(I+1) + B(I) S2 C(1:100) = T1$(1:100) + T1$(1:100) ENDDO T = T2$(100) URCS CS2/455 2004 17 URCS CS2/455 2004 18 3 Scalar renaming Array renaming E Original: E After renaming A: E Renaming algorithm partitions all definitions DO I = 1, N DO I = 1, N and uses of a scalar S into equivalent S1 A(I) = A(I-1) + X S1 A$(I) = A(I-1) + X classes, each of which can occupy a different S2 Y(I) = A(I) + Z S2 Y(I) = A$(I) + Z memory location (Fig 5.12) S3 A(I) = B(I) + C S3 A(I) = B(I) + C ENDDO ENDDO ¶ def-use graph ¶ reachable analysis E Vectorization: E Renaming works by removing critical loop- S3 A(1:N) = B(1:N) + C S1 A$(1:N) = A(0:N-1) + X independent anti- or output- dependence to S2 Y(1:N) = A$(1:N) + Z break a cycle URCS CS2/455 2004 19 URCS CS2/455 2004 20 Node splitting Node splitting: example DO I = 1, N E Original: E After node splitting: S1 A(I) = X(I+1) + X(I) DO I = 1, N DO I = 1, N S2 X(I+1) = B(I) + 10 S1 A(I) = X(I+1) + X(I) S1’ X$(I) = X(I+1) ENDDO S2 X(I+1) = B(I) + 10 S1 A(I) = X$(I) + X(I) E renaming does not work because of the two ENDDO S2 X(I+1) = B(I) + 10 ENDDO dependences share one single access: X(I+1) E renaming tries to give both name spaces the E Vectorization: S1’ X$(1:N) = X(2:N+1) original array name S2 X(2:N+1) = B(1:N) + 10 E solution: creating a copy of a node from which S1 A(1:N) = X$(1:N) + X(1:N) the critical anti-dependence emanates URCS CS2/455 2004 21 URCS CS2/455 2004 22 Node splitting: profitability Fine-grained parallelism: summary ENot always profitable EAfter splitting as before: if pii is cyclic then scalar expansion if k is the deepest loop in pii scalar renaming DO I = 1, N EFor example then try_recurrence_breaking(pii, D, k) array renaming S1’ X$(I) = X(I+1) DO I = 1, N else begin node splitting S1 A(I) = X$(I) + X(I) S1 A(I) = X(I+1) + X(I) select_loop_and_interchange(pii, D, k); S2 X(I+1) = A(I) + 10 generate a level-k DO statement; S2 X(I+1) = A(I) + 10 ENDDO let Di be the dependence graph consisting of ENDDO ERecurrence not broken all dependence edges in D that are at level k+1 or greater and are internal to pii; EWhy? codegen (pii, k+1, Di); generate the level-k ENDDO statement The target dependence must be critical! end end URCS CS2/455 2004 23 URCS CS2/455 2004 24 4 Coarse-grained parallelism Basics E Target machine: symmetric multiprocessor E Sequential loop vs. parallel loop ¶ multiple processors with a shared memory ¶ a sequential loop carries a dependence ¶ parallelism employed by creating and executing a ¶ iterations of a parallel loop can be correctly run in process on each processor any order ¶ expensive overhead: processes initiation and synchronization E Statement PARALLEL DO E Parallelism concern for high performance: ¶ represents a parallel loop that can be distributed ¶ find and package parallelism with a granularity large on different processors enough to compensate the overhead E Synchronization ¶ delicate trade-off between overhead minimization and ¶ barrier: forces all processes to reach a certain load balancing point before execution continues URCS CS2/455 2004 25 URCS CS2/455 2004 26 Loop interchange: revisited Loop interchange: example E Previously: fine-grained E After interchanging I-loop and J-loop: ¶ move loops inward to vectorize more PARALLEL DO J = 1, N DO I = 1, N E Now: coarse-grained DV(J,I) = ( =, <) A(I+1, J) = A(I, J) + B(I, J) ¶ move dependence-free loops outward to ENDDO generate large enough parallel unit END PARALLEL DO 1 barrier needed DO I = 1, N DO J = 1, M DV(I,J) = (<, =) A(I+1, J) = A(I, J) + B(I, J) ENDDO N barriers needed ENDDO URCS CS2/455 2004 27 URCS CS2/455 2004 28 Loop interchange: profitability Loop interchange: profitability E Not always possible to move a parallel loop E Theorem 6.3: In a perfect nest of loops, outward and have it remain free of dependence a particular loop can be parallelized at E Example: the outermost level if and only if the DO J = 1, N DO I = 1, N column of the direction matrix for that DV(J,I) = ( <, <) nest contains only “=“ entries. A(I+1, J+1) = A(I, J) + B(I, J) ENDDO ENDDO ¶ The best we can do: parallelize the inner loop URCS CS2/455 2004 29 URCS CS2/455 2004 30 5 Loop interchange: algorithm Loop selection and interchange E Move any loop with only “=“ entries into DO I=1,N DO J=1,M outermost position and parallelize it, remove DO K=1,L the column from the matrix S1 A(I+1,J,K) = A(I,J,K) +X1 E Move any loop with the most “<“ entries into S2 B(I,J,K+1) = B(I,J,K) + X2 next outermost position and sequentialize it, S3 C(I+1,J+1,K+1) = C(I,J,K) + X3 eliminate the column and any rows ENDDO I J K ENDDO representing carried dependences ENDDO S1 < = = E Repeat the algorithm starting with step 1 S2 = = < S3 < < < URCS CS2/455 2004 31 URCS CS2/455 2004 32 Loop selection and interchange Summary DO I=1,N E Transformations to break recurrence PARALLEL DO J=1,M DO K=1,L ¶ scalar expansion S1 A(I+1,J,K) = A(I,J,K) +X1 ¶ scalar/array renaming S2 B(I,J,K+1) = B(I,J,K) + X2 S3 C(I+1,J+1,K+1) = C(I,J,K) + X3 ¶ node splitting ENDDO E Transformations to increase parallelism END PARALLEL DO ENDDO ¶ loop interchange n fine-grained: vectorization E general case: NP-complete n coarse-grained: multiprocessor URCS CS2/455 2004 33 URCS CS2/455 2004 34 6