VIEWS: 12 PAGES: 55 POSTED ON: 8/21/2010 Public Domain
Control Dependences Chapter 7 Control Dependences • Roadmap — If-conversion — Control dependence Optimizing Compilers for Modern Architectures Control Dependences • Constraints posed by control flow DO 100 I = 1, N S1 IF (A(I-1).GT. 0.0) GO TO 100 S2 1 S1 S2 A(I) = A(I) + B(I)*C 100 CONTINUE If we vectorize by... S2 A(1:N) = A(1:N) + B(1:N)*C DO 100 I = 1, N S1 IF (A(I-1).GT. 0.0) GO TO 100 100 CONTINUE …we get the wrong answer • We are missing dependences • There is a dependence from S1 to S2 - a control dependence Optimizing Compilers for Modern Architectures Control Dependences • Two strategies to deal with control dependences: — If-conversion: expose by converting to data dependences. Used for vectorization — Explicitly expose as control dependences. Used for automatic parallelization Optimizing Compilers for Modern Architectures If-conversion • Underlying Idea: Convert statements affected by branches to conditionally executed statements DO 100 I = 1, N S1 IF (A(I-1).GT. 0.0) GO TO 100 S2 A(I) = A(I) + B(I)*C 100 CONTINUE can be converted to: DO I = 1, N IF (A(I-1).LE. 0.0) A(I) = A(I) + B(I)*C ENDDO Optimizing Compilers for Modern Architectures If-conversion DO 100 I = 1, N S1 IF (A(I-1).GT. 0.0) GO TO 100 S2 A(I) = A(I) + B(I) * C S3 B(I) = B(I) + A(I) 100 CONTINUE • can be converted to: DO 100 I = 1, N S2 IF (A(I-1).LE. 0.0) A(I) = A(I) + B(I) * C S3 IF (A(I-1).LE. 0.0) B(I) = B(I) + A(I) 100 CONTINUE • vectorize using the Fortran WHERE statement: DO 100 I = 1, N S2 IF (A(I-1).LE. 0.0) A(I) = A(I) + B(I) * C 100 CONTINUE S3 WHERE (A(0:N-1).LE. 0.0) B(1:N) = B(1:N) + A(1:N) Optimizing Compilers for Modern Architectures If-conversion • If-conversion assumes a target notation of guarded execution in which each statement implicitly contains a logical expression controlling its execution S1 IF (A(I-1).GT. 0.0) GO TO 100 S2 A(I) = A(I) + B(I)*C 100 CONTINUE • with guards in place: S1 M = A(I-1).GT. 0.0 S2 IF (.NOT. M) A(I) = A(I) + B(I)*C 100 CONTINUE Optimizing Compilers for Modern Architectures Branch Classification • Forward Branch: transfers control to a target that occurs lexically after the branch but at the same level of nesting • Backward Branch: transfers control to a statement occurring lexically before the branch but at the same level of nesting • Exit Branch: terminates one or more loops by transferring control to a target outside a loop nest Optimizing Compilers for Modern Architectures If-conversion • If-conversion is a composition of two different transformations: 1. Branch relocation 2. Branch removal Optimizing Compilers for Modern Architectures Branch removal • Basic idea: — Make a pass through the program. — Maintain a Boolean expression cc that represents the condition that must be true for the current expression to be executed — On encountering a branch, conjoin the controlling expression into cc — On encountering a target of a branch is encountered, its controlling expression is disjoined into cc Optimizing Compilers for Modern Architectures Branch Removal: Forward Branches • Remove forward branches by inserting appropriate guards DO 100 I = 1,N C1 IF (A(I).GT.10) GO TO 60 20 A(I) = A(I) + 10 C2 IF (B(I).GT.10) GO TO 80 40 B(I) = B(I) + 10 60 A(I) = B(I) + A(I) 80 B(I) = A(I) - 5 ENDDO DO 100 I = 1,N m1 = A(I).GT.10 20 IF(.NOT.m1) A(I) = A(I) + 10 IF(.NOT.m1) m2 = B(I).GT.10 40 IF(.NOT.m1.AND..NOT.m2) B(I) = B(I) + 10 60 IF(.NOT.m1.AND..NOT.m2.OR.m1)A(I) = B(I) + A(I) 80 IF(.NOT.m1.AND..NOT.m2.OR.m1.OR..NOT.m1 .AND.m2) B(I) = A(I) - 5 ENDDO Optimizing Compilers for Modern Architectures Branch Removal: Forward Branches • We can simplify to: DO 100 I = 1,N m1 = A(I).GT.10 20 IF(.NOT.m1) A(I) = A(I) + 10 IF(.NOT.m1) m2 = B(I).GT.10 40 IF(.NOT.m1.AND..NOT.m2) B(I) = B(I) + 10 60 IF(m1.OR..NOT.m2) A(I) = B(I) + A(I) 80 B(I) = A(I) - 5 ENDDO • vectorize to: m1(1:N) = A(1:N).GT.10 20 WHERE(.NOT.m1(1:N)) A(1:N) = A(1:N) + 10 WHERE(.NOT.m1(1:N)) m2(1:N) = B(1:N).GT.10 40 WHERE(.NOT.m1(1:N).AND..NOT.m2(1:N)) B(1:N) = B(1:N) + 10 60 WHERE(m1(1:N).OR..NOT.m2(1:N)) A(1:N) = B(1:N) + A(1:N) 80 B(1:N) = A(1:N) - 5 Optimizing Compilers for Modern Architectures Branch Removal: Forward Branches • To show correctness we must establish: — the guard for statement instance in the new program is true if and only if the corresponding statement in the old program is executed, unless the statement has been introduced to capture a guard variable value, which must be executed at the point the conditional expression would have been evaluated — the order of execution of statements in the new program with true guards is the same as the order of execution of those statements in the original program — Any expression with side effects is evaluated exactly as many times in the new program as in the old program Optimizing Compilers for Modern Architectures Exit Branches DO J = 1, M DO I = 1, N A(I,J) = B(I,J) + X S IF (L(I,J)) GO TO 200 C(I,J) = A(I,J) + Y ENDDO D(J) = A(N,J) 200 F(J) = C(10,J) ENDDO • more complicated because they terminate a loop • Solution: relocate exit branches and convert them to forward branches Optimizing Compilers for Modern Architectures Exit Branches DO J = 1, M DO I = 1, N A(I,J) = B(I,J) + X S IF (L(I,J)) GO TO 200 C(I,J) = A(I,J) + Y ENDDO D(J) = A(N,J) 200 F(J) = C(10,J) ENDDO DO J = 1, M DO I = 1, N IF (C1) A(I,J) = B(I,J) + X Sa Code to set C1 and C2 IF (C2) C(I,J) = A(I,J) + Y ENDDO Sb IF (.NOT.C1.OR..NOT.C2) GO TO 200 D(J) = A(N,J) 200 F(J) = C(10,J) ENDDO • What should C1 and C2 be? Optimizing Compilers for Modern Architectures Exit Branches • Statements in the inner loop should be executed only if exit branch was not taken on any previous iteration • For the ith iteration, C1 and C2 should be lm = AND( L(k, J) ), 1 k i-1 DO J = 1, M lm = .TRUE. DO I = 1, N IF (lm) A(I,J) = B(I,J) + X IF (lm) m1 = .NOT. L(I,J) lm = lm .AND. m1 IF (lm) C(I,J) = A(I,J) + Y ENDDO m2 = lm IF (m2) D(J) = A(N,J) 200 F(J) = C(10,J) ENDDO Optimizing Compilers for Modern Architectures Exit Branches • After forward substitution and expansion of lm, we get: DO J = 1, M lm(0,J) = .TRUE. DO I = 1, N IF (lm(I-1,J)) A(I,J) = B(I,J) + X IF (lm(I-1,J)) m1 = .NOT.L(I,J) lm(I,J) = lm(I-1,J) .AND. m1 IF (lm(I,J)) C(I,J) = A(I,J) + Y ENDDO IF (lm(N,J)) D(J) = A(N,J) 200 F(J) = C(10,J) ENDDO • codegen will produce four vectorized loops… Optimizing Compilers for Modern Architectures Exit Branches • After running codegen: DO J = 1, M lm(0,J) = .TRUE. DO I = 1, N IF (lm(I-1,J)) m1 =.NOT.L(I,J) lm(I,J) = lm(I-1,J) .AND. m1 ENDDO ENDDO WHERE(lm(0:N-1,1:M)) A(1:N,1:M)=B(1:N,1:M)+X WHERE(lm(0:N-1,1:M)) C(1:N,1:M)=A(1:N,1:M)+Y WHERE(lm(N,1:M)) D(1:M) = A(N,1:M) 200 F(1:M) = C(10,1:M) • Procedure relocate_branches() Optimizing Compilers for Modern Architectures Backward Branches • Problems: — Create implicit loops. Backward control flow cannot be simulated by simple guards — Complicate removal of forward branches - may create loops into which forward branches jump IF (P) GO TO 200 ... 100 S1 ... 200 S2 ... IF (Q) GO TO 100 • Applying forward if-conversion m1 = .NOT. P ... 100 IF (m1) S1 ... 200 S2 ... IF (Q) GO TO 100 Optimizing Compilers for Modern Architectures Backward Branches • Solutions? — Avoid region within a backward control flow edge — Eliminate backward branches through a variant of if-conversion • Note that: — S1 is executed on the first pass through the code only if P is false — S1 is always executed when the backward branch is taken • Use a backward branch guard! Optimizing Compilers for Modern Architectures Backward Branches • Using a backward branch guard: IF (P) GO TO 200 ... 100 S1 ... 200 S2 ... IF (Q) GO TO 100 • converted to: m = P ... bb = .FALSE. 100 IF (.NOT.m .OR (m.AND.bb)) S1 ... 200 S2 ... IF (Q) THEN bb = .TRUE. GO TO 100 ENDIF Optimizing Compilers for Modern Architectures Backward Branches • In general, two ways a target of a backward branch can be reached: — Fall through — Branch around the statement but reach it via a backward branch • Thus, if current condition just prior to target y is cc, the branch condition is m, and the backward branch condition is bb, the guard at y should be: cc OR (m AND bb) Optimizing Compilers for Modern Architectures Complete Forward Branch Removal 1 Statement is branch target: combine (disjoin) set of conditions associated with branches to that target with the current condition passed from the lexical predecessor 2 Statement is any type except DO, ENDDO, CONTINUE: the current condition is conjoined to the guard for the current statement 3 Statement is a DO: invoke relocate_branches to remove exit branches. Recur on body of the loop. May generate some statements before the loop which should be guarded by the current condition Optimizing Compilers for Modern Architectures Complete Forward Branch Removal 4 Statement is a conditional branch: 2 copies of the current condition cc are made. — The compiler generated variable associated with the new condition is conjoined with cc and the result is appended to the list associated with the branch target — The negation of the variable is conjoined to cc and is the current condition for the next statement 5 Statement is an unconditional branch: current condition, cc, is appended to the list of conditions for the branch target. Current condition for the next statement is set to false 6 Continue processing at step 1 for next statement Optimizing Compilers for Modern Architectures Simplification • Boolean Simplifier is NP-Complete • Use Simplify, an O(N2) algorithm by tweaking simplification process to focus on if-conversion Optimizing Compilers for Modern Architectures Iterative Dependences • Iterative statements can also create control dependences: 20 DO I = 1, 100 40 L = 2*I 60 DO J= 1,L 80 A(I,J) = 0 ENDDO ENDDO • If we vectorize as: 20 DO I = 1, 100 40 L = 2*I 100 ENDDO 80 A(1:100,1:L) = 0 • Incorrect! • Must capture the notion that the DO statement controls the number of times a particular statement is executed. Optimizing Compilers for Modern Architectures Iterative Dependences • Notation used: • A(I, J) (irange) • where irange is a compiler generated scalar which holds the iteration range • Using this notation, the example will be converted to: 20 irange1 = (1,100) DO I = irange1 40 L = 2*I (irange1) 60 irange2 = (1,L) (irange1) DO J = irange2 80 A(I,J) = 0 (irange2) ENDDO ENDDO Optimizing Compilers for Modern Architectures Iterative Dependences • Forward substituting constants and loop-independent variables: 20 DO I = 1,100 40 L = 2*I (1,100) 60 DO J = 1,L (1,100) 80 A(I,J) = 0 (1,L) (1,100) ENDDO ENDDO • which vectorizes to: 20 DO I = 1, 100 40 L = 2*I 80 A(I,1:L) = 0 ENDDO Optimizing Compilers for Modern Architectures If-reconstruction • If-conversion may degrade performance when vectorization is not possible DO 100 I = 1, N IF (A(I) .GT. 0) GOTO 100 B(I) = A(I) * 2.0 A(I+1) = B(I) + 1 100 CONTINUE • After if-conversion: DO 100 I = 1, N m1 = (A(I) .GT. 0) IF (.NOT. m1) B(I) = A(I) * 2.0 IF (.NOT. m1) A(I+1) = B(I) + 1 100 CONTINUE Optimizing Compilers for Modern Architectures If-reconstruction • On a machine without predicated execution: DO 100 I = 1, N m1 = (A(I) .GT. 0) IF ( m1) GOTO 10 B(I) = A(I) * 2.0 10 IF (m1) GOTO 20 A(I+1) = B(I) + 1 20 CONTINUE 100 CONTINUE • Overheads! • If-reconstruction: replace sections of guarded code with a minimal set of branches that enforce the guarded execution Optimizing Compilers for Modern Architectures Control Dependence • Disadvantages of if-conversion: — Unnecessarily complicates code when code cannot be vectorized — Cannot a priori analyze code to decide whether if-conversion will lead to parallel code. • Alternate approach: explicitly expose constraints due to control flow as control dependences Optimizing Compilers for Modern Architectures Control Dependence • A node x in directed graph G with a single exit node postdominates node y in G if any path from y to the exit node of G must pass through x. • A statement y is said to be control dependent on another statement x if: — there exists a non-trivial path from x to y such that every statement zx in the path is postdominated by y and — x is not postdominated by y. • In other words, a control dependence exists from S1 to S2 if one branch out of S1 forces execution of S2 and another doesn’t • Note that control dependences can be looked as a property of basic blocks Optimizing Compilers for Modern Architectures Control Dependence: Example Optimizing Compilers for Modern Architectures Control Dependence: Example • n nodes and O(n2) control dependences. • Control dependence graphs can thus get much larger than the corresponding CFG • procedure ConstructCD constructs the control dependence relation Optimizing Compilers for Modern Architectures Control Dependence: Loops • Loops can be converted to a CFG and then ConstructCD can be applied • Want to treat loops as special cases to help in transforming loops • Use a loop control node to represent the loop 10 DO I = 1, 100 20 A(I) = A(I) + B(I) 30 IF (A(I).GT.0) GO TO 50 40 A(I) = -A(I) 50 B(I) = A(I) + C(I) ENDDO Optimizing Compilers for Modern Architectures Execution Model • In Chapter 2, we annotated each statement S with the corresponding iteration vector i • S(i) could execute whenever every statement instance that it depended on had already executed • However… DO I = 1, N S0 IF (P) GO TO S2 S1 ... S2 ... ENDDO Optimizing Compilers for Modern Architectures Execution Model • Solution: Use a doit flag for each statement: S(i).doit • Statement instances that are not control dependent on any other statement: doit = True • For all other statements: doit = False • How does doit get set to True? — All those statements that are control dependent on the conditional and whose execution is forced by the sense of the condition: doit = true • Execute statement instance S(i) if its doit flag is set to True and every statement instance it depends on either has a false doit flag or has been executed Optimizing Compilers for Modern Architectures Execution Model • Note that if doit is true for S, then there is a sequence of control statements S0, S1, ... , Sm= S such that S0 is executed unconditionally and the decision taken at Sk forces the execution of Sk+1, 0 k < m • Sequence of control dependences defines a unique execution path Optimizing Compilers for Modern Architectures Execution Model • Behavior of loop control nodes under this model: • Case 1: Evaluation of iteration range does not depend on quantities computed in loop: — Set doit for loop node to True — Range of iteration can be completely evaluated — Create collection of statement instances for the loop body, one for each iteration of the loop — Set doit flags of statements control dependent on loop header to true, all other doit flags to False Optimizing Compilers for Modern Architectures Execution Model • Case 2: Evaluation of iteration range depends on quantities computed in loop: — If range is non-empty, create new instance of loop header, adjusting range to the remainder of the iterations — DO.doit = True if dependence back to DO is a data dependence and False if it is a control dependence — Set doit flags of statements control dependent on loop header to true, all other doit flags to False Optimizing Compilers for Modern Architectures Execution Model Theorem 7.1. Dependence graphs that are executed according to the execution model are equivalent in meaning to the programs from which they are created. • Proof: — Show that doit flag of statement is true iff it is executed in the original program — Proof by contradiction: Consider the shortest sequence S0, S1, …,Sm-1, Sm s.t. Sm is the first statement to get the wrong doit flag — Focus on Sm-1: – All statements executed leading to Sm-1 in the original program must be executed in this model – Statements that are not executed leading to Sm-1 in the original program cannot be executed in this model Optimizing Compilers for Modern Architectures Control Dependence and Parallelization • For simplicity, we shall only consider: — Forward branches - they create loop-independent control dependences — Control Dependences due to loops • From Chapter 2: Most loop transformations are unaffected by loop-independent dependences • Loop reversal, loop skewing, strip mining, index-set splitting, loop interchange do not affect independent dependences • Might be problematic: Loop fusion, loop distribution • However, since exit branches are excluded, loop fusion is not a problem Optimizing Compilers for Modern Architectures Loop Distribution DO I = 1, N S1 IF (A(I).LT.B(I)) GOTO 20 S2 B(I) = B(I) + C(I) S1 -1 S2 20 CONTINUE ENDDO • Distributing… DO I = 1, N S1 IF (A(I).LT.B(I)) GOTO 20 ENDDO DO I = 1, N S2 B(I) = B(I) + C(I) ENDDO 20 CONTINUE • Incorrect! Optimizing Compilers for Modern Architectures Loop Distribution • Problem: control dependences crossing between distributed loops • Solution: Keep a history of the evaluated conditions (similar to if-conversion). DO I = 1, N S1 IF (A(I).LT.B(I)) GOTO 20 S2 B(I) = B(I) + C(I) 20 CONTINUE ENDDO • Convert to: DO I = 1, N S1 e(I) = A(I).LT.B(I) ENDDO DO I = 1, N S2 IF (e(I).EQ..FALSE.) B(I) = B(I) + C(I) ENDDO Optimizing Compilers for Modern Architectures Loop Distribution • More complex example: DO I = 1, N 1 IF (A(I).NE.0) THEN 2 IF (B(I)/A(I).GT.1) GOTO 4 ENDIF 3 A(I) = B(I) GOTO 8 4 IF (A(I).GT.T) THEN 5 T = (B(I) - A(I)) + T ELSE 6 T = (T + B(I)) – A(I) 7 B(I) = A(I) ENDIF 8 C(I) = B(I) + C(I) ENDDO Optimizing Compilers for Modern Architectures Loop Distribution • Fusion into "like" regions • Needs two execution variables E2(I) and E4(I) to hold result of branches at statement 2 and 4 respectively Optimizing Compilers for Modern Architectures Loop Distribution • Consider branch at node 2: • 3 cases may hold — Statement 2 is executed and the true branch to statement 4 is taken — Statement 2 is executed and the false branch to statement 3 is taken — Statement 2 is never executed because the false branch is taken at statement 1 • Corresponds to condition for doit variable to be set: — A control dependence exists from S0 to S. — S0 has its doit flag set — Value of the conditional expression is the label on the branch Optimizing Compilers for Modern Architectures Loop Distribution • Use three corresponding values: True, False, Undefined • procedure DistributeCDG implements these ideas. It inserts execution variables at appropriate places in the code and selectively converts control dependences to data dependences Optimizing Compilers for Modern Architectures Code Generation • Problem: Mapping the arbitrary control flow represented in the control dependence graph to real machines DO I = 1, N S1 IF (p1) GOTO 3 S2 ... GOTO 4 3 IF (p3) GOTO 5 4 S4 5 S5 ENDDO Loop distribution Optimizing Compilers for Modern Architectures Code Generation • Code generated for first partition: DO I = 1, N E1(I) = p1 IF (E1(I).EQ.FALSE) THEN S2 ... ENDIF S5 ... ENDDO • For second partition: DO I = 1, N IF((E1(I).EQ..TRUE.).AND..NOT.p3).OR. (E1(I).EQ..FALSE.)) THEN S4 ... ENDIF ENDDO Optimizing Compilers for Modern Architectures Code Generation • Observation: generating code for graphs in which every vertex has at most one control dependence predecessor is relatively easy • Thus, transform graph into canonical form consisting of a set of control dependence trees with the following properties: — each statement is control dependent on at most one other statement, i.e., each statement is a member of at most one tree — the trees can be ordered so that all data dependences between trees flow from trees earlier in the order to trees that are later in the order Optimizing Compilers for Modern Architectures Code Generation Optimizing Compilers for Modern Architectures Code Generation Optimizing Compilers for Modern Architectures Code Generation • How can the statements be organized into groups of statements that are part of the same conditional statement? — Statements can be grouped together if there is no dependence path between them that passes through a statement that is not a child of the same conditional node with the same label — Typed Fusion! — Each statement typed by (p, l) where — p: its unique control dependence predecessor — l: the truth label of the edge from p to the statement Optimizing Compilers for Modern Architectures Code Generation • Simple recursive procedure • Generate code for each of the subtree in an order consistent with the data dependences • Roughly linear in size of the original dependence graph Optimizing Compilers for Modern Architectures