Compiling Fortran D For MIMD Distributed Machines Authors: Seema Hiranandani, Ken Kennedy, Chau-Wen Tseng Published: 1992 Presented by: Sunjeev Sikand Monday, February 01, 2010 Problem • Parallel computers represent the only plausible way to continue to increase the computational power available to scientists and engineers • However, they are difficult to program • In particular MIMD machines require message-passing to separate address spaces and synchronizing among processors Problem cont. • Because parallel programs are machine- specific scientists are discouraged from utilizing them because they lose their investment when the program changes or a new architecture arrives • However, vectorizable programs are easily maintained, debugged, portable, and the compilers do all the work Solution • Previous Fortran dialects lack a means of specifying a data decomposition • The authors believe that if a program is written in a data parallel programming style with reasonable data decompositions it can be implemented efficiently. • Thus they propose to develop a compiler technology to establish such a machine- independent programming model. • Want to reduce both communication and load imbalance Data Decomposition • A decomposition is an abstract problem or index domain; it does not require any storage • Each element of a decomposition represents a unit of computation • The DECOMPOSITION statement declares the name, dimensionality, and size of a decomposition for later use • There are two levels of parallelism in data parallel applications Decomposition Statement DECOMPOSITION D(N,N) Data Decomposition - Alignment • First level of parallelism is array alignment/problem mapping that is how arrays are aligned with respect to one another • Represents the minimal requirements for reducing data movement for the program given an unlimited number of processors • Machine independent and depends on the fine-grained parallelism defined by the individual member of data arrays Alignment cont. • Corresponding elements in aligned arrays are always mapped to the same processor • Array operations between aligned arrays are usually more efficient than array operations between arrays that are not known to be aligned. Alignment Example REAL A(N,N) DECOMPOSITION D(N,N) ALIGN A(I,J) with D(J-2,I+3) Data Decomposition - Distribution • Other level of parallelism is distribution/machine mapping that is how arrays are distributed on the actual parallel machine • Represents the translation of the problem onto the finite resources of the machine • Affected by the topology, communication mechanisms, size of local memory, and number of processors on the underlying machine Distribution cont. • Specified by assigning an independent attribute to each dimension. • Predefined attributes include BLOCK, CYCLIC, and BLOCK_CYCLIC • The symbol : marks dimensions that are not distributed Distribution Example 1 DISTRIBUTE D(:,BLOCK) Distribution Example 2 DISTRIBUTE D(:,CYCLIC) Fortran D Compiler • Two major steps in writing a data parallel program are selecting a data decomposition and using it to derive node programs with explicit movement • The former is left to user • Latter is automatically generated by the compiler when given a data decomposition • Translated program to a SPMD program with explicit message passing that execute directly on the nodes of the distributed-memory machine Fortran D Compiler Structure 1 Program Analysis a-Dependence Analysis b-Data Decomposition Analysis c-Partitioning Analysis d-Communication Analysis 2 Program optimization a-Message vectorization b-Collective communications c-Run-Time processing d-Pipelined computations 3 Code generation a-Program partitioning b-Message generation c-Storage management Partition Analysis • Converting global to local indices Original program SPMD node Program REAL A(100) REAL A(25) do i = I, I00 do i = i, 25 A(i) = 0.0 A(i) = 0.0 enddo enddo Jacobi Relaxation • In the grid approximation that discretizes the physical problem, the heat flow into any given point at a given moment is the sum of the four temperature differences between that point and each of the four points surrounding it. • Translating this into an iterative method, the correct solution can be found if the temperature of a given grid point at a given iteration is taken to be the average of the temperatures of the four surrounding grid points at the previous iteration. Jacobi Relaxation Code REAL A(100,100), B(100,100) DECOMPOSITION D(100,100) ALIGN A, B with D DISTRIBUTE D(:,BLOCK) do k = l,time do j = 2,99 do i = 2,99 S1 A(i,j) = (B(i,j-l)+B(i-l,j)+ B(i+l,j)+B(i,j+l))/4 enddo enddo do j = 2,99 do i = 2,99 S2 B(i,j) = A(i,j) enddo enddo enddo Jacobi Relaxation Processor Layout • Compiling for a four-processor machine. • Both arrays A and B are aligned identically with decomposition D, so they have the same distribution as D. • Because the first dimension of D is local and the second dimension is block- distributed, the local index set for both A and B on each processor (in local indices) is [1:100,1:25]. Jacobi Relaxation cont. 1 100 1 25 26 2 50 51 75 76 99 2 99 100 Jacobi Relaxation cont. • The iteration set of the loop nest (in global indices) is [l:time,2:99,2:99]. • Local iteration sets for each processor (in local indices) • Proc(1) = [1 : time, 2 : 25, 2 : 99] • Proc(2 : 3) = [1 :time, 1 : 25, 2 : 99] • Proc(4) = [1 : time, 1 : 24, 2 : 99] Generated Jacobi REAL A(100,25), B(100,0:26) if (Plocal = 1) lb1 = 2 else lb1 = 1 if (Plocal = 4) ub1 = 24 else ub1 = 25 do k = l,time if (Plocal > l) send(B(2:99,1), Pleft) if (Plocal < 4) send(B(2:99,25), Pright) if (Plocal < 4) recv(B(2:99,26), Pright) if (Plocal > 1) recv(B(2:99,0) , Pleft) do j = lb1, ub1 do i = 2,99 A(i,j) = (B(i,j-l)+B(i-l,j)+ B(i+l,j)+B(i,j+l) )/4 enddo enddo Generated Jacobi cont. do j = lb1,ub1 do i = 2,99 B(i,j) = A(i,j) enddo enddo enddo • Only true cross-processor dependences are on the k loop thus able to vectorize messages Pipelined Computation • In loosely synchronous all processors execute in loose lockstep, alternating between phases of local computation and global communication e.g. Red Black SOR and Jacobi • However some computations such as SOR contain loop carried dependences • They present an opportunity to exploit parallelism through pipelining. Pipelined Computation cont. • The observation is that for some pipelined computations, the program order must be changed. • Fine grained pipelining interchanges cross processor loops as deeply as possible to improve sequential computation but incurs the most communication overhead • Coarse grained pipelining uses strip mining and loop interchange to adjust the granularity of the pipelining. Decreases communication overhead at the expense of some parallelism Conclusions • A usable and efficient machine independent parallel programming model is needed to make large-scale parallel machines useful to scientific programmers • Fortran D with its data decomposition model performs message vectorization, collective communication, fine-grained pipelining, and several other optimizations for block distributed arrays • Fortran D compiler will generate efficient code a for a large class of data parallel programs with minimal effort Discussion • Q: How is this applicable to sensor networks? • A: There is no reference to sensor networks explicitly as this paper was written over a decade ago. But they provide a unified programming methodology to distribute data and communicate among processors. Replace this with motes and you’ll this is indeed relevant Discussion cont. • Q: What about issues such as fault tolerance? • A: Point well taken. If a message is lost it doesn’t seem as though the infrastructure is there to deal with this. The model could be extended to have redundant computation. Perhaps even check pointing but as someone mentioned the memory of motes may be an issue here. • Q: They provide a means for load balancing is this even applicable to sensor networks? • A: Yes, it is in sensor networks as we want to balance the load so energy isn’t completely spent on a mote.
Pages to are hidden for
"Compiling Fortran D"Please download to view full document