Compiling Fortran D by bzs12927

VIEWS: 0 PAGES: 28

									        Compiling Fortran D

      For MIMD Distributed Machines
Authors: Seema Hiranandani, Ken Kennedy, Chau-Wen Tseng
Published: 1992
Presented by: Sunjeev Sikand
Monday, February 01, 2010
                 Problem
• Parallel computers represent the only
  plausible way to continue to increase the
  computational power available to scientists
  and engineers
• However, they are difficult to program
• In particular MIMD machines require
  message-passing to separate address spaces
  and synchronizing among processors
             Problem cont.
• Because parallel programs are machine-
  specific scientists are discouraged from
  utilizing them because they lose their
  investment when the program changes or a
  new architecture arrives
• However, vectorizable programs are easily
  maintained, debugged, portable, and the
  compilers do all the work
                 Solution
• Previous Fortran dialects lack a means of
  specifying a data decomposition
• The authors believe that if a program is
  written in a data parallel programming style
  with reasonable data decompositions it can
  be implemented efficiently.
• Thus they propose to develop a compiler
  technology to establish such a machine-
  independent programming model.
• Want to reduce both communication and
  load imbalance
        Data Decomposition
• A decomposition is an abstract problem or
  index domain; it does not require any
  storage
• Each element of a decomposition represents
  a unit of computation
• The DECOMPOSITION statement declares
  the name, dimensionality, and size of a
  decomposition for later use
• There are two levels of parallelism in data
  parallel applications
Decomposition Statement




   DECOMPOSITION D(N,N)
Data Decomposition - Alignment
• First level of parallelism is array
  alignment/problem mapping that is how
  arrays are aligned with respect to one
  another
• Represents the minimal requirements for
  reducing data movement for the program
  given an unlimited number of processors
• Machine independent and depends on the
  fine-grained parallelism defined by the
  individual member of data arrays
            Alignment cont.
• Corresponding elements in aligned arrays
  are always mapped to the same processor
• Array operations between aligned arrays are
  usually more efficient than array operations
  between arrays that are not known to be
  aligned.
         Alignment Example
REAL A(N,N)
DECOMPOSITION D(N,N)
ALIGN A(I,J) with D(J-2,I+3)
        Data Decomposition -
            Distribution
• Other level of parallelism is
  distribution/machine mapping that is how
  arrays are distributed on the actual parallel
  machine
• Represents the translation of the problem
  onto the finite resources of the machine
• Affected by the topology, communication
  mechanisms, size of local memory, and
  number of processors on the underlying
  machine
          Distribution cont.
• Specified by assigning an independent
  attribute to each dimension.
• Predefined attributes include BLOCK,
  CYCLIC, and BLOCK_CYCLIC
• The symbol : marks dimensions that are not
  distributed
Distribution Example 1




   DISTRIBUTE D(:,BLOCK)
Distribution Example 2




  DISTRIBUTE D(:,CYCLIC)
          Fortran D Compiler
• Two major steps in writing a data parallel program
  are selecting a data decomposition and using it to
  derive node programs with explicit movement
• The former is left to user
• Latter is automatically generated by the compiler
  when given a data decomposition
• Translated program to a SPMD program with
  explicit message passing that execute directly on
  the nodes of the distributed-memory machine
    Fortran D Compiler Structure
1 Program Analysis
   a-Dependence Analysis
   b-Data Decomposition Analysis
   c-Partitioning Analysis
   d-Communication Analysis
2 Program optimization
   a-Message vectorization
   b-Collective communications
   c-Run-Time processing
   d-Pipelined computations
3 Code generation
   a-Program partitioning
   b-Message generation
   c-Storage management
         Partition Analysis
• Converting global to local indices

Original program    SPMD node Program
REAL A(100)         REAL A(25)
do i = I, I00       do i = i, 25
A(i) = 0.0          A(i) = 0.0
enddo               enddo
              Jacobi Relaxation
• In the grid approximation that discretizes the physical
  problem, the heat flow into any given point at a given
  moment is the sum of the four temperature differences
  between that point and each of the four points surrounding
  it.
• Translating this into an iterative method, the correct
  solution can be found if the temperature of a given grid
  point at a given iteration is taken to be the average of the
  temperatures of the four surrounding grid points at the
  previous iteration.
           Jacobi Relaxation Code
REAL A(100,100), B(100,100)
DECOMPOSITION D(100,100)
ALIGN A, B with D
DISTRIBUTE D(:,BLOCK)
   do k = l,time
        do j = 2,99
                 do i = 2,99
S1               A(i,j) = (B(i,j-l)+B(i-l,j)+
                           B(i+l,j)+B(i,j+l))/4
                 enddo
        enddo
        do j = 2,99
                 do i = 2,99
S2                         B(i,j) = A(i,j)
                 enddo
        enddo
enddo
    Jacobi Relaxation Processor
              Layout
• Compiling for a four-processor machine.
• Both arrays A and B are aligned identically
  with decomposition D, so they have the
  same distribution as D.
• Because the first dimension of D is local
  and the second dimension is block-
  distributed, the local index set for both A
  and B on each processor (in local indices) is
  [1:100,1:25].
           Jacobi Relaxation cont.
      1                                  100
 1             25 26
           2           50 51   75 76   99
      2




      99
100
       Jacobi Relaxation cont.
• The iteration set of the loop nest (in global
  indices) is [l:time,2:99,2:99].
• Local iteration sets for each processor (in
  local indices)
• Proc(1) = [1 : time, 2 : 25, 2 : 99]
• Proc(2 : 3) = [1 :time, 1 : 25, 2 : 99]
• Proc(4) = [1 : time, 1 : 24, 2 : 99]
                Generated Jacobi
REAL A(100,25), B(100,0:26)
if (Plocal = 1) lb1 = 2 else lb1 = 1
if (Plocal = 4) ub1 = 24 else ub1 = 25
do k = l,time
    if (Plocal > l) send(B(2:99,1), Pleft)
    if (Plocal < 4) send(B(2:99,25), Pright)
    if (Plocal < 4) recv(B(2:99,26), Pright)
    if (Plocal > 1) recv(B(2:99,0) , Pleft)
    do j = lb1, ub1
         do i = 2,99
                 A(i,j) = (B(i,j-l)+B(i-l,j)+
                         B(i+l,j)+B(i,j+l) )/4
         enddo
    enddo
         Generated Jacobi cont.
  do j = lb1,ub1
       do i = 2,99
               B(i,j) = A(i,j)
       enddo
  enddo
enddo
• Only true cross-processor dependences are on the k loop
  thus able to vectorize messages
      Pipelined Computation
• In loosely synchronous all processors
  execute in loose lockstep, alternating
  between phases of local computation and
  global communication e.g. Red Black SOR
  and Jacobi
• However some computations such as SOR
  contain loop carried dependences
• They present an opportunity to exploit
  parallelism through pipelining.
   Pipelined Computation cont.
• The observation is that for some pipelined
  computations, the program order must be changed.
• Fine grained pipelining interchanges cross
  processor loops as deeply as possible to improve
  sequential computation but incurs the most
  communication overhead
• Coarse grained pipelining uses strip mining and
  loop interchange to adjust the granularity of the
  pipelining. Decreases communication overhead at
  the expense of some parallelism
              Conclusions
• A usable and efficient machine independent
  parallel programming model is needed to
  make large-scale parallel machines useful to
  scientific programmers
• Fortran D with its data decomposition
  model performs message vectorization,
  collective communication, fine-grained
  pipelining, and several other optimizations
  for block distributed arrays
• Fortran D compiler will generate efficient
  code a for a large class of data parallel
  programs with minimal effort
               Discussion
• Q: How is this applicable to sensor
  networks?
• A: There is no reference to sensor networks
  explicitly as this paper was written over a
  decade ago. But they provide a unified
  programming methodology to distribute
  data and communicate among processors.
  Replace this with motes and you’ll this is
  indeed relevant
             Discussion cont.
• Q: What about issues such as fault tolerance?
• A: Point well taken. If a message is lost it doesn’t
  seem as though the infrastructure is there to deal
  with this. The model could be extended to have
  redundant computation. Perhaps even check
  pointing but as someone mentioned the memory of
  motes may be an issue here.
• Q: They provide a means for load balancing is this
  even applicable to sensor networks?
• A: Yes, it is in sensor networks as we want to
  balance the load so energy isn’t completely spent
  on a mote.

								
To top