Issues in Translation of High Performance Fortran

Document Sample
scope of work template
							Issues in Translation of High
Performance Fortran

       Bryan Carpenter
   NPAC at Syracuse University
      Syracuse, NY 13244
       dbc@npac.syr.edu
Goals of this lecture
   Discuss translation of some elementary
    HPF examples to MPI code. Illustrate
    the need for a Distributed Array
    Descriptor (DAD).

   Develop an abstract model of a DAD,
    and show how it can be used to
    translate simple codes.
Contents of Lecture
   Introduction.
       Translation of simple HPF fragment to SPMD.
       The problem of procedures.
   Requirements for an array descriptor.
   Groups.
       Process grids.
       Restricted groups.
   Range objects.
   A DAD
A simple HPF program
   Here is a simple HPF program:

       !HPF$ PROCESSORS P(4)

             REAL A(50)
       !HPF$ DISTRIBUTE A(BLOCK) ONTO P

            FORALL (I = 1:50) A(I) = 1.0 * I


   We want to translate this to an MPI program.
Translation of simple program
      INTEGER W_RANK, W_SIZE, ERRCODE
      INTEGER BLK_SIZE
      PARAMETER (BLK_SIZE = (50 + 3)/4)
      REAL A(BLK_SIZE)
      INTEGER BLK_START, BLK_COUNT
      INTEGER L, I
      CALL MPI_INIT(ERRCODE)
      CALL MPI_COMM_RANK(MPI_COMM_WORLD, W_RANK, ERRCODE)
      CALL MPI_COMM_SIZE(MPI_COMM_WORLD, W_SIZE, ERRCODE)
      IF (W_RANK < 4) THEN
         BLK_START = W_RANK * BLK_SIZE
         IF (50 – BLK_START >= BLK_SIZE) THEN
            BLK_COUNT = BLK_SIZE
         ELSEIF (50 – BLK_START > 0) THEN
            BLK_COUNT = 50 – BLK_START
         ELSE
            BLK_COUNT = 0
         ENDIF
        DO L = 1, BLK_COUNT
          I = BLK_START + L
          A(L) = 1.0 * I
        ENDDO
      ENDIF
      CALL MPI_FINALIZE(ERRCODE)
       Setting up the environment

   Associated code:

    INTEGER W_RANK, W_SIZE, ERRCODE
    ...
    CALL MPI_INIT(ERRCODE)
    CALL MPI_COMM_SIZE(MPI_COMM_WORLD, W_RANK, ERRCODE)
    CALL MPI_COMM_RANK(MPI_COMM_WORLD, W_RANK, ERRCODE)
    ...
    CALL MPI_FINALIZE(ERRCODE)
Allocating segment of the
distributed array
   Associated statements are:

      INTEGER BLK_SIZE
      PARAMETER (BLK_SIZE = (50 + 3)/4)
      REAL A(BLK_SIZE)

   Segment size is 50/4
Testing this processor holds a
segment
   Associated code is:

       IF (W_RANK < 4) THEN
          ...
       ENDIF


   Assumes number of MPI processes is at least
    the size of the largest processor arrangement
    of HPF program.
    Computing parameters of
    locally held segment
   Associated code:
        INTEGER BLK_START, BLK_COUNT
        ...
          BLK_START = W_RANK * BLK_SIZE
          IF (50 – BLK_START >= BLK_SIZE) THEN
             BLK_COUNT = BLK_SIZE
          ELSEIF (50 – BLK_START > 0) THEN
             BLK_COUNT = 50 – BLK_START
          ELSE
             BLK_COUNT = 0
          ENDIF

   BLK_START—position in global index space.
    BLK_COUNT—elements in segment.
Loop over local elements
   Associated code:
      INTEGER L, I
      ...
         DO L = 1, BLK_COUNT
           I = BLK_START + L
           A(L) = 1.0 * I
         ENDDO
An HPF procedure
   Superficially similar program:

            SUBROUTINE INIT(D)
             REAL D(50)
       !HPF$ INHERIT D
            FORALL (I = 1:50) D(I) = 1.0 * I
            END

   INHERIT directive means mapping of dummy
    should be same as actual, whatever that is.
Procedure call with block-
distributed actual
    !HPF$ PROCESSORS P(4)
          REAL A(50)
    !HPF$ DISTRIBUTE A(BLOCK) ONTO P
         CALL INIT(A)

   Mapping of D:
Procedure call with cyclically
distributed actual
    !HPF$ PROCESSORS P(4)
          REAL A(50)
    !HPF$ DISTRIBUTE A(CYCLIC) ONTO P
         CALL INIT(A)

   Mapping of D:
Procedure call with strided
alignment of actual
    !HPF$ PROCESSORS P(4)
          REAL A(100)
    !HPF$ DISTRIBUTE A(BLOCK) ONTO P
         CALL INIT(A(1:100:2))

   Mapping of D:
Procedure call with row-
aligned actual
    !HPF$ PROCESSORS Q(2, 2)
          REAL A(6, 50)
    !HPF$ DISTRIBUTE A(BLOCK, BLOCK) ONTO Q
         CALL INIT(A(2, :))
   Mapping of D:
The problem
   Somehow INIT must be translated to deal
    with data having any of these decompositions,
    or any legal HPF mapping. Actual mapping
    not known until run-time.

   Not an artificial example. Libraries that
    operate on distributed arrays (eg the
    communication libraries discussed later) must
    deal with exactly this situation.
Requirements for an array
descriptor
   Seems that to translate procedure calls,
    need some non-trivial data structure to
    describe layout of actual argument.
   The Distributed Array Descriptor (DAD).
   Want to understand requirements and
    best organization of a DAD.
   Adopt object-oriented principles to build
    an abstract design.
Distributed array dimensions
   Obvious structural feature of HPF array:
    multidimensional.
   Each dimension mapped independently as:
       Collapsed (serial),
       Simple block distribution,
       Simple cyclic distribution,
       Block cyclic distribution,
       General block distribution (HPF 2.0),
       Linear alignment to any of above.
       Converting block distribution
       to cyclic distribution
BLK_SIZE = (N + NP – 1) / NP          BLK_SIZE = (N + NP – 1) / NP
...                                   ...

BLK_START = R * BLK_SIZE              BLK_START = R
...                                   ...

IF (N – BLK_START >= BLK_SIZE) THEN
   BLK_COUNT = BLK_SIZE
ELSEIF (N – BLK_START > 0) THEN       BLK_COUNT = (N – R + NP – 1) / NP
   BLK_COUNT = N – BLK_START          ...
ELSE
    BLK_COUNT = 0
ENDIF
...

I = BLK_START + L                     I = BLK_START + NP * (L - 1) + 1
Distributed ranges
   Have different kinds of array dimension
    (distribution format).
   Each kind of dimension has a different set of
    formulae for segment layout, index
    computation, etc.
   OO interpretation: virtual functions on a class
    hierarchy.
   Implement as the Range hierarchy.
   DAD for rank-r array will contain r Range
    objects, one per dimension.
Dealing with ―hidden‖
dimensions of sections
   Array may be mapped to slice of grid:




   Rank-1 section only has one range object.
    Need some other structure to represent
    embedding in subgrid.
DAD groups
   Need a group concept similar to
    MPI_Group.
   Want lightweight structure for
    representing arbitrary slices of process
    grids.
   Object representing grid itself needs
    multidimensional structure (cf Cartesian
    Communicator in MPI).
Representing processor
arrangements
   In OO runtime descriptor, expect entity like
    processor arrangement becomes an object.
   Use C++ for definiteness:
       !HPF$ PROCESSORS P(4)
    becomes
       Procs1 p(4);
    and
      !HPF$ PROCESSORS Q(2, 2)
    becomes
      Procs2 q(2, 2);
Hierarchy of process grids
Interface of Procs and
Dimension
      class Procs {
      public:
         int member() const;
           Dimension dim(const int d) const;
           ...
      };
      class Dimension {
      public:
         int size() const;
           int crd() const;
           ...
      };
  Using Procs in translation
INTEGER W_RANK, . . .
CALL MPI_COMM_RANK(MPI_COMM_WORLD, W_RANK, ERRCODE)
...
IF (W_RANK < 4) THEN
   BLK_START = W_RANK * BLK_SIZE
   ...
ENDIF
                            Procs1 p(4);
                            ...

         Becomes:            if (p.member()) {
                                 blk_start = p.dim(0).crd() * blk_size;
                                 ...
                             }
Restricted process groups
   Slice of process grid to which array
    section may be mapped.
   Portion of grid selected by specifying
    subset of dimension coordinates.
   Lightweight representation. Use
    bitmask to represent dimension set.
Example restricted groups in
2-dimensional grid
  Representation of subgrids

example      dimension        lead       tuple
                set          process

  a)      {dim(0), dim(1)}      0      (p, 11 , 0)
                                             2

  b)         {dim(0)}           8      (p, 10 , 8)
                                             2

  c)         {dim(1)}           1      (p, 01 , 1)
                                             2

  d)             {}             6      (p, 00 , 6)
                                             2
The Group class
    class Group {
    public:
       Group(const Procs& p);
        void restrict(Dimension d, const int coord);
        int member() const;
        ...
    }

   Lightweight—implementation in about 3
    words. Can freely copy and discard. DAD
    contains a Group object.
Ranges
   In DAD, range object describes extent
    and distribution format of one array
    dimension.
   Expect a class hierarchy of ranges.
   Each subclass corresponds to a different
    kind of distribution format for an array
    dimension.
A hierarchy of ranges
Interface of the Range class
   Class range {
   public:
      int size() const;
        Dimension dim() const;
        int volume() const;
        Range subrng(const int extent, const int base,
                     const int stride = 1) const;
        void block(Block* blk, const int crd) const;
        void location(Location* loc, const int glb) const;
        ...
   };
        Translating simple HPF
        program to C++
                                        Translation:
                                        Procs1 p(4);

Source:                                 BlockRange x(50, p.dim(0));

                                        float* a = new float [x.volume()];
!HPF$ PROCESSORS P(4)
                                        if (p.member()) {
      REAL A(50)                            Block b;
!HPF$ DISTRIBUTE A(BLOCK) ONTO P            x.block(&b, p.dim(0).crd());
     FORALL (I = 1:50) A(I) = 1.0 * I       for (int l = 0; l < b.count; l++) {
                                               const int i = b.glb_bas + b.glb_stp * l + 1;
                                               a [b.sub_bas + b.sub_stp * l] = 1.0 * i;
                                            }
                                        }
Features of C++ translation
   Arguments of BlockRange constructor are
    process dimension and extent of range.
   Fields of Block define count of local loop and
    base and step for local subscript and global
    index.
   If distribution directive is changed to:
         !HPF$ DISTRIBUTE A(CYCLIC) ONTO P
    only change is x declaration becomes:
         CyclicRange x(50, p.dim(0));
    —apparently making progress toward writing
    code that works for any distribution.
The Block and Location
structures

struct Block {
   int count;       struct Location {
                       int sub;
     int glb_bas;
                       int crd;
     int glb_stp;      ...
     int sub_bas;   };
     int sub_stp;
};
     Memory strides
                          First dimension of D most-
Fortran 90 program:        rapidly-varying in memory.
  REAL B(100, 100)        Second dimension has
  ...                      memory stride 100—
                           inherited by C.
  CALL FOO(B(1, :))
                          Fortran compilers normally
                           pass a dope vector
  SUBROUTINE FOO(C )       containing r extents and r
  REAL C(:)                strides for rank-r argument.
  ...                     Stride not really a property
  END                      of the distributed range.
                           Store separately in DAD.
    A DAD

   Abstract DAD for a rank-r array is an
    object containing:
       A distribution group, and
       r range objects, and
       r integer strides.
Interface of the DAD class
 Struct DAD {
    DAD(const int _rank, const Group& _group,
         Map _maps []);

      const Group& grp() const;

      Range rng(const int d) const;

      int str(const int d) const;
      ...
 };
Map structure

 struct Map {
    Map(Range _range, const int _stride);

      Range range;
      int stride;
 };
           Translating HPF program with
           inherited mapping
                                        Translation:
                                        void init(float* d, DAD* d_dad) {

Source:                                     Group p = d_dad->grp();
                                            if (p.member()) {
                                                Range x = d_dad->rng(0);
     SUBROUTINE INIT(D)
                                                int   s = d_dad->str(0);
      REAL D(50)
                                                Block b;
!HPF$ INHERIT D
                                                x.block(&b, p.dim(0).crd());
     FORALL (I = 1:50) D(I) = 1.0 * I
                                                for (int l = 0; l < b.count; l++) {
     END
                                                   const int i = b.glb_bas + b.glb_stp * l + 1;
                                                   d [s * (b.sub_bas + b.sub_stp * l)] = 1.0 * i;
                                                }
                                            }
                                        }
        Translation of call with block-
        distributed actual

                                   Translation:
Source:                            Procs1 p(4);

                                   BlockRange x(50, p.dim(0));
!HPF$ PROCESSORS P(4)
                                   float* a = new float [x.volume()];
      REAL A(50)
!HPF$ DISTRIBUTE A(BLOCK) ONTO P   Map maps [1];
                                   maps [0] = Map(x, 1);
     CALL INIT(A)
                                   DAD dad(1, p, maps);

                                   init(a, &dad);
        Translation of call with
        cyclically distributed actual

                                    Translation:
Source:                             Procs1 p(4);

                                    CyclicRange x(50, p.dim(0));
!HPF$ PROCESSORS P(4)
                                    float* a = new float [x.volume()];
      REAL A(50)
!HPF$ DISTRIBUTE A(CYCLIC) ONTO P   Map maps [1];
                                    maps [0] = Map(x, 1);
     CALL INIT(A)
                                    DAD dad(1, p, maps);

                                    init(a, &dad);
        Translation of call with strided
        alignment of actual
                                   Translation:
                                   Procs1 p(4);

Source:                            BlockRange x(100, p.dim(0));

                                   float* a = new float [x.volume()];
!HPF$ PROCESSORS P(4)
                                   // Create DAD for section a(::2)
      REAL A(100)
                                   Range x2 = x.subrng(50, 0, 2);
!HPF$ DISTRIBUTE A(BLOCK) ONTO P
                                   Map maps [1];
     CALL INIT(A(1:100:2))
                                   maps [0] = Map(x2, 1);

                                   DAD dad(1, p, maps);

                                   init(a, &dad);
        Translation of call with row-
        aligned actual
                                  Translation:
                                  Procs2 q(2, 2);

                                  BlockRange x(6, q.dim(0)), y(50, q.dim(1));

Source:                           float* a = new float [x.volume() * y.volume()];

                                  // Create DAD for section a(1, :)
!HPF$ PROCESSORS Q(2, 2)          Location i;
      REAL A(6, 50)               x.location(&i, 1);
!HPF$ DISTRIBUTE                  Group p = q;
         A(BLOCK, BLOCK) ONTO Q   p.restrict(q.dim(0), i.crd);
     CALL INIT(A(2, :))           Map maps [1];
                                  maps [0] = Map(y, x.volume());

                                  DAD dad(1, p, maps);

                                  init(a + i.sub, &dad);
Other features of the Adlib
DAD
   Support for block-cyclic distributions.
    Local loops traversing distributed data need
    outer loop over set of local blocks.
    LocBlocksIndex iterator class. offset()
    method computes overall memory offset.
   Support for ghost extension, other
    memory layouts. Shift in memory for ghost
    region not in local subscript (universal—
    memory-layout-independent). disp(), offset(),
    step() methods applied to local subscript.
Other features of the Adlib
DAD, II
   Support for loops over subranges.
    Additional block() methods take triplet
    arguments—directly traverse subranges.
    crds() methods define ranges of coordinates
    where local blocks actually exist.
   Other feature to support communication
    library. AllBlocksIndex.
   Miscellaneous inquires and predicates.
    Useful in general libraries, and for runtime
    checking programs for correctness.
Next Lecture:
   Communication in Data Parallel
    Languages
       Patterns of communication needed to
        implement language constructs.
       Libraries that support these communication
        patterns.

						
Related docs
Other docs by hcw25539
Programmierkurs FORTRAN 95
Views: 141  |  Downloads: 0
Pascal Berruet
Views: 25  |  Downloads: 0
Pascal Vuillemin 15Q3
Views: 4  |  Downloads: 0
Lycée Polyvalent Blaise PASCAL
Views: 92  |  Downloads: 0