Issues in Translation of High Performance Fortran
Document Sample


Issues in Translation of High
Performance Fortran
Bryan Carpenter
NPAC at Syracuse University
Syracuse, NY 13244
dbc@npac.syr.edu
Goals of this lecture
Discuss translation of some elementary
HPF examples to MPI code. Illustrate
the need for a Distributed Array
Descriptor (DAD).
Develop an abstract model of a DAD,
and show how it can be used to
translate simple codes.
Contents of Lecture
Introduction.
Translation of simple HPF fragment to SPMD.
The problem of procedures.
Requirements for an array descriptor.
Groups.
Process grids.
Restricted groups.
Range objects.
A DAD
A simple HPF program
Here is a simple HPF program:
!HPF$ PROCESSORS P(4)
REAL A(50)
!HPF$ DISTRIBUTE A(BLOCK) ONTO P
FORALL (I = 1:50) A(I) = 1.0 * I
We want to translate this to an MPI program.
Translation of simple program
INTEGER W_RANK, W_SIZE, ERRCODE
INTEGER BLK_SIZE
PARAMETER (BLK_SIZE = (50 + 3)/4)
REAL A(BLK_SIZE)
INTEGER BLK_START, BLK_COUNT
INTEGER L, I
CALL MPI_INIT(ERRCODE)
CALL MPI_COMM_RANK(MPI_COMM_WORLD, W_RANK, ERRCODE)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD, W_SIZE, ERRCODE)
IF (W_RANK < 4) THEN
BLK_START = W_RANK * BLK_SIZE
IF (50 – BLK_START >= BLK_SIZE) THEN
BLK_COUNT = BLK_SIZE
ELSEIF (50 – BLK_START > 0) THEN
BLK_COUNT = 50 – BLK_START
ELSE
BLK_COUNT = 0
ENDIF
DO L = 1, BLK_COUNT
I = BLK_START + L
A(L) = 1.0 * I
ENDDO
ENDIF
CALL MPI_FINALIZE(ERRCODE)
Setting up the environment
Associated code:
INTEGER W_RANK, W_SIZE, ERRCODE
...
CALL MPI_INIT(ERRCODE)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD, W_RANK, ERRCODE)
CALL MPI_COMM_RANK(MPI_COMM_WORLD, W_RANK, ERRCODE)
...
CALL MPI_FINALIZE(ERRCODE)
Allocating segment of the
distributed array
Associated statements are:
INTEGER BLK_SIZE
PARAMETER (BLK_SIZE = (50 + 3)/4)
REAL A(BLK_SIZE)
Segment size is 50/4
Testing this processor holds a
segment
Associated code is:
IF (W_RANK < 4) THEN
...
ENDIF
Assumes number of MPI processes is at least
the size of the largest processor arrangement
of HPF program.
Computing parameters of
locally held segment
Associated code:
INTEGER BLK_START, BLK_COUNT
...
BLK_START = W_RANK * BLK_SIZE
IF (50 – BLK_START >= BLK_SIZE) THEN
BLK_COUNT = BLK_SIZE
ELSEIF (50 – BLK_START > 0) THEN
BLK_COUNT = 50 – BLK_START
ELSE
BLK_COUNT = 0
ENDIF
BLK_START—position in global index space.
BLK_COUNT—elements in segment.
Loop over local elements
Associated code:
INTEGER L, I
...
DO L = 1, BLK_COUNT
I = BLK_START + L
A(L) = 1.0 * I
ENDDO
An HPF procedure
Superficially similar program:
SUBROUTINE INIT(D)
REAL D(50)
!HPF$ INHERIT D
FORALL (I = 1:50) D(I) = 1.0 * I
END
INHERIT directive means mapping of dummy
should be same as actual, whatever that is.
Procedure call with block-
distributed actual
!HPF$ PROCESSORS P(4)
REAL A(50)
!HPF$ DISTRIBUTE A(BLOCK) ONTO P
CALL INIT(A)
Mapping of D:
Procedure call with cyclically
distributed actual
!HPF$ PROCESSORS P(4)
REAL A(50)
!HPF$ DISTRIBUTE A(CYCLIC) ONTO P
CALL INIT(A)
Mapping of D:
Procedure call with strided
alignment of actual
!HPF$ PROCESSORS P(4)
REAL A(100)
!HPF$ DISTRIBUTE A(BLOCK) ONTO P
CALL INIT(A(1:100:2))
Mapping of D:
Procedure call with row-
aligned actual
!HPF$ PROCESSORS Q(2, 2)
REAL A(6, 50)
!HPF$ DISTRIBUTE A(BLOCK, BLOCK) ONTO Q
CALL INIT(A(2, :))
Mapping of D:
The problem
Somehow INIT must be translated to deal
with data having any of these decompositions,
or any legal HPF mapping. Actual mapping
not known until run-time.
Not an artificial example. Libraries that
operate on distributed arrays (eg the
communication libraries discussed later) must
deal with exactly this situation.
Requirements for an array
descriptor
Seems that to translate procedure calls,
need some non-trivial data structure to
describe layout of actual argument.
The Distributed Array Descriptor (DAD).
Want to understand requirements and
best organization of a DAD.
Adopt object-oriented principles to build
an abstract design.
Distributed array dimensions
Obvious structural feature of HPF array:
multidimensional.
Each dimension mapped independently as:
Collapsed (serial),
Simple block distribution,
Simple cyclic distribution,
Block cyclic distribution,
General block distribution (HPF 2.0),
Linear alignment to any of above.
Converting block distribution
to cyclic distribution
BLK_SIZE = (N + NP – 1) / NP BLK_SIZE = (N + NP – 1) / NP
... ...
BLK_START = R * BLK_SIZE BLK_START = R
... ...
IF (N – BLK_START >= BLK_SIZE) THEN
BLK_COUNT = BLK_SIZE
ELSEIF (N – BLK_START > 0) THEN BLK_COUNT = (N – R + NP – 1) / NP
BLK_COUNT = N – BLK_START ...
ELSE
BLK_COUNT = 0
ENDIF
...
I = BLK_START + L I = BLK_START + NP * (L - 1) + 1
Distributed ranges
Have different kinds of array dimension
(distribution format).
Each kind of dimension has a different set of
formulae for segment layout, index
computation, etc.
OO interpretation: virtual functions on a class
hierarchy.
Implement as the Range hierarchy.
DAD for rank-r array will contain r Range
objects, one per dimension.
Dealing with ―hidden‖
dimensions of sections
Array may be mapped to slice of grid:
Rank-1 section only has one range object.
Need some other structure to represent
embedding in subgrid.
DAD groups
Need a group concept similar to
MPI_Group.
Want lightweight structure for
representing arbitrary slices of process
grids.
Object representing grid itself needs
multidimensional structure (cf Cartesian
Communicator in MPI).
Representing processor
arrangements
In OO runtime descriptor, expect entity like
processor arrangement becomes an object.
Use C++ for definiteness:
!HPF$ PROCESSORS P(4)
becomes
Procs1 p(4);
and
!HPF$ PROCESSORS Q(2, 2)
becomes
Procs2 q(2, 2);
Hierarchy of process grids
Interface of Procs and
Dimension
class Procs {
public:
int member() const;
Dimension dim(const int d) const;
...
};
class Dimension {
public:
int size() const;
int crd() const;
...
};
Using Procs in translation
INTEGER W_RANK, . . .
CALL MPI_COMM_RANK(MPI_COMM_WORLD, W_RANK, ERRCODE)
...
IF (W_RANK < 4) THEN
BLK_START = W_RANK * BLK_SIZE
...
ENDIF
Procs1 p(4);
...
Becomes: if (p.member()) {
blk_start = p.dim(0).crd() * blk_size;
...
}
Restricted process groups
Slice of process grid to which array
section may be mapped.
Portion of grid selected by specifying
subset of dimension coordinates.
Lightweight representation. Use
bitmask to represent dimension set.
Example restricted groups in
2-dimensional grid
Representation of subgrids
example dimension lead tuple
set process
a) {dim(0), dim(1)} 0 (p, 11 , 0)
2
b) {dim(0)} 8 (p, 10 , 8)
2
c) {dim(1)} 1 (p, 01 , 1)
2
d) {} 6 (p, 00 , 6)
2
The Group class
class Group {
public:
Group(const Procs& p);
void restrict(Dimension d, const int coord);
int member() const;
...
}
Lightweight—implementation in about 3
words. Can freely copy and discard. DAD
contains a Group object.
Ranges
In DAD, range object describes extent
and distribution format of one array
dimension.
Expect a class hierarchy of ranges.
Each subclass corresponds to a different
kind of distribution format for an array
dimension.
A hierarchy of ranges
Interface of the Range class
Class range {
public:
int size() const;
Dimension dim() const;
int volume() const;
Range subrng(const int extent, const int base,
const int stride = 1) const;
void block(Block* blk, const int crd) const;
void location(Location* loc, const int glb) const;
...
};
Translating simple HPF
program to C++
Translation:
Procs1 p(4);
Source: BlockRange x(50, p.dim(0));
float* a = new float [x.volume()];
!HPF$ PROCESSORS P(4)
if (p.member()) {
REAL A(50) Block b;
!HPF$ DISTRIBUTE A(BLOCK) ONTO P x.block(&b, p.dim(0).crd());
FORALL (I = 1:50) A(I) = 1.0 * I for (int l = 0; l < b.count; l++) {
const int i = b.glb_bas + b.glb_stp * l + 1;
a [b.sub_bas + b.sub_stp * l] = 1.0 * i;
}
}
Features of C++ translation
Arguments of BlockRange constructor are
process dimension and extent of range.
Fields of Block define count of local loop and
base and step for local subscript and global
index.
If distribution directive is changed to:
!HPF$ DISTRIBUTE A(CYCLIC) ONTO P
only change is x declaration becomes:
CyclicRange x(50, p.dim(0));
—apparently making progress toward writing
code that works for any distribution.
The Block and Location
structures
struct Block {
int count; struct Location {
int sub;
int glb_bas;
int crd;
int glb_stp; ...
int sub_bas; };
int sub_stp;
};
Memory strides
First dimension of D most-
Fortran 90 program: rapidly-varying in memory.
REAL B(100, 100) Second dimension has
... memory stride 100—
inherited by C.
CALL FOO(B(1, :))
Fortran compilers normally
pass a dope vector
SUBROUTINE FOO(C ) containing r extents and r
REAL C(:) strides for rank-r argument.
... Stride not really a property
END of the distributed range.
Store separately in DAD.
A DAD
Abstract DAD for a rank-r array is an
object containing:
A distribution group, and
r range objects, and
r integer strides.
Interface of the DAD class
Struct DAD {
DAD(const int _rank, const Group& _group,
Map _maps []);
const Group& grp() const;
Range rng(const int d) const;
int str(const int d) const;
...
};
Map structure
struct Map {
Map(Range _range, const int _stride);
Range range;
int stride;
};
Translating HPF program with
inherited mapping
Translation:
void init(float* d, DAD* d_dad) {
Source: Group p = d_dad->grp();
if (p.member()) {
Range x = d_dad->rng(0);
SUBROUTINE INIT(D)
int s = d_dad->str(0);
REAL D(50)
Block b;
!HPF$ INHERIT D
x.block(&b, p.dim(0).crd());
FORALL (I = 1:50) D(I) = 1.0 * I
for (int l = 0; l < b.count; l++) {
END
const int i = b.glb_bas + b.glb_stp * l + 1;
d [s * (b.sub_bas + b.sub_stp * l)] = 1.0 * i;
}
}
}
Translation of call with block-
distributed actual
Translation:
Source: Procs1 p(4);
BlockRange x(50, p.dim(0));
!HPF$ PROCESSORS P(4)
float* a = new float [x.volume()];
REAL A(50)
!HPF$ DISTRIBUTE A(BLOCK) ONTO P Map maps [1];
maps [0] = Map(x, 1);
CALL INIT(A)
DAD dad(1, p, maps);
init(a, &dad);
Translation of call with
cyclically distributed actual
Translation:
Source: Procs1 p(4);
CyclicRange x(50, p.dim(0));
!HPF$ PROCESSORS P(4)
float* a = new float [x.volume()];
REAL A(50)
!HPF$ DISTRIBUTE A(CYCLIC) ONTO P Map maps [1];
maps [0] = Map(x, 1);
CALL INIT(A)
DAD dad(1, p, maps);
init(a, &dad);
Translation of call with strided
alignment of actual
Translation:
Procs1 p(4);
Source: BlockRange x(100, p.dim(0));
float* a = new float [x.volume()];
!HPF$ PROCESSORS P(4)
// Create DAD for section a(::2)
REAL A(100)
Range x2 = x.subrng(50, 0, 2);
!HPF$ DISTRIBUTE A(BLOCK) ONTO P
Map maps [1];
CALL INIT(A(1:100:2))
maps [0] = Map(x2, 1);
DAD dad(1, p, maps);
init(a, &dad);
Translation of call with row-
aligned actual
Translation:
Procs2 q(2, 2);
BlockRange x(6, q.dim(0)), y(50, q.dim(1));
Source: float* a = new float [x.volume() * y.volume()];
// Create DAD for section a(1, :)
!HPF$ PROCESSORS Q(2, 2) Location i;
REAL A(6, 50) x.location(&i, 1);
!HPF$ DISTRIBUTE Group p = q;
A(BLOCK, BLOCK) ONTO Q p.restrict(q.dim(0), i.crd);
CALL INIT(A(2, :)) Map maps [1];
maps [0] = Map(y, x.volume());
DAD dad(1, p, maps);
init(a + i.sub, &dad);
Other features of the Adlib
DAD
Support for block-cyclic distributions.
Local loops traversing distributed data need
outer loop over set of local blocks.
LocBlocksIndex iterator class. offset()
method computes overall memory offset.
Support for ghost extension, other
memory layouts. Shift in memory for ghost
region not in local subscript (universal—
memory-layout-independent). disp(), offset(),
step() methods applied to local subscript.
Other features of the Adlib
DAD, II
Support for loops over subranges.
Additional block() methods take triplet
arguments—directly traverse subranges.
crds() methods define ranges of coordinates
where local blocks actually exist.
Other feature to support communication
library. AllBlocksIndex.
Miscellaneous inquires and predicates.
Useful in general libraries, and for runtime
checking programs for correctness.
Next Lecture:
Communication in Data Parallel
Languages
Patterns of communication needed to
implement language constructs.
Libraries that support these communication
patterns.
Get documents about "