Experiences with Co-array Fortran on Hardware Shared Memory by bzs12927

VIEWS: 10 PAGES: 35

									  Experiences with Co-array Fortran on
  Hardware Shared Memory Platforms


Yuri Dotsenko             Cristian Coarfa
John Mellor-Crummey       Daniel Chavarria-Miranda

             Rice University, Houston, TX
Co-array Fortran
Global Address Space (GAS) language
SPMD programming model
Simple extension of Fortran 90
Explicit control over data placement and
computation distribution
Private data
Shared data: both local and remote
One-sided communication (PUT and GET)
Team and point-to-point synchronization
   Co-array Fortran: Example
integer :: a(10,20)[*]

       a(10,20)   a(10,20)                  a(10,20)

       image 1    image 2                   image N
if (this_image() > 1)        Copies from left neighbor
     a(1:10,1:2) = a(1:10,19:20)[this_image()-1]




       image 1    image 2                   image N
Compiling CAF

Source-to-source translation
Prototype Rice cafc
   Fortran 90 pointer-based co-array representation
   ARMCI-based data movement
Goal: performance transparency
Challenges:
   Retain CAF source-level information
        Array contiguity, array bounds, lack of aliasing
   Exploit efficient fine-grain communication on SMPs
Outline

Co-array representation and data access
 Local data
 Remote data

Experimental evaluation
Conclusions
Representation and Access for
Local Data
Efficient local access to SAVE/COMMON co-
arrays is crucial to achieving best
performance on a target architecture

Fortran 90 pointer
Fortran 90 pointer to structure
Cray pointer
Subroutine argument
COMMON block (need support for symmetric
shared objects)
   Fortran 90 Pointer Representation
CAF declaration:           real, save :: a(10,20)[*]
After translation:         type T1
                             integer(PtrSize) handle
                             real, pointer :: local(:,:)
                           end type T1
                           type (T1) ca
Local access:              ca%local(2,3)
   Portable representation
   Back-end compiler has no knowledge about:
           Potential aliasing (no-alias flags for some compilers)
           Contiguity
           Bounds
   Implemented in cafc
   Fortran 90 Pointer to Structure
   Representation
CAF declaration:     real, save :: a(10,20)[*]

After translation:   type T1
                       real :: local(10,20)
                     end type T1
                     type (T1), pointer :: ca



   Conveys constant bounds and contiguity
   Potential aliasing is still a problem
   Cray Pointer Representation
CAF declaration:     real, save :: a(10,20)[*]

After translation:   real :: a_local(10,20)
                     pointer (a_ptr, a_local)



   Conveys constant bounds and contiguity
   Potential aliasing is still a problem
   Cray pointer is not in Fortran 90 standard
   Subroutine Argument
   Representation
CAF source:          subroutine foo(…)
                       real, save :: a(10,20)[*]
                       a(i,j) = … + a(i-1,j) * …
                     end subroutine foo
After translation:
subroutine foo(…)
  ! F90 representation for co-array a
  call foo_body(ca%local(1,1), ca%handle, …)
end subroutine foo
subroutine foo_body(a_local, a_handle, …)
  real :: a_local(10,20)
  a_local(i,j) = … + a_local(i-1,j) * …
end subroutine foo_body
Subroutine Argument
Representation (cont.)

Avoid conservative assumptions about co-
array aliasing by the back-end compiler
Performance is close to optimal
Extra procedures and procedure calls
Implemented in cafc
   COMMON Block Representation
CAF declaration:       real :: a(10,20)[*]
                       common /a_cb/ a

After translation:     real :: ca(10,20)
                       common /ca_cb/ ca



   Yields best performance for local accesses
   OS must support symmetric data objects
Outline

Co-array representation and data access
 Local data
 Remote data

Experimental evaluation
Conclusions
Generating CAF Communication

Generic parallel architectures
   Library function calls to move data
Shared memory architectures (load/store)
   Fortran 90 pointers
   Vector of Fortran 90 pointers
   Cray pointers
   Communication Generation for
   Generic Parallel Architectures
CAF code:          a(:) = b(:)[p] + …
Translated code:   allocate b_temp(:)
                   call GET( b, p, b_temp, … )
                   a(:) = b_temp(:) + …
                   deallocate b_temp

  Portable: works on clusters and SMPs
  Function overhead per fine-grain access
  Uses temporary to hold off-processor data
  Implemented in cafc
   Communication Generation
   Using Fortran 90 Pointers
CAF code:          do j = 1, N
                     C(j) = A(j)[p]
                   end do
Translated code:   do j =   1, N
                     ptrA   => A(j)
                     call   CafSetPtr(ptrA,p,A_handle)
                     C(j)   = ptrA
                   end do

  Function call overhead for each reference
  Implemented in cafc
   Pointer Initialization Hoisting
Naïvely translated code:   do j =    1, N
                             ptrA    => A(j)
                             call    CafSetPtr(ptrA,p,A_handle)
                             C(j)    = ptrA
                           end do
Code with hoisted pointer initialization:
               ptrA => A(1:N)
               call CafSetPtr(ptrA,p,A_handle)
               do j = 1, N
                 C(j) = ptrA(j)
               end do
   Pointer initialization hoisting is not yet implemented in cafc
   Communication Generation Using
   Vector of Fortran 90 Pointers
CAF code:          do j = 1, N
                     C(j) = A(j)[p]
                   end do
Translated code:   … initialization …
                   do j = 1, N
                     C(j) = ptrVectorA(p)%ptrA(j)
                   end do

  Does not require pointer initialization hoisting
  and avoids function calls
  Worse performance than that of hoisted
  pointer initialization
   Communication Generation
   Using Cray Pointers
CAF code:           do j = 1, N
                      C(j) = A(j)[p]
                    end do
Translated code:    integer(PtrSize) :: addrA(:)
                    … addrA initialization …
                    do j = 1, N
                      ptrA = addrA(p)
                      C(j) = A_rem(j)
                    end do

  addrA(p) – address of co-array A on image p
  Cray pointer initialization hoisting yields only marginal
  improvement
Outline

Co-array representation and data access
 Local data
 Remote data

Experimental evaluation
Conclusions
Experimental Platforms

SGI Altix 3000
   128 Itanium2 1.5 GHz, 6 MB L3 cache processors
   Linux (2.4.21 kernel)
   Intel Fortran Compiler 8.0


SGI Origin 2000
   16 MIPS R12000 350 MHz, 8 MB L2 cache processors
   IRIX64 6.5
   MIPSpro Compiler 7.3.1.3m
Benchmarks

STREAM
Random Access
Spark98
NAS MG and SP
   STREAM
                           Copy kernel
DO J = 1, N                    DO J = 1, N
  C(J) = A(J)                    C(J) = A(J)[p]
END DO                         END DO

                           Triad kernel
DO J = 1, N                     DO J = 1, N
  A(J)=B(J)+s*C(J)                A(J)=B(J)[p]+s*C(J)[p]
END DO                          END DO

Goal: investigate how well architecture bandwidth can be delivered
  up to the language level
STREAM: Local Accesses
COMMON block is the best, if platform allows
Subroutine parameter has similar performance to
COMMON block representation
Pointer-based representations have performance within
5% of the best on the Altix (with no-aliasing flag), and
within 15% on the Origin
Fortran 90 pointer representation yields 30% of
performance on the Altix without using the flag to specify
lack of pointer aliasing
Array section statements with Fortran 90 pointer
representation yield 40-50% performance on the Origin
STREAM: Remote Accesses
COMMON block representation for local access + Cray pointer
for remote accesses is the best
Subroutine argument + Cray pointer for remote accesses has
similar performance
Remote accesses with function call per access yield very poor
performance (24 times slower than the best on the Altix, five
times slower on the Origin)
Generic strategy (with intermediate temporaries) delivers only
50-60% of performance on the Altix and 30-40% of performance
on the Origin for vectorized code (except for Copy kernel)
Pointer initialization hoisting is crucial for Fortran 90 pointers
remote accesses and desirable for Cray pointers
Similarly coded OpenMP version has comparable performance
on the Altix (90% for the scale kernel) and 86-90% on the Origin
Spark98

Based on CMU’s earthquake simulation code
Computes sparse matrix-vector product
Irregular application with fine-grain accesses
Matrix distribution and computation partitioning
is done offline (sf2 traces)
Spark98 computes partial product locally, then
assembles the result across processors
Spark98 (cont.)

Versions
   Serial (Fortran kernel, ported from C)
   MPI (Fortran kernel, ported from C)
   Hybrid (best shared memory threaded version)
   CAF versions (based on MPI version):
        CAF Packed PUTs
        CAF Packed GETs
        CAF GETs (computation with remote data accessed “in
         place”)
  Spark98 GETs Result Assembly
v2(:,:) = v(:,:)
call sync_all()
do s = 0, subdomains-1
  if (commindex(s) < commindex(s+1)) then
    pos = commindex(s)
    comm_len = commindex(s+1) - pos
    v(:, comm(pos:pos+comm_len-1)) =       &
        v(:, comm(pos:pos+comm_len-1)) +   &
        v2(:, comm_gets(pos:pos+comm_len-1))[s]
  end if
end do
call sync_all()
  Spark98 GETs Result Assembly
v2(:,:) = v(:,:)
call sync_all()
do s = 0, subdomains-1
  if (commindex(s) < commindex(s+1)) then
    pos = commindex(s)
    comm_len = commindex(s+1) - pos
    v(:, comm(pos:pos+comm_len-1)) =       &
        v(:, comm(pos:pos+comm_len-1)) +   &
        v2(:, comm_gets(pos:pos+comm_len-1))[s]
  end if
end do
call sync_all()
Spark98 Performance on Altix




     Performance of all CAF versions is comparable to
         that of MPI and better on large number of CPUs
     CAF GETs is simple and more “natural” to code,
        but up to 13% slower
     Without considering locality, applications do not
         scale on NUMA architectures (Hybrid)
     ARMCI library is more efficient than MPI
NAS MG and SP

Versions:
   MPI (NPB 2.3)
   CAF (based on MPI NPB 2.3)
        Generic code generation with subroutine argument co-
         array representation (procedure splitting)
        Shared memory code generation (Fortran 90 pointers;
         vectorized source code) with subroutine argument co-
         array representation
   OpenMP (NPB 3.0)
Class C
NAS SP Performance on Altix




          Performance of CAF versions is comparable to that of MPI
          CAF-generic has better performance than CAF-shm
              because it uses memcpy, which hides latency by
              keeping optimal number of memory ops in flight
          OpenMP scales poorly
NAS MG Performance on Altix
Conclusions
Direct load/store communication improves
performance of fine-grain accesses by a factor
of 24 on the Altix 3000 and five on the Origin
2000
“In-place” data use in CAF statements incurs
acceptable abstraction overhead
Performance comparable to that of MPI codes
for fine- and coarse-grain applications
We plan to implement in cafc optimal,
architecture dependent, code generation for
local and remote co-array accesses
www.hipersoft.rice.edu/caf

								
To top