Your Federal Quarterly Tax Payments are due April 15th Get Help Now >>

Summer Institute presentation on DFFT San Diego by sanmelody

VIEWS: 2 PAGES: 29

									Scaling Three-dimensional Fourier
           Transforms

                   Dmitry Pekurovsky

             dmitry@sdsc.edu
        San Diego Supercomputer Center
   SAN DIEGO SUPERCOMPUTER CENTER

                                    UNIVERSITY OF CALIFORNIA, SAN DIEGO
            Fast Fourier Transform (FFT)
                     F(k) = Sx f(x) exp(-i2pkx/N)
•   A well-known, efficient, commonly used algorithm
•   1D FFT used in MANY science fields
•   Implemented in almost every scientific library (e.g. ESSL,
    NAG, MKL)
•   Open-source software
    • FFTW
• 3D FFT is fairly commonly used too
    •   Turbulence simulations
    •   Reverse tomography
    •   Cosmology
    •   Ocean science
    •   Molecular Dynamics
               SAN DIEGO SUPERCOMPUTER CENTER

                                                UNIVERSITY OF CALIFORNIA, SAN DIEGO
                                    1D FFT

• Cooley-Tukey algorithm (1965)
  •   Divide-and-conquer algorithm
  •   Break up FT size N into N1 N2 – even and odd indices
  •   Keep going until reach size 2
  •   Complexity O(N logN)




          SAN DIEGO SUPERCOMPUTER CENTER

                                           UNIVERSITY OF CALIFORNIA, SAN DIEGO
         3D Fourier Transform in parallel:
                     strategy
• Problem: grid N x N x N distributed over Np processors
     • Would like to do 1D FFT in each of the three dimensions
     • Not all data in a given dimension may be local to any processor

Strategy
1. Direct approach: parallel version of 1D FFT – many processors take part,
    sending data when needed
2.   Transpose approach
     • Rearrange data in advance to make it always local for the dimension to be
        transformed.
     • All processors call regular 1D FFT
Lots of communication in both cases, but transpose approach is
  better: need to communicate only once, right before the local
  1D FFT.
               SAN DIEGO SUPERCOMPUTER CENTER

                                                  UNIVERSITY OF CALIFORNIA, SAN DIEGO
  Typical parallel 3D FFT up to now
Transpose strategy, 1D decomposition

•Each CPU has several planes (slabs)
   •Local geometry: N x N x L, where L=N/P
•Transform in 2 dimensions local to the planes first
   •Use an established library for 1D FFT, e.g. ESSL,
   FFTW
•Transpose data to localize third dimension
   •Local geometry: N x L x N
•Complete transform by operating along third dimension
•Most everyone has been using this approach. Examples:
   •PESSL
   •NAS Parallel Benchmarks
       SAN DIEGO SUPERCOMPUTER CENTER

                                        UNIVERSITY OF CALIFORNIA, SAN DIEGO
            1D decomposition scaling
• Communication: 1 call to MPI_Alltoall or equivalent all-to-all
  exchange.
• Demanding operation. Performance depends on
  interconnect bisection bandwidth.
       – Cut the network in two. Count the links cut.
       – This parameter depends on bandwidth of each link, and also network
         topology (e.g. fat-tree for Datastar, 3D torus for Blue Gene)
• Can scale OK (depending on interconnect), but only up to a
  point:
                             P <= N
• For 4096^3 grid, no more than 4096 processors. But we need
  a lot more even to fit in memory.
• => A new approach is needed

           SAN DIEGO SUPERCOMPUTER CENTER

                                              UNIVERSITY OF CALIFORNIA, SAN DIEGO
                       2D decomposition




y


    x



        z




            SAN DIEGO SUPERCOMPUTER CENTER

                                             UNIVERSITY OF CALIFORNIA, SAN DIEGO
                      2D decomposition
• 2D decomposition: processors form a square grid
  P1 x P2
  • Columns (pencils) instead of slabs
  • Local geometry:
     •   Stage 1: N x K x M
     •   Stage 2: K x N x M
     •   Stage 3: K x M x N
     •   K = N/P1, M=N/P2
     •   At each stage, transform along the local dimension (N)
• Software scaling limit removed! Now the limit is
     P <= N2

           SAN DIEGO SUPERCOMPUTER CENTER

                                            UNIVERSITY OF CALIFORNIA, SAN DIEGO
                     MPI Implementation
• Setup stage: create two subcommunicators, for
  rows and columns:
  integer
     periodic(2),remain_dims(2),cartid(2),dims(2),comm_cart,comm_row,comm_col
  periodic(1) = .false.
  periodic(2) = .false.
  call MPI_Cart_create(MPI_COMM_WORLD,2,dims,periodic,.false.,comm_cart,ierr)
  call MPI_Cart_coords(comm_cart,taskid,2,cartid,ierr)
  remain_dims(1) = .true.
  remain_dims(2) = .false.
  call MPI_Cart_sub(comm_cart,remain_dims,comm_row,ierr)
  remain_dims(1) = .false.
  remain_dims(2) = .true.
  call MPI_Cart_sub(comm_cart,remain_dims,comm_col,ierr)

            SAN DIEGO SUPERCOMPUTER CENTER

                                             UNIVERSITY OF CALIFORNIA, SAN DIEGO
                 MPI implementation
• First transpose
call MPI_Alltoall(sendbuf,Ncount,MPI_DOUBLE,
  recvbuf,Ncount,MPI_DOUBLE,comm_row,ierr)

• Second transpose
call MPI_Alltoall(sendbuf,Ncount,MPI_DOUBLE,
  recvbuf,Ncount,MPI_DOUBLE,comm_col,ierr)




        SAN DIEGO SUPERCOMPUTER CENTER

                                         UNIVERSITY OF CALIFORNIA, SAN DIEGO
       What about overlapping
   communication with computation?
• In general it’s a good idea
• There isn’t a non-blocking version of
  MPI_Alltoall is the MPI standard today
  • Need to write your own, based on MPI_Isend, MPI_Irecv,
    MPI_Wait
• Performance advantage will depend on platform
  • Blue Gene/L is not equipped with capabilities for overlap.
    CPUs are busy transmitting data.
  • Next generation of machines likely to have RDMA
    capabilities (Remote Direct Memory Access)

         SAN DIEGO SUPERCOMPUTER CENTER

                                          UNIVERSITY OF CALIFORNIA, SAN DIEGO
                Some rules of thumb
• Aggregate data and call MPI as few times as
  possible
• Derived datatypes are convenient but are often a
  disadvantage in performance
• Try to minimize the number of memory copies –
  do in-place operations if feasible
• If possible, try to use array stride 1
• When calling 1D FFT give it many transforms to
  do at a time


         SAN DIEGO SUPERCOMPUTER CENTER

                                          UNIVERSITY OF CALIFORNIA, SAN DIEGO
         Implementation and Performance
Each processor doing a (large) number of 1D FFT
• Memory access stride is a crucial parameter
    1.    FFT in X, stride 1: 28% peak on Datastar
    2.    Transpose in X & Y
    3.    FFT in Y, stride Nx/P1: 25% peak on Datastar
    4.    Transpose in Y & Z
    5.    FFT in Z, stride (Nx Ny)/P: 10% peak on Datastar

•       Alternative is to rearrange in memory space so
        that stride is 1. Involves an extra memory load.
•       ESSL is about 3 times faster than FFTW.
    •     Always use native high-performance library if available
             SAN DIEGO SUPERCOMPUTER CENTER

                                              UNIVERSITY OF CALIFORNIA, SAN DIEGO
               Implementation in a code
• Originally implemented in DNS (Direct Numerical Simulations) turbulence
  code
    •   Prof. P.K.Yeung, Georgia Tech
• Part of SAC project (Strategic Applications Collaboration,
  http://www.sdsc.edu/us/sac)
• Algorithm: straight 2nd order Runge-Kutta in time, pseudospectal in space.
   • 3D FFT is ~70% of the computation, and ~95% communication
• Researchers interested in grid sizes 4096 3 and beyond
    • (Earth Simulator apparently the only machine to handle this size to date, in
      2002!).
• More than 4096 processors are likely needed due to memory
  and compute power.
    • Face the 1D decomposition software limitation
• Ran 40963 test problem with 2D decomposition on Blue Gene
  at IBM T.J.Watson lab, up to 32k CPUs.
             SAN DIEGO SUPERCOMPUTER CENTER

                                                 UNIVERSITY OF CALIFORNIA, SAN DIEGO
                              Performance of DNS (2D) on IBM Blue Gene
                                              DNS performance on Blue Gene
                          1.E-06                                                                            512^3, CO
(Time x Np) /(N^3log(N)




                          1.E-06                                                                            512^3,VN

                          1.E-06                                                                            1024^3, CO

                          8.E-07                                                                            1024^3, VN

                                                                                                            2048^3, CO
                          6.E-07
                                                                                                            2048^3, VN
                          4.E-07
                                                                                                            4096^3, CO
                          2.E-07
                                                                                                            4096^3, VN
                          0.E+00
                                   64   128     256    512 1024 2048 4096 8192 1638432768
                                                                     Np
                                                                          VN: Two processors per node
                                                                          CO: One processor per node

                                    SAN DIEGO SUPERCOMPUTER CENTER

                                                                          UNIVERSITY OF CALIFORNIA, SAN DIEGO
DNS (2D) scaling on BG/W




SAN DIEGO SUPERCOMPUTER CENTER

                                 UNIVERSITY OF CALIFORNIA, SAN DIEGO
2D versus 1D performance




SAN DIEGO SUPERCOMPUTER CENTER

                                 UNIVERSITY OF CALIFORNIA, SAN DIEGO
Communication and performance: 2D vs. 1D

• 1D: 1 All-to-all exchange over P tasks
• 2D: 2 All-to-all exchanges in groups:
   • P2 groups of P1 tasks each
   • P1 groups of P2 tasks each
• Which is better?
   • Message size decreases if the exchange group size increases
   • Some links are unused in each of the 2 transposes, and data volume
     is the same as in 1D case
      • 1D decomposition ends up winning
• But again: 1D is only good up to P = N



           SAN DIEGO SUPERCOMPUTER CENTER

                                            UNIVERSITY OF CALIFORNIA, SAN DIEGO
                  Choosing P1 and P2

• P1 x P2 = P
• If P <= N, choose P1 = 1 (1D decomposition is
  better)
• If P >N, in most cases it’s best to choose
      smallest P1 (= P/N)
  • Better communication pattern
  • In some cases, better cache use: arrays leading
    dimension is N/ P1 which is a problem if it gets to be less
    than the cache line.


         SAN DIEGO SUPERCOMPUTER CENTER

                                          UNIVERSITY OF CALIFORNIA, SAN DIEGO
Choosing P1 and P2, cont’d




SAN DIEGO SUPERCOMPUTER CENTER

                                 UNIVERSITY OF CALIFORNIA, SAN DIEGO
                                  P3DFFT
• Open source library for efficient, highly scalable 3D
  FFT on parallel platforms
• Built on top of an optimized 1D FFT library
   • Currently ESSL or FFTW
   • In the future, more libraries
• Uses 2D decomposition, with option for 1D.
• Available at
  http://www.sdsc.edu/us/resources/p3dfft.php
• Current status: basic version available, more work is
  under way.
          SAN DIEGO SUPERCOMPUTER CENTER

                                           UNIVERSITY OF CALIFORNIA, SAN DIEGO
               P3DFFT: more features
• Implements real-to-complex (R2C) and complex-to-real
  (C2R) 3D transforms
• Written in Fortran 90 with MPI
• Implemented as F90 module
• Single or double precision
• Arbitrary dimensions
   • Handles many uneven cases (N i does not have to be a factor of Pj)
   • Assume Ni >= Pj
• Can do either in-place transform or otherwise
   • In which case the input and output arrays should not overlap
• Memory requirements: input and output arrays + 1 buffer

           SAN DIEGO SUPERCOMPUTER CENTER

                                            UNIVERSITY OF CALIFORNIA, SAN DIEGO
                        P3DFFT: Storage
• R2C transform input:
  • Global: (Nx,Ny,Nz) real array
  • Local: (Nx,Ny/P1,Nz/P2) real array
• R2C output:
  • Global: (Nx/2+1,Ny,Nz) complex array
     •   Conjugate symmetry: F(N-i) = F*(i)
     •   F(N/2+1) through F(N-1) are redundant
     •   F(0) and F(N/2) are real
     •   Total N/2+1 complex storage units
  • Local: ((Nx+2)/(2P1),Ny/P2,Nz) complex
• C2R: input and output interchanged from R2C
            SAN DIEGO SUPERCOMPUTER CENTER

                                             UNIVERSITY OF CALIFORNIA, SAN DIEGO
                         Building P3DFFT
• Subdirectory build/
   • Edit makefile for your architecture
   • Specify -DSINGLE_PREC or -DDOUBLE_PREC
       • Default is single precision
   • Specify -DESSL or -DFFTW
   • Choose P3DFFT_ROOT
   • Type ‘make’ – will compile and install the library in P3DFFT_ROOT
     directory
• On SDSC platforms, it is already installed in
  /usr/local/apps/p3dfft
• Subdirectory sample/
   • Contains sample programs
• For manual see README file in upper level directory

            SAN DIEGO SUPERCOMPUTER CENTER

                                             UNIVERSITY OF CALIFORNIA, SAN DIEGO
                             Using P3DFFT
• ‘use p3dfft’
  • Defines mytype – 4 or 8 byte reals
• Initialization: p3dfft_setup(proc_dims,nx,ny,nz)
  •   procs_dims(2) – two integers, P1 and P2
  •   For 1D decomposition, set P1 = 1
  •   Sets up the square 2D grid of MPI communicators in rows and columns
  •   Initializes local array dimensions
• Local dimensions: get_dims(istart,iend,isize,Q)
       • istart(3),iend(3),isize(3) – X,Y,Z bounds for local domains
       • Set Q=1 for R2C input, Q=2 for R2C output




             SAN DIEGO SUPERCOMPUTER CENTER

                                                  UNIVERSITY OF CALIFORNIA, SAN DIEGO
                     Using P3DFFT (2)

• Now allocate your arrays
• When ready to call transform:
  p3dfft_ftran_r2c(IN,OUT) – Forward transform
     • exp(-ikx/N)
  p3dfft_btran_c2r(IN,OUT) – Backward (inverse)
    transform
     • exp(ikx/N)




         SAN DIEGO SUPERCOMPUTER CENTER

                                          UNIVERSITY OF CALIFORNIA, SAN DIEGO
                              Future work
• Include more array distribution schemes
• Expand options for memory ordering
• Include more transform types
    • Derivatives
    • Complex-to-complex
•   C interface
•   Convenience functions
•   Break down in stages
•   Expand choice of 1D FFT libraries
•   Questions? Comments?
        dmitry@sdsc.edu
           SAN DIEGO SUPERCOMPUTER CENTER

                                            UNIVERSITY OF CALIFORNIA, SAN DIEGO
              P3DFFT and Petascale
• This algorithm is applicable to a large number of
  problems, in particular DNS turbulence
  • One of the three model problems for NSF’s Petascale
    platform procurement RFP
• This algorithm is likely to scale even to 105 -106
  CPUs, assuming powerful interconnect
• Overlap of communication and computation will
  help! (RDMA)
• Hybrid MPI/OpenMP
• Co-Array Fortran
  • PGAS – Partitioned Global Address Space model
         SAN DIEGO SUPERCOMPUTER CENTER

                                          UNIVERSITY OF CALIFORNIA, SAN DIEGO
                               Exercise
1.   Write a program that would:
     •    initialize P3DFFT
     •   allocate input and output 3D arrays (choose global size not too big: 1283 or
         2563 is reasonable)
     •   initialize arrays to your favorite functional form such as a sine wave
     •   Do a forward transform
     •   Compare the data with what you expect
     •   Do a backward transform
     •   After multiplying by a normalization factor, compare the results with original
         array
     •   Time performance of the computational part
     •   Run at a few grid sizes and processor counts, study performance
2.   Do the same for in-place transform (output overwrites input)




               SAN DIEGO SUPERCOMPUTER CENTER

                                                     UNIVERSITY OF CALIFORNIA, SAN DIEGO

								
To top