gaussian-GompertsFrischPanziera by NiceTime

VIEWS: 0 PAGES: 18

									Scalability of Gaussian 03
on SGI Altix: The
Importance of Data Locality
on CC-NUMA
Architecture
            Roberto Gomperts1, Michael Frisch2, Jean-Pierre            (1) SGI
                                                              (2) Gaussian, Inc
Panziera1
Top Original
 top - 14:32:46 up 5 days, 1:05, 10 users, load average: 22.86, 21.96, 13.56
 Tasks: 924 total, 33 running, 891 sleeping,   0 stopped,  0 zombie
 Cpu(s): 2.0%us, 0.0%sy, 0.0%ni, 97.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
 Mem: 144829024k total, 5608320k used, 139220704k free,    12480k buffers
 Swap: 195417776k total,       0k used, 195417776k free,  774048k cached

    PID   USER   PR   NI    VIRT    RES    SHR   S %CPU %MEM          TIME    P   COMMAND
 216786   chem   15    0    5200   2768   1632   R   14 0.0           0:00   42   top
 216693   chem   25    0   35.6g   2.7g   8016   R 100 1.9           10:39   35   l1002.exe
 216692   chem   25    0   35.6g   2.7g   8016   R 100 1.9           10:21   34   l1002.exe
 216691   chem   25    0   35.6g   2.7g   8016   R 100 1.9           10:41   33   l1002.exe
 216690   chem   25    0   35.6g   2.7g   8016   R 100 1.9           10:14   32   l1002.exe
 216689   chem   25    0   35.6g   2.7g   8016   R   99 1.9          10:36   31   l1002.exe
 216688   chem   25    0   35.6g   2.7g   8016   R   99 1.9          10:22   30   l1002.exe
 216687   chem   25    0   35.6g   2.7g   8016   R   99 1.9          10:43   29   l1002.exe
 ...
 ...
 216672   chem   25   0    35.6g   2.7g   8016   R   100   1.9       10:05   14   l1002.exe
 216671   chem   25   0    35.6g   2.7g   8016   R   100   1.9       10:32   13   l1002.exe
 216670   chem   25   0    35.6g   2.7g   8016   R   100   1.9       10:13   12   l1002.exe
 216669   chem   25   0    35.6g   2.7g   8016   R   100   1.9       10:33   11   l1002.exe
 216668   chem   25   0    35.6g   2.7g   8016   R   100   1.9       10:23   10   l1002.exe
 216667   chem   25   0    35.6g   2.7g   8016   R   100   1.9       10:33    9   l1002.exe
 216666   chem   25   0    35.6g   2.7g   8016   R    99   1.9       10:12    8   l1002.exe
 216665   chem   25   0    35.6g   2.7g   8016   R    99   1.9        6:13    7   l1002.exe
 216664   chem   25   0    35.6g   2.7g   8016   R    99   1.9        5:54    6   l1002.exe
 216663   chem   25   0    35.6g   2.7g   8016   R    99   1.9        6:08    5   l1002.exe
 216661   chem   25   0    35.6g   2.7g   8016   R   100   1.9        7:15    4   l1002.exe




                                                                 2
Topics

  Gaussian.
  Molecular Model.
  Altix Architecture.
  Parallel Implementation.
  Tools.
  Effect of Localization.
  Summary.


                     3
Gaussian (www.gaussian.com)
  Gaussian 03 is the latest in the Gaussian series of
  electronic structure programs. Gaussian 03 is used by
  chemists, chemical engineers, biochemists, physicists
  and others for research in established and emerging
  areas of chemical interest.
  Starting from the basic laws of quantum mechanics,
  Gaussian predicts the energies, molecular structures,
  and vibrational frequencies of molecular systems, along
  with numerous molecular properties derived from these
  basic computation types. It can be used to study
  molecules and reactions under a wide range of
  conditions, including both stable species and compounds
  which are difficult or impossible to observe
  experimentally such as short-lived intermediates and
  transition structures.


                           4
α-pinene
                                                  C10 H26 (with reactive 4-
                                                  membered ring).
                                                  Monoterpene that may be a
                                                  significant factor affecting
                                                  bacterial activities in nature.
                                                  It is a major component in tea-
                                                  tree oils, and gives off a piney
                                                  smelling odor.
                                                  Terpenes are widely used as
                                                  flavorings, deodorants, and
                                                  medicines (as in the treatment of
                                                  acne).
     Image generated with Avogadro                Calculation:
                                                   –   IR (Infra Red) vibrational spectrum.
                                                   –   Density Functional Theory (DFT): B3Lyp
 Devlin, F.J., Stephens, P.J., Cheeseman,              model.
    J.R., et al.: J. Phys. Chem. A 101(51),        –   6-311G(df,p) basis set (346 basis
    9912 (1997)                                        functions).




                                              5
Altix Architecture CC-NUMA: Shared
Memory
                                                                    Each Compute Node
                        Router                             Router   has:
            Router                            Router                – 2 processor sockets.
                        NUMAlink fabric 1/2
                                                                    – Memory DIMMs.
                                                                    – Hub chip:
Proc                               Proc
                                                                        • Memory controller
            Hub                               Hub
                     Memory                            Memory             for local DIMMs.
Proc                               Proc
                                                                        • Interfaces with
             compute node0                     compute node4              processors via FSB
              compute node1                     compute node5             (667 MHz).
               compute node2                     compute node6
                compute node3                     compute node7         • Interface with rest
                       IO node 0                                          of system via 2
                                                                          NUMAlink ports.
                                                                    NUMAlink fabric:
       In this study: Altix 450                                     – 8-port Routers (BW:
        – 72 cores: Itanium 9130M: 1.66GHz/4MB                        3.2 GB/s each
          per core L3 cache.                                          direction).
        – 144 GB Physical Memory.                                   – Remote memory-load
        – Used only 32 cores so, when well placed,                    latency 2.1-2.9x than
          maximum of 2 Routers hop.                                   local access.


                                                       6
Gaussian’s Parallelism and Memory
Usage Model

      Two Parallelization Models
      – Shared Memory (OpenMP)
      – Distributed Memory (Linda)
      Possible to Combine Hybrid/Hierarchical)
      Linda Parallelism is a subset of OpenMP

  Gaussian Shared memory with OpenMP
  S   "P"
  0   0
  S   P                              P
  0   0                              1
  S   P            P                 P   P
  0   0            1                 2   3

  First-touch Memory Placement Policy
                                 7
OpenMP Parallelization of Gaussian:
Basic Algorithm
                            DA(:,:,:)
 Subroutine CPHFdriver                      Subroutine CalcInt(LARGE,scratch)
   allocate scratch(LARGE)                    do while(FitInLarge)
   FA(:,:,:) = 0                                do i,j,k,l
   do while(iterative_solution)                   scratch(i,j,k,l) = f(i,j,k,l)
     Call DirSCF(LARGE,scratch,FA)              enddo
     Call Update(FA)                          enddo
   enddo                                    …
   …
                                            Subroutine InFock(LARGE,scratch,FA)
                                              do i= 1,LARGE
 Subroutine DirSCF(LARGE,scratch,FA)            do j = 1,nFock
   do while(.not. All_ijkl)                        FA(j,ij,kl) = FA(j,ij,kl)+
     Call CalcInt(LARGE,scratch)                         c(j,ij,kl)*scratch(i,j,k,l)
     Call InFock(LARGE,scratch,FA)               enddo
   enddo                                      enddo
 …                                          …                   DA(:,:,:)




                                        8
OpenMP Parallelization of
Gaussian: Modifications
Subroutine CPHFdriver            Subroutine CPHFdriver
  allocate scratch(LARGE)          allocate scratch(LARGE)
  FA(:,:,:) = 0                    allocate FA_p(:,:,:,MAXPROC-1)
  do while(iterative_solution)     FA(:,:) = 0
     Call                          sl = LARGE/np; lst_sl = LARGE – sl*(np-1)
    DirSCF(LARGE,scratch,FA)       do while(iterative_solution)
     Call Update(FA)             C$OMP Parallel DO Private(ip)
  enddo                              do ip=1,np
  …                                    if(ip==np) then
                                          Call DirSCF(lst_sl,
                                            scratch(1+sl*(ip-1)),FA,ip,np)
                                       else
                                          FA(:,:,:,ip) = 0
                                          Call DirSCF(LARGE,scratch(1,sl*(ip-1)),
                                               FA_p(1,1,1,ip),ip,np)
                                       endif
                                     enddo
                                     do ip=1,np-1
                                          Call Daxpy(ij*kl,1.0d0,FA_p(1,1,1,ip),1,
                                              FA,1)
                                     enddo
                                    Call Update(FA)
                                    enddo

                                      9
SGI Histx Performance Tool Suite:
Lipfpm & Histx
   Lipfpm (Counter)
    – lipfpm “mlat” computes the average memory access latency seen at the
       processor interface (FSB) dividing the cumulative number of memory
       read transactions outstanding per cycle by the total number of cycles.
    – lipfpm “dlatNN” simply counts the number of memory data-loads which
       took more than NN cycles to complete; possible values for NN are any
       power of two between 4 and 4096. E.g. lipfpm “dlat1024” measures the
       number of data-loads which took more than 1024 cycles.
   Histx (Profiler)
    – histx “dlatNN” profiles the application based on the occurrence of data-
       loads with latency longer than NN cycles. E.g. with histx
       “dlat1024@2000” a sample is taking after 2000 data-loads longer than
       1024 cycles. So if an application generates 200 million data-loads
       longer than 1024 cycles, 100 thousand samples will be taken allowing
       for a precise profile.
    – histx "numaNN" also samples data-loads longer than NN cycles (like
       histx "dlatNN" does). However it further identifies which node of the CC-
       NUMA system the data accessed was located on.




                                      10
Original Code: Average Latencies and
Cycle Counts
          Four threads per node                                                              Average FSB Relative Memory Latency (nsec)
              – Load Unbalance
              – Large latencies (cycles)                                           28
                                                                                   24
                                                                                   20




                                                                          Thread
                                                                                   16
                                                                                   12
                                                                                    8

                       Data load miss latency >= 1024                               4
                                                                                    0
                                                                                         0     100   200   300   400   500   600   700   800   900   1000
         28
                                                                                                                  Latency in ns
         24
         20
Thread




         16
         12
         8
         4
         0
              0   25     50   75      100    125        150   175   200            225
                                   Counts in Millions
                                                                          11
Hot Spot: Routine Dgst01

      Do 500 IShC = 1, NShCom
       IJ = LookLT(C4IndR(IShC,1)+IJOff)
       KL = LookLT(C4IndR(IShC,6)+KLOff)
       R1IJKL = C4RS(IShC,IRS1,1)*( C4ERI(IShC,IJKL)
 $             - FactX*(C4ERI(IShC,IKJL)+C4ERI(IShC,ILJK)) )
 C
       If(Abs(R1IJKL).ge.CutOff) then
          Do 10 IMat = 1, NMatS
 10          FA(IMat,IJ) = FA(IMat,IJ) + DA(IMat,KL)*R1IJKL
          Do 20 IMat = 1, NMatS
 20          FA(IMat,KL) = FA(IMat,KL) + DA(IMat,IJ)*R1IJKL
       endIf

            [Similar If/endIf constructs as above]

 500 Continue




 Overall Nesting level 5



                                             12
Parallelization of Gaussian:
Modifications for Local DA
 Subroutine CPHFdriver                                   Subroutine CPHFdriver
   allocate scratch(LARGE)                                 allocate scratch(LARGE)
   allocate FA_p(:,:,:,MAXPROC-1)                          allocate FA_p(:,:,:,MAXPROC-1)
   FA(:,:) = 0                                             FA(:,:) = 0
   sl = LARGE/np; lst_sl = LARGE – sl*(np-1)               sl = LARGE/np; lst_sl = LARGE – sl*(np-1)
   do while(iterative_solution)                            do while(iterative_solution)
 C$OMP Parallel DO Private(ip)                           C$OMP Parallel DO Private(ip,DA_p)
     do ip=1,np                                              do ip=1,np
       if(ip==np) then                                         if(ip==np) then
         Call DirSCF(lst_sl,                                     Call DirSCF(lst_sl,
           scratch(1+sl*(ip-1)),FA,ip,np)                          scratch(1+sl*(ip-1)),FA,ip,np)
       else                                                    else
                                                                 allocate DA_p(:,:,:)
                                                                 DA_p(:,:,:) = DA(:,:,:)
          FA(:,:,:,ip) = 0                                       FA(:,:,:,ip) = 0
          Call DirSCF(LARGE,scratch(1,sl*(ip-1)),                Call DirSCF(LARGE,scratch(1,sl*(ip-1)),
               FA_p(1,1,1,ip),ip,np)                                   FA_p(1,1,1,ip),ip,np,DA_p)
       endif                                                   endif
     enddo                                                   enddo
     do ip=1,np-1                                            do ip=1,np-1
         Call Daxpy(ij*kl,1.0d0,FA_p(1,1,1,ip),1,                Call Daxpy(ij*kl,1.0d0,FA_p(1,1,1,ip),1,
              FA,1)                                                  FA,1)
     enddo                                                   enddo
    Call Update(FA)                                         Call Update(FA)
    enddo                                                   enddo




                                                    13
Local DA Code: Latencies
                   Average FSB Relative Memory Latency (nsec)
                                                                                  Dramatic reduction in
          28
          24
                                                                                  Average FSB relative
          20
                                                                                  Memory Latency
 Thread




          16
          12
          8
                                                                                  – Local: from ~470 ns to
          4                                                                         ~320 ns
          0
               0     100   200   300   400   500   600   700   800   900   1000   – Remote: from ~950 ns
                                      Latency in ns
                   Average FSB Relative Memory Latency (nsec)                       to ~370 ns
          28                                                                      Elapsed Time
          24
          20                                                                      – Original code: 1140
 Thread




          16
          12
                                                                                    s
          8
          4
                                                                                  – Local DA code: 670
          0                                                                         s
               0     100   200   300   400   500   600   700   800   900   1000
                                        Latency in ns
                                                                            14
Memory to Cores Mapping
    Memory        Nod es &         Memory      Nod es &       Memory            Nod es &
    la you t       Cores           la you t     Cores         la you t           Cores
31                28    29     31              28    29   31                    28    29

                  30   31                      30   31               DA         30   31

       .
       .
           FA

                                      .
                                          FA
                                          da
                                                                 .
                                                                 .
                                                                     FA



       .
       .          . .                 .        . .               .
                                                                 .              . .
                                                                                . .
3
                  . .
                  . .                 .
                                      .        . .
                                               . .        3                     . .
                               3
                                                                     DA
           FA                                                        FA
2                                         FA              2
                   8    9                 da   8     9                          8     9
                               2                                     DA
           FA     10   11                      10   11               FA         10   11
1                                                         1
                                          FA
                   4    5                 da   4     5               DA         4     5
           FA                  1                                     FA
0                  6    7                      6     7    0                     6     7

                                          FA                         DA
           FA      0    1                 da   0     1               FA         0     1
(0)                            0                          (0)
           DA      2    3                      2     3                          2     3

                                          FA
                                          da

           Original          Intermediate (round-robin)                  Local DA
            1140 sec                 800 sec 15                          670 sec
Suggestion

  Nice to have an extension/variation on the
  FirstPrivate clause: Replication & copy-in of
  read-only target variable/array
   – No replication on thread 0. As advanced extension do
     not replicate the n threads running on node 0. This
     requires proper placement.
   – No need for synchronization constructs (barrier) while
     creating the copies of the read-only data structures.
   – Conditional clauses to allow run-time decisions on
     whether to replicate or not.
   – Run-time definition of size and shape of replicated
     arrays. This could help with arrays with assumed
     dimensions and when only a section of the array is
     needed in the parallel section.

                            16
Summary

  In a CC-NUMA architecture performance
  bottlenecks can be found when many nodes
  have very frequent and almost simultaneous
  (read-only) access to the same memory
  structure on one node.
  Spreading the memory being accessed
  somewhat alleviates the problem.
  Best solution is to localize (by replication) the
  read-only data structures.
  SGI’s Histx Performance Analysis tools suite
  offer an efficient way to diagnose the problem
  and pin-point the hot-spots.
                         17
designed. engineered. results.

								
To top