Molecular Dynamics Simulations on Massively Parallel Processor by iih17598

VIEWS: 5 PAGES: 38

									John von Neumann - Institut für Computing
Zentralinstitut für Angewandte Mathematik




                     Molecular Dynamics Simulations
                                      on
                 Massively Parallel Processor Architectures
                                     and
                                PC-Clusters

                                                     G. Sutmann

                                            (e-mail: g.sutmann@fz-juelich.de)



                                                       Research Centre Jülich (FZJ)
                                                       John von Neumann Institute for Computing (NIC)
                                                       Central Institute for Applied Mathematics (ZAM)
                                                       D - 52425 Jülich

                                                       http://www.kfa-juelich.de/zam
John von Neumann - Institut für Computing
Zentralinstitut für Angewandte Mathematik




                                            Outline of the talk



                          • Molecular dynamics computer simulations

                          • Some applications

                          • Parallelization strategies

                          • Architectural considerations

                          • Some benchmarks
John von Neumann - Institut für Computing
Zentralinstitut für Angewandte Mathematik




                      Essential Molecular Dynamics Simulations:

     • Model: interaction between particles through potential function Φ(r)
              - bonded and non-bonded interactions
                                                                 N          ∂uij (r )
                         - forces on particles via   Fi = −    ∑
                                                              j =1, j ≠ i     ∂rij
     • Integrator: propagation through phase space
               - finite difference schemes
               - implicit vs. explicit integrators
               - stability considerations

     • Statistical Ensemble: thermodynamical conditions
               - microcanonical ensemble
               - canonical ensemble
               - isothermal-isobaric ensemble

     • Results: thermodynamics and statistical mechanics
              - internal energy, pressure, temperature
              - response functions, correlation functions,
                linear response theory
  John von Neumann - Institut für Computing
  Zentralinstitut für Angewandte Mathematik




                                              Interactions between particles

                                                      Bonded vs Non-Bonded


Characterization of interaction potential:

assume potential form of r-n:



                                                1 finite , n > d , short ranged
                                         ∫ dr   r n
                                                    =
                                                     ∞   , n ≤ d , long ranged


 Electrostatic Coulomb interaction between point charges or point dipoles is long ranged.
 Hard core repulsion interaction is short ranged.
     John von Neumann - Institut für Computing
     Zentralinstitut für Angewandte Mathematik




                                                        Bonded Interactions
                                                  (local interactions - short ranged)
                                                                                                  bij
Bond stretching:                  uij (rij ) =
                                    s            1
                                                 2
                                                        (
                                                   kij rij − bij   )
                                                                   2




Bond bending:             uij (ϑijk ) =
                           b                 1
                                             2
                                                    (
                                               kijk ϑijk − ϑijk
                                                            0
                                                                   )   2
                                                                                              ϑ


                                                                                                                                l
                                                                                          j             k
Improper dihedrals:
                                      uij (ξ ijkl ) =
                                       id               1
                                                        2
                                                               (
                                                          kijkl ξijkl − ξ 0   )
                                                                              2
                                                                                  i                         l               i

                                                                                                                    j               k


                                                                                                                        l
Proper dihedrals:                 uijpd (ϕijkl ) = kϕ ( + cos(nϕijkl − ϕ0 ))
                                                      1                                       j
                                                                                                                k

                                                                                      i
John von Neumann - Institut für Computing
Zentralinstitut für Angewandte Mathematik




                             Short Range Non-Bonded Interactions


                                                   = Aij exp(− Bij rij )
                                                                                                                Cij
    1.Hard core:                       u    hard
                                            ij                                       or          u   hard
                                                                                                     ij     =
                                                                                                                rij12
                                                         Dij
    2. Dispersion:                     u    disp
                                            ij     =−
                                                          rij6                    2,0                                                     2,0



                                                                                  1,5                                   Buckingham        1,5
     => Buckingham potential:                                                                                           Lennard-Jones


              u = Aij exp(− Bij rij )−
                  B
                                                             Dij                  1,0                                                     1,0



                 ij
                                                                 rij6             0,5                                                     0,5




                                                                           U(r)
                                                                                  0,0                                                     0,0



     => Lennard-Jones potential:                                                  -0,5                                                    -0,5



                         σ            
                                         12
                                             σij  
                                                   6

            uij = 4ε ij   ij           −  
                                                                                  -1,0                                                    -1,0
              LJ                                                                         2   3   4          5      6    7    8     9    10

                         r                rij                                                             r [Å]
                          ij               
                 σ + σi
            σij = i                         ,      ε ij = ε i ε j          Fast calculation through cut-off-radius
                    2
   John von Neumann - Institut für Computing
   Zentralinstitut für Angewandte Mathematik




                                  Long Range Non-bonded Interactions

                                                      1 qi q j
 Coulomb interaction:                            u = C

                                                     4πε0 rij
                                                     ij
                                                                                              Global interactions:
                                                                                              computationally intensive

Molten salts:                                    Molecules:
qi = ne , n ∈ Ζ                                  qiα = ze , z ∈ R
                                                   α
     +                                                                +z                  4                                         4

 + - -                                                          +z
                                                                     -2z                                         Coulomb

 - + +                                                                 +z
                                                                                                                 Yukawa-Debye

+ - -                                                -2z                     +z




                                                                                   U(r)
                                                +z         +z            -2z
                                                                                          2                                         2




Screened Coulomb interaction:

                                 exp(− κrij )                                  -          0                                         0

           1                                                                                    2         4          6          8

uij =
 SC
               qi q j                                                                                    r [Å]
          4πε0                            rij                              +
          John von Neumann - Institut für Computing
          Zentralinstitut für Angewandte Mathematik




                 Calculation of Long Range Interactions: Lattice Sums
     Lattice sums:

     Periodic boundary conditions                        1 N              qi q j
     => infinite system                               U = ∑∑ '
     => lattice summation
                                                         2 i , j =1 n ri − r j + nL


      Problem: Lattice sums in infinite systems are
              conditionally convergent



                                                                                     Complexity: O(N2) or O(N3/2)
   1 1  N
                                               (
                       qi q j erfc α rij + nL / L          )
U=          ∑∑ '
   2 4πε 0 i , j =1 n           rij + nL
                                 !
                       $!!! #!!! "            !
           
                                   Φ real

                                                                         
                      4πqi q j1 ikrij − k 2 / 4α 2                       
                                                                         
                  +     3 ∑ k2e e
                       L n ≠!
                                                   + Φ self     + Φ surf 
                      $!!! #!!!!
                            0
                                              "                          
                                          Φ recip                        
                                                                         
             John von Neumann - Institut für Computing
             Zentralinstitut für Angewandte Mathematik




qi(r,ϑ,ϕ)
                                               Fast Multipole Methods
                                                         Complexity: O(N)
         • •
     •
 •              •                                            P(R,θ,φ)
         •
                         ∞ m  n
                         Mn m
     Φ( P ) = ∑ ∑ n +1 Yn (θ, φ)
              n=0 m= − n R                                                                                 •
                                                                            level i
                     k
     M = ∑ qi ri nYn− m (ϑi , ϕi )
         m
         n
                    i =1




                                                                                      level i+1
                                                                                                       •

                                                         •
                                                                                       level i+2

                                                                                                   •
        John von Neumann - Institut für Computing
        Zentralinstitut für Angewandte Mathematik




                          The location of MD on time and spatial scales
 Challenging problems in many particle simulations, e.g. :
           - long time dynamics of solvated macromolecules (pattern recognition, protein folding)
           - simulation of nanostructures (cracks, crystal growth)



Simulation methods:

• Quantum molecular dynamics                         length [Å]                                          Hydro-
  (Schrödinger equation,                               10000                                             dynamics
   density functional)
                                                        1000
• Classical molecular dynamics
  (Newton‘s equations of motion)
                                                         100
                                                                                     BD
• Brownian dynamics
  (Smoluchowski-,                                          10                CMD
   Fokker-Planck equation)                                         QMD
                                                                   QMD
                                                            1
• Hydrodynamics
  (Navier-Stokes equation)                                        fs         ps           ns        µs          time

                                       Requirements for long time- and large spatial scales simulations
       John von Neumann - Institut für Computing
       Zentralinstitut für Angewandte Mathematik




                                                   Large length scale simulations

                                                               Maximum atomic systems:

                                                               Simulation of more than 5 billion particles
                                                                         - memory optimized program IMD
                                                                            ∼
                                                                           (∼50 Byte/atom)

                                                                         - short range interacting particles
                                                                         - size of the system: 0.42 µm (1540 atoms)
                                                                         - time / integration: 388 s
                                                                         - 10000 integration steps: ¼ year



Applications: shock dynamics
              crack propagation




System: CRAY-T3E 1200 (ZAM, Jülich)
        512 PEs, 262 GBytes RAM
     John von Neumann - Institut für Computing
     Zentralinstitut für Angewandte Mathematik




                                                 Long time scale simulations

                                                            Challenge: understanding of protein folding

                                                            simulation of the autonomously folding subdomain
                                                                                                        µ
                                                                       HP-36 from villin headpiece for 1µs

                                                            • 596 protein atoms + 3000 water molecules

                                                            • timestep: 2 fs → 5 x 108 integration steps
                                                            • parallel efficiency of 66 % on 256 Pes → 5 ns / day




combination of atom- and spatial-decomposition
           - protein is fixed to CPUs
           - regroup water molecules when updating lists
             (good cache behavior)
truncation of electrostatic interactions (Rc=8 Å)


system: CRAY-T3D and T3E (Pittsburgh Supercomputer Center)
          256 PEs
     John von Neumann - Institut für Computing
     Zentralinstitut für Angewandte Mathematik




                                             Limits in Parallelization (ideal)
                                                     1                   p
Amdahl‘s Law:                               σ=                  =                                 p = number of Pe‘s
                                                 q
                                                   + (1 − q )       q + p (1 − q )
                                                                                                  q = portion of parallelizable work
                                                 p



                                     1024                                                                      1024
                                                           ideal
                                     512                                                                       512
                                                           99%
                                     256                                                                       256
                                                           95%
                                     128                                                                       128
                                                           50%
                                      64                                                                       64
                           Speedup




                                      32                                                                       32

                                      16                                                                       16

                                       8                                                                       8

                                       4                                                                       4

                                       2                                                                       2

                                       1                                                                       1

                                                 2     4      8       16    32       64   128   256   512   1024

                                                                           # PEs
       John von Neumann - Institut für Computing
       Zentralinstitut für Angewandte Mathematik




                  Parallel efficiency of molecular dynamics programs

Essential elements of MD:

• Force routine
• Neighbor lists
• Integrator
• Thermostats, barostats etc.

• Parameter setup
• File I/O

• Communication




       MD programs may be tuned to have
       99,999 ... % parallel efficiency
         John von Neumann - Institut für Computing
         Zentralinstitut für Angewandte Mathematik




                                                      Distributed Memory Molecular Dynamics

Implementation:                                  Module oriented program – tree structure
                                                     F language (subset of Fortran90)


Communication interface: Different message passing protocols and system architectures
                                                     Strong type checking and syntax simplification


MD features:                                     Short range interaction potentials
                                                     List techniques (Verlet, linked-cell, linked-cell-Verlet)
                                                     Different integrators (Verlet, Predictor-corrector)


Parallelization:                                     Domain decomposition
                                                     Systolic loop particle decomposition
John von Neumann - Institut für Computing
Zentralinstitut für Angewandte Mathematik




               Parallelization strategies for Molecular Dynamics



                                            • Atom – decomposition

                                            • Force – decomposition

                                            • Domain – decomposition
       John von Neumann - Institut für Computing
       Zentralinstitut für Angewandte Mathematik




                  Particle Decomposition: Replicated data algorithm
                                                                             P1:P1       P1:P2
          N particles on P processors: NP=N/P
• Distribute
 (number of particles per processor is fixed)

• NP particle coordinates x are updated on one processor

• copy coordinates from (P-1) processors to the local PE

• Use of Newton‘s 3rd law: Fij=-Fji

• Calculate elements Fij of „checkerboard force matrix“:                     P2:P1       P2:P2
  Fij=0 if i>j and i+j odd ; Fij=0 if i<j and i+j even
                                                                             log2(P) send / receive operations

Communication                                      Scaling
                                                                                 0 1 2 3 4 5 6 7
1. All-to-all communication of positions                   (P-1)NP
   local copy of total vector x on each            or: ln2(P)N (tree code)
        processor                                                                0 1 2 3 4 5 6 7


2. All-to-all communication of forces Fij                  (P-1)NP               0 1 2 3 4 5 6 7
                                                   or: ln2(P)N (tree code)
    John von Neumann - Institut für Computing
    Zentralinstitut für Angewandte Mathematik




                                 Particle Decomposition: Systolic Loop


• N particles on P processors: Np=N/P (fixed)                              1
                                                                                     ri
                                                                       i
• Np coordinates x are updated on one processor
                                                                               Fij
• split each iteration step into (P-1) steps and send
data from Pi to Plocal                                             4                      2

• Calculate forces using Newton‘s 3rd law and send
forces back to Pi

                                                                           3

Communication                                            Scaling
1. All-to-all communication of positions                (P-1)NP/2

2. All-to-all communication of forces Fij               (P-1)NP/2
      John von Neumann - Institut für Computing
      Zentralinstitut für Angewandte Mathematik




                     Force decomposition: data-replication algorithm
• Replicate coordinates on each PE
                                                                              j = 1, 2, ...    N
• Calculate forces on particles i, according to a
   subdivision of the force-matrix Fij.                                     i=1               A(L1)       L1
• Regions in the force matrix are subdivided                                  2               A(L2)       L2
  equally, i.e. A(L1)= A(Ln); n=1,...,K
• Apply principle of action-counteraction




                                                                                                      …
• Global reduction of force-vectors to get Fi




                                                                            …
• Propagate positions and velocities for all
  particles on each PE

                                                                                              A(LK)
                k −1
                        L +1
 A( Lk ) =  N − ∑ L j − k
           
                              Lk                  ; k = 1, %, P                                          LK
                j =1      2 
                             
      Qk − Qk2 − 4 N (N − 1) / P                                     k −1    N
 Lk =                                             ; Qk = 2 N − 1 − 2∑ L j
                 2                                                   j =1



  Communication:                                            Scaling:
  Global reduction of force-vector                                         (P-1)NP
                                                                   or: ln2(P)N (tree code)
    John von Neumann - Institut für Computing
    Zentralinstitut für Angewandte Mathematik




           Force decomposition: distributed data implementation

• Distribute particles N homogenously on main
                                                  √
                                                                               1   1    1        1         1
 diagonal of quadratically processor matrix: NP=N/√P                           2   2    2   7    2    11   2    15
                                                                               3   3    3   8    3    12   3    16
                                                                               4   4    4        4         4
• Copy the coordinates row and column wise
                                                                               1        5   5    5         5
                                                                               2   5    6   6    6    11   6    15
• Calculate forces according Newton‘s                  3rd   law               3
                                                                               4
                                                                                   6    7
                                                                                        8
                                                                                            7
                                                                                            8
                                                                                                 7
                                                                                                 8
                                                                                                      12   7
                                                                                                           8
                                                                                                                16


                                                                               1        5         9    9    9
• Use symmetry of the transpose matrix elements to                             2    9   6   9    10   10   10   15
                                                                               3   10   7   10   11   11   11   16
  reduce force calculations                                                    4        8        12   12   12
                                                                               1        5         9        13   13
• Reduce forces to the diagonal elements of the matrix                         2   13   6   13   10   13   14   14
                                                                               3   14   7   14   11   14   15   15
  and update atom positions                                                    4        8        12        16   16



 Communication:                                 Scaling:

 Row-/column wise replication                                √
                                                           2(√P-1) NP
 Transpose exchange                                            NP
 Force reduction                                             √
                                                            (√P-1) NP
                                                           √
                                                   or: ln2(√P)NP (tree code)
      John von Neumann - Institut für Computing
      Zentralinstitut für Angewandte Mathematik




                                                  Domain Decomposition
 Principle: distribute the spatial domain most uniform among processors
            (make domains as cubic as possible with sidelength D)

 • Particles can move across different processors
 • Only local communications required (interacting particles are
    on the local and neighbored processors)
 • In 3 dimensions (if D>Rc): communication with 26 neighbor Pes
 • Only small amount of data is transfered. Communication grows
   with surface area ( <N>2/3 )



Efficient communication: in 3 dimensions only 6 communications are required – not 26!
John von Neumann - Institut für Computing
Zentralinstitut für Angewandte Mathematik




                                Scaling of Domain Decomposition



Communication                                                                 Scaling
1. Update atom information of a domain –                                            6 δN
   communicate particles to neighbored boxes

2. List of boundary atoms to be sent                                                6 δN

3. Communication of boundary atoms                                                  6 δN




                                            δN = xN / P2/3   ;   x = δr / L
John von Neumann - Institut für Computing
Zentralinstitut für Angewandte Mathematik




               Important factors, influencing the performance
                          on parallel architectures



                                            •   Power of CPU‘s
                                            •   Cache size
                                            •   Speed of memory access
                                            •   Bandwidth and latency of the network
                                            •   ...
     John von Neumann - Institut für Computing
     Zentralinstitut für Angewandte Mathematik




                            Symmetric Multiprocessor Systems (SMP)

   All processors have access to global address space


   CPU
                                                               Typical systems: 1 – 32 nodes

                                                               Largest system:
                                                                            SGI 3800
                            Memory                                          scalable from 32 to 512 processors


• Uniform Memory Access (UMA): all addresses in memory
            space are equally available for all processors
            (Compaq HPC320, NEC SX-5)

• Non Uniform Memory Access (NUMA): the speed of memory
            access may differ for different PEs
            (Tera)

• Cache Coherent Non Uniform Memory Access (ccNUMA):
           buffers data of „far away“ memory entries
           data are updated coherently with cache entries
           (SGI Origin 2000, HP-V class)


Problems:       possible conflicts with memory access between PEs
                synchronization of PEs
        John von Neumann - Institut für Computing
        Zentralinstitut für Angewandte Mathematik




                                        Distributed Multiprocessor Systems


      CPU


                             Local Memory



                                 Interconnect


                                         CRAY-T3E 1200 (ZAM / Jülich)

Processors                           512 compute nodes Typ 1200
Cycle time                           1.66 ns
Performance                          1200 MFlops/ PE
Overall Peak                         614 GFlops
Memory                               512 MB / PE
No. Of users                         ~ 450


Streams:        maximization of memory bandwidth
E-registers:    gather-scatter operations for local and remote memory
    John von Neumann - Institut für Computing
    Zentralinstitut für Angewandte Mathematik




                                                SMP-Cluster




Advantage: Performance
                                                                Advantage: Scalability
Drawback: Scalability
                                                                Drawback: Performance




                                   CPU


                  Local Shared Memory



                                                 Interconnect
 • Good node performance
 • Scalable to very large systems (>1000 nodes)
    John von Neumann - Institut für Computing
    Zentralinstitut für Angewandte Mathematik




                                                ZAMpano




• 4x Intel Pentium III Xeon 550 MHz, 512 KB on-chip cache
•   Intel 450 NX chipset
•   2 GB ECC-RAM
•   Myrinet 64-bit PCI SAN card M2M-PCI64A-A, 4 MB
•   3COM 3c905B 10/100 MBit Ethernet card
•   Overall peak performance: 19.8 GFLOPS
•   Overall main memory: 18 GB                     •Operating system: SUSE Linux
                                                   •Internet address: zampano.zam.kfa-juelich.de
John von Neumann - Institut für Computing
Zentralinstitut für Angewandte Mathematik




                                            Programming models


                • Distributed systems:
                                     message passing (e.g. MPI)

                • SMP systems:
                                              shared memory implementation (e.g. OpenMP)

                • Coupled SMP-systems:
                                   1.) pure message passing
                                   2.) hybrid implementation
                                              - shared memeory on the nodes
                                              - message passing between the nodes



  MP advantages: programs run on both message-passing and shared memory systems
  SM advanages: easy implementation of parallel directives, step-by-step parallelization
      John von Neumann - Institut für Computing
      Zentralinstitut für Angewandte Mathematik




                        Latency and bandwidth of different networks

Machine                        CPU                 Network         Latency           Bandwidth

CRAY T3E-1200                  DEC 21164           CRAY T3E        ≈ 8 µs (2 µs)     ≈ 350 MB/s
(Jülich)                       (600 MHz)           interconnect


ZAMpano                        Intel Pentium III   Myrinet         ≈ 80 µs (15 µs)   ≈ 65 MB/s
(Jülich)                       Xeon (550 MHz)


MPCB                           Intel Pentium III   Fast Ethernet   ≈ 470 µs          ≈10 MB/s
(Orléans)                      (550 MHz)
John von Neumann - Institut für Computing
Zentralinstitut für Angewandte Mathematik




Limits in Parallelization: 1. Global communication (all-to-all)

 Extended version of Amdahl‘s law:
                          1                                                      χ
  σ=                                                         c( p) = ( p − 1) λ +           λ = latency
           q                                                                     p          χ = pure communication
             + c( p) + (1 − q )                                                   
           p

                                  2048                                                                 2048
                                                 ideal
                                  1024                                                                 1024
                                                 q = 1, λ = 10 , χ = 0
                                                               -4
                                   512                                                                 512
                                                 q = 1, λ = 0, χ = 0.01
                                   256                                                                 256
                                                 q = 1, λ = 10 , χ = 0.01
                                                               -4
                                   128                                                                 128
                        Speedup




                                                 q = 1, λ = 10 , χ = 0.05
                                                               -4
                                    64                                                                 64
                                    32                                                                 32
                                    16                                                                 16
                                     8                                                                 8
                                     4                                                                 4
                                     2                                                                 2
                                     1                                                                 1
                                         1   2    4      8    16    32    64   128 256 512 1024 2048
                                                                    # PEs


                                                  Speedup is limited by latency
  John von Neumann - Institut für Computing
  Zentralinstitut für Angewandte Mathematik




Limits in Parallelization: 2. Global communication (tree-structure)
Extended version of Amdahl‘s law:
                            1                                                                      2( n −1) χ   λ = latency
                                                                                     log 2 ( p )
    σ=
             q
               + c( p) + (1 − q )
                                                        c( p) = log 2 ( p)λ +          ∑n =1           p        χ = data transfer
             p

                                       2048                                                                      2048
                                                       ideal
                                       1024                                                                      1024
                                                       q = 1, λ = 10 , χ = 0
                                                                    -4
                                        512                                                                      512
                                                       q = 1, λ = 0, χ = 0.01
                                        256                                                                      256
                                                       q = 1, λ = 10 , χ = 0.01
                                                                    -4
                                        128                                                                      128
                             Speedup




                                                       q = 1, λ = 10 , χ = 0.05
                                                                    -4
                                         64                                                                      64
                                         32                                                                      32
                                         16                                                                      16
                                          8                                                                      8
                                          4                                                                      4
                                          2                                                                      2
                                          1                                                                      1
                                              1    2    4      8   16    32     64   128 256 512 1024 2048
                                                                          # PEs

                                                  Speedup is limited by communication
  John von Neumann - Institut für Computing
  Zentralinstitut für Angewandte Mathematik




                   Limits in Parallelization: 3. Local communication
Extended version of Amdahl‘s law:
                                                                                                                          0    p =1
                                 1                                                                                        2    p=2
        σ=                                                                                     χ                        
                                                                            c( p) = f ( p) λ + 2 / 3 
                                                                                                                f ( p) = 
                                                                                               p 
                 q
                   + c( p) + (1 − q )                                                                                   4    p=4
                 p                                                                                                        6
                                                                                                                               p≥8
                               2048                                                                       2048
                               1024
                                                   ideal
                                                                                                          1024          λ = latency
                                                   q = 1, λ = 10 , χ = 0
                                                                -4
                                512                                                                       512           χ = data transfer
                                                   q = 1, λ = 0, χ = 0.01
                                256                                                                       256
                                                   q = 1, λ = 10 , χ = 0.01
                                                                -4
                                128                                                                       128
                     Speedup




                                                   q = 1, λ = 10 , χ = 0.05
                                                                -4
                                 64                                                                       64
                                 32                                                                       32
                                 16                                                                       16
                                  8                                                                       8
                                  4                                                                       4
                                  2                                                                       2
                                  1                                                                       1
                                      1       2     4      8   16    32      64   128 256 512 1024 2048
                                                                      # PEs

                                                  Speedup is limited by communication
                               John von Neumann - Institut für Computing
                               Zentralinstitut für Angewandte Mathematik




                                                              Performance on different PC-clusters


                               ZAMpano Jülich                                                                                      MPCB Orléans
                         60

                                                                                                                             40
                         50      MPI_ssend                                                                                          MPI_ssend
                                      1 node ; 2 PEs                                                                                     1 node ; 2 PEs
bandwidth [Mbytes / s]




                                                                                                    bandwidth [Mbytes / s]
                                      2 nodes ; 1 PE                                                                                     2 nodes ; 1 PE
                         40                                                                                                  30


                         30
                                                                                                                             20

                         20

                                                                                                                             10
                         10


                         0                                                                                                   0


                                10         100         1000       10000    100000   1000000   1E7                                   10     100     1000    10000   100000   1000000   1E7
                                                   buffer length [bytes]                                                                         buffer length [bytes]

                              Myrinet                                                                                             Fast Eathernet
John von Neumann - Institut für Computing
Zentralinstitut für Angewandte Mathematik




                                                          Comparison with MPP-systems
                                                                    (inter-node communication)

                                                        400

                                                                   CRAY communication network
                                                                   Myrinet
                                                                   Fast Ethernet
                               bandwidth [Mbytes / s]
                                                        300




                                                        200




                                                        100




                                                         0


                                                              10    100     1000    10000   100000   1000000   1E7

                                                                           buffer length [bytes]
John von Neumann - Institut für Computing
Zentralinstitut für Angewandte Mathematik




                                               MD benchmarks (I)


                                     System parameters:

                                     • Lennard-Jones mixture (Argon - Krypton)
                                     • T = 116 K
                                     • n = 0.018 Å-3
                                     • δt = 20 fs
                                     • Nsmall= 2048 ; Nbig = 32000
                                     • List-technique: Verlet


                                     Parallelization:

                                     • Atom decomposition
                                     • Systolic loop
                              John von Neumann - Institut für Computing
                              Zentralinstitut für Angewandte Mathematik




                                                                               MD benchmarks (II)
                                                                                                                            ZAMpano speedup
                                              ZAMpano
                                      10000
                                                                             32000 part.: no list
                                                                             32000 part.: Verlet list             32
                                                                             2048 part.: no list                                     32000 part.: no list
time / particle / step [µs]




                                                                             2048 part.: Verlet list                                 32000 part.: Verlet list
                                                                                                                  16                 2048 part.: no list
                                      1000                                                                                           2048 part.: Verlet list




                                                                                                        Speedup
                                                                                                                    8


                                       100
                                                                                                                    4



                                                                                                                    2

                                        10

                                                                                                                    1
                                                  2      4            8       16                32                      1        2                4                 8   16   32

                                                               no. of PEs                                                                              no. of PEs
                                                                                                                            CRAY T3E-1200 speedup
                                              CRAY T3E-1200
                                                                                                                    64
                                       1000
                                                                             32000 part.: no list
                                                                             32000 part.: Verlet list
                                                                             2048 part.: no list                    32
        time / particle / step [µs]




                                                                             2048 part.: Verlet list
                                                                                                                                     32000 part.: no list
                                                                                                                                     32000 part.: Verlet list
                                                                                                                    16               2048 part.: no list
                                                                                                                                     2048 part.: Verlet list
                                                                                                          Speedup
                                        100
                                                                                                                    8



                                                                                                                    4



                                         10                                                                         2


                                              2         4             8       16                32                           2                4                 8       16    32

                                                                no. of PEs                                                                              no. of PEs
 John von Neumann - Institut für Computing
 Zentralinstitut für Angewandte Mathematik




                                                              MD benchmarks (III)



                                      MPCB
                              10000
                                                               32000 part.: no list                 32
                                                               32000 part.: Verlet list                       32000 part.: no list
                                                               2048 part.: no list                            32000 part.: Verlet list
time / particle / step [µs]




                                                               2048 part.: Verlet list              16        2048 part.: no list
                                                                                                              2048 part.: Verlet list
                              1000
                                                                                                     8




                                                                                          Speedup
                                                                                                     4


                               100
                                                                                                     2


                                                                                                     1


                                10                                                                  0,5
                                      2      4       8         16                 32                      2             4                8      16   32

                                                 no. of PEs                                                                        no. of PEs
                              John von Neumann - Institut für Computing
                              Zentralinstitut für Angewandte Mathematik




                                                                                       MD benchmarks (IV)



                              300                                                                                                                4000


                                                                          2048 particles; no list technique                                      3500
                              250                                               CRAY T3E-1200                                                                       32000 particles; no neighbor lists




                                                                                                                   time / particle / step [µs]
time / particle / step [µs]




                                                                                ZAMpano                                                                                   CRAY T3E-1200
                                                                                                                                                 3000
                                                                                MPCB                                                                                      ZAMpano
                              200                                                                                                                                         MPCB
                                                                                                                                                 2500


                              150                                                                                                                2000


                                                                                                                                                 1500
                              100

                                                                                                                                                 1000

                              50
                                                                                                                                                  500


                               0                                                                                                                   0
                                        2             4              8                  16                    32                                        2   4       8                16                  32

                                                              no. of PEs                                                                                        no. of PEs

								
To top