Document Sample
70 Powered By Docstoc
					          Parallelizing Molecular Dynamics Codes using Parti
                         Software Primitives 
                                   R. Das y            J. Saltz z

         This paper is concerned with the implementation of the molecular dynamics code,
      CHARMM, on massively parallel distributed-memory computer architectures using a
      data-parallel approach. The implementation is carried out by creating a set of software
      tools, which provide an interface between the parallelization issues and the sequential
      code. Large practical MD problems is solved on the Intel iPSC 860 hypercube. The
      overall solution e ciency is compared with that obtained when implementation is done
      using data-replication.

1 Introduction
The rapid growth in microprocessor technology over the last decade has lead to the de-
velopment of massively parallel hardware capable of solving large computational problems.
Given the relatively slow increases in large mainframe supercomputer capabilities, it is now
generally acknowledged that the most feasible and economical means of solving extremely
large computational problems in the future will be with highly parallel distributed memory
architectures. The large molecular dynamics MD codes like CHARMM 3 , GROMOS
 10 , AMBER 11 , etc. have parts which are extremely computationally intensive. Imple-
mentationg these codes on massively parallel machines not only reduces the total solution
time but allows one to solve very large problems.
    There are two software issues that needs to be addressed when one considers to
e ectively parallelize MD codes. The rst is the development of e cient and inherently
parallelizable algorithms to do the inter-particle force calculations, which happens to
consume bulk of the computation time in these codes. A number of parallel algorithms has
been designed and implemented for both SIMD and MIMD architectures 7, 1, 8, 9 . The
second is an implementation issue. Most often, the programmer is required to explicitly
distribute large arrays over multiple local processor memories, and keep track of which
portions of each array reside on which processors. In order to access a given element
of a distributed array, communication between appropriate processors must be invoked.
The programmer should be relieved of such machine dependent and architecture speci c
tasks. Such low-level implementational issues have severely limited the growth of parallel
computational applications, much in the same way as vectorized applications were inhibited
early on, prior to the development of e cient vectorizing compilers.
    Support for authors has been provided by pending DARPA contract, by the National Aeronautics and
Space Administration under NASA contract NAS1-18605 while these authors were in residence at ICASE,
NASA Langley Research Center
   y Institute for Advanced Computer Studies, Computer Science Department, University of Maryland,
College Park, MD
   z Institute for Advanced Computer Studies, Computer Science Department, University of Maryland,
College Park, MD
2     Das and Saltz
    We have parallelized the MD code CHARMM using the Parti software primitives. This
is a data-parallel implementation for a distributed memory architecture. Several di erent
optimizations have been performed to improve the single node performance and also to
reduce the communication time. Most of the optimizations have been incorporated in the
software primitives. This paper is less concerned with the individual development of the
solver or the parallelizing primitives, but more so with the interaction between these two
e orts resulting from a speci c application: the implementation of the MD code CHARMM
on the Intel iPSC 860.

2 Molecular Dynamics
Molecular dynamics is a technique for simulating the thermodynamic and dynamic
properties of liquid and solid systems. For each timestep of the simulation two separate
calculations are performed. First, there is the bonded and non-bonded force calculation
for each atom. The second part is the integration of the Newtons equation for each atom.
In most MD codes the bulk of the time a little more than 90 is spent in the long-
range force i.e. the non-bonded force calculation. Hence this is the part that needs to
be parallelized e ciently. This is an ON 2 calculation, where N is the number of atoms.
Every single atom interacts with each other but usually a cuto distance Rc is speci ed and
interactions outside this are neglected. The non-bonded force calculation constitutes of two
distinct parts. For each atom, rst the pailist needs to be generated and next the Vander
Waals and electrostatic force calculation has to be performed. The pairlist generation is
not performed every iteration but after every n iterations where n is a variable that can be
  xed by the user.
    The MD code that we used was CHARMM Chemistry at HARvard Macromolecular
Mechanics 3 which was developed at Harvard University for biomolecular simulations. It
is a relatively e cient program which uses empirical energy functions to model molecular
systems. The code is written in Fortran and is about 110,000 lines. The program is capable
of performing a wide range of analyses. Some of the important simulation routines are
the dynamic analysis, the trajectory manipulations, energy calculations and minimization,
and vibrational analysis. It also performs statistical analysis, time series and correlation
function analysis and spectral analysis.

3 Parti Primitives
The PARTI primitives Parallel Automated Runtime Toolkit at ICASE 2 are designed
to ease the implementation of computational problems on parallel architecture machines
by relieving the user of the low-level machine speci c issues. The primitives enable the
distribution and retrieval of globally indexed but irregularly distributed data-sets over the
numerous local processor memories. In distributed memory machines, large data arrays
need to be partitioned between local memories of processors. These partitioned data arrays
are called distributed arrays. Long term storage of distributed array data is assigned to
speci c memory locations in the distributed machine. A processor that needs to read an
array element must fetch a copy of that element from the memory of the processor in which
that array element is stored. Alternately, a processor may need to store a value in an o -
processor distributed array element. Thus, each element in a distributed array is assigned
to a particular processor, and in order to be able to access a given element of the array
we must know the processor in which it resides, and its local address in this processor's
memory. We thus build a translation table which, for each array element, lists the host
Parallelizing Molecular Dynamics Codes using Parti Software Primitives                         3
processor address. One of the primitives handles initialization of distributed translation
tables, and another primitive is used to access the distributed translation tables.
    In distributed memory MIMD architectures, there is typically a non-trivial communica-
tions latency or startup cost. For e ciency reasons, information to be transmitted should
be collected into relatively large messages. The cost of fetching array elements can be re-
duced by precomputing what data each processor needs to send and to receive. In irregular
problems, such as calculating inter-particle forces and sparse matrix algorithms, it is not
possible to predict at compile time what data must be prefetched. This lack of informa-
tion is dealt with by transforming the original parallel loop into two constructs called an
inspector and executor 6 . The PARTI primitives can be used directly by programmers
to generate inspector executor pairs. Each inspector produces a communications sched-
ule, which is essentially a pattern of communication for gathering or scattering data. The
executor has embedded PARTI primitives to gather or scatter data. The primitives are de-
signed to minimize the e ect on the source code, such that the nal parallel code remains
as close as possible to the original sequential code. The primitives issue instructions to
gather, scatter or accumulate data according to a speci ed schedule. Latency or start-up
cost is reduced by packing various small messages with the same destinations into one large

4 Implementation and Results
We initially carry out a preprocessing phase to generate appropriately partitioned datas-
tructures that are read in by each processor. The parallelized code looks very similar to
the sequential code except that it has a number of Parti primitive calls embeded in it.
Initially, all parallel irregular loops were transformed into inspector executor pairs. Then
we performed optimizations associated with reusing copies of o -processor data and with
vectorizing communications 5 . The entire energy calculation portion of CHARMM has
been parallelized. This involves both the internal bond, angle etc. energy calculation and
the external nonbonded energy calculations.
    We present some of the results that we have obtained on the Intel iPSC 860. In these
tables depicting the results, we present the total time required to carry out 100 CHARMM
timesteps including 4 list regenerations, the communication time associated with 100
timesteps, and the cost associated with a single list regeneration. We used the con guration
TIP3P TIP4P -4-site water model of Jorgenson which contains 216 water molecules 648
atoms in a box. Table 1 shows the timings that were obtained for 4 to 32 processors. In
this case the atoms were partitioned by simple blocking. We compare these to the data
presented in Table 2, for the same case on the same numbers of processors. The atom
partitioning in this case was obtained by rst partitioning the iterations in the non-bonded
loop by blocks and then carrying out the atom distribution based on the loop distribution.
We can see that partitioning the iterations in the non-bonded loop leads to a reduction
in the total time due to better load balancing in the non-bonded force calculation. As
expected, we see that the pairlist regeneration time remains virtually unchanged.
    The second test case we used was a simulation of carboxy-myoglobin hydrated with
3830 water molecules the total number of atoms is 14026. The list generation cuto
was 14 angstroms. The timings obtained for 100 steps are shown in Table 3. The
internal energy calculation is not very expensive and accounts for 1  percent of the total
energy calculation time. The two most computationally intensive parts of the code are the
nonbonded energy calculations and the atom pairlist generation. In our implementation,
4     Das and Saltz

                Number Total Time Comm time List Regen. time
                   of      per 100    per 100      per
               processors iterations iteration generation
                    4            111             3.8               5.0
                    8             62             5.3               2.8
                   16             40             9.1               1.6
                   32             27            13.7               1.1
                                          Table 1
     Timings of the Dynamic Analysis on Intel iPSC 860 for TIP3P BLOCK partitioning

every processor generates the pairlist for the atoms that have been assigned to the processor.
A communication phase preceeds the pairlist generation. During this communication
phase, the current atomic coordinates are exchanged. For the purpose of simplifying the
implementation and to reduce the volume of communication, we have replicated some of
the data structures used during pairlist generation.
    For the smaller TIP3P problem, pairlist generation is relatively inexpensive and has
only a modest e ect on the total cost note that pairlist generation is only occasionally
performed, in our base sequential code, pairlist generation was carried out every 25 steps.
In the larger carboxy-myoglobin simulation, pairlist generation proved to be extremely
expensive. The reasons for the high cost do not appear to be fundamental, these costs in
part are prompting us to redesign the address translation mechanisms in our software.
    We utilize incremental scheduling 5 to reduce the volume of data to be communicated
by reusing live data that has been fetched for previous loops. Since we use incremental
scheduling we can fetch all the required data i.e. the o processor atom coordinates that
will be referenced. On the other hand, in the current implementation, we scatter o -
processor data during the calculation of energy and forces. This allows the ordering of
operations in the parallel code to more closely approximate the ordering of operations in
the sequential code both parallelization and vectorizations of loops with accumulations
typically require a reordering of operations within the loop. We could potentially achieve
a very signi cant savings in our communications overhead by delaying the o -processor
accumulation until the end of the energy calculation loops. This savings would require a
further compromise in the sequential operation ordering, but at least one group has found
that this modi cation does not have an adverse e ect on the computational results 4 .
    These results can be compared with the ones presented by Brooks in 4 . Their
implementation replicates all the data structure on all the processors. For 64 processors on
the Intel iPSC 860, Brooks reports a total time of 1521.1 seconds for 1000 timesteps for the
carboxy-myoglobin benchmark, this compares to the 368 seconds we required to carry out
100 iterations of the same problem on 64 processors of an iPSC 860. While the drawback
of the replicated approach is that it limits the size of the problem that can be t into a
machine because of the limited memory on each processor node, at present that group is
able to obtain better results.
Parallelizing Molecular Dynamics Codes using Parti Software Primitives                            5

                Number Total Time Comm time List Regen. time
                   of      per 100    per 100     per
               processors iterations iteration generation
                    4            105             3.1                5.0
                    8             56             5.9                2.8
                   16             39             10.1               1.8
                   32             25             13.0               1.1
                                           Table 2
    Timings of the Dynamic Analysis on Intel iPSC 860 for TIP3P partitioning based on iteration

                Number Total Time Comm time List Regen. time
                   of      per 100    per 100     per
               processors iterations iteration generation
                   16            724             102               75.4
                   32            509             173               43.4
                   64            368             191               29.9
                                           Table 3
 Timings of the Dynamic Analysis on Intel iPSC 860 for carboxy-myoglobin hydrated with water
6    Das and Saltz
 1 R. C. G. A. I. Mel'cuk and H. Gould, Molecular dynamics simulation of liquids on the
   connection machine, Computers in Physics, May Jun 1991, pp. 311 318.
 2 H. Berryman, J. Saltz, and J. Scroggs, Execution time support for adaptive scienti c algorithms
   on distributed memory architectures, Concurrency: Practice and Experience, 3 1991, pp. 159
 3 B. R. Brooks, R. E. Bruccoleri, B. D. Olafson, D. J. States, S. Swaminathan, and M. Karplus,
   Charmm: A program for macromolecular energy, minimization, and dynamics calculations,
   Journal of Computational Chemistry, 4 1983, p. 187.
 4 B. R. Brooks and M. Hodoscek, Parallelization of charmm for mimd machines, Chemical
   Design Automation News, 7 1992, p. 16.
 5 R. Das, R. Ponnusamy, J. Saltz, and D. Mavriplis, Distributed memory compiler methods
   for irregular problems - data copy reuse and runtime partitioning, in Compilers and Runtime
   Software for Scalable Multiprocessors, J. Saltz and P. Mehrotra Editors, Amsterdam, The
   Netherlands, 1992, Elsevier.
 6 R. Mirchandaney, J. H. Saltz, R. M. Smith, D. M. Nicol, and K. Crowley, Principles of runtime
   support for parallel processors, in Proceedings of the 1988 ACM International Conference on
   Supercomputing, July 1988, pp. 140 152.
 7 J. P. M. P. Tamayo and B. M. Boghosian, Parallel approaches to short range molecular
   dynamics simulations, in Supercomputing '91, Albuquerque, NM, Nov. 1991.
 8 S. Plimpton and G. He el nger, Scalable parallel molecular dynamics on mimd supercomputers,
   in SHPCC92, Williamsburg, April 1992.
 9 J. A. M. Terry W. Clark and L. R. Scott, Parallel molecular dynamics, in Proceedings of the
   Fifth SIAM Conference on Parallel Processing for Scienti c Computing, Houston, TX., March
10 W. F. van Gunsteren and H. J. C. Berendsen, Gromos: Groningen molecular simulation
   software., technical report, Laboratory of Physical Chemistry, University of Groningen,
   Nijenborgh, The Netherlands, 1988.
11 P. K. Weiner and P. A. Kollman, Amber:assisted model building with energy re nement. a
   general program for modeling molecules and their interactions, Journal of Computational
   Chemistry, 2 1981, p. 287.

Shared By: