Efficient Parallelization Strategies for Hierarchical AMR Algorithms

Document Sample
Efficient Parallelization Strategies for Hierarchical AMR Algorithms Powered By Docstoc
					         Proposal for a Summer Undergraduate Research Fellowship 2004
          Computer science / Applied and computational mathematics
           Efficient Parallelization Strategies for
             Hierarchical AMR Algorithms
                                    Randolf Rotta
                         u                                  a
               Institut f¨r Informatik, Technische Universit¨t Cottbus
                     Postfach 10 13 44, 03013 Cottbus, Germany
                       E-Mail: rrotta@informatik.tu-cottbus.de
      Highly resolved solutions of partial differential equations are nowadays important
      in many areas of science and technology. Only adaptive mesh refinement methods
      reduce the necessary work sufficiently allowing the calculation of realistic problems.
      The block-structured AMR method is well-suited for the time-explicit computa-
      tion of large-scale fluid dynamical problems, but still requires the use of distributed
      memory computers, raising the need for adequate parallelization strategies. Here
      the most common strategies are summarized and open problems outlined. The
      research ideas proposed for the participation in SURF 2004 include the optimiza-
      tion of boundary data synchronization by overlapped non-blocking communication,
      combined with a simple hybrid parallelization strategy tailored to hybrid cluster
      architectures and the incorporation and investigation of different partitioning algo-

1    Introduction
A large number of physical phenomena are modeled by partial differential equations.
An important sub-class are hydrodynamic flow processes, which are often approx-
imated with finite differences or finite volume methods. In particular, numerical
simulations of inviscid gas dynamics are challenging, because a wide range of dif-
ferent scales need to be resolved. Fig. 1 displays a simulated detonation wave as
it exits a rectangular tube and the large areas of uniform flow are apparent. On
the other hand, the entire physical process of detonation propagation is determined
by the chemical reaction at the detonation front, which is almost completely within
the grey shaded region. This region requires such a high local resolution that re-
alistic simulations would be impossible even on current high performance systems
employing uniform discretizations only. Problems with non-neglectable small-scale

Figure 1: Numerical simulation of detonation wave leaving a rectangular tube.
Calculated with hierarchical adaptive mesh refinement.

    Figure 2: Unstructured (left) and structured (right) mesh refinement strategy.

processes necessarily require dynamically adaptive mesh refinement methods and
can reduce the computation expense drastically.
    Mainly two different mesh refinement techniques are used nowadays, cf. Fig. 2:
unstructured (triangles or tetrahedrons) and hierarchical structured (rectangular)
grids. While the unstructured approach only allows a refinement in space, the
structured method additionally admits the refinement in time.
    But realistic simulations, in particular in three space-dimensions, nevertheless
require large clustered computers. These systems perform the calculations on lo-
calized sub-domains in parallel and can achieve significant speed-ups provided the
workload is well balanced among all nodes. Because of the dynamic character of
the applications and the recurrent mesh adaptation the load balancing operation is
crucial for the performance of the entire computation.

2      Block-structured Adaptive Mesh Refinement
The block-structured adaptive mesh refinement method (AMR or SAMR) after
Berger and Collela [2, 1] uses a hierarchy of successively embedded refinement levels
with the same refinement factor in space and time. A coarse base grid of uniform
finite volumes is refined patch-wise with subgrids as shown in Fig 3. Cells marked for
refinement are clustered into rectangular blocks forming the new subgrid structure
of the next higher level.
    The regularity of the AMR data structures allows high computational perfor-
mance by exploiting vectorization and cache reuse. While unstructured refinement
methods require complex graph structures, AMR only needs a global integer coor-
dinate system to calculate neighborhood and parent-child relationships on-the-fly.
By the incorporation of domain decomposition, load-balancing and process commu-
nication via MPI (Message Passing Interface) the block-structured AMR method is
also applicable in distributed memory computing environments, like ASCI Q or the
Earth Simulator.

                     Figure 3: Hierarchy of refinement grids.

3     AMR on Distributed Memory Systems
The distribution of data and work among the nodes of a distributed memory machine
is the most critical part of the whole parallelization. Load-imbalances result in
wait-cycles until the most overloaded node completes its work and can reduce the
achieved speed-up dramatically. Hence, the partitioning has to fulfill a number of
requirements: I. It must be fast, because it is carried out on-the-fly and interrupts
the work of all nodes. II. It must be workload-balancing. III. It should reduce
the redistribution overhead when the partitions change. IV. It should lead to small
communication costs when exchanging boundary data during the numerical update
    Up to now, almost all authors have focussed on the requirements I., II. and
sometimes on III. [6]. But test calculations with the highly efficient AMROC code
[4] used by Caltech’s ASCI group show that the largest portion of the computing
time is wasted with boundary synchronization. Therefore, this proposal addresses
especially requirement IV.

3.1    Research Ideas
In contrast to less communication-efficient AMR implementations that distribute
the work on each level of the AMR hierarchy independently [8], the AMROC code
already uses a parallelization strategy tailored especially for distributed memory
machines and follows a rigorous domain decomposition approach [5, 3]. The idea is
to preserve the physical locality of the data by keeping all refinement patches, lying
on top of each other, on the same node. Then inter-node communication is only
necessary for exchanging ghost cells at the borders of the node’s subdomain. This is
achieved by partitioning only the base grid and distributing the overlaid refinement
grids accordingly, cf. Fig. 4.

               Figure 4: Domain partitions generated by AMROC.

3.1.1   Effect of the Partitioner on the Boundary Volume
There exists a wide range of domain decompositions methods used in various areas of
computer science and mathematics for load-balancing, distributed databases (data-
balancing), task scheduling and layouting of big VLSI designs. Some approaches
used in practice are: geometric methods like Hilbert’s and Lebesgue’s space-filling
curves, graph partitioning or heuristics like iterative bipartitioning and grid moving.
    Space-filling curves employ very simple recursive algorithms enumerating adja-
cent base cells of N-dimensional rectangular domains. This linearization is then
split up amongst the processors using sequence partitioning. A number of tech-
niques using inverse space filling curves is presented in [9]. These techniques have
the advantage that they are very fast and use only small memory. Hilbert’s space-
filling curve, as illustrated in Fig. 5, is implemented in AMROC and some typical
partitioning results are shown in Fig. 4. However, measurements in [6] show that
Hilbert’s curve tends to forming spirals and does not have minimal boundaries com-
pared to Lebesgue’s curve and multilevel graph partitioning techniques. On the
other hand Lebesgue’s space-filling curve can generate disconnected partitions.
    Graph partitioning constitutes a NP-complete problem of discrete mathematics,
to which solutions can be approximated by some heuristics using greedy algorithms
or multilevel graph decompositions that are implemented in libraries like Metis and
Jostl. They generally generate better partitions with fewer boundary cells around
factors from 3.5 to 1.3 in 2D and 7.0 to 1.8 in 3D based on the number of partitions
and the domain size [6]. But the existing libraries need the construction of a node
graph resulting in a high memory usage and low speeds.
    Addressing the dynamic imbalances of heterogeneous networks, an iterative hier-
archical partitioning approach, interleaving grid splitting and direct grid exchanges
between imbalanced nodes, is presented in [7]. The machine nodes are grouped into
homogeneous subnetworks and this groups are grouped further, mapping well to
the currently common hybrid clusters of shared memory nodes and grid computing
environments. Load-balancing is then performed in each level of the group hierarchy
independently and parallel over all subgroups. Unfortunately no comparisons to the
other mentioned approaches seem to exist yet.

        Figure 5: Hilbert’s Space-Filling Curve on workload refined grids.

    Some of the partitioning algorithms recommended by [9] and [6] should be incor-
porated into AMROC and compared regarding how they reduce the boundary data
volume. In particular, graph partitioning algorithms seem to propose an interesting
alternative and it should be investigated, whether the higher computational effort
of these methods is neglectable compared to the gains in communication speed due
to a smaller boundary data volume.
    As higher refinement levels have to be updated considerably more frequently,
the splitting of higher level grids due to partitioning should be avoided whenever
possible. Even a minor workload imbalance will usually be less expensive. The par-
titioning method currently used in AMROC does not weight the additional commu-
nication costs due to subgrid splitting. It should be investigated how a partitioning
reducing subgrid splitting could be achieved and how much impact on the overall
performance it might have.

3.1.2   Overlapping Non-blocking Communication
However all reasonably usable partitioning algorithms aren’t perfect and might only
achieve 20% increase in performance compared to the current partitioner imple-
mented in AMROC. Therefore also other possibilities for optimization have to be
    All current implementations of SAMR for distributed environments use a two-
phase cycle for performing the grid updates. At first all boundary data is synchro-
nized, then all grids are updated. Therby no computations are done, while boundary
data may be already available for some grids and during the computation phase no
transfer of already updated boundary values takes place.
    The overlapping of non-blocking MPI communication with the numerical update
could eliminate the waiting time for boundary data exchanges almost completely
reducing computation time by up to 50%. Due to the dynamic nature of the ap-
plication and network the software has to be able to choose at runtime, which grid
should be updated next. This can be achieved by decomposing the computational

work into single tasks which are scheduled according their priority and completion
of synchronization dependencies. Through this all synchronization can be carried
out in background whilst other grids are updated.

3.1.3   Investigation of Hybrid Parallelization Strategies
Nearly all modern high performance distributed memory machines are hybrid clus-
ters of shared memory systems with 2–16 processors. For example ASCI Q has four
shared memory processors on each node. On many systems, like the ASCI Q, al-
ready one processor of a node can saturate the inter-node network connections and it
is often more efficient to do all inter-node communications with only one processor,
which usually reduces conflicts on the connecting network.
    Furthermore, only a quarter of subdomains have to be generated by the dynamic
load-balancer reducing partitioning costs and allowing bigger domains. All proces-
sors of a node are used in the computation of the grid updates again by parallelizing
the AMR grid-update loop on each level with OpenMP. The dynamic task scheduler
proposed in the last section is especially well-suited for distributing the intra-node
work over all processors of a node by assigning ready tasks to free processors on
runtime. This is a new way of utilizing the hybrid nature of modern distributed
memory architectures combined with reducing boundary synchronization costs and
could be implemented as an external library for reuse in similar projects.


[1] J. Bell, M. Berger, J. Saltzman, and M. Welcome. Three-dimensional adaptive mesh refinement
    for hyperbolic conservation laws. SIAM J. Sci. Comp., 15(1):127–138, 1994.
[2] M. Berger and P. Colella. Local adaptive mesh refinement for shock hydrodynamics. J.
    Comput. Phys., 82:64–84, 1988.
[3] R. Deiterding. Parallel adaptive simulation of multi-dimensional detonation struc-
    tures. PhD thesis, Techn. Univ. Cottbus, Sep 2003.
[4] R. Deiterding. AMROC - Blockstructured Adaptive Mesh Refinement in Object-oriented
    C++. Available at http://amroc.sourceforge.net, Oct 2003.
[5] R. Deiterding. Construction and application of an AMR algorithm for distributed memory
    computers. Proc. of Chicago Workshop on Adaptive Mesh Refinement Methods,
    Sep 2003.
[6] S. Schamberger and J. M. Wierum Graph Partitioning in Scientific Simulations: Multi-
    level Schemes versus Space-Filling Curves Proc. of Parallel Computing Technologies
    (PaCT-2003), 165–179, Sep 2003.
[7] Z. Lan, V. E. Taylor and G. Bryan Dynamic load balancing of SAMR applications on dis-
    tributed systems Proc. of the 2001 ACM/IEEE conference on Supercomputing,
    36–36, 2001.
[8] C. A. Rendleman, V. E. Beckner, M. Lijewski, W. Crutchfield, and J. B. Bell. Parallelization
    of structured, hierarchical adaptive mesh refinement algorithms. Computing and Visual-
    ization in Science, 3, 2000.
[9] J. Steensland, S. Chandra, M. Thune and M. Parashar Characterization of Domain-Based
    Partitioners for Parallel SAMR Applications Proc. of the IASTED International Con-
    ference on Parallel and Distributed Computing and Systems, Las Vegas, 425–430,
    Nov 2000.