Proposal for a Summer Undergraduate Research Fellowship 2004
Computer science / Applied and computational mathematics
Eﬃcient Parallelization Strategies for
Hierarchical AMR Algorithms
Institut f¨r Informatik, Technische Universit¨t Cottbus
Postfach 10 13 44, 03013 Cottbus, Germany
Highly resolved solutions of partial diﬀerential equations are nowadays important
in many areas of science and technology. Only adaptive mesh reﬁnement methods
reduce the necessary work suﬃciently allowing the calculation of realistic problems.
The block-structured AMR method is well-suited for the time-explicit computa-
tion of large-scale ﬂuid dynamical problems, but still requires the use of distributed
memory computers, raising the need for adequate parallelization strategies. Here
the most common strategies are summarized and open problems outlined. The
research ideas proposed for the participation in SURF 2004 include the optimiza-
tion of boundary data synchronization by overlapped non-blocking communication,
combined with a simple hybrid parallelization strategy tailored to hybrid cluster
architectures and the incorporation and investigation of diﬀerent partitioning algo-
A large number of physical phenomena are modeled by partial diﬀerential equations.
An important sub-class are hydrodynamic ﬂow processes, which are often approx-
imated with ﬁnite diﬀerences or ﬁnite volume methods. In particular, numerical
simulations of inviscid gas dynamics are challenging, because a wide range of dif-
ferent scales need to be resolved. Fig. 1 displays a simulated detonation wave as
it exits a rectangular tube and the large areas of uniform ﬂow are apparent. On
the other hand, the entire physical process of detonation propagation is determined
by the chemical reaction at the detonation front, which is almost completely within
the grey shaded region. This region requires such a high local resolution that re-
alistic simulations would be impossible even on current high performance systems
employing uniform discretizations only. Problems with non-neglectable small-scale
Figure 1: Numerical simulation of detonation wave leaving a rectangular tube.
Calculated with hierarchical adaptive mesh reﬁnement.
Figure 2: Unstructured (left) and structured (right) mesh reﬁnement strategy.
processes necessarily require dynamically adaptive mesh reﬁnement methods and
can reduce the computation expense drastically.
Mainly two diﬀerent mesh reﬁnement techniques are used nowadays, cf. Fig. 2:
unstructured (triangles or tetrahedrons) and hierarchical structured (rectangular)
grids. While the unstructured approach only allows a reﬁnement in space, the
structured method additionally admits the reﬁnement in time.
But realistic simulations, in particular in three space-dimensions, nevertheless
require large clustered computers. These systems perform the calculations on lo-
calized sub-domains in parallel and can achieve signiﬁcant speed-ups provided the
workload is well balanced among all nodes. Because of the dynamic character of
the applications and the recurrent mesh adaptation the load balancing operation is
crucial for the performance of the entire computation.
2 Block-structured Adaptive Mesh Reﬁnement
The block-structured adaptive mesh reﬁnement method (AMR or SAMR) after
Berger and Collela [2, 1] uses a hierarchy of successively embedded reﬁnement levels
with the same reﬁnement factor in space and time. A coarse base grid of uniform
ﬁnite volumes is reﬁned patch-wise with subgrids as shown in Fig 3. Cells marked for
reﬁnement are clustered into rectangular blocks forming the new subgrid structure
of the next higher level.
The regularity of the AMR data structures allows high computational perfor-
mance by exploiting vectorization and cache reuse. While unstructured reﬁnement
methods require complex graph structures, AMR only needs a global integer coor-
dinate system to calculate neighborhood and parent-child relationships on-the-ﬂy.
By the incorporation of domain decomposition, load-balancing and process commu-
nication via MPI (Message Passing Interface) the block-structured AMR method is
also applicable in distributed memory computing environments, like ASCI Q or the
Figure 3: Hierarchy of reﬁnement grids.
3 AMR on Distributed Memory Systems
The distribution of data and work among the nodes of a distributed memory machine
is the most critical part of the whole parallelization. Load-imbalances result in
wait-cycles until the most overloaded node completes its work and can reduce the
achieved speed-up dramatically. Hence, the partitioning has to fulﬁll a number of
requirements: I. It must be fast, because it is carried out on-the-ﬂy and interrupts
the work of all nodes. II. It must be workload-balancing. III. It should reduce
the redistribution overhead when the partitions change. IV. It should lead to small
communication costs when exchanging boundary data during the numerical update
Up to now, almost all authors have focussed on the requirements I., II. and
sometimes on III. . But test calculations with the highly eﬃcient AMROC code
 used by Caltech’s ASCI group show that the largest portion of the computing
time is wasted with boundary synchronization. Therefore, this proposal addresses
especially requirement IV.
3.1 Research Ideas
In contrast to less communication-eﬃcient AMR implementations that distribute
the work on each level of the AMR hierarchy independently , the AMROC code
already uses a parallelization strategy tailored especially for distributed memory
machines and follows a rigorous domain decomposition approach [5, 3]. The idea is
to preserve the physical locality of the data by keeping all reﬁnement patches, lying
on top of each other, on the same node. Then inter-node communication is only
necessary for exchanging ghost cells at the borders of the node’s subdomain. This is
achieved by partitioning only the base grid and distributing the overlaid reﬁnement
grids accordingly, cf. Fig. 4.
Figure 4: Domain partitions generated by AMROC.
3.1.1 Eﬀect of the Partitioner on the Boundary Volume
There exists a wide range of domain decompositions methods used in various areas of
computer science and mathematics for load-balancing, distributed databases (data-
balancing), task scheduling and layouting of big VLSI designs. Some approaches
used in practice are: geometric methods like Hilbert’s and Lebesgue’s space-ﬁlling
curves, graph partitioning or heuristics like iterative bipartitioning and grid moving.
Space-ﬁlling curves employ very simple recursive algorithms enumerating adja-
cent base cells of N-dimensional rectangular domains. This linearization is then
split up amongst the processors using sequence partitioning. A number of tech-
niques using inverse space ﬁlling curves is presented in . These techniques have
the advantage that they are very fast and use only small memory. Hilbert’s space-
ﬁlling curve, as illustrated in Fig. 5, is implemented in AMROC and some typical
partitioning results are shown in Fig. 4. However, measurements in  show that
Hilbert’s curve tends to forming spirals and does not have minimal boundaries com-
pared to Lebesgue’s curve and multilevel graph partitioning techniques. On the
other hand Lebesgue’s space-ﬁlling curve can generate disconnected partitions.
Graph partitioning constitutes a NP-complete problem of discrete mathematics,
to which solutions can be approximated by some heuristics using greedy algorithms
or multilevel graph decompositions that are implemented in libraries like Metis and
Jostl. They generally generate better partitions with fewer boundary cells around
factors from 3.5 to 1.3 in 2D and 7.0 to 1.8 in 3D based on the number of partitions
and the domain size . But the existing libraries need the construction of a node
graph resulting in a high memory usage and low speeds.
Addressing the dynamic imbalances of heterogeneous networks, an iterative hier-
archical partitioning approach, interleaving grid splitting and direct grid exchanges
between imbalanced nodes, is presented in . The machine nodes are grouped into
homogeneous subnetworks and this groups are grouped further, mapping well to
the currently common hybrid clusters of shared memory nodes and grid computing
environments. Load-balancing is then performed in each level of the group hierarchy
independently and parallel over all subgroups. Unfortunately no comparisons to the
other mentioned approaches seem to exist yet.
Figure 5: Hilbert’s Space-Filling Curve on workload reﬁned grids.
Some of the partitioning algorithms recommended by  and  should be incor-
porated into AMROC and compared regarding how they reduce the boundary data
volume. In particular, graph partitioning algorithms seem to propose an interesting
alternative and it should be investigated, whether the higher computational eﬀort
of these methods is neglectable compared to the gains in communication speed due
to a smaller boundary data volume.
As higher reﬁnement levels have to be updated considerably more frequently,
the splitting of higher level grids due to partitioning should be avoided whenever
possible. Even a minor workload imbalance will usually be less expensive. The par-
titioning method currently used in AMROC does not weight the additional commu-
nication costs due to subgrid splitting. It should be investigated how a partitioning
reducing subgrid splitting could be achieved and how much impact on the overall
performance it might have.
3.1.2 Overlapping Non-blocking Communication
However all reasonably usable partitioning algorithms aren’t perfect and might only
achieve 20% increase in performance compared to the current partitioner imple-
mented in AMROC. Therefore also other possibilities for optimization have to be
All current implementations of SAMR for distributed environments use a two-
phase cycle for performing the grid updates. At ﬁrst all boundary data is synchro-
nized, then all grids are updated. Therby no computations are done, while boundary
data may be already available for some grids and during the computation phase no
transfer of already updated boundary values takes place.
The overlapping of non-blocking MPI communication with the numerical update
could eliminate the waiting time for boundary data exchanges almost completely
reducing computation time by up to 50%. Due to the dynamic nature of the ap-
plication and network the software has to be able to choose at runtime, which grid
should be updated next. This can be achieved by decomposing the computational
work into single tasks which are scheduled according their priority and completion
of synchronization dependencies. Through this all synchronization can be carried
out in background whilst other grids are updated.
3.1.3 Investigation of Hybrid Parallelization Strategies
Nearly all modern high performance distributed memory machines are hybrid clus-
ters of shared memory systems with 2–16 processors. For example ASCI Q has four
shared memory processors on each node. On many systems, like the ASCI Q, al-
ready one processor of a node can saturate the inter-node network connections and it
is often more eﬃcient to do all inter-node communications with only one processor,
which usually reduces conﬂicts on the connecting network.
Furthermore, only a quarter of subdomains have to be generated by the dynamic
load-balancer reducing partitioning costs and allowing bigger domains. All proces-
sors of a node are used in the computation of the grid updates again by parallelizing
the AMR grid-update loop on each level with OpenMP. The dynamic task scheduler
proposed in the last section is especially well-suited for distributing the intra-node
work over all processors of a node by assigning ready tasks to free processors on
runtime. This is a new way of utilizing the hybrid nature of modern distributed
memory architectures combined with reducing boundary synchronization costs and
could be implemented as an external library for reuse in similar projects.
 J. Bell, M. Berger, J. Saltzman, and M. Welcome. Three-dimensional adaptive mesh reﬁnement
for hyperbolic conservation laws. SIAM J. Sci. Comp., 15(1):127–138, 1994.
 M. Berger and P. Colella. Local adaptive mesh reﬁnement for shock hydrodynamics. J.
Comput. Phys., 82:64–84, 1988.
 R. Deiterding. Parallel adaptive simulation of multi-dimensional detonation struc-
tures. PhD thesis, Techn. Univ. Cottbus, Sep 2003.
 R. Deiterding. AMROC - Blockstructured Adaptive Mesh Reﬁnement in Object-oriented
C++. Available at http://amroc.sourceforge.net, Oct 2003.
 R. Deiterding. Construction and application of an AMR algorithm for distributed memory
computers. Proc. of Chicago Workshop on Adaptive Mesh Reﬁnement Methods,
 S. Schamberger and J. M. Wierum Graph Partitioning in Scientiﬁc Simulations: Multi-
level Schemes versus Space-Filling Curves Proc. of Parallel Computing Technologies
(PaCT-2003), 165–179, Sep 2003.
 Z. Lan, V. E. Taylor and G. Bryan Dynamic load balancing of SAMR applications on dis-
tributed systems Proc. of the 2001 ACM/IEEE conference on Supercomputing,
 C. A. Rendleman, V. E. Beckner, M. Lijewski, W. Crutchﬁeld, and J. B. Bell. Parallelization
of structured, hierarchical adaptive mesh reﬁnement algorithms. Computing and Visual-
ization in Science, 3, 2000.
 J. Steensland, S. Chandra, M. Thune and M. Parashar Characterization of Domain-Based
Partitioners for Parallel SAMR Applications Proc. of the IASTED International Con-
ference on Parallel and Distributed Computing and Systems, Las Vegas, 425–430,