Massively Parallel Cosmological Simulations with ChaNGa
Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn
Simulations and Scientific Discovery
Help reconcile observation and theory
Calculate final states of theories of structure formation What should we look for in space?
Direct observational programs
Help determine underlying structures and masses
Computational Challenges
N ~10^12
Direct summation forces would take ~10^10 Teraflop years
Need efficient, scalable algorithms
Need multiple timestepping Balance load across processors
Large dynamic ranges
Irregular domains
ChaNGa
Uses Barnes-Hut algorithm
Based on Charm++
Processor virtualization Asynchronous message-driven model
Computation and communication overlap Load balancing
Intelligent, adaptive runtime system
Barnes-Hut Algorithm Overview
Space divided into cells Cells form nodes of Barnes-Hut tree
Particles grouped into buckets Buckets assigned to TreePieces
TreePiece 1 TreePiece 2 TreePiece 3
Computing Forces
Collect relevant nodes/particles at TreePiece Traverse global tree to get force on each bucket
Nodes “opened” (too close)
or not (far enough)
Involved in computation Not involved
Algorithm Overview
Processor
TreePieces
TreePiece Needs Remote Particles
Pref (n-1)
Comp (n-1)
Request Particles
Local Work
Global Work
Pref (n) Comp (n) Pref (n+1) Comp (n+1)
Have in Cache?
No
Yes
CacheManager
Receive Particles
Reply with Particles
Major Optimizations
Pipelined computation
Prefetch tree chunk before starting traversal Aggregate trees from all chares on processor
Tree-in-Cache
Tunable computation granularity
Response time for data requests vs Scheduling overhead
Experimental Setup
lambs 3 million particles dwarf 5 and 50 million particles
hrwh_LCDMs 16 milllion particles
drgas 700 million particles
Experimental Setup (contd.)
Platforms
Parallel Performance
A comparison of Parallel Performance with PKDGRAV. (`Dwarf' dataset on Tungsten.)
Scaling Tests
IBM BG/L
Cray XT3
Poor scaling
Towards Greater Scalability
Load Imbalance causes poor scaling
Static balancing not good enough
Even number of particles != Even work distribution
Must balance both computation & communication
Balancing Load to Improve Performance
Increased communication Greater balance
Time → Computation Communication
LB algorithms must consider both computation and communication
Accounting for Communication: OrbRefineLB
Based on Charm++ OrbLB
ORB along object ident. line
Time →
1024 BG/L processors
Dwarf dataset OrbLB
Processors →
OrbRefineLB: `Refines' placement by exchanging load between processors in shifting window
Results with OrbRefineLB
Different datasets OrbRefineLB
Multistepped Simulations for Greater Efficiency
Group particles into `rungs'
Lower rung means higher acceleration Different rungs active at different times
Update particles on higher rungs less frequently
Less work done than singlestepping
0 1 0 Time → 2 0 1 0 2
Computation split into phases
0: rung 0
1: rungs 0,1 2: rungs 0,1,2
Balancing Load in MS Runs
Different strategies for different phases
Multiphase instrumentation Model-based load estimation (first few small steps)
0
1
0
2
Preliminary Results
Singlestepped (613 s)
Dwarf dataset
32 BG/L processors
Multistepped (429 s)
Different timestepping schemes
Multistepped with load balancing (228 s)
Preliminary Results
~50% reduction in execution time:
Lambb dataset
512 and 1024 BG/L processors
Singlestepped vs loadbalanced multistepped
Multistepping and overdecomposition
Lambb dataset 1024 BG/L processors Varying num. TreePieces
More TreePieces → greater load balance
Future Work
SPH
Alternative decomposition schemes Runtime optimizations to reduce communication cost More sophisticated load balancing algorithms
Account for:
Complete simulation space topology Processor topology (reduce hop-bytes)
Conclusions
Introduced ChaNGa
Optimizations to reduce simulation time Load imbalance issues tackled Multiple timestepping beneficial Balancing load in multistepped simulations