Massively Parallel Cosmological Simulations with ChaNGa

Reviews
Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn Simulations and Scientific Discovery  Help reconcile observation and theory  Calculate final states of theories of structure formation What should we look for in space?  Direct observational programs   Help determine underlying structures and masses Computational Challenges  N ~10^12  Direct summation forces would take ~10^10 Teraflop years   Need efficient, scalable algorithms Need multiple timestepping Balance load across processors Large dynamic ranges   Irregular domains  ChaNGa   Uses Barnes-Hut algorithm Based on Charm++   Processor virtualization Asynchronous message-driven model  Computation and communication overlap Load balancing  Intelligent, adaptive runtime system  Barnes-Hut Algorithm Overview  Space divided into cells Cells form nodes of Barnes-Hut tree    Particles grouped into buckets Buckets assigned to TreePieces TreePiece 1 TreePiece 2 TreePiece 3 Computing Forces  Collect relevant nodes/particles at TreePiece Traverse global tree to get force on each bucket   Nodes “opened” (too close)  or not (far enough) Involved in computation Not involved Algorithm Overview Processor TreePieces TreePiece Needs Remote Particles Pref (n-1) Comp (n-1) Request Particles Local Work Global Work Pref (n) Comp (n) Pref (n+1) Comp (n+1) Have in Cache? No Yes CacheManager Receive Particles Reply with Particles Major Optimizations  Pipelined computation  Prefetch tree chunk before starting traversal Aggregate trees from all chares on processor  Tree-in-Cache   Tunable computation granularity  Response time for data requests vs Scheduling overhead Experimental Setup lambs 3 million particles dwarf 5 and 50 million particles hrwh_LCDMs 16 milllion particles drgas 700 million particles Experimental Setup (contd.)  Platforms Parallel Performance A comparison of Parallel Performance with PKDGRAV. (`Dwarf' dataset on Tungsten.) Scaling Tests IBM BG/L Cray XT3 Poor scaling Towards Greater Scalability   Load Imbalance causes poor scaling Static balancing not good enough  Even number of particles != Even work distribution  Must balance both computation & communication Balancing Load to Improve Performance Increased communication Greater balance Time → Computation Communication LB algorithms must consider both computation and communication Accounting for Communication: OrbRefineLB  Based on Charm++ OrbLB  ORB along object ident. line Time → 1024 BG/L processors  Dwarf dataset  OrbLB  Processors → OrbRefineLB: `Refines' placement by exchanging load between processors in shifting window  Results with OrbRefineLB Different datasets  OrbRefineLB  Multistepped Simulations for Greater Efficiency  Group particles into `rungs'   Lower rung means higher acceleration Different rungs active at different times  Update particles on higher rungs less frequently Less work done than singlestepping 0 1 0 Time → 2 0 1 0 2 Computation split  into phases 0: rung 0 1: rungs 0,1 2: rungs 0,1,2 Balancing Load in MS Runs  Different strategies for different phases Multiphase instrumentation Model-based load estimation (first few small steps) 0 1 0 2   Preliminary Results Singlestepped (613 s)  Dwarf dataset 32 BG/L processors  Multistepped (429 s) Different timestepping schemes  Multistepped with load balancing (228 s) Preliminary Results  ~50% reduction in execution time:  Lambb dataset 512 and 1024 BG/L processors  Singlestepped vs loadbalanced multistepped   Multistepping and overdecomposition  Lambb dataset 1024 BG/L processors Varying num. TreePieces   More TreePieces → greater load balance Future Work    SPH Alternative decomposition schemes Runtime optimizations to reduce communication cost More sophisticated load balancing algorithms   Account for:   Complete simulation space topology Processor topology (reduce hop-bytes) Conclusions      Introduced ChaNGa Optimizations to reduce simulation time Load imbalance issues tackled Multiple timestepping beneficial Balancing load in multistepped simulations

Related docs
Other docs by Juan Agui
E-mail Policy
Views: 411  |  Downloads: 13
Bay Area Multimedia Inc Ammendments and By laws
Views: 155  |  Downloads: 0
Disability Policy
Views: 390  |  Downloads: 9
Hypnosis Studies on weight loss
Views: 863  |  Downloads: 9
edens_2c-all
Views: 156  |  Downloads: 0
Inst W-2C and W-3C (PDF) Instructions
Views: 331  |  Downloads: 3
0206 Inst W-3C (PR) (PDF) Instructions
Views: 227  |  Downloads: 3
Shareholders Resolution Approving Agreement
Views: 178  |  Downloads: 11
ALLEGATION OF JURISDICTION
Views: 188  |  Downloads: 0