Docstoc

Scaling Applications to Massively Parallel Machines using

Document Sample
Scaling Applications to Massively Parallel Machines using Powered By Docstoc
					Scaling Applications to
Massively Parallel Machines
using Projections Performance
Analysis Tool
Presented by Chee Wai Lee
Authors: L. V. Kale, Gengbin Zheng,
      Chee Wai Lee, Sameer Kumar


                                      1
Motivation
 Performance optimization is increasingly
  challenging
  – Modern applications are complex and dynamic
  – Some may involve small amount of
    computation per step
  – Performance issues and obstacles change:
 Need very good Performance Analysis tools
  – Feedback at the level of applications
  – Analysis capabilities
  – Scalable views
  – Automatic instrumentation
                                                  2
Projections
 Outline:
  – Projections: trace generation
  – Projections: views
  – Case Study: NAMD, Molecular Dynamics
    program that won a Gordon Bell award at
    SC’02 by scaling MD for biomolecules to
    3,000 procs
  – Case Study: CPAIMD, a Car-parrinello ab
    initio MD application.
  – Performance Analysis on next generation
    supercomputers: Challenges.
                                              3
Trace Generation
 Automatic instrumentation by runtime system
 Detailed
   – In the log mode each event is recorded in full detail
     (including timestamp) in an internal buffer.
 Summary
   – reduces the size of output files and memory overhead.
   – It produces (in the default mode) a few lines of output
     data per processor.
   – This data is recorded in bins corresponding to intervals
     of size 1ms by default.
 Flexible
   – APIs and runtime options for instrumenting user events
     and data generation control.
                                                            4
5
Post mortem analysis: views
 Utilization Graph
   – As a function of time interval or processor
   – Shows processor utilization
   – As well as: time spent on specific parallel methods
 Timeline:
   – upshot-like, but more details
   – Pop-up views of method execution, message arrows,
     user-level events
 Profile: stacked graphs:
   – For a given period, breakdown of the time on each
     processor
       • Includes idle time, and message-sending, receiving times


                                                                    6
7
Projections Views: continued
 Overview
   – Like a timeline, but includes all processors, and all
     time!
   – Each pixel (x,y) represents utilization of processor y at
     time x
 Histogram of method execution times
   – How many method-execution instances had a time of 0-
     1 ms? 1-2 ms? ..
 Performance counters
   – Associated with each entry method
   – Usual counters, interface to PAPI

                                                                 8
Projections and Performance
Analysis

 Identify performance bottlenecks.
 Verification of performance.




                                      9
Case Studies: Outline
 We illustrate the use of Projections
  – Through case studies of NAMD & CPAIMD.
  – Illustrate the use of different visualization
    options.
  – Show performance debugging methodology.




                                                    10
   NAMD: A Production MD
   program      NAMD
                • Fully featured program
                                  NIH-funded development
                                  • Distributed free of
                                  charge (~5000 downloads
                                  so far) Binaries and
                                  source code
                                  • Installed at NSF centers
                                  User training and support
                                  • Large published
                                  simulations (e.g.,
                                  aquaporin simulation
                                  featured in SC’02
                                  keynote)
Collaboration with K. Schulten, R. Skeel, and co-workers
                                                               11
Molecular Dynamics in NAMD
 Collection of [charged] atoms, with bonds
  – Newtonian mechanics
  – Thousands of atoms (10,000 - 500,000)
 At each time-step
  – Calculate forces on each atom
      • Bonds:
      • Non-bonded: electrostatic and van der Waal’s
          – Short-distance: every timestep
          – Long-distance: using PME (3D FFT)
          – Multiple Time Stepping : PME every 4 timesteps
   – Calculate velocities and advance positions
 Challenge: femtosecond time-step, millions
  needed!
                                                             12
NAMD Parallelization using Charm++ with
                 PME


700
VPs                   30,000 VPs


                                                        192 +
                                                        144 VPs




  These 30,000+ Virtual Processors (VPs) are mapped to real
             processors by charm runtime system
                                                                  13
Grainsize Issues
 A variant of Amdahl’s law, for objects:
  – The fastest time can be no shorter than the time
    for the biggest single object!
  – Lesson from previous efforts
 Splitting computation objects:
   – 30,000 nonbonded compute objects
   – Instead of approx 10,000



                                                       14
Distribution of execution times of
non-bonded force computation objects (over 24 steps)


                       Mode: 700 us




                                                       15
Message Packing Overhead
and Multicast




Effect of Multicast Optimization on Integration Overhead
By eliminating overhead of message copying and allocation.


                                                             16
              Load                                          Refinement
                                                               Load
            Balancing
                                                            Balancing

                                             Aggressive
                                           Load Balancing




Processor Utilization against Time on 128 and 1024 processors
 On 128 processor, a single load balancing step suffices, but
      On 1024 processors, we need a “refinement” step.
                                                                         17
Load Balancing Steps

Regular       Detailed, aggressive Load
Timesteps     Balancing : object migration




   Instrumented        Refinement Load
   Timesteps           Balancing



                                             18
       Some overloaded
         processors




Processor Utilization across processors after (a) greedy load
balancing and (b) refining.
Note that the underloaded processors are left underloaded (as
they don’t impact performance); refinement deals only with the
overloaded ones
                                                                 19
 Benefits of Avoiding Barrier
 Problem with barriers:
   – Not the direct cost of the operation itself as
     much
   – But it prevents the program from adjusting to
     small variations
       • E.g. K phases, separated by barriers (or scalar
         reductions)
       • Load is effectively balanced. But
            – In each phase, there may be slight non-determistic load
              imbalance
            – Let Li,j be the load on I’th processor in j’th phase

                 k                                         k
 With barrier:    max {L
                        i   i, j   }   Without:    max i { Li , j }
                 j 1                                      j 1

                                                                        20
100 milliseconds




                   21
Handling Stretches
 Challenge
  – NAMD still did not scale well to 3000 procs with 4
    procs per node
  – due to stretches : inexplicable increase in compute time
    or communication gaps at random (but few) points
  – Stretches caused by: Operating system, file system and
    resource management daemons interfering with the job
  – Badly configured network API
       • Messages waiting for the rendezvous of the previous message to
         be acknowledged, leading to stretches in the ISends
 Managing stretches
  – Use blocking receives
  – Giving OS time when the job process is idle, to run
    daemons
  – Fine tuning the network layer                                    22
 Stretched Computations
 Jitter in computes up to 80ms
   – On 1000+ processors using 4 processors per node
   – NAMD ATPase 3000 processors time steps of 12 ms
   – Within that time: each processor sends and receives :
       • Approximately 60-70 messages of 4-6 KB each
   – OS Context switch time is 10 ms
   – OS and Communication layer can have “hiccups”
       • “Hiccups” termed as stretches
   – Stretches can be a large performance impediment




                                                             23
   Stretch Removal
                        Histogram Views
        Number of function executions vs. their granularity
                         Note: log scale on Y-axis




                                                 After Optimizations
  Before Optimizations                        About 5 large stretched calls,
Over 16 large stretched calls              largest of them much smaller, and
                                          almost all calls take less than 3.2 ms   24
Activity Priorities
 Identified a portion of CPAIMD that ran too
  early via the Time Profile tool.




                                            25
Serial Performance
 The use of performance counters helped
  identify serial performance issues like cache
  performance.
 Projections makes use of PAPI to measure
  performance counters.




                                                  26
Challenges Ahead
 Scalable Performance Data generation
   – Meaningful restrictions on Trace data
     generation.
   – Data compression.
   – Online analysis.
 Scalable Performance Visualization
   – Automatic identification of performance
     problems.


                                               27

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:0
posted:5/10/2013
language:English
pages:27