N-Body simulations

Document Sample
N-Body simulations Powered By Docstoc
					Optimizing N-body Simulations for
  Multi-core Compute Clusters


         Ammar Ahmad Awan
               BIT-6


                       Advisor: Dr. Aamir Shafi
                       Co-Advisor: Mr. Ali Sajjad
                       Member: Dr. Hafiz Farooq
                       Member: Mr. Tahir Azim
        Presentation Outline

•   Introduction
•   Design & Implementation
•   Performance Evaluation
•   Conclusions and Future Work




                                  2
           Introduction
• Sea change in the basic computer architecture:
   – Power Consumption
   – Heat Dissipation
• Emergence of multiple energy-efficient processing cores
  instead of a single power-hungry core
• Moore’s law will now be realized by increasing core-count
  instead of increasing clock speeds
• Impact on software applications:
   – Change of focus from Instruction Level Parallelism ( higher clock
     frequency) to Thread Level Parallelism ( increasing core count )
• Huge impact on High Performance Computing (HPC)
  community:
   – 70% of the TOP500 supercomputers are based on multi-core
     processors                                                          3
Source : Google Images
                     4
Source : www.intel.com   5
              SMP vs Multicore
    Symmetric Multi-Processor             Multi-core Processor



                                               Main Memory
              Main Memory


   Cache                      Cache       A Dual Core Processor
   MMU                         MMU
                                                 Cache
Single Core                 Single Core          MMU
 Processor                   Processor
                                            Core 1    Core 2
  Core 1                      Core 1




                                                                  6
             HPC and Multi-core
• Message Passing Interface (MPI) is the defacto standard for
  programming today’s supercomputers
   – Alternatives include OpenMP (for SMP machines) and Unified Parallel C (UPC)


• With the existing approaches, it is possible to port MPI on multi-core
  processors:
   – One MPI process per core—we call it the “Pure MPI” approach
   – OpenMP threads inside MPI process—we call it “MPI+threads” approach


• We expect “MPI+threads” approach to be good because
   – Communication cost for threads is lower than processes
   – Threads are light-weight

• We have evaluated this hypothesis by comparing both approaches                   7
Pure MPI vs “MPI+threads” approach



                           MPI + Threads
                           Approach




                           Pure MPI Approach




                                           8
           Sample Application: N-body Simulations

• To demonstrate the usefulness of our “MPI+threads” approach,
  we chose N-body simulation code

• N-body or “many body” method is used for simulating the
  evolution of a system consisting of ‘n’ bodies.

• It has found a widespread use in the fields of
   – Astrophysics
   – Molecular Dynamics
   – Computational Biology



                                                                 9
         Summation Approach to solving N-body problems

The most compute intensive part of any N-body method is
the “force calculation” phase

The simplest expression for a far field force f(i) on particle ‘i’ is

  for i = 1 to n

    f(i) = sum[ j=1,...,n, j != i ] f(i,j)

 end for

where f(i,j) is the force on particle i due to particle j.

The cost of this calculation is O(n2)
                                                                   10
              Barnes Hut Tree

The Barnes-Hut algorithm is divided into 3 steps
1. Building the tree – O( n * log n )

2. Computing cell centers of mass – O (n)

3. Computing Forces – O( n * log n )

Other popular methods are
•   Fast Multipole Method
•   Particle Mesh Method
•   TreePM Method
•   Symplectic Methods
                                                   11
         Sample Application: Gadget-2

• Cosmological Simulation Code
• Simulates a system of “n” bodies
   – Implements Barnes-Hut Algorithm
• Written in C language & parallelized with MPI
• As part of this project:
   – Understood the Gadget-2 code
   – How it is used in production mode
   – Modified the C code to use threads in the Barnes-hut tree
     algorithm
   – Added performance counters to the code for measuring
     cache utilization                                       12
        Presentation Outline

•   Introduction
•   Design & Implementation
•   Performance Evaluation
•   Conclusions and Future Work




                                  13
Gadget-2 Architecture




                        14
           Code Analysis
for ( i = 0 to No. of particles && n = 0 to BufferSize)
{
     calculate_force ( i );
     for ( j = 0 to No. of tasks )
     {
             export_particles ( j );
     }
}                         Original Code

parallel for ( i=0 to n )
{
   calculate_force( i );
}

for ( i = 0 to No. of particles && n = 0 to BufferSize )
{
      for ( j = 0 to No. of tasks )
      {
              export_particles ( j );
      }
}                          Modified Code                   15
        Presentation Outline

•   Introduction
•   Design & Implementation
•   Performance Evaluation
•   Conclusions and Future Work




                                  16
         Evaluation Testbed
• Our cluster called Chenab consists of nine nodes.
• Each node consists of an
   – Intel Xeon Quad-Core Kentsfield Processor
       • 2.4 GHz with 1066 MHZ FSB
       • 4 MB L2 Cache / two cores
       • 32 KB L1 Cache / core
   – 2 GB main memory




                                                      17
               Performance Evaluation
• Performance evaluation is based on two main
  parameters
   – Execution Time
        • Calculated directly from MPI wallclock timings
   – Cache Utilization
        • We patched the Linux kernel using perfctr patch
        • We selected the PerfAPI ( PAPI ) for hardware performance counting
        • Used PAPI_L2_TCM (Total Cache Misses ) and PAPI_L2_TCA (Total
          Cache Accesses ) to calculate cache miss ratio
• Results are shown on the upcoming slides
   –   Execution Time for Colliding Galaxies
   –   Execution Time for Cluster Formation
   –   Execution Time for Custom Simulation
   –   Cache Utilization for Cluster Formation                          18
                                     Execution Time for Colliding Galaxies

                                             Galaxy Simulation Results
                            20

                            18

                            16
Execution Time in Minutes




                            14

                            12

                            10
                                                                              Gadget-2 ( Optimized )

                             8                                                Gadget-2 ( Original )


                             6

                             4

                             2

                             0
                                 4           8                   16      20
                                           Number of Processors/Cores


                                                                                               19
                                     Execution Time for Cluster Formation

                                    Cluster Simulation ( 276,498 Particles )
Execution Time ( seconds )




                             2500


                             2000


                             1500

                                                                          Threaded
                             1000                                         Original




                              500


                                0
                                     1   2    3   4    5   6    7   8

                                             Number of Processors               20
                                   Execution Time for Custom Simulation

                                            Custom Simulation
                                             2 Million particles
Execution Time ( Minutes )




                             400

                             350

                             300

                             250

                             200
                                                                         Threaded
                             150                                         Original

                             100

                             50

                              0
                                   12      20                  24   28
                                                No. of cores



                                                                            21
                 Cache Utilization for Cluster Formation

                                                Cluster Simulation (276,498 particles)
                                           90
            L1 Miss Ratio ( Percentage )

                                           80

                                           70

                                           60

                                           50

                                           40                                            Optimized
                                                                                         Original
                                           30

                                           20

                                           10

                                            0
                                                1         2          4          5
                                                      Number of Processors

Cache utilization has been measured using hardware counters provided by the
kernel patch (Perfctr) and PerfAPI (PAPI)

                                                                                                     22
        Presentation Outline

•   Introduction
•   Design & Implementation
•   Performance Evaluation
•   Conclusions and Future Work




                                  23
       Conclusion
• We optimized Gadget-2 which was our
  sample application
  – “MPI+threads” approach performs better
  – The optimized code offers scalable performance


• We are witnessing dramatic changes in core
  designs for multicore systems
  – Heterogeneous and Homogeneous designs
  – Targeting a 1000 core processor will require
    scalable frameworks and tools for programming
                                                 24
              Conclusion
• Towards Many-core computing
    – Multicore : 2x / 2 yrs  ≈ 64 cores in 8 years
    – Manycore : 8x to 16x multicore




Source: Dave Patterson, Overview of the Parallel Laboratory   25
        Future Work
• Scalable Frameworks which provide programmer
  friendly high level constructs are very important
  – PeakStream provides GPU and CPU+GPU hybrid
    programs
  – Cilk++ augment the C++ compiler with three new
    keywords ( cilk_for, cilk_sync, cilk_spawn )
  – Research Accelerator for Multi Processors (RAMP) can
    be used to simulate a 1000 core processor
  – Gadget-2 can be ported to GPUs using Nvidia’s CUDA
    framework
  – ‘xlc’ compiler to program the STI Cell Processor

                                                           26
                                    The Timeline

                                                              Duratio Jan 2008 Feb 2008      Mar 2008    Apr 2008     May 2008       Jun 2008   Jul 2008
ID                 Task Name                  Start    Finish
                                                                n              2/3        3/2 3/9       4/6         5/4          6/1 6/8         7/6

1 Literature Review                         1/18/2008 2/28/2008 6w

2 Evaluation of Gaget-2                     2/28/2008 3/26/2008 4w
3 Optimizations in Gadget-2 ( prototype1) 3/26/2008 4/29/2008 5w

4 Testing of prototype1                     4/29/2008 5/12/2008 2w

5 Optimizations in prototype1               5/12/2008 5/30/2008 3w

6 Final Version                             5/30/2008 6/12/2008 2w
7 Simulation Snapshots and Results          6/12/2008 6/25/2008 2w

8 Final Documentation and Finishing Tasks   6/25/2008 7/21/2008 3.8w

9 Improvements in Documentation             4/15/2008 7/18/2008 13.8w
                                                                                                                                                28
29
30
31
Barnes Hut Tree




                  32
33
34

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:6
posted:2/27/2012
language:
pages:34