Docstoc

Object Based Parallel Programming.ppt

Document Sample
Object Based Parallel Programming.ppt Powered By Docstoc
					 Component Frameworks:

       Laxmikant (Sanjay) Kale
       Parallel Programming Laboratory
    Department of Computer Science
University of Illinois at Urbana-Champaign
         http://charm.cs.uiuc.edu


              PPL-Dept of Computer Science, UIUC
                           Motivation
• Parallel Computing in Science and Engineering
   – Competitive advantage
   – Pain in the neck
   – Necessary evil
• It is not so difficult
   – But tedious, and error-prone
   – New issues: race conditions, load imbalances, modularity in
     presence of concurrency,..
   – Just have to bite the bullet, right?



                       PPL-Dept of Computer Science, UIUC
                       But wait…
• Parallel computation structures
   – The set of the parallel applications is diverse and
     complex
   – Yet, the underlying parallel data structures and
     communication structures are small in number
      • Structured and unstructured grids, trees (AMR,..),
        particles, interactions between these, space-time
• One should be able to reuse those
   – Avoid doing the same parallel programming again and
     again
   – Domain specific frameworks

                     PPL-Dept of Computer Science, UIUC
                 A Unique Twist
• Many apps require dynamic load balancing
  – Reuse load re-balancing strategies
     • It should be possible to separate load balancing code
       from application code
• This strategy is embodied in Charm++
  – Express the program as a collection of interacting
    entities (objects).
  – Let the system control mapping to processors




                    PPL-Dept of Computer Science, UIUC
        Multi-partition decomposition
• Idea: divide the computation into a large number
  of pieces
   – Independent of number of processors
   – typically larger than number of processors
   – Let the system map entities to processors




                     PPL-Dept of Computer Science, UIUC
          Object-based Parallelization
User is only concerned with interaction between objects
                                      System implementation




 User View

                    PPL-Dept of Computer Science, UIUC
             Charm Component Frameworks

                                                         Reuse of
                 Object based                           Specialized
                 decomposition                       Parallel Strucutres

Load balancing
Auto. Checkpointing
Flexible use of clusters
Out-of-core execn.
                                 Component
                                 Frameworks


                           PPL-Dept of Computer Science, UIUC
            Goals for Our Frameworks
• Ease of use:
   –   C++ and Fortran versions
   –   Retain “look-and-feel” of sequential programs
   –   Provide commonly needed features
   –   Application-driven development
   –   Portability
• Performance:
   – Low overhead
   – Dynamic load balancing via Charm++
   – Cache performance

                      PPL-Dept of Computer Science, UIUC
 Current Set of Component Frameworks
• FEM / unstructured meshes:
   – “Mature”, with several applications already
• Multiblock: multiple structured grids
   – New, but very promising
• AMR:
   – Oct and Quad-trees




                     PPL-Dept of Computer Science, UIUC
      Using the Load Balancing Framework

      Cross module                                           Automatic
      interpolation                                        Conversion from
                                                                MPI

    FEM       Structured
                                           MPI-on-Charm               Irecv+

Framework         Load database + balancer
   path                                                               Migration
                                                                        path
                           Charm++

                            Converse
                      PPL-Dept of Computer Science, UIUC
       Finite Element Framework Goals
 • Hide parallel implementation in the runtime system
 • Allow adaptive parallel computation and dynamic
   automatic load balancing
 • Leave physics and numerics to user
 • Present clean, “almost serial” interface:

begin time loop                                 begin time loop
  compute forces                                  compute forces
  communicate shared nodes                        update node positions
  update node positions                         end time loop
end time loop
                                                              Serial Code
  Framework Code                                            for entire mesh
  for mesh partition
                       PPL-Dept of Computer Science, UIUC
FEM Framework: Responsibilities



                       FEM Application
(Initialize, Registration of Nodal Attributes, Loops Over Elements, Finalize)



                       FEM Framework
(Update of Nodal properties, Reductions over nodes or partitions)


   Partitioner                                         Combiner

 METIS                        Charm++                              I/O
                (Dynamic Load Balancing, Communication)




                     PPL-Dept of Computer Science, UIUC
         Structure of an FEM Program
• Serial init() and finalize() subroutines
   – Do serial I/O, read serial mesh and call FEM_Set_Mesh
• Parallel driver() main routine:
   – One driver per partitioned mesh chunk
   – Runs in a thread: time-loop looks like serial version
   – Does computation and call FEM_Update_Field
• Framework handles partitioning, parallelization, and
  communication




                         PPL-Dept of Computer Science, UIUC
Structure of an FEM Application

                              init()


   driver                       driver                           driver

             Shared Nodes                         Shared Nodes
    Update                       Update                          Update




                          finalize()

             PPL-Dept of Computer Science, UIUC
                        Framework Calls
• FEM_Set_Mesh
    – Called from initialization to set the serial mesh
    – Framework partitions mesh into chunks
• FEM_Create_Field
    – Registers a node data field with the framework, supports user data types
• FEM_Update_Field
    – Updates node data field across all processors
    – Handles all parallel communication
• Other parallel calls (Reductions, etc.)




                            PPL-Dept of Computer Science, UIUC
                    Dendritic Growth
• Studies evolution of
  solidification
  microstructures using a
  phase-field model
  computed on an adaptive
  finite element grid
• Adaptive refinement and
  coarsening of grid involves
  re-partitioning




                         PPL-Dept of Computer Science, UIUC
                    Crack Propagation
• Explicit FEM code
• Zero-volume Cohesive
  Elements inserted near the
  crack
• As the crack propagates, more                                                                     Decomposition into 16 chunks (left) and
                                                                                                    128 chunks, 8 for each PE (right). The
  cohesive elements added near                                                                      middle area contains cohesive elements.
                                                                                                    Pictures: S. Breitenfeld, and P. Geubelle
  the crack, which leads to severe
  load imbalance                                                                  50




                                               Num ber of Iterations Per second
                                                                                  45
                                                                                  40
                                                                                  35
                                                                                  30
                                                                                  25
                                                                                  20
                                                                                  15
                                                                                  10
                                                                                  5
                                                                                  0
                                                                                       1

                                                                                           6

                                                                                               11

                                                                                                    16

                                                                                                         21

                                                                                                              26

                                                                                                                   31

                                                                                                                        36

                                                                                                                             41

                                                                                                                                  46

                                                                                                                                       51

                                                                                                                                            56

                                                                                                                                                 61

                                                                                                                                                      66

                                                                                                                                                           71

                                                                                                                                                                76

                                                                                                                                                                     81

                                                                                                                                                                          86

                                                                                                                                                                               91
                                                                                                                        Iteration Number




                         PPL-Dept of Computer Science, UIUC
            Crack Propagation




Decomposition into 16 chunks (left) and 128 chunks, 8 for each PE
(right). The middle area contains cohesive elements. Both
decompositions obtained using Metis. Pictures: S. Breitenfeld, and
P. Geubelle
                  PPL-Dept of Computer Science, UIUC
“Overhead” of Multipartitioning


                                0.8

                                0.7
 Time (Seconds) per Iteration




                                0.6

                                0.5

                                0.4

                                0.3

                                0.2

                                0.1

                                 0
                                      1   2   4     8    16     32    64    128    256   512   1024 2048

                                                  Number of Chunks Per Processor


                                              PPL-Dept of Computer Science, UIUC
                                            Load balancer in action

Automatic Load Balancing in Crack Propagation
                                                                                     1. Elements
                                   50                                                   Added                  3. Chunks
                                                                                                               Migrated
Num ber of Iterations Per second




                                   45
                                   40
                                   35
                                   30
                                   25
                                   20                                                               2. Load
                                   15                                                               Balancer
                                   10                                                               Invoked
                                   5
                                   0
                                        1

                                            6

                                                11

                                                     16

                                                            21

                                                                 26

                                                                      31

                                                                           36

                                                                                41

                                                                                     46

                                                                                          51

                                                                                               56

                                                                                                    61

                                                                                                         66

                                                                                                              71

                                                                                                                   76

                                                                                                                        81

                                                                                                                             86

                                                                                                                                  91
                                                                           Iteration Num ber



                                                          PPL-Dept of Computer Science, UIUC
Scalability of FEM Framework

                        Speedup of Crack Propagation


          40
          35
          30
Speedup




          25
                                                             Actual Speedup
          20
                                                             Ideal Speedup
          15
          10
          5
          0
               1   2         4        8        16       32
                   Number of Processors



                       PPL-Dept of Computer Science, UIUC
                            Scalability of FEM Framework
                                        Processors
                        1          10                  100                     1000
                1.E+1
                                                                                      3.1M elements
                                                                                      1.5M Nodes
                1.E+0

                                                                                      ASCI Red
Time/Step (s)




                1.E-1




                1.E-2




                1.E-3


                            1 processor time : 8.24 secs
                            1024 processors time:7.13 msecs
                                          PPL-Dept of Computer Science, UIUC
    Parallel Collision Detection
• Detect collisions (intersections) between
  objects scattered across processors




Approach,     based on Charm++ Arrays
   Overlay regular, sparse 3D grid of voxels (boxes)
   Send objects to all voxels they touch
   Collide voxels independently and collect results

Leave   collision response to user code
                 PPL-Dept of Computer Science, UIUC
         Collision Detection Speed
• O(n) serial performance

                                                      Single Linux PC
                                                      2us per polygon serial
                                                      performance




Good   speedups to 1000s of processors
                                                      ASCI Red, 65,000
                                                      polygons per processor
                                                      scaling problem
                                                      (to 100 million polygons)



                 PPL-Dept of Computer Science, UIUC
                FEM: Future Plans
• Better support for implicit computations
   – Interface to Solvers: e.g. ESI (PETSC), ScaLAPACK
     or POOMA’s Linear Solvers

• Better discontinuous Galerkin method support
• Fully distributed startup
• Fully distributed insertion
   – Eliminate serial bottleneck in insertion phase
• Abstraction to allow multiple active meshes
   – Needed for multigrid methods



                     PPL-Dept of Computer Science, UIUC
              Multiblock framework
• For collection of structured grids
   – Older versions:
      • (Gengbin Zheng, 1999-2000)
   – Recent completely new version:
      • Motivated by ROCFLO
   – Like FEM:
      • User writes driver subroutines that deal with the life-
        cycle of a single chunk of the grid
      • Ghost arrays managed by the framework
          – Based on registration of data by the user program
      • Support for “connecting up” multiple blocks
          – makemblock processes geometry info
                       PPL-Dept of Computer Science, UIUC
Multiblock Constituents




     PPL-Dept of Computer Science, UIUC
Terminology




PPL-Dept of Computer Science, UIUC
                Mutiblock structure
• Steps:
   – Feed geometry information to makemblock
      • Input: top level blocks, number of partitions desired
      • Output: block file containing list of partitions, and
        communication structure
   – Run parallel application
      • Reads the block file
      • Initialization of data
• Manual and info:
      • http://charm.cs.uiuc.edu/ppl_research/mblock/


                     PPL-Dept of Computer Science, UIUC
   Multiblock code example: main loop
do tStep=1,nSteps

  call MBLK_Apply_bc_All(grid, size, err)
  call MBLK_Update_field(fid,ghostWidth,grid,err)

  do k=sk,ek
     do j=sj,ej
        do i=si,ei
          ! Only relax along I and J directions-- not K
          newGrid(i,j,k)=cenWeight*grid(i,j,k) &
        &+neighWeight*(grid(i+1,j,k)+grid(i,j+1,k) &
         &+grid(i-1,j,k)+grid(i,j-1,k))
        end do
     end do
  end do




                         PPL-Dept of Computer Science, UIUC
                     Multiblock Driver
subroutine driver()
implicit none
include 'mblockf.h’
…
 call MBLK_Get_myblock(blockNo,err)
 call MBLK_Get_blocksize(size,err)
...
call MBLK_Create_field(&
     &size,1, MBLK_DOUBLE,1,&
     &offsetof(grid(1,1,1),grid(si,sj,sk)),&
     &offsetof(grid(1,1,1),grid(2,1,1)),fid,err)

! Register boundary condition functions
  call MBLK_Register_bc(0,ghostWidth, BC_imposed, err)
  … Time Loop
end

                          PPL-Dept of Computer Science, UIUC
            Multiblock: Future work
• Support other stencils
   – Currently diagonal elements are not used
• Applications
   – We need volunteers!
   – We will write demo apps ourselves




                    PPL-Dept of Computer Science, UIUC
           Adaptive Mesh Refinement
• Used in various engineering applications where
  there are regions of greater interest
   –   e.g.http://www.damtp.cam.ac.uk/user/sdh20/amr/amr.html
   –   Global Atmospheric modeling
   –   Numerical Cosmology
   –   Hyperbolic partial differential equations (M.J. Berger and
       J. Oliger)
• Problems with uniformly refined meshes for above
   – Grid is too fine grained thus wasting resources
   – Grid is too coarse thus the results are not accurate


                       PPL-Dept of Computer Science, UIUC
                 AMR Library
• Implements the distributed grid which can be
  dynamically adapted at runtime
• Uses the arbitrary bit indexing of arrays
• Requires synchronization only before refinement
  or coarsening
• Interoperability because of Charm++
• Uses the dynamic load balancing capability of the
  chare arrays



                  PPL-Dept of Computer Science, UIUC
                                    Indexing of array elements
                Question: Who are my neighbors
                                                                                                Node or root

        • Case of 2D mesh (4x4)                                                                  Leaf

                                                                                                 Virtual Leaf
                                                                            0,0,0




                0,0,2                              0,1,2                             1,0,2        1,1,2




0,0,4   0,1,4           1,0,4   1,1,4   0,2,4   0,3,4       1,2,4   1,3,4




                                                           PPL-Dept of Computer Science, UIUC
    Indexing of array elements (contd.)
• Mathematicaly: (for 2D)
     if parent is x,y using n bits then,
     child1 – 2x , 2y using n+2 bits
     child2 – 2x ,2y+1 using n+2 bits
     child3 – 2x+1, 2y using n+2 bits
     child4 – 2x+1,2y+1 using n+2 bits




                  PPL-Dept of Computer Science, UIUC
          Pictorially




0,0,4




        PPL-Dept of Computer Science, UIUC
        Communication with Nbors
• In dimension x the two nbors can be obtained by
      - nbor --- x-1 where x is not equal to 0
      + nbor --- x+1 where x is not equal to 2n
• In dimension y the two nbors can be obtained by
      - nbor --- y-1 where y is not equal to 0
      + nbor--- y+1 where y is not equal to 2n




                  PPL-Dept of Computer Science, UIUC
   Case 1
                                                                                                Case 3
   Nbors of 1,1,2
    Case 2
    Nbors of 1,1,2                                                                              Nbors of 1,3,4
   Y dimension : -nbor 1,0,2
    X dimension : -nbor 0,1,2
                                                                                                X dimension : +nbor 2,3,4
                                                                                                Nbor of 1,2,4
                                                                            0,0,0
                                                                                                X Dimension : +nbor 2,2,4


                0,0,2                              0,1,2                             1,0,2                        1,1,2




0,0,4   0,1,4           1,0,4   1,1,4   0,2,4   0,3,4       1,2,4   1,3,4




                                                           PPL-Dept of Computer Science, UIUC
           Communication (contd.)
• Assumption : The level of refinement of adjacent
  cells differs at maximum by one (requirement of
  the indexing scheme used)
• Indexing scheme is similar for 1D and 3D cases




                  PPL-Dept of Computer Science, UIUC
                      AMR Interface
• Library Tasks
      - Creation of Tree
      - Creation of Data at cells
      - Communication between cells
      - Calling the appropriate user routines in each iteration
      - Refining – Refine on Criteria (specified by user)

• User Tasks
      - Writing the user data structure to be kept by each cell
      - Fragmenting + Combining of data for the Neighbors
      - Fragmenting of the data of the cell for refine
      - Writing the sequential computation code at each cell



                           PPL-Dept of Computer Science, UIUC
                          Some Related Work
• PARAMESH Peter MacNeice et al.
    http://sdcd.gsfc.nasa.gov/RIB/repositories/inhouse_gsfc/Users_manual/amr.htm
     -   This library is implemented in Fortran 90
     -   Supported on CrayT3E and SGI’s
•   Parallel Algorithms for Adaptive Mesh Refinement, Mark T. Jones and Paul E.
    Plassmann, SIAM J. on Scientific Computing, 18,(1997) pp. 686-708. (Also
    MSC Preprint p 421-0394. )
     http://www-unix.mcs.anl.gov/sumaa3d/Papers/papers.html
• DAGH-Dynamic Adaptive Grid Hierarchies
     – By Manish Parashar & James C. Browne
     – In C++ using MPI




                                    PPL-Dept of Computer Science, UIUC
                      Future work
• Specialized version for structured grids
   – Integration with multiblock
• Fortran interface
   – Current version is C++ only
      • unlike FEM and Multiblock frameworks, which
        support Fortran 90
   – Relatively easy to do




                     PPL-Dept of Computer Science, UIUC
                       Summary
• Frameworks are ripe for use
   – Well tested in some cases
• Questions and answers:
   – MPI libraries?
   – Performance issues?
• Future plans:
   – Provide all features of Charm++




                    PPL-Dept of Computer Science, UIUC
                          AMPI: Goals

        • Runtime adaptivity for MPI programs
              – Based on multi-domain decomposition and dynamic
                load balancing features of Charm++
              – Minimal changes to the original MPI code
              – Full MPI 1.1 standard compliance
              – Additional support for coupled codes
                AMPIzer

              – Automatic conversion of existing MPI programs

Original MPI Code     AMPI Code




                                          AMPI Runtime
                           PPL-Dept of Computer Science, UIUC
                           Charm++

• Parallel C++ with Data Driven Objects
• Object Arrays/ Object Collections
• Object Groups:
    – Global object with a “representative” on each PE
•   Asynchronous method invocation
•   Prioritized scheduling
•   Mature, robust, portable
•   http://charm.cs.uiuc.edu


                        PPL-Dept of Computer Science, UIUC
     Data driven execution




Scheduler                                 Scheduler

Message Q                                 Message Q

            PPL-Dept of Computer Science, UIUC
          Load Balancing Framework
• Based on object migration and measurement of load
  information
• Partition problem more finely than the number of available
  processors
• Partitions implemented as objects (or threads) and mapped
  to available processors by LB framework
• Runtime system measures actual computation times of
  every partition, as well as communication patterns
• Variety of “plug-in” LB strategies available




                     PPL-Dept of Computer Science, UIUC
Load Balancing Framework




      PPL-Dept of Computer Science, UIUC
  Building on Object-based Parallelism
• Application induced load imbalances
• Environment induced performance issues:
  –   Dealing with extraneous loads on shared m/cs
  –   Vacating workstations
  –   Automatic checkpointing
  –   Automatic prefetching for out-of-core execution
  –   Heterogeneous clusters
• Reuse: object based components
• But: Must use Charm++!


                     PPL-Dept of Computer Science, UIUC
                                  AMPI: Goals
• Runtime adaptivity for MPI programs
      – Based on multi-domain decomposition and dynamic load balancing features of
        Charm++
      – Minimal changes to the original MPI code
      – Full MPI 1.1 standard compliance
      – Additional support for coupled codes
      – Automatic conversion of existing MPI programs


                    AMPIzer




Original MPI Code             AMPI Code




                                                  AMPI Runtime
                                   PPL-Dept of Computer Science, UIUC
                   Adaptive MPI
• A bridge between legacy MPI codes and dynamic load
  balancing capabilities of Charm++
• AMPI = MPI + dynamic load balancing
• Based on Charm++ object arrays and Converse’s
  migratable threads
• Minimal modification needed to convert existing MPI
  programs (to be automated in future)
• Bindings for C, C++, and Fortran90
• Currently supports most of the MPI 1.1 standard




                    PPL-Dept of Computer Science, UIUC
                           AMPI Features
• Over 70+ common MPI routines                              Verylow “overhead”
    – C, C++, and Fortran 90 bindings
                                                            compared with native
                                                            MPI
    – Tested on IBM SP, SGI Origin 2000,
      Linux clusters                                               64
                                                                   62

• Automatic conversion: AMPIzer




                                                Time (seconds)
                                                                   60
                                                                   58
    – Based on Polaris front-end                                   56
                                                                                                             AMPI
                                                                                                             MPI
                                                                   54
    – Source-to-source translator for                              52
      converting MPI programs to AMPI                              50
                                                                   48
    – Generates supporting code for                                     1   8     16    32     64   128

      migration                                                             Number of Processors

                                                                   5

                                                                   4



                                                Percent Overhead
                                                                   3

                                                                   2
                                                                   1                                      Overhead

                                                                   0
                                                                   -1   1   8     16    32    64    128

                                                                   -2
                                                                   -3

                                                                            Number of Processors




                              PPL-Dept of Computer Science, UIUC
                      AMPI Extensions
• Integration of multiple MPI-based modules
    – Example: integrated rocket simulation
       • ROCFLO, ROCSOLID, ROCBURN, ROCFACE
• Each module gets its own MPI_COMM_WORLD
    – All COMM_WORLDs form MPI_COMM_UNIVERSE
• Point-to-point communication among different
  MPI_COMM_WORLDs using the same AMPI functions
• Communication across modules also considered for balancing load
• Automatic checkpoint-and-restart
    – On different number of processors
    – Number of virtual processors remain the same, but can be mapped to
      different number of physical processors




                           PPL-Dept of Computer Science, UIUC
       Charm++

      Converse




PPL-Dept of Computer Science, UIUC
  Application Areas and Collaborations
• Molecular Dynamics:
  – Simulation of biomolecules
  – Material properties and electronic structures
• CSE applications:
  – Rocket Simulation
  – Industrial process simulation
  – Cosmology visualizer
• Combinatorial Search:
  – State space search, game tree search, optimization



                    PPL-Dept of Computer Science, UIUC
              Molecular Dynamics
• Collection of [charged] atoms, with bonds
• Newtonian mechanics
• At each time-step
   – Calculate forces on each atom
      • Bonds:
      • Non-bonded: electrostatic and van der Waal’s
   – Calculate velocities and advance positions
• 1 femtosecond time-step, millions needed!
• Thousands of atoms (1,000 - 100,000)


                    PPL-Dept of Computer Science, UIUC
PPL-Dept of Computer Science, UIUC
PPL-Dept of Computer Science, UIUC
BC1 complex: 200k atoms




      PPL-Dept of Computer Science, UIUC
                     Performance Data: SC2000

                             Speedup on ASCI Red: BC1 (200k atoms)

          1400



          1200



          1000



           800
Speedup




           600



           400



           200



             0
                 0     500            1000                 1500      2000   2500
                                              Processors

                                PPL-Dept of Computer Science, UIUC
                  Component Frameworks:
            Using the Load Balancing Framework

      Cross module                                            Automatic
      interpolation                                         Conversion from
                                                                 MPI

    FEM        Structured
                                            MPI-on-Charm               Irecv+

Framework         Load database + balancer
   path                                                                Migration
                                                                         path
                            Charm++

                             Converse
                       PPL-Dept of Computer Science, UIUC
       Finite Element Framework Goals
   • Hide parallel implementation in the runtime system
   • Allow adaptive parallel computation and dynamic
     automatic load balancing
   • Leave physics and numerics to user
   • Present clean, “almost serial” interface:


                                       begin time loop
begin time loop
                                         compute forces
  compute forces
                                         communicate shared nodes
  update node positions
                                         update node positions
end time loop
                                       end time loop
  Serial Code
                                                          Framework Code
for entire mesh
                     PPL-Dept of Computer Science, UIUC   for mesh partition
FEM Framework: Responsibilities



                       FEM Application
(Initialize, Registration of Nodal Attributes, Loops Over Elements, Finalize)



                       FEM Framework
(Update of Nodal properties, Reductions over nodes or partitions)


   Partitioner                                         Combiner

 METIS                        Charm++                              I/O
                (Dynamic Load Balancing, Communication)




                     PPL-Dept of Computer Science, UIUC
Structure of an FEM Application

                              init()


   driver                       driver                           driver

             Shared Nodes                         Shared Nodes
    Update                       Update                          Update




                          finalize()

             PPL-Dept of Computer Science, UIUC
                    Dendritic Growth
• Studies evolution of
  solidification
  microstructures using a
  phase-field model
  computed on an adaptive
  finite element grid
• Adaptive refinement and
  coarsening of grid involves
  re-partitioning




                         PPL-Dept of Computer Science, UIUC
            Crack Propagation




Decomposition into 16 chunks (left) and 128 chunks, 8 for each PE
(right). The middle area contains cohesive elements. Both
decompositions obtained using Metis. Pictures: S. Breitenfeld, and
P. Geubelle
                  PPL-Dept of Computer Science, UIUC
“Overhead” of Multipartitioning


                                0.8

                                0.7
 Time (Seconds) per Iteration




                                0.6

                                0.5

                                0.4

                                0.3

                                0.2

                                0.1

                                 0
                                      1   2   4     8    16     32    64    128    256   512   1024 2048

                                                  Number of Chunks Per Processor


                                              PPL-Dept of Computer Science, UIUC
                                            Load balancer in action

Automatic Load Balancing in Crack Propagation
                                                                                     1. Elements
                                   50                                                   Added                  3. Chunks
                                                                                                               Migrated
Num ber of Iterations Per second




                                   45
                                   40
                                   35
                                   30
                                   25
                                   20                                                               2. Load
                                   15                                                               Balancer
                                   10                                                               Invoked
                                   5
                                   0
                                        1

                                            6

                                                11

                                                     16

                                                            21

                                                                 26

                                                                      31

                                                                           36

                                                                                41

                                                                                     46

                                                                                          51

                                                                                               56

                                                                                                    61

                                                                                                         66

                                                                                                              71

                                                                                                                   76

                                                                                                                        81

                                                                                                                             86

                                                                                                                                  91
                                                                           Iteration Num ber



                                                          PPL-Dept of Computer Science, UIUC
    Parallel Collision Detection
• Detect collisions (intersections) between
  objects scattered across processors




Approach,     based on Charm++ Arrays
   Overlay regular, sparse 3D grid of voxels (boxes)
   Send objects to all voxels they touch
   Collide voxels independently and collect results

Leave   collision response to user code
                 PPL-Dept of Computer Science, UIUC
         Collision Detection Speed

• O(n) serial performance
                                                      Single Linux PC
                                                      2us per polygon serial
                                                      performance




Good   speedups to 1000s of processors
                                                      ASCI Red, 65,000
                                                      polygons per processor
                                                      scaling problem
                                                      (to 100 million polygons)



                 PPL-Dept of Computer Science, UIUC
                    Rocket Simulation
• Our Approach:
   – Multi-partition decomposition
   – Data-driven objects
     (Charm++)
   – Automatic load balancing
     framework
• AMPI: Migration path for
  existing MPI+Fortran90 codes
   – ROCFLO, ROCSOLID, and
     ROCFACE




                          PPL-Dept of Computer Science, UIUC
        Timeshared parallel machines
• How to use parallel machines effectively?
• Need resource management
   – Shrink and expand individual jobs to available sets of
     processors
   – Example: Machine with 100 processors
      • Job1 arrives, can use 20-150 processors
      • Assign 100 processors to it
      • Job2 arrives, can use 30-70 processors,
          – and will pay more if we meet its deadline
• We can do this with migratable objects!

                       PPL-Dept of Computer Science, UIUC
    Faucets: Multiple Parallel Machines
• Faucet submits a request, with a QoS contract:
   – CPU seconds, min-max cpus, deadline, interacive?
• Parallel machines submit bids:
   – A job for 100 cpu hours may get a lower price bid if:
      • It has less tight deadline,
      • more flexible PE range
   – A job that requires 15 cpu minutes and a deadline of 1 minute
      • Will generate a variety of bids
      • A machine with idle time on its hand: low bid




                        PPL-Dept of Computer Science, UIUC
            Faucets QoS and Architecture
•User specifies desired job parameters such as:
   •min PE, max PE, estimated CPU-seconds, priority, etc.

•User does not specify machine. .
•Planned: Integration with Globus



                                                  Workstation Cluster
    Faucet Client

                                                                         Workstation Cluster
                      Central Server

     Web Browser                                   Workstation Cluster




                      PPL-Dept of Computer Science, UIUC
       How to make all of this work?
• The key: fine-grained resource management
  model
  – Work units are objects and threads
     • rather than processes
  – Data units are object data, thread stacks, ..
     • Rather than pages
  – Work/Data units can be migrated automatically
     • during a run




                   PPL-Dept of Computer Science, UIUC
Time-Shared Parallel Machines




         PPL-Dept of Computer Science, UIUC
  Appspector: Web-based Monitoring and
      Steering of Parallel Programs
• Parallel Jobs submitted via a server
   – Server maintains database of running programs
   – Charm++ client-server interface
      • Allows one to inject messages into a running application
• From any web browser:
   –   You can attach to a job (if authenticated)
   –   Monitor performance
   –   Monitor behavior
   –   Interact and steer job (send commands)


                         PPL-Dept of Computer Science, UIUC
                                            BioCoRE
      Goal: Provide a web-based
      way to virtually bring
      scientists together.




                                                              •Project Based
                                                              •Workbench for Modeling
                                                              •Conferences/Chat Rooms
                                                              •Lab Notebook
                                                              •Joint Document Preparation
                                       PPL-Dept of Computer Science, UIUC
http://www.ks.uiuc.edu/Research/biocore/
               Some New Projects
• Load Balancing for really large machines:
   – 30k-128k processors
• Million-processor Petaflops class machines
   – Emulation for software development
   – Simulation for Performance Prediction
• Operations Research
   – Combinatorial optiization
• Parallel Discrete Event Simulation



                    PPL-Dept of Computer Science, UIUC
                          Summary
• Exciting times for parallel computing ahead
• We are preparing an object based infrastructure
   – To exploit future apps on future machines
• Charm++, AMPI, automatic load balancing
• Application-oriented research that produces enabling
  CS technology
• Rich set of collaborations
• More information: http://charm.cs.uiuc.edu



                       PPL-Dept of Computer Science, UIUC

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:3
posted:6/1/2012
language:English
pages:81
shensengvf shensengvf http://
About