Docstoc

MPICH2 – A High-Performance and Widely Portable Open Source MPI

Document Sample
MPICH2 – A High-Performance and Widely Portable Open Source MPI Powered By Docstoc
					MPICH2 – A High-Performance
and Widely Portable Open-
Source MPI Implementation




Darius Buntinas
Argonne National Laboratory
Overview


 MPICH2
   – High-performance
   – Open-source
   – Widely portable
 MPICH2-based implementations
   – IBM for BG/L and BG/P
   – Cray for XT3/4
   – Intel
   – Microsoft
   – SiCortex
   – Myricom
   – Ohio State
Outline


 Architectural overview
 Nemesis – a new communication subsystem
 New features and optimizations
   – Intranode communication
   – Optimizing non-contiguous messages
   – Optimizing large messages
 Current work in progress
   – Optimizations
   – Multi-threaded environments
   – Process manager
   – Other optimizations
 Libraries and tools
Traditional MPICH2 Developer APIs


 Two APIs for porting MPICH2 to new communication architectures
   – ADI3
   – CH3
 ADI3 – Implement a new device
   – Richer interface
      • ~60 functions
   – More work to port
   – More flexibility
 CH3 – Implement a new CH3 channel
   – Simpler interface
      • ~15 functions
   – Easier to port
   – Less flexibility
                                                                  Application                                           Jumpshot

                                                                  MPI Inteface




                                                                                                                 PMPI
                                                                                                                         MPE
                                                     MPI Layer                                ROMIO

                                                    ADI3 Inteface                         ADIO Interface

 MPD                                       CH3 Device                    BG Cray MX ... PVFS GPFS XFS ...
          PMI Interface




                                 CH3 Interface                                 Support for High-speed Networks
                                                                                  – 10-Gigabit Ethernet iWARP, Qlogic PSM,
                                 Nemesis



                                                     SSHM
                                             SCTP



                                                            SHM
                          Sock




                                                                  ....


SMPD
                                                                                      InfiniBand, Myrinet (MX and GM)
                                                                               Supports proprietary platforms
                          Nemesis Net Mod Interface                               – BlueGene/L, BlueGene/P, SiCortex, Cray
Gforker                                                                        Distribution with ROMIO MPI/IO library
                          TCP IB/iWARP PSM MX                            GM
                                                                               Profiling and visualization tools (MPE, Jumpshot)
Nemesis


 Nemesis is a new CH3 channel for MPICH2
   – Shared-memory for intranode communication
      • Lock-free queues
      • Scalability
      • Improved intranode performance
   – Network modules for internode communication
      • New interface
 New developer API – Nemesis netmod interface
   – Simpler interface than ADI3
   – More flexible than CH3
Nemesis: Lock-Free Queues


                                Free Recv               Free Recv
                                              2
 Atomic memory operations
 Scalable
   – One recv queue per process
 Optimized to reduce cache misses

                                     1



                                            Free Recv
Nemesis Network Modules
 Improved interface for network modules
   – Allows optimized handling of noncontiguous data
   – Allows optimized transfer of large data
   – Optimized small contiguous message path
      • < 2.5us over QLogic PSM
                      50
     Latency (usec)




                      40
                                                  QLogic PSM
                      30
                                                  Gigabit Ethernet
                      20
                      10
                      0
                           0   1   2   4   8 16 32 64 128 256 512
                                       Message size (byte)
 Future work
   – Multiple network modules
       • E.g., Myrinet for intra-cluster and TCP for inter-cluster
   – Dynamically loadable
Optimized Non-contiguous


 Issues with non-contiguous data
   – Representation
   – Manipulation
       • Packing, generating other representations (e.g., iov), etc
 Dataloops – MPICH2’s optimized internal datatype representation
   – Efficiently describes non-contiguous data
   – Utilities to efficiently manipulate non-contiguous data
 Dataloop is passed to network module
   – Previously, an I/O vector was generated then passed
   – Netmod implementation manipulates the dataloop. E.g.,
       • TCP uses iov
       • IB, PSM, pack data into send buffer.
Optimized Large Message Transfer Using Rendezvous

 MPICH2 uses rendezvous to transfer large messages
   – Original implementation: channel was oblivious to rendezvous
       • CH3 sent RTS, CTS, DATA
       • Shared mem: Large messages would be sent through queue
       • Netmod: Netmod would perform its own rendezvous
   – Shm: Queues may not be the most efficient mechanism to transfer
     large data
       • E.g., network RDMA, inter-process copy mechanism, copy buffer
   – Netmod: Redundant rendezvous
 Developed LMT interface to support various mechanisms
   – Sender transfers data (put)
   – Receiver transfers data (get)
   – Both sender and receiver participate in data transfer
 Modified CH3 to use LMT
   – Works with rendezvous protocol
Optimization: LMT for Intranode Communication

 For intranode, LMT copies through buffer in shared memory
 Sender allocates shared memory region
   – Sends buffer ID to receiver in RTS packet
 Receiver attaches to memory region
 Both sender and receiver participate in transfer
   – Use double-buffering




        Sender                                       Receiver
Current Work In Progress


   Optimizations
   Multi-threaded environments
   Process manager
   Other work
   Atomic Operations Library
Current Optimization Work


 Handle common case fast: Eager contiguous messages
   – Identify this case early in the operation
   – Call netmod’s send_eager_contig() function directly
 Bypass receive queue
   – Currently: check unexp queue, post on posted queue, check network
   – Optimized: check unexp queue, check network
       • Reduced instruction count by 48%
 Eliminate function calls
   – Collapse layers where possible
 Merge Nemesis with CH3
   – Move Nemesis functionality to CH3
   – CH3 shared memory support
   – New CH3 channel/netmod interface
 Cache-aware placement of fields in structures
Fine Grained Threading


 MPICH2 supports multi-threaded applications
   – MPI_THREAD_MULTIPLE
 Currently, thread safety is implemented with a single lock
   – Lock is acquired on entering an MPI function
   – And released on exit
   – Also released when making blocking communication system calls
 Limits concurrency in communication
   – Only one thread can be in the progress engine at one time
 New architectures have multiple DMA engines for communication
   – These can work independently of each other
 Concurrency is needed in the progress engine for maximum performance
 Even without independent network hardware
   – Internal concurrency can improve performance
Multicore-Aware Collectives


 Intra-node communication is much faster than inter-node
 Take advantage of this in collective algorithms
 E.g., Broadcast
   – Send to one process per node, that process
       broadcasts to other processes on that node
 Step further: collectives over shared memory
   – E.g., Broadcast
        • Within a node, process writes data to shared
          memory region
        • Other processes read data
   – Issues
        • Memory traffic, cache misses, etc.
Process Manager


 Enhanced support for third party process managers
   – PBS, Slurm,
   – Working on others
 Replacement for existing process managers
   – Scalable to 10,000’s of nodes +
   – Fault-tolerant
   – Aware of topology
Other Work


 Heterogeneous data representations
   – Different architectures use different data representations
       • E.g., big/little-endian, 32/64-bit, IEEE floats/non-IEEE floats, etc
   – Important for heterogeneous clusters and grids
   – Use existing datatype manipulation utilities
 Fault-tolerance support
   – CIFTS – fault-tolerance backplane
   – Fault detection and reporting
Atomic Operations Library


 Lock-free algorithms use atomic assembly instructions
 Assembly instructions are non-portable
   – Must be ported for each architecture and compiler

 We’re working on an atomic operations library
   – Implementations for various architectures and various compilers
   – Stand-alone library
   – Not all atomic operations are natively supported on all architectures
      • E.g., some have LL-SC but no SWAP
   – Such operations can be emulated using provided operations
Tools Included in MPICH2

  MPE library for tracing MPI and other calls
  Scalable log file format (slog2)
  Jumpshot tool for visualizing log files
    – Supports threads
  Collchk library for checking that the application calls collective operations
   correctly
For more information…

 MPICH2 website
   – http://www.mcs.anl.gov/research/projects/mpich2

 SVN repository
   – svn co https://svn.mcs.anl.gov/repos/mpi/mpich2/trunk mpich2

 Developer pages
   – http://wiki.mcs.anl.gov/mpich2/index.php/Developer_Documentation

 Mailing lists
   – mpich2-maint@mcs.anl.gov
   – mpich-discuss@mcs.anl.gov

 Me
   – buntinas@mcs.anl.gov
   – http://www.mcs.anl.gov/~buntinas

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:3
posted:4/4/2012
language:
pages:20