MPICH2 – A High-Performance and Widely Portable Open Source MPI

Document Sample
MPICH2 – A High-Performance and Widely Portable Open Source MPI Powered By Docstoc
					MPICH2 – A High-Performance
and Widely Portable Open-
Source MPI Implementation

Darius Buntinas
Argonne National Laboratory

   – High-performance
   – Open-source
   – Widely portable
 MPICH2-based implementations
   – IBM for BG/L and BG/P
   – Cray for XT3/4
   – Intel
   – Microsoft
   – SiCortex
   – Myricom
   – Ohio State

 Architectural overview
 Nemesis – a new communication subsystem
 New features and optimizations
   – Intranode communication
   – Optimizing non-contiguous messages
   – Optimizing large messages
 Current work in progress
   – Optimizations
   – Multi-threaded environments
   – Process manager
   – Other optimizations
 Libraries and tools
Traditional MPICH2 Developer APIs

 Two APIs for porting MPICH2 to new communication architectures
   – ADI3
   – CH3
 ADI3 – Implement a new device
   – Richer interface
      • ~60 functions
   – More work to port
   – More flexibility
 CH3 – Implement a new CH3 channel
   – Simpler interface
      • ~15 functions
   – Easier to port
   – Less flexibility
                                                                  Application                                           Jumpshot

                                                                  MPI Inteface

                                                     MPI Layer                                ROMIO

                                                    ADI3 Inteface                         ADIO Interface

 MPD                                       CH3 Device                    BG Cray MX ... PVFS GPFS XFS ...
          PMI Interface

                                 CH3 Interface                                 Support for High-speed Networks
                                                                                  – 10-Gigabit Ethernet iWARP, Qlogic PSM,




                                                                                      InfiniBand, Myrinet (MX and GM)
                                                                               Supports proprietary platforms
                          Nemesis Net Mod Interface                               – BlueGene/L, BlueGene/P, SiCortex, Cray
Gforker                                                                        Distribution with ROMIO MPI/IO library
                          TCP IB/iWARP PSM MX                            GM
                                                                               Profiling and visualization tools (MPE, Jumpshot)

 Nemesis is a new CH3 channel for MPICH2
   – Shared-memory for intranode communication
      • Lock-free queues
      • Scalability
      • Improved intranode performance
   – Network modules for internode communication
      • New interface
 New developer API – Nemesis netmod interface
   – Simpler interface than ADI3
   – More flexible than CH3
Nemesis: Lock-Free Queues

                                Free Recv               Free Recv
 Atomic memory operations
 Scalable
   – One recv queue per process
 Optimized to reduce cache misses


                                            Free Recv
Nemesis Network Modules
 Improved interface for network modules
   – Allows optimized handling of noncontiguous data
   – Allows optimized transfer of large data
   – Optimized small contiguous message path
      • < 2.5us over QLogic PSM
     Latency (usec)

                                                  QLogic PSM
                                                  Gigabit Ethernet
                           0   1   2   4   8 16 32 64 128 256 512
                                       Message size (byte)
 Future work
   – Multiple network modules
       • E.g., Myrinet for intra-cluster and TCP for inter-cluster
   – Dynamically loadable
Optimized Non-contiguous

 Issues with non-contiguous data
   – Representation
   – Manipulation
       • Packing, generating other representations (e.g., iov), etc
 Dataloops – MPICH2’s optimized internal datatype representation
   – Efficiently describes non-contiguous data
   – Utilities to efficiently manipulate non-contiguous data
 Dataloop is passed to network module
   – Previously, an I/O vector was generated then passed
   – Netmod implementation manipulates the dataloop. E.g.,
       • TCP uses iov
       • IB, PSM, pack data into send buffer.
Optimized Large Message Transfer Using Rendezvous

 MPICH2 uses rendezvous to transfer large messages
   – Original implementation: channel was oblivious to rendezvous
       • CH3 sent RTS, CTS, DATA
       • Shared mem: Large messages would be sent through queue
       • Netmod: Netmod would perform its own rendezvous
   – Shm: Queues may not be the most efficient mechanism to transfer
     large data
       • E.g., network RDMA, inter-process copy mechanism, copy buffer
   – Netmod: Redundant rendezvous
 Developed LMT interface to support various mechanisms
   – Sender transfers data (put)
   – Receiver transfers data (get)
   – Both sender and receiver participate in data transfer
 Modified CH3 to use LMT
   – Works with rendezvous protocol
Optimization: LMT for Intranode Communication

 For intranode, LMT copies through buffer in shared memory
 Sender allocates shared memory region
   – Sends buffer ID to receiver in RTS packet
 Receiver attaches to memory region
 Both sender and receiver participate in transfer
   – Use double-buffering

        Sender                                       Receiver
Current Work In Progress

   Optimizations
   Multi-threaded environments
   Process manager
   Other work
   Atomic Operations Library
Current Optimization Work

 Handle common case fast: Eager contiguous messages
   – Identify this case early in the operation
   – Call netmod’s send_eager_contig() function directly
 Bypass receive queue
   – Currently: check unexp queue, post on posted queue, check network
   – Optimized: check unexp queue, check network
       • Reduced instruction count by 48%
 Eliminate function calls
   – Collapse layers where possible
 Merge Nemesis with CH3
   – Move Nemesis functionality to CH3
   – CH3 shared memory support
   – New CH3 channel/netmod interface
 Cache-aware placement of fields in structures
Fine Grained Threading

 MPICH2 supports multi-threaded applications
 Currently, thread safety is implemented with a single lock
   – Lock is acquired on entering an MPI function
   – And released on exit
   – Also released when making blocking communication system calls
 Limits concurrency in communication
   – Only one thread can be in the progress engine at one time
 New architectures have multiple DMA engines for communication
   – These can work independently of each other
 Concurrency is needed in the progress engine for maximum performance
 Even without independent network hardware
   – Internal concurrency can improve performance
Multicore-Aware Collectives

 Intra-node communication is much faster than inter-node
 Take advantage of this in collective algorithms
 E.g., Broadcast
   – Send to one process per node, that process
       broadcasts to other processes on that node
 Step further: collectives over shared memory
   – E.g., Broadcast
        • Within a node, process writes data to shared
          memory region
        • Other processes read data
   – Issues
        • Memory traffic, cache misses, etc.
Process Manager

 Enhanced support for third party process managers
   – PBS, Slurm,
   – Working on others
 Replacement for existing process managers
   – Scalable to 10,000’s of nodes +
   – Fault-tolerant
   – Aware of topology
Other Work

 Heterogeneous data representations
   – Different architectures use different data representations
       • E.g., big/little-endian, 32/64-bit, IEEE floats/non-IEEE floats, etc
   – Important for heterogeneous clusters and grids
   – Use existing datatype manipulation utilities
 Fault-tolerance support
   – CIFTS – fault-tolerance backplane
   – Fault detection and reporting
Atomic Operations Library

 Lock-free algorithms use atomic assembly instructions
 Assembly instructions are non-portable
   – Must be ported for each architecture and compiler

 We’re working on an atomic operations library
   – Implementations for various architectures and various compilers
   – Stand-alone library
   – Not all atomic operations are natively supported on all architectures
      • E.g., some have LL-SC but no SWAP
   – Such operations can be emulated using provided operations
Tools Included in MPICH2

  MPE library for tracing MPI and other calls
  Scalable log file format (slog2)
  Jumpshot tool for visualizing log files
    – Supports threads
  Collchk library for checking that the application calls collective operations
For more information…

 MPICH2 website

 SVN repository
   – svn co mpich2

 Developer pages

 Mailing lists

 Me

Shared By: