The Virtual Machine Interface _VMI__ A High Performance by pptfiles

VIEWS: 47 PAGES: 16

									Distributed Parallel Processing – MPICH-VMI
Avneesh Pant

VMI
What is VMI?

– Virtual Machine Interface – High performance communication middleware – Abstracts underlying communication network
What is MPICH-VMI

– MPI library based on MPICH 1.2 from Argonne that uses VMI for underlying communication

Features
Communication over heterogeneous networks

– Infiniband, Myrinet, TCP, Shmem supported – Underlying networks selected at runtime – Enables cross-site jobs over compute grids
Optimized point-to-point communication

– Higher level MPICH protocols (eager and rendezvous) implemented over RDMA Put and Get primitives – RDMA emulated on networks without native RDMA support (TCP)
Extensive support for profiling

– Profiling counters collect information about communication pattern of the application – Profiling information logged to a databases during MPI_Finalize – Profile Guided Optimization (PGO) framework uses profile databases to optimize subsequent executions of the application

Features
Hiding pt-to-pt communication latency

– RMDA get protocol very useful in overlapping communication and computation – PGO infrastructure maps MPI processes to nodes to take advantage of heterogeneity of underlying network, effectively hiding latencies
Optimized Collectives

– RDMA based collectives (e.g., MPI_Barrier) – Multicast based collectives (e,g., MPI_Bcast experimental implementation using multicast) – Topology aware collectives (currently MPI_Bcast, MPI_Reduce, MPI_Allreduce supported)

MPI on Teragrid
MPI flavors available on Teragrid

– MPICH-GM – MPICH-G2 – MPICH-VMI 1
• Deprecated! Was part of CTSS v1

– MPICH-VMI 2
• Available as part of CTSS v2 and v3 All are part of CTSS

– Which one to use? – We are biased!

MPI on Teragrid
MPI Designed for

– MPICH-GM -> Single site runs using myrinet – MPICH-G2 -> Running across globus grids – MPICH-VMI2 -> Scale out seamlessly from single site to across grid
Currently need to keep two separate executables

– Single site using MPICH-GM and Grid job using MPICH-G2 – MPICH-VMI2 allows you to use the same executable with comparable or better performance

MPI on Teragrid
Intra Site Myrinet Intra Site Ethernet Inter Site Myrinet + Ethernet Optimized MPI Collectives Hiding Comm. Latencies

MPICH-GM

MPICH-VMI 2

MPICH-G2

MPICH-VMI 1

Using MPICH-VMI
Two flavors of MPICH-VMI2 on Teragrid

– GCC compiled library – Intel compiled library – Recommended not to mix them together
CTSS defines keys for each compiled library

– GCC: mpich-vmi-2.1.0-1-gcc-3-2 – Intel: mpich-vmi-2.1.0-1-intel-8.0

Setting the Environment
To use MPICH VMI 2.1

– $ soft add +mpich-vmi-2.1.0-1-{gcc-3-2 | intel-8.0}
To preserve VMI 2.1 environment across sessions, add

– “+mpich-vmi-2.1.0-1-{gcc-3-2 | intel-8.0}” to the .soft file in your home directory – Intel 8.1 is also available at NCSA. Other sites do not have Intel 8.1 completely installed yet.
Softenv brings in the compiler wrapper scripts into your environment

– mpicc and mpiCC for C and C++ codes – mpif77 and mpif90 for F77 and F90 codes – Some underlying compilers such as GNU compiler suite do not support F90. Use “mpif90 –show” to determine underlying compiler being used.

Compiling with MPICH-VMI
The compiler scripts are wrappers that include all MPICH-VMI specific libraries and paths All underlying compiler switches are supported and passed to the compiler

– eg. mpicc hello.c –o hello
The MPICH-VMI library by default is compiled with debug symbols.

Running with MPICH-VMI
mpirun script is available for launching jobs Supports all standard arguments in addition to MPICH-VMI specific arguments mpirun uses ssh, rsh and MPD for launching jobs. Default is MPD. Provides automatic selection/failover

– If MPD ring not available, falls back to ssh/rsh
Supports standard way to run jobs

– mpirun –np <# of procs> -machinefile <nodelist file> <executable> <arguments> – -machinefile argument not needed when running within PBS or LSF environment
Can select network to use at runtime by specifying

– -specfile <network> – Supported networks are myrinet, tcp and xsite-myrinet-tcp
Default network on Teragrid is Myrinet

– Recommend to always specify network explicitly using –specfile switch

Running with MPICH-VMI
MPICH-VMI 2.1 specific arguments related to three broad categories

– Parameters for runtime tuning – Parameters for launching GRID jobs – Parameters for controlling profiling of job
mpirun –help option to list all tunable parameters

– All MPICH-VMI 2.1 specific parameters are optional. GRID jobs require some parameters to be set.
To run a simple job within a Teragrid cluster

– mpirun –np 4 /path/to/hello – mpirun –np 4 –specfile myrinet /path/to/hello
Within PBS $PBS_NODEFILE contains the path to the nodes allocated at runtime

– mpirun –np <# procs> –machinefile $PBS_NODEFILE /path/to/hello
For cross-site jobs, additional arguments required (discussed later)

For Detecting/Reporting Errors
Verbosity switches

– -v Verbose Level 1. Output VMI startup messages and make MPIRUN verbose. – -vv Verbose Level 2. Additionally output any warning messages. – -vvv Verbose Level 3. Additionally output any error messages. – -vvvv Verbose Level 10. Excess Debug. Useful only for developers of MPICH-VMI and submitting crash dumps.

Running Inter Site Jobs
A MPICH-VMI GRID job consists of one or more subjobs A subjob is launched on each site using individual mpirun commands. The specfile selected should be one of the xsite network transports (xsite-mst-tcp or xsite-myrinet-tcp). The higher performance SAN (Infiniband or Myinet) is used for intra site communication. Cross site communication uses TCP automatically In Addition to Intra Site Parameters all Inter Site Runs Must Specify the same Grid Specific Parameters A Grid CRM Must be Available on the Network to Synchronize Subjobs

– Grid CRM on Teragrid is available at tg-master2.ncsa.uiuc.edu – No reason why any other site can’t host their own – In fact, you can run one on your own desktop!
Grid Specific Parameters

– -grid-procs Specifies the total number of processes in the job. –np parameter to mpirun still specifies the number of processes in the subjob – -grid-crm Specifies the host running the grid CRM to be used for subjob synchronization.

– -key Alphanumeric string that uniquely identifies the grid job. This should be the same for all subjobs!

Running Inter Site Jobs
– Running xsite across SDSC (2 procs) and NCSA (6 procs)
•@SDSC: mpirun -np 2 grid-procs 8 -key myxsitejob -specfile xsite-myrinet-tcp –gridcrm tg-master2.ncsa.teragrid.org cpi

•@NCSA: mpirun -np 6 grid-procs 8 -key myxsitejob -specfile xsite-myrinet-tcp –gridcrm tg-master2.ncsa.teragrid.org cpi

MPICH-VMI2 Support
Support

– help@teragrid.org – Mailing lists: http://vmi.ncsa.uiuc.edu/mailingLists.php – Announcements: vmi-announce@yams.ncsa.uiuc.edu – Users: vmi-user@yams.ncsa.uiuc.edu – Developers: vmi-devel@yams.ncsa.uiuc.edu

nordered networks

• Datagrams and unbounded streams • Single connection across multiple networks

University of Illinois at Urbana-Champaign

Data and Message Model
Connection
Process B Process A Slab Stream Slab Slab Slab Slab Slab Stream

Slab

Slab

Stream

Slab

Slab

Buffer Op
Buffer Op Buffer Op

Connections across multiple wires/protocols Support for streams and datagrams Control over ordering Scatter/gather from disjoint regions Aggregated memory registration

Buffer
University of Illinois at Urbana-Champaign

Buffer

Data Transfer Modes in VMI2
• Generalized RDMA Semantics
– Hardware assisted RDMA on networks that provide support (Infiniband, Myrinet) – Software emulated RDMA for TCP – Provide two distinct RDMA transfer APIs
• Receiver notification on RDMA data arrival • No notification on RDMA data arrival. Receiver must poll
– Notification mode has slightly higher overhead due to PCI interrupt overhead on host.

University of Illinois at Urbana-Champaign

Data Transfer Modes in VMI2
• Generalized RDMA Semantics
– Target publishes a buffer with a user specified context for notification – Publish callback invoked on sender for target buffer. A VMI RDMA handle encapsulates RDMA specific fields for the underlying network to be used – Sender performs RDMA puts using RDMA handles and a local RDMA op descriptor that provides a generalization of gather semantics – For notification based RDMA the receiver is notified via a callback for RDMA data arrival using the user specified notification context

University of Illinois at Urbana-Champaign

Which Network To Use?
• A Specification File in XML is Used
– Specifies the devices to load – Specifies the chains to create – The order the devices are stacked in a chain

• GUIDs are Used to Uniquely Identify a Device • Only Devices With Similar GUIDs Can Communicate With Each Other
University of Illinois at Urbana-Champaign

Which Network To Use?
• Don’t Want to Limit an Application to Use the Same Set of Devices
– Allows application to span across clusters with different networks on each cluster

• How to Determine What Devices are Available on Remote Node
– Use a node specific directory: vmieyes – Devices register GUIDs with local vmieyes – On connection setup query remote vmieyes to get active GUIDs

University of Illinois at Urbana-Champaign

Software Architecture
Application Higher-level messaging layers Connection Mgr. Chain Mgr. Buffer Mgr. Alert Mgr. Device Mgr.
D0 D1 D2

I/O Mgr.

Slab Mgr.

Stream Mgr.

Lower-level messaging layers

VMIeyes (one per node)

University of Illinois at Urbana-Champaign

MPI
• Most Open Source MPI are Derived from MPICH
– MPICH-GM and MPICH-G2 are both incarnations of MPICH – MPICH-VMI2 is no different

• MPICH provides a modular architecture for research and development
– Layered approach allows selective customization of stack

University of Illinois at Urbana-Champaign

MPI
• Upper MPI Layers are Network Agnostic
– MPI_Send, MPI_Barrier, etc. – Implemented by targeting a network specific interface called ADI

• Abstract device interface (ADI)
– Consists of over 30 functions for transporting messages – The Channel Layer – A primitive communication interface consisting of only 6 functions. – A generic ADI implementation from ANL available that targets the Channel Layer (MPICH-P4 for ethernet)

University of Illinois at Urbana-Champaign

MPI
• Hard to Extract Performance Using Channel Layer
– Interface is simple precluding use of advanced communication facilities that may be available – Though easy to port MPICH to a new network

• MPICH-VMI Implements both the Channel Interface and some Performance Critical ADI Functions Selectively
University of Illinois at Urbana-Champaign

MPI

University of Illinois at Urbana-Champaign

MPI

University of Illinois at Urbana-Champaign

MPICH-VMI1
• MPI implementation using MPICH from Argonne on VMI1
– True Channel Level Implementation – Optimized for intra site communications
• No topology aware optimizations

– Higher overhead on data transfer path
• Potentially multiple intermediate copies in send/recv path.

University of Illinois at Urbana-Champaign

MPICH-VMI2
• A complete rewrite using MPICH 1.2.5
– A full MPICH channel implementation with additional ADI2 functions implemented natively – Performance optimization for deep communication hierarchies in CoC
• Topology aware collectives • Profile guided communication optimizations

– Integrated support for profiling communication patterns and memory/buffer usage for MPI applications – Scaleable to large number of nodes
• ssh, MPD, PBS, LSF

University of Illinois at Urbana-Champaign

MPICH-VMI2
• Support for MPICH-ADI2 Short and Rendezvous protocols
– Switchover point between Short -> Rendezvous can be changed at runtime

• Short protocol optimized for latency
– Reduce long control paths at expense of data copies

• Rendezvous protocol optimized for bandwidth
– Increased latency using handshake protocol to minimize intermediate data copies. True zero-copy implementation!
University of Illinois at Urbana-Champaign

MPICH-VMI2 Software Stack
MPICH Core Library Functions (Collectives, Communicator Management, etc) ADI2 Functions (MPI Datatypes, Request Management, Multi-protocol ADI2 Devices) CH_VMI Device, Inter Node Communication CH_Self Intra Proccess Communication CH_Smp Intra Node Communication Topology Aware Collectives Short Rendezvous

Virtual Machine Interface

TCP Ethernet

GM VAPI Myrinet Infiniband

University of Illinois at Urbana-Champaign

MPICH-VMI2
• Most modern SANs require memory to be pinned before data transfer can take place
– This is an expensive operation requiring a transition into kernel

• We implement a pin down cache
– Deregister memory lazily – Applications with high buffer reuse percentage can benefit from cache hits. – Tricky to implement on linux. Requires our own memory allocator and some messy work with trapping mmap and unmaps to keep cache consistent!

University of Illinois at Urbana-Champaign

MPICH-VMI2
• Short Protocol
– Uses VMI Streams and RDMA transfer modes – RDMA preferable to VMI Streams however …
• RDMA with no notification requires polling memory to check incoming data. Expensive operations with large number of nodes! • We use an adaptive RDMA protocol
– All connections start out using VMI Stream mode – Only connections considered active graduate to using RDMA. Minimizes the size of the polling set

University of Illinois at Urbana-Champaign

MPICH-VMI2
RDMA Short Slots

Sender 0

Sender 1 Receiver Sender n

University of Illinois at Urbana-Champaign

MPICH-VMI2
Sender 0

Sender 1 Receiver Sender n

University of Illinois at Urbana-Champaign

MPICH-VMI2
• Rendezvous protocol
– Employs a three way handshake (Request to Send, Ok to Send, Send Data) – Uses RDMA with notifications to implement zero-copy transfer path. – Setup a RDMA pipeline to minimize pinned memory usage
• Important for production clusters!

University of Illinois at Urbana-Champaign

Inter Site Optimizations
• Inter Site Optimizations Uses Two Complimentary Approaches
– Profile Guided Optimizations (PGO) – Topology Aware Collectives

• PGO Approach Similar to Compilers
– Run application to gather trace data to make intelligent decisions on subsequent runs

• Topology Aware Collectives
– Optimize MPI collective calls to be aware of underlying network topology at runtime
University of Illinois at Urbana-Champaign

Profile Guided Optimizations
• Profile Guided Optimizations
– Try to maximize communication over fast links and avoid the slow, long ones! – We generate the topology of the MPI applications on the fly at runtime and use it to make intelligent decisions
• Grid job consists of multiple subjobs.
– Subjob consists of multiple processes at each site

• Communication hierarchy is intra node -> intra subjob -> inter subjob
University of Illinois at Urbana-Champaign

Profile Guided Optimizations
• Profile Guided Optimizations
– Initial run to capture communication graph to a profile database
• Vertices are the MPI ranks, edges signify communication events between ranks. Weight of edges is the amount of communication

– Subsequent runs query profile database to optimize inter subjob communication for current topology
University of Illinois at Urbana-Champaign

Profile Guided Optimizations
• How do we minimize inter subjob communication dynamically?
– Observe that this is similar to a graph partitioning problem – Mapping of virtual MPI ranks to physical processors
• Objective: Minimize edgecut across partitions • Constraint: Specified number of vertices in each partition

– Allocation of MPI ranks to physical processors based on paritioning
• Allowed within MPI 1 & 2 specification

University of Illinois at Urbana-Champaign

Profile Guided Optimizations
• Multiple algorithms for graph partitioning
– Kernighan and Lin (VLSI circuit layout) – METIS library used in Adaptive Mesh Refinement (AMR) to balance computational loads – Write your own!

University of Illinois at Urbana-Champaign

Profile Guided Optimizations

University of Illinois at Urbana-Champaign

Topology Aware Collectives
• MPI Collectives
– MPI 1.1 specification has 14 collective operations. We implement topology aware algorithms for 4 collectives (Broadcast, Barrier, Gather and Allreduce). – The goal is to make optimal use of fast intra subjob links and minimize communication of slow inter-subjob links.

University of Illinois at Urbana-Champaign

Topology Aware Collectives
• Default MPI collectives
– Good for switch homogeneous networks – Good when there is dedicated channel between any two processes

• In computation grids
– Heterogeneous network topology – Communication channel between any two processes may not be dedicated
University of Illinois at Urbana-Champaign

Topology Aware Collectives
• Default MPICH collectives send same data items multiple times over high-latency, low bandwidth wide area links. • In our implementation, the fundamental concept is to use the slow inter-site link only once for any data item in order to reduce the traffic over this slow link. • Hides communication latencies, conserves widearea bandwidth
University of Illinois at Urbana-Champaign

Topology Aware Collectives
• Topology Aware Broadcast Implementation
– At each subjob level, the process with the lowest rank is designated as the coordinator. – If the coordinator is the root of the broadcast, it does a flat tree broadcast to all the coordinator processes in the other subjobs followed by a binomial tree broadcast to processes within its subjob. The coordinator processes in turn do a binomial tree broadcast at their own sites. – If the root of the broadcast is not the coordinator of its subjob, it temporarily assumes the role of the coordinator, doing a flat tree broadcast to coordinator nodes (except its own) followed by a binomial tree broadcast with its own subjob.

University of Illinois at Urbana-Champaign

Topology Aware Collectives
• Topology aware Broadcast on MPI_COMM_WORLD

q1

q0

p0

p1

q2

p2

University of Illinois at Urbana-Champaign

Topology Aware Collectives
• Topology aware Barrier on MPI_COMM_WORLD

q1

q0

p0

p1

q2

p2

University of Illinois at Urbana-Champaign

CRM
• In MPI a Fully Connected Mesh is Required
– Any process can talk to any process – Requires every process to know the location of all other processes in the computation

• CRM is Used for Job Synchronization on Startup
– In VMI1 all jobs required a standalone CRM server – With VMI2 this is relaxed. Only inter site jobs require a standalone CRM called the Grid CRM (GCRM)

• Single Site Runs Elect One of the Processes as CRM
– Improves fault tolerance with VMI2
University of Illinois at Urbana-Champaign

CRM
• On Startup Each Process Registers Its Location With the CRM • Each Registration Request Specifies the Following
– Location of process – Number of processes expected in the job – An alphanumeric string identifying the job to allow the CRM to synchronize multiple jobs concurrently

• When All Process Have Checked In
– The CRM allocates MPI ranks to each process – Broadcasts the locations and ranks assigned to all processes in the job
University of Illinois at Urbana-Champaign

Grid CRM
• Provides an Extensible Grid CRM Allocator Module Framework
– Allocator modules implement custom rank mappings for applications – Grid CRM generates job topology on startup consisting of number of subjobs, processes per subjob and the physical nodes allocated for run – Allocator modules implement MPI rank mappings for physical resources – Multiple allocator modules may be loaded by Grid CRM on job startup – During job launch user can specify which allocator module to use via switch to mpirun

University of Illinois at Urbana-Champaign

Job Startup at Single Site
NCSA

• mpirunvmieyes ranks mesh between Establish connection Query CRM daemons CRM allocatesfor Contactexecutes job and for active ranks using the available network devices for each process broadcasts location of all ranks startup synchronization network • spawns process on nodes devices VMIEYES MPIRUN APPLICATION
University of Illinois at Urbana-Champaign

Setting the environment
• CTSS uses softenv for environment management • Defines a set of keys and associated paths to update your environment • Commands to add and remove packages from environment
– Temporarily (for the current session only)
• soft add, soft delete

– For environment to persist across sessions
• Add softenv keys to your .soft file in your home directory

University of Illinois at Urbana-Champaign

Setting the Environment
• Two flavors of MPICH-VMI2 on Teragrid
– GCC compiled library – Intel compiled library – Recommended not to mix them together

• CTSS defined keys for each compiled library
– GCC: mpich-vmi-2.0-gcc-r1 – Intel: mpich-vmi-2.0-intel-r1

University of Illinois at Urbana-Champaign

Setting the Environment
• To use MPICH VMI 2.0, gcc 3.2
– $ soft add +mpich-vmi-2.0-gcc-r1

• To use MPICH VMI 2.0, Intel C 8.0 Fortran 8.0
– $ soft add P + mpich-vmi-2.0-intel-r1

• Add to .soft file
– + mpich-vmi-2.0-gcc-r1 or – + mpich-vmi-2.0-intel-r1

University of Illinois at Urbana-Champaign

Setting the Environment
• softenv brings in the compiler wrapper scripts into your environment
– mpicc and mpiCC for C and C++ codes – mpif77 and mpif90 for F77 and F90 codes – Some underlying compilers such as GNU compiler suite do not support F90. Use ―mpif90 –show‖ to determine underlying compiler being used.
University of Illinois at Urbana-Champaign

Compiling with MPICH-VMI
• The compiler scripts are wrappers that include all MPICH-VMI specific libraries and paths • All underlying compiler switches are supported and passed to the compiler
– eg. mpicc hello.c –o hello

• The MPICH-VMI library by default is compiled with debug symbols.
University of Illinois at Urbana-Champaign

Running with MPICH-VMI
• softenv also bring mpirun into the current path • Since, for running jobs, MPICH-VMI environment (mpirun in current path, etc), must exist across all sessions, add MPICHVMI keys to your .soft file. • Execute ―which mpirun‖ to determine which MPI flavor is active
University of Illinois at Urbana-Champaign

Running with MPICH-VMI
• mpirun script is available for launching jobs • Supports all standard arguments in addition to MPICH-VMI specific arguments • mpirun uses ssh, rsh and MPD for launching jobs. Default is MPD • Provides automatic selection/failover
– If MPD ring not available, falls back to ssh/rsh

University of Illinois at Urbana-Champaign

Running with MPICH-VMI
• Supports standard way to run jobs
– mpirun –np <# of procs> -machinefile <nodelist file> <executable> <arguments> – -machinefile argument not needed when running within PBS or LSF environment

• Can select network to use at runtime by specifying
– -specfile <network> – Supported networks are myrinet, tcp and xsite-myrinet-tcp

• Default network on Teragrid is Myrinet
– Recommend to always specify network explicitly using –specfile switch

University of Illinois at Urbana-Champaign

mpirun - Examples
• Following are the same on Teragrid
– mpirun –np 4 /path/to/hello – mpirun –np 4 –specfile myrinet /path/to/hello

• Within PBS $PBS_NODEFILE contains the path to the nodes allocated at runtime
– pirun –np <# procs> –machinefile $PBS_NODEFILE /path/to/hello

• For cross-site jobs, additional arguments required (discussed later)
University of Illinois at Urbana-Champaign

Running with MPICH-VMI
• MPICH-VMI 2.0 specific arguments related to three broad categories
– Parameters for runtime tuning – Parameters for launching GRID jobs – Parameters for controlling profiling of job

• mpirun –help option to list all tunable parameters
– All MPICH-VMI 2.0 specific parameters are optional. GRID jobs require some parameters to be set.

University of Illinois at Urbana-Champaign

Runtime Parameters
• Runtime Parameters
– -specfile Specify the underlying network transport to use. This can be a shortened network name (tcp, myrinet or mst for Infiniband) or path to a VMI transport definition specification file in XML format. – -force-shell Disables use of MPDs for launching job. GRID jobs require use of ssh/rsh for job launching.

University of Illinois at Urbana-Champaign

Runtime Parameters
• Runtime Parameters
– -job-sync-timeout Maximum number of seconds allowed for all processes to start. Default is 300 seconds. – -debugger Use the specified debugger to debug MPI application with. Supported debuggers are
• gdb – Invokes parallel gdb debugger with CLI. MPD job launch required • tv – Support for TotalView from Etnus • ddt – Support for DDT debugger from Streamline Computing

University of Illinois at Urbana-Champaign

Runtime Parameters
• Runtime Parameters
– -v Verbose Level 1. Output VMI startup messages and make MPIRUN verbose. – -vv Verbose Level 2. Additionally output any warning messages. – -vvv Verbose Level 3. Additionally output any error messages. – -vvvv Verbose Level 10. Excess Debug. Useful only for developers of MPICH-VMI and submitting crash dumps.
University of Illinois at Urbana-Champaign

Performance Tuning Parameters
• Performance Tuning Runtime Parameters
– -eagerlen Specifies the message size in bytes to switch from short/eager protocol to rendezvous protocol. Default is 16KB. – -eagerisendcopy Specifies the largest message size that can be completed immediately for asynchronous sends (MPI_Isend). – -disable-short-rdma Disables the use of RDMA protocol for short messages. – -short-rdma-credits Specifies the maximum number of unacknowledged short RDMA messages. Default is 32.
University of Illinois at Urbana-Champaign

Performance Tuning Parameters
• Performance Tuning Runtime Parameters
– -rdmachunk Specifies the base RDMA chunk size for rendezvous protocol. All RDMA transfers for rendezvous are performed using the base RDMA chunk size. Default is 256KB. – -rdmapipeline Specifies the maximum number of RDMA chunks in flight. The overall memory demand for RDMA is rdmachunk size * rdmapipeline length. – -mmapthreshold Specifies the memory allocation size in bytes for which MMAP will be used to obtain memory (4MB default)

University of Illinois at Urbana-Champaign

Performance Tuning Parameters
• Performance Tuning Runtime Parameters
– -enable-multicast-collectives enables multicast implementation of MPI collectives. (Warning! Experimental) – -disable-rdma-barrier Disables use of RDMA based optimized barrier implementation. – -disable-shmem-comm Disables use of shared memory for intra node communications

University of Illinois at Urbana-Champaign

Running Inter Site Jobs
• A MPICH-VMI GRID job consists of one or more subjobs • A subjob is launched on each site using individual mpirun commands. The specfile selected should be one of the xsite network transports (xsite-msttcp or xsite-myrinet-tcp). • The higher performance SAN (Infiniband or Myinet) is used for intra site communication. Cross site communication uses TCP automatically
University of Illinois at Urbana-Champaign

Running Inter Site Jobs
• In Addition to Intra Site Parameters all Inter Site Runs Must Specify the same Grid Specific Parameters • A Grid CRM Must be Available on the Network to Synchronize Subjobs
– Grid CRM on Teragrid is available at tgmaster.ncsa.uiuc.edu – No reason why any other site can’t host their own – Infact you can run one on your own desktop!
University of Illinois at Urbana-Champaign

Running Inter Site Jobs
• Grid Specific Parameters
– -grid-procs Specifies the total number of processes in the job. –np parameter to mpirun still specifies the number of processes in the subjob – -grid-crm Specifies the host running the grid CRM to be used for subjob synchronization. – -key Alphanumeric string that uniquely identifies the grid job. This should be the same for all subjobs!

University of Illinois at Urbana-Champaign

Running Inter Site Jobs
– Running xsite across SDSC (2 procs) and NCSA (6 procs)
• @SDSC: mpirun -np 2 grid-procs 8 -key myxsitejob -specfile xsite-myrinet-tcp –grid-crm tgmaster.ncsa.teragrid.org cpi • @NCSA: mpirun -np 6 grid-procs 8 -key myxsitejob -specfile xsite-myrinet-tcp –grid-crm tgmaster.ncsa.teragrid.org cpi

– This Uses the Default Rank Allocator (FIFO)
University of Illinois at Urbana-Champaign

Startup for Grid Jobs
NCSA

• mpirunvmieyes daemons for active ranks NCSA Subjob as for Establish connectionproxy between Query executesNCSA broadcasts SubjobCRM at CRMmesh and ContactCRMactsat NCSAjob registers processes with GRID CRM using the available each NCSA network devices fornetwork ranks to all NCSA processesdevices. Myrinet startup synchronization atprocess • spawns processes on NCSA nodes used at NCSA in example, TCP for WAN

VMIEYES

MPIRUN
GRID CRM
SDSC •GRID CRM generates job topology and allocates ranks. Topology and ranks forwarded by subjob CRM servers at NCSA and SDSC • Establish connection mesh • •SDSCvmieyes daemons for active Query Subjob Subjob SDSC broadcasts Contact executesCRM for and registers mpirun CRM actsat SDSCjob between ranks at as proxy usingsynchronizationCRM devices. Infiniband thewith GRID atprocess available network network ranks to devices for each SDSC startup all SDSC processes processes • spawns processes example, TCP for WAN used at SDSC in on SDSC nodes
University of Illinois at Urbana-Champaign

APPLICATION
Myrinet Infiniband

TCP -WAN

PGO Infrastructure
• Basic Principle: Use information from previous runs of grid job to place MPI processes on the nodes intelligently. • Components of infrastructure
– Grid CRM – Profile Daemon – Profile Database

University of Illinois at Urbana-Champaign

PGO Infrastructure
• The profile daemon receives profile data from grank 0. • Profile daemon talks to mysql database server to save profile data • VMI-tools packages contains script to generate vmiprofile database tables.

University of Illinois at Urbana-Champaign

MPICH-VMI Profile Database
• MPI Communication Rank specific data used to build point to point communication graphs. • Job specific information to enable optimizations to be based on a specific job.

University of Illinois at Urbana-Champaign

MPICH-VMI Profile Database
• Vmiprofile tables
– – – – – – – job_msg_bins job_prof_cache_stats job_prof_global_colls job_prof_rooted_colls job_prof_table job_table | rank_prof_table

University of Illinois at Urbana-Champaign

MPICH-VMI Profile Database
JID Rank NAP Job ID of application Rank assigned to the task Number of active processors

User
Exec Specfile Eagerlen Args/Argc

User ID of the owner of job
Location of executable The VMI level specfile Message size in eager protocol Arguments used in run

Start
Stop

Start time of job
End time of job

University of Illinois at Urbana-Champaign

MPICH-VMI Profile Database
SBSEND
SNBSEND SERECV SURECV SURECVMALLOC

Number of short blocking sends by the process
Number of short non blocking sends Number of expected receives Number of unexpected receives Number of unexpected receives mallocs

SURECVGRAB
LBSEND LNBSEND LERECV LURECV

Number of unexpected receives grabs
Number of Rendezvous Blocking sends Number of Rendezvous non blocking sends Number of Rendezvous expected receives Number of Rendezvous unexpected receives

University of Illinois at Urbana-Champaign

MPICH-VMI Profile Database
• Job_prof_table gives information about number of short and rendezvous messages. • Need greater degree to granularity to better characterize the network traffics. • That is where message bins comes in.

University of Illinois at Urbana-Champaign

MPICH-VMI Profile Database
Zero bytes 2 ^ 14 2 ^ 32

Short msg range

Rndvz msg range

2^0, 2^1, 2^2..

2^14

2^0, 2^1, 2^2..

2^32

University of Illinois at Urbana-Champaign

MPICH-VMI Profile Database
Zero bytes 2 ^ 14 2 ^ 32

Short msg range

Rndvz msg range

2^0, 2^1, 2^2..

2^14

2^0, 2^1, 2^2..

2^32

Greater Granularity

University of Illinois at Urbana-Champaign

MPICH-VMI Profile Database
• Tables for message bins in vmiprofile DB
– job_msg_bins
• Bin number -> range of messages

– Job_prof_table
• Range of messages -> number of messages in that range

University of Illinois at Urbana-Champaign

Profile Data
• Message bins examples
Rank 0 Rankto 1 Bin0 512 Bin1 1024 Bin2 16384 bin4 65536

Job_msg_bins

job_prof_table

Rank 0

Rankto 1

Bin0 54

Bin1 233

Bin2 123

bin4 25467

University of Illinois at Urbana-Champaign

Profile Analyzer Tools
• Command line tools
– vmicollect
• Used for generating communication graphs.

– mincut
• Used to generate a partition for a communication topology.

• Graphic tools
– Pajeck
• Used to display the communication grpahs.
University of Illinois at Urbana-Champaign

Profile Analyzer Tools
• VMIcollect – Queries the vmiprofile database to collect the profile data for a specific job – Outputs a communication graph in pajek format • Arguments – -p <program name> – -d <start date> <end date> – -j <jobid> – -a <arguments passed> – -n <# of processors> • If the query is not unique, the jobid, the program name and argument list of each matching entry is displayed.

University of Illinois at Urbana-Champaign

Profile Analyzer Tools
• Mincut
– Creates a partition of a communication graph generated from a given jobid.

– Outputs the partition list. • Arguments – -j <jobid> – -h <hostname of database server> – -u <username> – -p <password>
University of Illinois at Urbana-Champaign

Grid CRM
• Performs two distinct functions
– Synchronization of subjobs at startup – Mapping of virtual MPI ranks to physical processors

• Mapping of ranks achieved via allocators
– Available allocators are FIFO, LIFO, Random and METIS

• Custom allocators possible using the allocator framework
– Allocators are instantiated at CRM startup
University of Illinois at Urbana-Champaign

Grid CRM
• Mapping process consists of two steps
– Topology discovery – generated automatically by the CRM – Mapping algorithm for the topology

• Common use for mapping is to use historical profile data from database
– Conqueror allocator available to ease interacting with profile data
University of Illinois at Urbana-Champaign

Grid CRM
grid_job myjob sub_jobs = 2 grid_procs =6
subjob 0 num_procs=4

grank 3 0
srank 0

grank 5 1
srank 1

grank 1 grank 2
subjob 1 num_procs= 2

srank 0 srank 0

FIFO Random

grank 3 grank 0
srank 1 srank 1

grank 4 2 grank 5 4
University of Illinois at Urbana-Champaign

srank 2

srank 3

Grid CRM
• Custom allocators implemented as shared objects • Allocators to load are listed in a configuration file for CRM – <allocatorname>:<modulepath>:<arguments> – Ex: queue:/opt/mpichvmi2.0/tools/random.so:host=vmiprofile.ncsa.u iuc.edu?user=root • In order to implement custom allocators four functions need to be implemented.
University of Illinois at Urbana-Champaign

Grid CRM
• VMI allocator module implements four functions – int VMI_Grid_AllocatorInit(int argc, char** argv) • Initializes the allocator module and takes the arguments given in the allocator file – int VMI_Grid_Allocate(PCRM_JOB job, int argc, char **argv) • Allocates ranks to subjobs – int VMI_Grid_AllocatorTerminate() • Cleans up the allocator module – char* VMI_Grid_AllocatorName() • Returns a version name for the allocator

University of Illinois at Urbana-Champaign

Selecting Allocators at Runtime
• Switch to mpirun to Specify Which Allocator to Use for Mapping
– -alloc-uri <allocator>?<allocator arguments>

• URI Specifies the Allocator to Use and any Allocator Specific Arguments
– Eg. Job ID of previous run in database – Location of database – Username and password to connect to database

• Example: -allocator-uri mincut:host=vmiprofile?user=apant?jobid=256

University of Illinois at Urbana-Champaign

Results
VMI/GM Bandwidth Comparison
250

Bandwidth (MB/sec)

200 150 100 50 0

MPICH-VMI Myrinet MPICH-GM

Peak bandwidth (MB/s) MPICH-VMI: 227.65 MPICH-GM: 227.54
20 48 16 38 13 4 10 72 10 48 57 6 25 6 4 32

0

Msg Size (Bytes)
University of Illinois at Urbana-Champaign

Results
Infiniband PCI Express Latency
10000 1000 100 10 1

Latency (us)

Small Message Latency = 4 us

University of Illinois at Urbana-Champaign

51 2 20 48 81 92 32 76 13 8 10 7 52 2 42 20 88 97 15 2
Msg Size (Bytes)

12 8

0

2

8

32

Results
Infiniband PCI Express Bandwidth
1200

Bandwidth (MB/sec)

1000 800 600 400 200 0

Peak Bandwidth = 965 MB/sec

University of Illinois at Urbana-Champaign

10 24 40 96 16 38 4 65 53 26 6 21 10 44 48 5 41 76 94 30 4
Msg Size (byes)

25 6

1

4

16

64

Results
HPL Scaling on Itanium2
200 180 160 140 120 Mflop/sec 100 80 60 40 20 0 1 2 4 8 16 32 64 128
Processors
University of Illinois at Urbana-Champaign

MPICH-GM MPICH-P4 MPICH-VMI2/GM MPICHVMI2/TCP
Peak MPICH GM = 547 GFLOP/s Peak MPICH VMI2/GM = 549.4 GFLOP/s

Results
MILC Scaling on Itanium2
200 180 160 140 120 Mflop/sec 100 80 60 40 20 0 1 2 4 8 16 32 64
Processors
University of Illinois at Urbana-Champaign

MPICH-GM MPICH-P4 MPICH-VMI2/GM MPICHVMI2/TCP
Peak MPICH GM = 158.5 GFLOP/s Peak MPICH VMI2/GM = 156.98 GFLOP/s

Results
MPI Broadcast (16 processes, 2 subjobs, 8 processer per subjob) 10000000

Completion Time (usecs)

1000000

100000

W ith topology W ithout Topology

10000

1000

Broadcast Message Size (bytes)

University of Illinois at Urbana-Champaign

20 48 81 92 32 76 13 8 10 7 52 2 42 20 88 97 15 2

12 8

51 2

32

2

8

Conclusions
• Comparative performance to native MPI implementations
– Abstraction does not impact performance

• Modular architecture allows ease of deployment and upgrade on the Teragrid • Possible to do a high performance MPI implementation from the machine room to the grid
– Need to be careful to understand host of new issues with wide are computational grids
University of Illinois at Urbana-Champaign

MPICH-VMI2 CTSS Installs
• Installations on Teragrid
– – – – NCSA: Version 2.0.1 installed SDSC: Version 2.0.1 installed Caltech: Version 2.0 installed ANL: Version 2.0 installed

• Caltech and ANL expected to update to latest release
– Latest release has a couple of bug fixes

• MPD job launch unavailable on Teragrid
– Security issues

University of Illinois at Urbana-Champaign

Upcoming Features
• Upcoming Release at SC’04 (VMI 2.1)
– Enhanced support for overlapping computation and communication (MPI_Isend) – Additional OS support
• Mac OS X, Linux RH and SuSE

– Additional networks supported
• Myrinet, Infiniband, iWarp, Ethernet

– Improved rank mapping allocators
University of Illinois at Urbana-Champaign

MPICH-VMI2 Support
• Support
– help@teragrid.org – Mailing lists: http://vmi.ncsa.uiuc.edu/mailingLists.php – Announcements: vmiannounce@shodan.ncsa.uiuc.edu – Users: vmi-user@shodan.ncsa.uiuc.edu – Developers: vmi-devel@shodan.ncsa.uiuc.edu
University of Illinois at Urbana-Champaign


								
To top