Resource Utilization in Large Scale InfiniBand Jobs by 8o699M


									Resource Utilization in Large Scale
        InfiniBand Jobs

           Galen M. Shipman
        Los Alamos National Labs

The Problem

 InfiniBand specifies that receive resources are
  consumed in order regardless of size
 Small messages may therefore consume much
  larger receive buffers
 At very large scale, many applications are
  dominated by small message transfers
 Message sizes vary substantially from job to job
  and even rank to rank                                  2
Receive Buffer Efficiency         3
Implication for SRQ

Flood of small messages may exhaust SRQ
Probability of RNR NAK increases
      Stalls the pipeline
Performance degrades
      Wasted resource utilization
      Application may not complete within allotted time
       slot (12 + Hours for some jobs)                                    4
Why not just tune the buffer size?

 There is no “one size fits all” solution!
 Message size patterns differ based on:
      Number of processes in the parallel job
      Input deck
      Identity / function in the parallel job
 Need to balance optimization between:
      Performance
      Memory footprint
 Tuning for each application run is not acceptable                                   5
What Do Users Want?

 Optimal performance is important
      But predictability at “acceptable” performance is more
 HPC users want a default/“good enough” solution
      Parameter tweaking is fine for papers
      Not for our end users
 Parameter explosion
      OMPI OpenFabrics-related driver parameters: 48
      OMPI other parameters: …many…                                             6
What Do Others Do?

 Portals
      Contiguous memory region for unexpected messages
       (Receiver managed offset semantic)
 Myrinet GM
      Variable size receive buffers can be allocated
      Sender specifies which size receive buffer to consume
       (SIZE & PRIORITY fields)
 Quadrics Elan
      TPORTS manages pools of buffers of various sizes
      On receipt of an unexpected message a buffer is chosen
       from the relevant pool                                             7

Inspired from standard bucket allocation
Multiple “buckets” of receive descriptors are
 created in multiple SRQs
      Each associated a different size buffer
A small pool of per-peer resources is also
 allocated                              8
Bucket-SRQ   9
Performance Implications

Good overall performance
      Decreased/no RNR NAKS from draining SRQ
          • Never trigger “SRQ limit reached” event
Latency penalty for SRQ
      ~1 usec
Large number of QPs may not be efficient
      Still investigating impact of high QP count on
       performance                                     10

Evaluation applications
      SAGE (DOE/LANL application)
      Sweep3D (DOE/LANL application)
      NAS Parallel Benchmarks (benchmark)
Instrumented Open MPI
      Measured receive buffer efficiency:
     Size of receive buffer / size of data received                                   11
SAGE: Hydrodynamics
 SAGE – SAIC’s Adaptive Grid Eulerian hydrocode
 Hydrodynamics code with Adaptive Mesh Refinement (AMR)
 Applied to: water shock, energy coupling, hydro instability
  problems, etc.
 Routinely run on 1,000’s of processors.
 Scaling characteristic: Weak

 Data Decomposition (Default): 1-D (of a 3-D AMR spatial grid)

    "Predictive Performance and Scalability Modeling of a Large-Scale Application", D.J.
    Kerbyson, H.J. Alme, A. Hoisie, F. Petrini, H.J. Wasserman, M. Gittings, in Proc. SC,
    Denver, 2001                                                      Courtesy: PAL Team - LANL                                                                         12

 Adaptive Mesh Refinement (AMR) hydro-code
 3 repeated phases
      Gather data (including processor boundary data)
      Compute
      Scatter data (send back results)
 3-D spatial grid, partitioned in 1-D
 Parallel characteristics
    Message sizes vary, typically 10 - 100’s Kbytes
    Distance between neighbors increases with scale

                                                         Courtesy: PAL Team - LANL                                                           13
SAGE: Receive Buffer Usage

256 Processes          14
SAGE: Receive Buffer Usage

4096 Processes          15
SAGE: Receive buffer efficiency               16
SAGE: Performance   17   18   19
Sweep3D Receive Buffer Usage

256 Processes            20
Sweep3D: Receive Buffer Efficiency                  21
Sweep3d: Performance    22
NPB Receive Buffer Usage

Class D 256 Processes        23
NPB Receive Buffer Efficiency    IS Benchmark Not
                                Available for Class D

Class D 256 Processes                                     24
NPB Performance Results

NPB Class D 256 Processes          25

Bucket SRQ provides
      Good performance at scale
      “One size fits most” solution
          • Eliminates need to custom-tune each run
      Minimizes receive buffer memory footprint
          • No more than 25 MB was allocated for any run
      Avoids RNR NAKs in communication patterns we
       examined                                        26
Future Work

Take advantage of ConnectX SRC feature to
 reduce the number of active QPs
Further examine our protocol at 4K+
 processor count on SNL’s ThunderBird
 cluster                      27

To top