Operational and Application Experiences with the Infiniband

Document Sample
Operational and Application Experiences with the Infiniband Powered By Docstoc
					 Operational and Application
Experiences with the Infiniband
        Environment




           Sharon Brunett
              Caltech
            May 1, 2007
        www.openfabrics.org
Outline

 Production Environment using Infiniband
      Hardware configuration
      Software stack
      Usage model
 Infiniband particulars
      Sample application
      Benchmarks
      Issues
 A less challenging future
      A collection of hoped for improvements


www.openfabrics.org                             2
            Opteron/Infiniband Cluster Configuration

AMD Opteron head/login node
  (shc.cacr.caltech.edu)                                Extreme Networks
            16 GB memory                               Black Diamond 8810
               2.2 GHz
       quad CPU, dual core                                Copper GigE
            Voltaire Infininband




                                                     …124…                  Opteron dual CPU
                                                                             dual core, 16 GB
                                                                               NFS server                    8
   Switch




                                                                                                                    ~ 25 TB
                                    :       8 GB memory                                                  8          /pvfs/data-store02
                                   124         2.2 GHz
                                    :      256 GB scratch
                                                               8 GB memory           ~ 24 TB (RAID6)
                                   86 dual CPU, dual core         2.4 GHz            /nfs/data-store01
                                   AMD Opteron nodes          256 GB scratch

                                                            38 dual CPU, dual core                       ~ 25 TB
                                                            AMD Opteron nodes                             /pvfs/data-store03

 www.openfabrics.org                                                                                                               3
     Compute Resource Utilization Summary
            Compute Node Utilization by Project
                 Jan. 1 to Mar. 5, 2007                     Even balance between
                                                             active projects
                                                            76% utilization for 2007
               24%           22%


                                            sx s


             5%
                                            vtf
                                            tm x               up from 64.9% in 2006
                                            shc-suppo rt


                              29.20%
                                            idle +P M
                                                            Mix of development and
                  20%
                                                             production jobs
            Compute Node Utilization by Project
                                                               Typically ranging in size from
                Jan. 1 to Dec. 31, 2006                         4 to 32 nodes, 2 to 24 hours
                                                            Approx 100 user accounts,
                            22.1%

           35.1%                          sxs
                                                             5 partner projects
                                          vtf
                                          tmx
                                          shc-support
                              22.4%       idle+PM
             1.3%

                    19.1%




www.openfabrics.org                                                                          4
Production Environment

 Software stack impacting Golden Image
     SLES9 (security patched) kernel version 2.6.15.9
     Mellanox Infiniband drivers v3.5.5
         • No sources available to us
       Parallel Virtual File System (pvfs) v2
       OpenMPI (2.1.X)
       Torque
       Maui
 Software stack - user tools
    Plotting and Data Visualization Tool - Tecplot
     Debugger - Totalview
     Numerical Computing Environment/language - Matlab
     Portable Extensible Toolkit for Scientific Computation -
        PETSc
     Hierarchial
www.openfabrics.org   Data Format (HDF) v4,5                     5
SCS Grains Simulation

 Highly resolved
  simulations of shear
  compression
  polycrystal specimen
  tests
 Production run stats
      LLNL’s alc, 12 hours
       118 CPUs, 900K steps,
       4.4 GB of dumps



www.openfabrics.org            6
Sample Application MPI profile

 As problem size grows,
                                                                               SCS MPI scaling
  MPI impact less due to                                         128 way run, 500 steps, mpiP tracing in main


  better load balancing                                 50

                                                        45

      MPI_Waitall and                                  40




                               % of time in MPI calls
                                                        35
       AllReduce are major                              30
                                                                                                                    alc
       time consumers                                   25
                                                                                                                    shc
                                                        20

      Run smaller                                      15


       benchmarks for tuning                            10

                                                         5

       suggestions                                       0
                                                             0               1                   2              3
                                                                                  subdivisions




www.openfabrics.org                                                                                                       7
PMB PingPong

                        2 processes (1 per node)
               100000

                10000                              OpenMPI 1.2.1 - IB
Latency (us)




                 1000
                                                   mpich tcp/ip - IB
                  100

                   10                              mpich tcp/ip - GigE

                    1

                     5221 2
                     1042 4
                    20 4 8 8
                    4197 8 5 6
                         258
                         5 6
                       10 12




                            4
                     13 5 8
                     261076
                          16
                          32
                         124


                       2024
                       4048
                       8196
                            2
                            0
                            1
                            2
                            4
                            8




                      94152
                            4
                           4
                      3238
                      6576
                          3
                          6




                      16 9




                        30
                        Message Size (Bytes)


www.openfabrics.org                                                    8
PMB PingPong

                             2 processes (1 per node)
                      900
 Bandwidth (MB/sec)




                      800
                      700                               OpenMPI 1.2.1 - IB
                      600
                      500
                                                        mpich tcp/ip - IB
                      400
                      300
                      200                               mpich tcp/ip - GigE
                      100
                        0

                         522142
                         104284
                        20 4 8
                        41971 5 6
                            258
                            516
                           10 2




                          32384
                          65768
                         261076
                              16
                              32
                            124



                           2024
                           4048
                           8196
                          16 92
                               0
                               1
                               2
                               4
                               8




                          94 52
                               4
                              3
                              6




                            30
                            8
                         13 5




                            Message Size (Bytes)


www.openfabrics.org                                                         9
PMB MPI_AllReduce

                100000
                                                                                      2 processes 2
                 10000                                                                nodes
 Latency (us)




                                                                                      4 processes 4
                  1000
                                                                                      nodes
                                                                                      8 processes 8
                                                                                      nodes
                   100                                                                16 processes 8
                                                                                      nodes
                    10                                                                32 processes 8
                                                                                      nodes
                     1
                                                                  72

                                                                           88
                                   8

                                         2




                                                             8
                             32




                                              48

                                                    92
                     0

                         8




                                                                                  2
                                  12

                                       51




                                                         76




                                                                                15
                                                                 10

                                                                       42
                                             20

                                                   81
                                                        32




                                                                            97
                                                             13

                                                                      52
                                                                           20
                                       Message Size (Bytes)


www.openfabrics.org                                                                                10
Tuning Tests Revealed Infiniband Issues

The Port Management (PM) facility gives
 sysadmin/user ability to analyze and
 maintain the Infiniband environment
      Particular ports had high PortRcvErrors,
       indicative of a bad link
          • Moving cables and swapping in a new IB blade
            isolated the problem further
      Congestion reduced by configurable threshold
       limit (HOQlife)

www.openfabrics.org                                        11
Problem IB Blade Identified
New Challenges Arise

 Servicing the Infiniband
  switch, as currently
  installed, is no picnic
      Note how working parts need
       to be dismantled to access
       parts needing service
          • Cable tracing and stress
            needs attention
      Line boards can take multiple
       re-seatings before they’re
       “snug”
 As Mark says…hardware
  should be treated like a
  delicate flower

www.openfabrics.org                    12
Lessons Learned

 Sections of the code with MPI collective calls
  sensitive to msg lengths and process counts
      Run indicative benchmarks as part of production run set
       up process
 Use Voltaire’s PM utility to routinely monitor the
  fabric for problems
      Functionality and performance
 Buy dinner for Trent and Ira
      test out linkcheck and ibcheckfabric on our little cluster


www.openfabrics.org                                                 13
Making our Lives Easier
 Mellanox drivers -> OpenIB ?
      Locally built golden image gives flexibility but has
       drawbacks
 Automatic probing of PM counter report files to
  compare against “known good” states
      Report suspect components
 Use standard/factory benchmarks to verify
  Infiniband cluster is working at customer site as
  well as when the integrated system shipped!
      Increasingly important as cluster expands
      Incorportate low level PM facilities into support level tools
       for better integrated monitoring
www.openfabrics.org                                               14

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:5
posted:10/4/2012
language:English
pages:14