Current and Future for NT Clustering with HPVM by k8l506d

VIEWS: 5 PAGES: 25

									Current and Future for NT Clustering
            with HPVM

                    Philip M. Papadopoulos
             Department of Computer Science and Engineering
                    University of California, San Diego




7 Oct 1999                   JPC4 - Oak Ridge, TN
                        Outline

• NT Clustering - Our clusters, Software
• What’s new in the latest version of HPVM 1.9
• Looking at performance
     – Gratuitous bandwidth and latency
     – Iowa State results (Luecke, Raffin, Coyle)
• Futures for HPVM
     – Natural upgrade paths (Windows 2K, Lanai 7, …)
     – Adding Dynamics

7 Oct 1999             JPC4 - Oak Ridge, TN
                              Why NT?

• Technical reasons
     – Good support of SMP systems
             • System is designed to be threaded at all levels
     – User-scheduled ultra-lightweight threads (NT
       Fibers) are very powerful
     – Integrated/Extensible performance monitoring
       system
     – Well-supported device driver development
       environment

7 Oct 1999                     JPC4 - Oak Ridge, TN
  Remote Access (NT is Challenged)

• Myth: You can’t do things remotely in NT
• Fact: You can, it just doesn’t have a unified remote
  abstraction like rsh/ssh. (Think Client/Server )
     –   Remote manipulation of registry (regini.exe)
     –   Remote administrative access to file system
     –   Ability to create remote threads (CreateRemoteThread)
     –   Ability to start/stop services (sc.exe)
     –   Too many interfaces! One must essentially learn new tools
         to perform (scripted) remote admin.
• NT Terminal Server and Win2K improve access, but
  still fall short of X-Windows.

7 Oct 1999                  JPC4 - Oak Ridge, TN
      Hardware/Software Environment

• Our clusters
     – 64 Dual Processor Pentium IIs
             • 32 HP Kayak. 300MhZ, 384MB, 100 GB disk
             • 32 HP LPr NetServer 450Mhz, 1024MB, 36GB disk
     – Myrinet – Lanai4 32-bit PCI cards all 64 Machines
     – Giganet – Hardware VIA only on NetServers
• NT Terminal Server 4.0 on all nodes
• LSF for managing/starting parallel jobs
• HPVM is the “clusterware”
7 Oct 1999                   JPC4 - Oak Ridge, TN
         High Performance Virtual Machines

PI: Andrew A. Chien, co-PIs: Daniel Reed, David Padua
Students:
    Scott Pakin, Mario Lauria*, Louis Giannini, Paff Liu*, Geta
    Sampemane, Kay Connelly, and Andy Lavery
Research Staff:
    Philip Papadopoulos, Greg Bruno, Caroline Papadopoulos*,
    Mason Katz*, Greg Koenig, and Qian Liu
*Funded from other sources
URL: http://www-csag.ucsd.edu/projects/hpvm.html
DARPA #E313, AFOSR F30602-96-1-0286



7 Oct 1999                JPC4 - Oak Ridge, TN
                       What is HPVM?

• High-performance (MPP-class) thread-safe
  communication
• A layered set of APIs (not just MPI) that allow
  applications to obtain a significant fraction of HW
  performance
• A small number of services that allow distributed
  processed to find out and communicate with each
  other
• Device driver support for Myrinet. Vendor for VIA
• Focus/contribution has been effective layering
     –       Especially short message performance.

7 Oct 1999                    JPC4 - Oak Ridge, TN
                 Supported APIs

• FM (Fast Messages)
     – Core messaging layer. Reliable, in-order delivery
• MPI – MPICH 1.0 based
• SHMEM – put/get interface (Similar to Cray)
• BSP – Bulk Synchronous Parallel (Oxford)
• Global Arrays - Global abstraction for matrix
  operations. (PNNL)
• TCGMSG – Theoretical Chemistry Group
  Messaging
7 Oct 1999             JPC4 - Oak Ridge, TN
                   Libraries/Layering
• All libraries layered on top of FM
• Semantics are active-message like
• FM designed to build other libraries, FM level not
  desirable for applications
• Designed for efficient gather/scatter and header
  processing
                                 Global
                                                            BSP
                                 Arrays
                  SHMEM                        MPI
       Fast
       Messages
             Myrinet or VIA                   Shared Memory (SMP)


7 Oct 1999                    JPC4 - Oak Ridge, TN
             What’s New in HPVM 1.9

• Better Performance (ref v. 1.1 @NCSA)
      – 25% Bandwidth (80MB/s  100+MB/s)
      – 14% Latency reduction 10s  8.6 s
• Three transports
      – Shared Memory Transport + [Myrinet,VIA]
      – Standalone desktop version uses shared mem
• Integration with NT Performance Monitor
• Improved configuration/installation
• BSP API added
7 Oct 1999            JPC4 - Oak Ridge, TN
             Performance Basics (Ver 1.9)

• Basics
     – Myrinet
             • FM: 100+MB/sec, 8.6 µsec latency
             • MPI: 91MB/sec @ 64K, 9.6 µsec latency
                – Approximately 10% overhead
     – Giganet (VIA)
             • FM: 81MB/sec, 14.7 µsec latency
             • MPI: 77MB/sec, 18.6 µsec latency
                – 5% BW overhead, 26% latency!
     – Shared Memory Transport
             • FM: 195MB/sec, 3.13 µsec latency
             • MPI: 85MB/sec, 5.75 µsec latency
                – Our software structure requires 2 mem copies/packet :-(
7 Oct 1999                      JPC4 - Oak Ridge, TN
                    Gratuitous Bandwidth Graphs
                MPI on VIA   FM on Myrinet     MPI on Myrinet       FM on VIA
  120

  100                                                                           • FM bandwidth
                                                                                usually a good
       80
                                                                                indicator of
MB/s




       60                                                                       deliverable
                                                                                bandwidth
       40
                             • N1/2 ~ 512 Bytes
                                                                                • High BW
       20
                                                                                attained for
        0                                                                       small messages
            0      2048 4096 6144 8192 10240 12288 14336 16384
                                 message size (bytes)
       7 Oct 1999                            JPC4 - Oak Ridge, TN
   “Nothing is more humbling or more
    revealing than having others use
             your software.”




7 Oct 1999     JPC4 - Oak Ridge, TN
      Iowa State Performance Results

• “Comparing the Communication Performance and
  Scalability of a Linux and a NT Cluster of PCs, a Cray
  Origin 2000, an IBM SP and a Cray T3E-600”
     – Glenn R. Luecke, Bruno Raffin and James J. Coyle, Iowa State
• Machines
     –   64 Node NT SuperCluster, NCSA, Dual PIII 550, HPVM 1.1
     –   64 Node AltaCluster, ABQ HPCC Dual PII 450, GM
     –   O2K, 64 Node, Eagan MN, Dual 300MhZ R12000
     –   T3E-600, 512 proc, Eagan MN, Alpha EV5 300MhZ
     –   IBM SP, 250 proc, Maui, (96 were 160MhZ)
• They ran MPI benchmarks for 8 byte, 10000 Byte, 1MB

7 Oct 1999                  JPC4 - Oak Ridge, TN
                         Right Shift - 8 Byte Messages
Time (ms)




                                       # processors
            • FM optimization for short messages
            7 Oct 1999              JPC4 - Oak Ridge, TN
              Right Shift - 10000 Bytes Messages
Time (ms)




            • FM: starts at 25MB/sec and drops to 12MB/sec above 64 nodes
            7 Oct 1999            JPC4 - Oak Ridge, TN
                           Right Shift - 1MB Messages
Time (ms)




            • Change at 64 processors prompted Shared Memory Transport in HPVM1.9
                • Curve flattened (better scalability)

            • Recently (last week), found a fairness issue in FM Lanai Control Program
              7 Oct 1999                     JPC4 - Oak Ridge, TN
             MPI Barrier - 8 Bytes
Time (ms)




• FM Significantly faster at 128 Procs (4x - 9x)
7 Oct 1999              JPC4 - Oak Ridge, TN
                 MPI Barrier - 10000 Bytes
Time (ms)




            • FM 2.5x slower than T3E, 2x Slower than O2K
7 Oct 1999                     JPC4 - Oak Ridge, TN
             Interpreting These Numbers

• Concentration on short message
  performance puts clusters on par with
  (expensive) traditional supers
• Longer message performance not as
  competitive. Version 1.9 addresses some
  issues
• Lends some understanding of large
  application performance on NT SuperCluster

7 Oct 1999            JPC4 - Oak Ridge, TN
             Future HPVM Development

• (Obvious) things that will happen
     – Support of Windows 2000
     – Alpha NT -- move towards 64 bit code base
     – Support for new Myrinet Lanai 7 Hardware
• HPVM development will move into support
  role for other projects
     – Agile Objects: High-performance OO computing
     – Federated Clusters
     – Tracking NCSA SuperCluster hardware curve
7 Oct 1999            JPC4 - Oak Ridge, TN
             Current State for Reference

• HPVM supports multiple processes/node, multiple
  process groups/cluster
     – Inter-group communication not supported
• In-order reliable messaging guaranteed by
     – Credit-based flow control scheme
             • Static scheme is simple but inflexible
     – Only one route between any pair of processes
             • Even if multiple routes available, only one used
• Comm within cluster very fast, outside is not
• Speed comes from many static constraints

7 Oct 1999                        JPC4 - Oak Ridge, TN
     Designed and Now Implementing

• Dynamic flow control scheme for better scalability
     – Support larger clusters
• Multiple routes and out-of-order packet re-sequence
     – Allow parallel paths for high-performance WAN connections
• Support inter-group communication
     – Driven by agile objects need for remote method invocation/
       client-server interactions
• Support “Federated Clusters”
     – Integration into Grid. Bring performance of cluster outside of
       the machine room

7 Oct 1999                  JPC4 - Oak Ridge, TN
             Is Linux in HPVM’s Future?

• Maybe ;-)
• Critical technical hurdle is finding a user-
  scheduled lightweight thread package
     – NT version makes use of “Fibers”
• Major impediment is time/project driver




7 Oct 1999            JPC4 - Oak Ridge, TN
                      Summary

• HPVM gives good relative and absolute
  performance
• HPVM moving past the “numbers game”
     – Concentrate on overall usability
     – Integration into Grid
• Software will continue development but takes
  on a support role for driving projects
• Check out www-csag.ucsd.edu

7 Oct 1999             JPC4 - Oak Ridge, TN

								
To top