camp

Document Sample
camp Powered By Docstoc
					Bill Camp, Jim Tomkins
& Rob Leland
How Much Commodity is Enough?
  the Red Storm Architecture


William J. Camp, James L. Tomkins & Rob Leland
       CCIM, Sandia National Laboratories
               Albuquerque, NM
                bill@sandia.gov
              Sandia MPPs (since 1987)
   1987: 1024-processor nCUBE10 [512 Mflops]
 1990--1992 + +: 2 1024-processor nCUBE-2 machines [2 @ 2 Gflops]
 1988--1990: 16384-processor CM-200
 1991: 64-processor Intel IPSC-860
 1993--1996: ~3700-processor Intel Paragon [180 Gflops]
 1996--present: 9400-processor Intel TFLOPS (ASCI Red) [3.2 Tflops]
 1997--present: 400 --> 2800 processors in Cplant Linux Cluster [~3 Tflops]
 2003: 1280-processor IA32- Linux cluster [~7 Tflops]
 2004: Red Storm: ~11600 processor Opteron-based MPP [>40 Tflops]
                Our rubric (since 1987)

 Complex, mission-critical, engineering & science applications
 Large systems (1000’s of PE’s) with a few processors per node
   Message passing paradigm
   Balanced architecture
   Use commodity wherever possible
   Efficient systems software
   Emphasis on scalability & reliability in all aspects
   Critical advances in parallel algorithms
   Vertical integration of technologies
    A partitioned, scalable computing
               architecture
                  Compute       File I/O
        Service




Users

    /home
                                Net I/O
                  Computing domains at Sandia


                                                           Peak
                                        Mid-Range
    Domain
                         Volume

      # Procs        1            101      102       103          104
                                            X         X           X
    Red Storm
   Cplant Linux                   X         X         X
   Supercluster
  Beowulf clusters   X            X         X


     Desktop         X




 Red Storm is targeting the highest-end market but has real
  advantages for the mid-range market (from 1 cabinet on up)
             Red Storm Architecture
 True MPP, designed to be a single system-- not a cluster
 Distributed memory MIMD parallel supercomputer
 Fully connected 3D mesh interconnect. Each compute
  node processor has a bi-directional connection to the
  primary communication network
 108 compute node cabinets and 10,368 compute node
  processors (AMD Sledgehammer @ 2.0--2.4 GHz)
 ~10 or 20 TB of DDR memory @ 333MHz
 Red/Black switching: ~1/4, ~1/2, ~1/4 (for data security)
 12 Service, Visualization, and I/O cabinets on each end
  (640 S,V & I processors for each color)
 240 TB of disk storage (120 TB per color) initially
           Red Storm Architecture
 Functional hardware partitioning: service and I/O nodes,
  compute nodes, Visualization nodes, and RAS nodes
 Partitioned Operating System (OS): LINUX on Service,
  Visualization, and I/O nodes, LWK (Catamount) on
  compute nodes, LINUX on RAS nodes
 Separate RAS and system management network (Ethernet)
 Router table-based routing in the interconnect
 Less than 2 MW total power and cooling
 Less than 3,000 ft2 of floor space
             Usage Model


              Unix (Linux)     Compute      I/O
  Batch       Login Node       Resource
Processing     with Unix
              environment
    or




                User sees a coherent, single system
          Thor’s Hammer Topology

 3D-mesh Compute node topology:
    27 x 16 x 24 (x, y, z) – Red/Black split: 2,688 – 4,992 – 2,688
 Service, Visualization, and I/O partitions
    3 cab’s on each end of each row
      • 384 full bandwidth links to Compute Node Mesh
      • Not all nodes have a processor-- all have routers
    256 PE’s in each Visualization Partition--2 per board
    256 PE’s in each I/O Partition-- 2 per board
    128 PE’s in each Service Partition-- 4 per board
       3-D Mesh topology (Z direction is a
                    torus)      Torus
                               Interconnect
                                    in Z

                 Y=16
Visualization,




                                      Visualization
& I/O Nodes




                                      & I/O Nodes
                                         Service
   Service




                      10,368




                                          640
    640




                     Compute Z=24
                    Node Mesh

                        X=27
    Thor’s Hammer Network Chips

 3D-mesh is created by SEASTAR ASIC:
    Hyper-transport Interface and 6 network router ports on each chip
    In computer partitions each processor has its own SEASTAR
    In service partition, some boards are configured like compute
     partition (4 PE’s per board)
    Others have only 2 PE’s per board; but still have 4 SEASTARS
       • So, network topology is uniform
 SEASTAR designed by CRAY to our spec’s, Fabricated
  by IBM
    The only truly custom part in Red Storm-- complies with HT open
     standard
                       Node architecture

                      DRAM 1 (or 2) Gbyte or more



                                CPU
                                AMD
                               Opteron
                                              Six Links
ASIC = Application                            To Other
Specific Integrated             ASIC        Nodes in X, Y,
   Circuit, or a                NIC +          and Z
  “custom chip”                 Router
               System Layout
               (27 x 16 x 24 mesh)
 Normally          Switchable        Normally
Unclassified        Nodes            Classified
{



                                     {
               Disconnect Cabinets
        Thor’s Hammer Cabinet Layout
   Compute Node Cabinet
       CPU Boards                       Compute Node Partition
 2 ft              4 ft                       3 Card Cages per Cabinet
                                           
                                           
                                               8 Boards per Card Cage
                                               4 Processors per Board
                                                                       }    96 PE

                                              4 NIC/Router Chips per Board




                              Cables
                                              N + 1 Power Supplies
                                              Passive Backplane
                                        Service. Viz, and I/O Node Partition
                                            2 (or 3) Card Cages per Cabinet
                                            8 Boards per Card Cage
                                            2 (or 4) Processors per Board
                           Power
Fan           Fan                           4 NIC/Router Chips per Board
                           Supply
                                            2-PE I/O Boards have 4 PCI-X
                                             busses
Front               Side                    N + 1 Power Supplies
                                            Passive Backplane
                 Performance
 Peak of 41.4 (46.6) TF based on 2 floating point
  instruction issues per clock at 2.0 Gigahertz .
 We required 7-fold speedup versus ASCI Red but
  based on our benchmarks expect performance will be
  8-10 time faster than ASCI Red.
 Expected MP-Linpack performance: ~30--35 TF
 Aggregate system memory bandwidth: ~55 TB/s
 Interconnect Performance:
    Latency <2 ms (neighbor), <5 ms (full machine)
    Link bandwidth ~ 6.0 GB/s bi-directional
    Minimal XC bi-section bandwidth ~2.3 TB/s
                      Performance
 I/O System Performance
    Sustained file system bandwidth of 50 GB/s for each color
    Sustained external network bandwidth of 25 GB/s for each color
 Node memory system
    Page miss latency to local memory is ~80 ns
    Peak bandwidth of ~5.4 GB/s for each processor
           Red Storm System Software
 Operating Systems
     LINUX on service and I/O nodes
     Sandia’s LWK (Catamount) on compute nodes
     LINUX on RAS nodes
 Run-Time System
       Logarithmic loader
       Fast, efficient Node allocator
       Batch system – PBS
       Libraries – MPI, I/O, Math
 File Systems being considered include
       PVFS – interim file system
       Lustre – Design Intent
       Panassas-- possible alternative
       …
          Red Storm System Software
 Tools
    All IA32 Compilers, all AMD 64-bit Compilers – Fortran, C, C++
    Debugger – Totalview (also examining alternatives)
    Performance Tools (was going to be Vampir until Intel bought
     Pallas-- now?)
 System Management and Administration
    Accounting
    RAS GUI Interface
                  Comparison of ASCI Red
                     and Red Storm
                                           ASCI Red                 Red Storm
Full System Operational Time Frame   June 1997 (processor and        August 2004
                                     memory upgrade in 1999)
Theoretical Peak (TF)-- compute                3.15                     41.47
partition alone
MP-Linpack Performance (TF)                    2.38                 >30 (estimated)
Architecture                         Distributed Memory MIMD     Distributed Memory
                                                                        MIMD
Number of Compute Node Processors             9,460                     10,368
Processor                              Intel P II @ 333 MHz     AMD Opteron @ 2 GHz
Total Memory                                  1.2 TB             10.4 TB (up to 80 TB)
System Memory Bandwidth                      2.5 TB/s                  55 TB/s
Disk Storage                                 12.5 TB                    240 TB
Parallel File System Bandwidth          1.0 GB/s each color      50.0 GB/s each color
External Network Bandwidth              0.2 GB/s each color       25 GB/s each color
               Comparison of ASCI Red
                  and Red Storm
                                    ASCI Red               RED STORM
Interconnect Topology             3D Mesh (x, y, z)         3D Mesh (x, y, z)
                                    38 x 32 x 2               27 x 16 x 24

Interconnect Performance
 MPI Latency                    15 ms 1 hop, 20 ms max   2.0 ms 1 hop, 5 ms s max
                                       800 MB/s                  6.0 GB/s
 Bi-Directional Bandwidth
                                       51.2 GB/s                 2.3 TB/s
 Minimum Bi-section Bandwidth
Full System RAS
 RAS Network                      10 Mbit Ethernet         100 Mbit Ethernet
 RAS Processors                  1 for each 32 CPUs        1 for each 4 CPUs

Operating System
 Compute Nodes                        Cougar                   Catamount
 Service and I/O Nodes           TOS (OSF1 UNIX)                LINUX
 RAS Nodes                          VX-Works                    LINUX
Red/Black Switching              2260 – 4940 – 2260        2688 – 4992 - 2688
System Foot Print                     ~2500 ft2                 ~3000 ft2
Power Requirement                     850 KW                    1.7 MW
                    Red Storm Project
 23 months, design to First Product Shipment!
 System software is a joint project between Cray and Sandia
     Sandia is supplying Catamount LWK and the service node run-time system
     Cray is responsible for Linux, NIC software interface, RAS software, file
      system software, and Totalview port
     Initial software development was done on a cluster of workstations with a
      commodity interconnect. Second stage involves an FPGA implementation of
      SEASTAR NIC/Router (Starfish). Final checkout on real SEASTAR-based
      system
 System design is going on now
     Cabinets-- exist
     SEASTAR NIC/Router-- released to Fabrication at IBM earlier this month
 Full system to be installed and turned over to Sandia in stages culminating
  in August--September 2004
New Building for Thor’s Hammer
      Designing for scalable
        supercomputing
Challenges in:
    -Design
    -Integration
    -Management
    -Use
     SUREty for Very Large Parallel
          Computer Systems
Scalability - Full System Hardware and System Software

Usability - Required Functionality Only

Reliability - Hardware and System Software

Expense minimization- use commodity, high-volume parts

SURE poses Computer System Requirements:
     SURE Architectural tradeoffs:
•   Processor and memory sub-
    system balance
•   Compute vs interconnect balance
•   Topology choices
•   Software choices
•   RAS
•   Commodity vs. Custom technology
•   Geometry and mechanical design
          Sandia Strategies:
-build on commodity
-leverage Open Source (e.g., Linux)
-Add to commodity selectively (in RS
there is basically one truly custom
part!)
-leverage experience with previous
scalable supercomputers
    System Scalability Driven Requirements


Overall System Scalability - Complex
scientific applications such as molecular
dynamics, hydrodynamics & radiation
transport should achieve scaled parallel
efficiencies greater than 50% on the full
system (~20,000 processors).

-
                            Scalability
System Software;
System Software Performance scales nearly perfectly with the
number of processors to the full size of the computer (~30,000
processors). This means that System Software time (overhead)
remains nearly constant with the size of the system or scales at
most logarithmically with the system size.

- Full re-boot time scales logarithmically with the system size.
- Job loading is logarithmic with the number of processors.
- Parallel I/O performance is not sensitive to # of PEs doing I/O
- Communication Network software must be scalable.
         - No connection-based protocols among compute nodes.
         - Message buffer space independent of # of processors.
         - Compute node OS gets out of the way of the application.
            Hardware scalability
•Balance in the node hardware:
      •Memory BW must match CPU speed
            Ideally 24 Bytes/flop (never yet done)
      •Communications speed must match CPU speed
      •I/O must match CPU speeds
      •Scalable System SW( OS and Libraries)
      •Scalable Applications
       Usability
>Application Code Support:
       Software that supports scalability of the
       Computer System
              Math Libraries
              MPI Support for Full System Size
              Parallel I/O Library
              Compilers
       Tools that Scale to the Full Size of the
       Computer System
              Debuggers
              Performance Monitors
       Full-featured LINUX OS support at the
user          interface
                         Reliability
 Light Weight Kernel (LWK) O. S. on compute partition
    Much less code fails much less often
 Monitoring of correctible errors
    Fix soft errors before they become hard
 Hot swapping of components
    Overall system keeps running during maintenance
 Redundant power supplies & memories
 Completely independent RAS System monitors virtually
  every component in system
   Economy

1. Use high-volume parts where possible
2. Minimize power requirements
      Cuts operating costs
      Reduces need for new capital investment
3. Minimize system volume
      Reduces need for large new capital
      facilities
4. Use standard manufacturing processes where
   possible-- minimize customization
5. Maximize reliability and availability/dollar
6. Maximize scalability/dollar
7. Design for integrability
                             Economy
 Red Storm leverages economies of scale
      AMD Opteron microprocessor & standard memory
      Air cooled
      Electrical interconnect based on Infiniband physical devices
      Linux operating system
 Selected use of custom components
    System chip ASIC
        • Critical for communication intensive applications
    Light Weight Kernel
        • Truly custom, but we already have it (4th generation)
                                        Cplant on a slide
Goal: MPP “look and feel”
                                                                                              Compute
                                                                                 Service
• Start ~1997, upgrade ~1999--2001
• Alpha & Myrinet, mesh topology                                                                            File I/O

• ~3000 procs (3Tf) in 7 systems                                           Users
• Configurable to ~1700 procs                                                                              Net I/O
                                                                         /home
• Red/Black switching
• Linux w/ custom runtime & mgmt.
                                                                             System Support
• Production operation for several yrs.                                                        Sys Admin



     I/O                             Service                    ATM
               Compute Nodes         Nodes     I/O Nodes
   Nodes
                             …                                 HiPPI
                         …
                                                               other

                             …
                         …                                 Ethernet



                                                System
           …      …              …     …



                                                           Operator(s)
                             …
                         …




               ASCI Red
                                 IA-32 Cplant on a slide
Goal: Mid-range capacity
                                                                                              Compute
                                                                                 Service
• Started 2003, upgrade annually
• Pentium-4 & Myrinet, Clos network                                                                         File I/O

• 1280 procs (~7 Tf) in 3 systems                                          Users
• Currently configurable to 512 procs                                                                      Net I/O
                                                                         /home
• Linux w/ custom runtime & mgmt.
• Production operation for several yrs.
                                                                             System Support
                                                                                               Sys Admin



     I/O                             Service                    ATM
               Compute Nodes         Nodes     I/O Nodes
   Nodes
                             …                                 HiPPI
                         …
                                                               other

                             …
                         …                                 Ethernet



                                                System
           …      …              …     …



                                                           Operator(s)
                             …
                         …




               ASCI Red
Observation:
For most large scientific and engineering applications the
performance is more determined by parallel scalability
and less by the speed of individual CPUs.

There must be balance between processor, interconnect,
and I/O performance to achieve overall performance.

To date, only a few tightly-coupled, parallel computer
systems have been able to demonstrate a high level of
scalability on a broad set of scientific and engineering
applications.
               Let’s Compare Balance In Parallel
                          Systems
     Machine      Node Speed      Network Link BW   Communications
                 Rating(MFlops)      (Mbytes/s)         Balance
                                                      (Bytes/flop)
ASCI RED              400            800(533)          2(1.33)
T3E                  1200              1200               1
ASCI RED**            666            800(533)         (1.2)0.67
Cplant               1000              140              0.14
Blue Mtn*             500              800               1.6
BlueMtn**           64000          1200 (9600*)      0.02 (0.16*)
Blue Pacific         2650           300 (132)        0.11 (0.05)
White               24000              2000             0.083
Q*                   2500              650               0.2
Q**               10000                400              0.04
      Comparing Red Storm and BGL

                     Blue Gene Light**   Red Storm*
Node Speed             5.6 GF             5.6 GF         (1x)
Node Memory            0.25--.5 GB        2 (1--8 ) GB   (4x nom.)
Network latency        7 msecs            2 msecs        (2/7 x)
Network BW             0.28 GB/s           6.0 GB/s      (22x)
BW Bytes/Flops          0.05               1.1           (22x)
Bi-Section B/F          0.0016            0.038          (24x)
#nodes/problem         40,000              10,000        (1/4 x)
*100 TF version of Red Storm
* * 360 TF version of BGL
Fixed problem performance




   Molecular dynamics problem
            (LJ liquid)
Parallel Sn Neutronics (provided by LANL)
                                           Scalable computing works

                                           ASCI Red efficiencies for major codes


                                                                                   QS-Particles
                                 100
Scaled parallel efficiency (%)




                                                                                   QS-Fields-Only
                                                                                   QS-1B Cells
                                  80                                               Rad x-port-1B Cells
                                                                                   Rad x-port - 17M
                                                                                   Rad x-port - 80M
                                  60
                                                                                   Rad x-port - 168M
                                                                                   Rad x-port - 532M
                                  40                                               Finite Element
                                                                                   Zapotec
                                  20                                               Reactive Fluid Flow
                                                                                   Salinas
                                                                                   CTH
                                   0
                                       1   10       100       1000     10000
                                                 Processors
                             Balance is critical to scalability

                                      Basic Parallel Efficiency Model

                      1.20                                                 Red Storm
                                        Scientific & eng. codes            (B=1.5)
                      1.00                                                 ASCI Red
Parallel Efficiency




                                                                           (B=1.2)
                      0.80
                                                                           Ref. Machine
                      0.60                                                 (B=1.0)
                                                                           Earth Sim.
                      0.40                                                 (B=.4)
                                                                           Cplant (B=.25)
                      0.20
                                                                           Blue Gene Light
                      0.00                                                 (B=.05)
                             0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0   Std. Linux Cluster
                                 Communication/Computation Load            (B=.04)
                                           Relating scalability and cost

                                                           Cluster more                           MPP more
Efficiency ratio (Red/Cplant)




                                 6.00
                                                           cost effective                         cost effective
                                 5.00

                                 4.00

                                 3.00

                                 2.00                                                                    Efficiency ratio =
                                                                                                         Cost ratio = 1.8
                                 1.00

                                 0.00
                                         1

                                               2

                                                     4

                                                            8

                                                                16

                                                                      32

                                                                              64

                                                                                   128

                                                                                           256

                                                                                                 512

                                                                                                         1024

                                                                                                                2048

                                                                                                                       4096
                                Average efficiency ratio over
                                the five codes that consume           Processors
                                >80% of Sandia’s cycles

                                                                 Eff. Ratio              Extrapolation
                                              Scalability determines
                                                cost effectiveness
                                            Sandia’s top priority computing workload:
                           80,000,000


                           70,000,000           Cluster more                              MPP more
                                                cost effective                            cost effective
                           60,000,000
Total Node-Hours of Jobs




                                               55M node-hrs                        380M node-hrs
                           50,000,000


                           40,000,000


                           30,000,000


                           20,000,000


                           10,000,000


                                   0
                                        1           10                 100   256   1000            10000
                                                                 Number of Nodes
                   Scalability also
                  limits capability
                   ITS Speedup curves

                        ~3x processors
          1200
                                         Red Speedup
          1000
           800                           Cplant Speedup
Speedup




           600
                                         Poly. (Red
           400                           Speedup)
           200                           Poly. (Cplant
                                         Speedup)
             0
                  128
                  256
                  384
                  512
                  640
                  768
                  896
                 1024
                 1152
                 1280
                 1408
                    0




                   Processors
        Commodity nearly everywhere--
          Customization drives cost
• Earth Simulator and Cray X-1 are fully custom Vector
  systems with good balance
   • This drives their high cost (and their high performance).
• Clusters are nearly entirely high-volume with no truly custom
  parts
   • Which drives their low-cost (and their low scalability)
• Red Storm uses custom parts only where they are critical to
  performance and reliability
   • High scalability at minimal cost/performance
                                        Scaling data for some
                                        key engineering codes
                                        Performance on Engineering Codes
                                                                   Random variation at
                                                                   small proc. counts
                             1.20
Scaled Parallel Efficiency




                             1.00

                             0.80                                                           ITS, Red
                                                                                            ITS, Cplant
                             0.60
                                                                                            ACME, Red
                             0.40                                                           ACME, Cplant
                             0.20

                             0.00                                                            Large differential in
                                                                   128
                                                                         256
                                                                               512
                                                                                     1024
                                    1
                                        2
                                            4
                                                8
                                                    16
                                                         32
                                                              64



                                                                                             efficiency at large
                                                                                             proc. counts

                                                    Processors
                                                  Scaling data for some
                                                    key physics codes
                         PARTISN Diffusion Solver Sizeup Study

                                                                                                           Los Alamos’
                           S6P2, 12 Groups, 13,800 cells/PE

                      120%
                                                                                                            Radiation
                      100%
Parallel Efficiency




                      80%                                   ASCI Red                                     transport code
                                                            Blue Mountain
                      60%                                                                           PARTISN Transport Solver Sizeup Study
                                                            White
                      40%
                                                            QSC
                                                                                                       S6P2, 12 Groups, 13,800 cells/PE
                      20%
                                                                                                  120%
                       0%
                                                                                                  100%


                                                                            Parallel Efficiency
                              25 8
                                 6
                             10 2
                             20 24
                                48
                                 1
                                 2
                                 4
                                 8
                               16
                               32
                               64
                              12

                              51




                                                                                                  80%                                   ASCI Red
                             Number of Processor Elements                                                                               Blue Mountain
                                                                                                  60%
                                                                                                                                        White
                                                                                                  40%
                                                                                                                                        QSC
                                                                                                  20%
                                                                                                   0%




                                                                                                          25 8
                                                                                                          51 6
                                                                                                         10 2
                                                                                                         20 24
                                                                                                            48
                                                                                                             1
                                                                                                             2
                                                                                                             4
                                                                                                             8
                                                                                                           16
                                                                                                           32
                                                                                                           64
                                                                                                          12
                                                                                                         Number of Processor Elements

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:20
posted:10/29/2011
language:English
pages:50