Bill Camp, Jim Tomkins
& Rob Leland
How Much Commodity is Enough?
the Red Storm Architecture
William J. Camp, James L. Tomkins & Rob Leland
CCIM, Sandia National Laboratories
Albuquerque, NM
bill@sandia.gov
Sandia MPPs (since 1987)
1987: 1024-processor nCUBE10 [512 Mflops]
1990--1992 + +: 2 1024-processor nCUBE-2 machines [2 @ 2 Gflops]
1988--1990: 16384-processor CM-200
1991: 64-processor Intel IPSC-860
1993--1996: ~3700-processor Intel Paragon [180 Gflops]
1996--present: 9400-processor Intel TFLOPS (ASCI Red) [3.2 Tflops]
1997--present: 400 --> 2800 processors in Cplant Linux Cluster [~3 Tflops]
2003: 1280-processor IA32- Linux cluster [~7 Tflops]
2004: Red Storm: ~11600 processor Opteron-based MPP [>40 Tflops]
Our rubric (since 1987)
Complex, mission-critical, engineering & science applications
Large systems (1000’s of PE’s) with a few processors per node
Message passing paradigm
Balanced architecture
Use commodity wherever possible
Efficient systems software
Emphasis on scalability & reliability in all aspects
Critical advances in parallel algorithms
Vertical integration of technologies
A partitioned, scalable computing
architecture
Compute File I/O
Service
Users
/home
Net I/O
Computing domains at Sandia
Peak
Mid-Range
Domain
Volume
# Procs 1 101 102 103 104
X X X
Red Storm
Cplant Linux X X X
Supercluster
Beowulf clusters X X X
Desktop X
Red Storm is targeting the highest-end market but has real
advantages for the mid-range market (from 1 cabinet on up)
Red Storm Architecture
True MPP, designed to be a single system-- not a cluster
Distributed memory MIMD parallel supercomputer
Fully connected 3D mesh interconnect. Each compute
node processor has a bi-directional connection to the
primary communication network
108 compute node cabinets and 10,368 compute node
processors (AMD Sledgehammer @ 2.0--2.4 GHz)
~10 or 20 TB of DDR memory @ 333MHz
Red/Black switching: ~1/4, ~1/2, ~1/4 (for data security)
12 Service, Visualization, and I/O cabinets on each end
(640 S,V & I processors for each color)
240 TB of disk storage (120 TB per color) initially
Red Storm Architecture
Functional hardware partitioning: service and I/O nodes,
compute nodes, Visualization nodes, and RAS nodes
Partitioned Operating System (OS): LINUX on Service,
Visualization, and I/O nodes, LWK (Catamount) on
compute nodes, LINUX on RAS nodes
Separate RAS and system management network (Ethernet)
Router table-based routing in the interconnect
Less than 2 MW total power and cooling
Less than 3,000 ft2 of floor space
Usage Model
Unix (Linux) Compute I/O
Batch Login Node Resource
Processing with Unix
environment
or
User sees a coherent, single system
Thor’s Hammer Topology
3D-mesh Compute node topology:
27 x 16 x 24 (x, y, z) – Red/Black split: 2,688 – 4,992 – 2,688
Service, Visualization, and I/O partitions
3 cab’s on each end of each row
• 384 full bandwidth links to Compute Node Mesh
• Not all nodes have a processor-- all have routers
256 PE’s in each Visualization Partition--2 per board
256 PE’s in each I/O Partition-- 2 per board
128 PE’s in each Service Partition-- 4 per board
3-D Mesh topology (Z direction is a
torus) Torus
Interconnect
in Z
Y=16
Visualization,
Visualization
& I/O Nodes
& I/O Nodes
Service
Service
10,368
640
640
Compute Z=24
Node Mesh
X=27
Thor’s Hammer Network Chips
3D-mesh is created by SEASTAR ASIC:
Hyper-transport Interface and 6 network router ports on each chip
In computer partitions each processor has its own SEASTAR
In service partition, some boards are configured like compute
partition (4 PE’s per board)
Others have only 2 PE’s per board; but still have 4 SEASTARS
• So, network topology is uniform
SEASTAR designed by CRAY to our spec’s, Fabricated
by IBM
The only truly custom part in Red Storm-- complies with HT open
standard
Node architecture
DRAM 1 (or 2) Gbyte or more
CPU
AMD
Opteron
Six Links
ASIC = Application To Other
Specific Integrated ASIC Nodes in X, Y,
Circuit, or a NIC + and Z
“custom chip” Router
System Layout
(27 x 16 x 24 mesh)
Normally Switchable Normally
Unclassified Nodes Classified
{
{
Disconnect Cabinets
Thor’s Hammer Cabinet Layout
Compute Node Cabinet
CPU Boards Compute Node Partition
2 ft 4 ft 3 Card Cages per Cabinet
8 Boards per Card Cage
4 Processors per Board
} 96 PE
4 NIC/Router Chips per Board
Cables
N + 1 Power Supplies
Passive Backplane
Service. Viz, and I/O Node Partition
2 (or 3) Card Cages per Cabinet
8 Boards per Card Cage
2 (or 4) Processors per Board
Power
Fan Fan 4 NIC/Router Chips per Board
Supply
2-PE I/O Boards have 4 PCI-X
busses
Front Side N + 1 Power Supplies
Passive Backplane
Performance
Peak of 41.4 (46.6) TF based on 2 floating point
instruction issues per clock at 2.0 Gigahertz .
We required 7-fold speedup versus ASCI Red but
based on our benchmarks expect performance will be
8-10 time faster than ASCI Red.
Expected MP-Linpack performance: ~30--35 TF
Aggregate system memory bandwidth: ~55 TB/s
Interconnect Performance:
Latency 30 (estimated)
Architecture Distributed Memory MIMD Distributed Memory
MIMD
Number of Compute Node Processors 9,460 10,368
Processor Intel P II @ 333 MHz AMD Opteron @ 2 GHz
Total Memory 1.2 TB 10.4 TB (up to 80 TB)
System Memory Bandwidth 2.5 TB/s 55 TB/s
Disk Storage 12.5 TB 240 TB
Parallel File System Bandwidth 1.0 GB/s each color 50.0 GB/s each color
External Network Bandwidth 0.2 GB/s each color 25 GB/s each color
Comparison of ASCI Red
and Red Storm
ASCI Red RED STORM
Interconnect Topology 3D Mesh (x, y, z) 3D Mesh (x, y, z)
38 x 32 x 2 27 x 16 x 24
Interconnect Performance
MPI Latency 15 ms 1 hop, 20 ms max 2.0 ms 1 hop, 5 ms s max
800 MB/s 6.0 GB/s
Bi-Directional Bandwidth
51.2 GB/s 2.3 TB/s
Minimum Bi-section Bandwidth
Full System RAS
RAS Network 10 Mbit Ethernet 100 Mbit Ethernet
RAS Processors 1 for each 32 CPUs 1 for each 4 CPUs
Operating System
Compute Nodes Cougar Catamount
Service and I/O Nodes TOS (OSF1 UNIX) LINUX
RAS Nodes VX-Works LINUX
Red/Black Switching 2260 – 4940 – 2260 2688 – 4992 - 2688
System Foot Print ~2500 ft2 ~3000 ft2
Power Requirement 850 KW 1.7 MW
Red Storm Project
23 months, design to First Product Shipment!
System software is a joint project between Cray and Sandia
Sandia is supplying Catamount LWK and the service node run-time system
Cray is responsible for Linux, NIC software interface, RAS software, file
system software, and Totalview port
Initial software development was done on a cluster of workstations with a
commodity interconnect. Second stage involves an FPGA implementation of
SEASTAR NIC/Router (Starfish). Final checkout on real SEASTAR-based
system
System design is going on now
Cabinets-- exist
SEASTAR NIC/Router-- released to Fabrication at IBM earlier this month
Full system to be installed and turned over to Sandia in stages culminating
in August--September 2004
New Building for Thor’s Hammer
Designing for scalable
supercomputing
Challenges in:
-Design
-Integration
-Management
-Use
SUREty for Very Large Parallel
Computer Systems
Scalability - Full System Hardware and System Software
Usability - Required Functionality Only
Reliability - Hardware and System Software
Expense minimization- use commodity, high-volume parts
SURE poses Computer System Requirements:
SURE Architectural tradeoffs:
• Processor and memory sub-
system balance
• Compute vs interconnect balance
• Topology choices
• Software choices
• RAS
• Commodity vs. Custom technology
• Geometry and mechanical design
Sandia Strategies:
-build on commodity
-leverage Open Source (e.g., Linux)
-Add to commodity selectively (in RS
there is basically one truly custom
part!)
-leverage experience with previous
scalable supercomputers
System Scalability Driven Requirements
Overall System Scalability - Complex
scientific applications such as molecular
dynamics, hydrodynamics & radiation
transport should achieve scaled parallel
efficiencies greater than 50% on the full
system (~20,000 processors).
-
Scalability
System Software;
System Software Performance scales nearly perfectly with the
number of processors to the full size of the computer (~30,000
processors). This means that System Software time (overhead)
remains nearly constant with the size of the system or scales at
most logarithmically with the system size.
- Full re-boot time scales logarithmically with the system size.
- Job loading is logarithmic with the number of processors.
- Parallel I/O performance is not sensitive to # of PEs doing I/O
- Communication Network software must be scalable.
- No connection-based protocols among compute nodes.
- Message buffer space independent of # of processors.
- Compute node OS gets out of the way of the application.
Hardware scalability
•Balance in the node hardware:
•Memory BW must match CPU speed
Ideally 24 Bytes/flop (never yet done)
•Communications speed must match CPU speed
•I/O must match CPU speeds
•Scalable System SW( OS and Libraries)
•Scalable Applications
Usability
>Application Code Support:
Software that supports scalability of the
Computer System
Math Libraries
MPI Support for Full System Size
Parallel I/O Library
Compilers
Tools that Scale to the Full Size of the
Computer System
Debuggers
Performance Monitors
Full-featured LINUX OS support at the
user interface
Reliability
Light Weight Kernel (LWK) O. S. on compute partition
Much less code fails much less often
Monitoring of correctible errors
Fix soft errors before they become hard
Hot swapping of components
Overall system keeps running during maintenance
Redundant power supplies & memories
Completely independent RAS System monitors virtually
every component in system
Economy
1. Use high-volume parts where possible
2. Minimize power requirements
Cuts operating costs
Reduces need for new capital investment
3. Minimize system volume
Reduces need for large new capital
facilities
4. Use standard manufacturing processes where
possible-- minimize customization
5. Maximize reliability and availability/dollar
6. Maximize scalability/dollar
7. Design for integrability
Economy
Red Storm leverages economies of scale
AMD Opteron microprocessor & standard memory
Air cooled
Electrical interconnect based on Infiniband physical devices
Linux operating system
Selected use of custom components
System chip ASIC
• Critical for communication intensive applications
Light Weight Kernel
• Truly custom, but we already have it (4th generation)
Cplant on a slide
Goal: MPP “look and feel”
Compute
Service
• Start ~1997, upgrade ~1999--2001
• Alpha & Myrinet, mesh topology File I/O
• ~3000 procs (3Tf) in 7 systems Users
• Configurable to ~1700 procs Net I/O
/home
• Red/Black switching
• Linux w/ custom runtime & mgmt.
System Support
• Production operation for several yrs. Sys Admin
I/O Service ATM
Compute Nodes Nodes I/O Nodes
Nodes
… HiPPI
…
other
…
… Ethernet
System
… … … …
Operator(s)
…
…
ASCI Red
IA-32 Cplant on a slide
Goal: Mid-range capacity
Compute
Service
• Started 2003, upgrade annually
• Pentium-4 & Myrinet, Clos network File I/O
• 1280 procs (~7 Tf) in 3 systems Users
• Currently configurable to 512 procs Net I/O
/home
• Linux w/ custom runtime & mgmt.
• Production operation for several yrs.
System Support
Sys Admin
I/O Service ATM
Compute Nodes Nodes I/O Nodes
Nodes
… HiPPI
…
other
…
… Ethernet
System
… … … …
Operator(s)
…
…
ASCI Red
Observation:
For most large scientific and engineering applications the
performance is more determined by parallel scalability
and less by the speed of individual CPUs.
There must be balance between processor, interconnect,
and I/O performance to achieve overall performance.
To date, only a few tightly-coupled, parallel computer
systems have been able to demonstrate a high level of
scalability on a broad set of scientific and engineering
applications.
Let’s Compare Balance In Parallel
Systems
Machine Node Speed Network Link BW Communications
Rating(MFlops) (Mbytes/s) Balance
(Bytes/flop)
ASCI RED 400 800(533) 2(1.33)
T3E 1200 1200 1
ASCI RED** 666 800(533) (1.2)0.67
Cplant 1000 140 0.14
Blue Mtn* 500 800 1.6
BlueMtn** 64000 1200 (9600*) 0.02 (0.16*)
Blue Pacific 2650 300 (132) 0.11 (0.05)
White 24000 2000 0.083
Q* 2500 650 0.2
Q** 10000 400 0.04
Comparing Red Storm and BGL
Blue Gene Light** Red Storm*
Node Speed 5.6 GF 5.6 GF (1x)
Node Memory 0.25--.5 GB 2 (1--8 ) GB (4x nom.)
Network latency 7 msecs 2 msecs (2/7 x)
Network BW 0.28 GB/s 6.0 GB/s (22x)
BW Bytes/Flops 0.05 1.1 (22x)
Bi-Section B/F 0.0016 0.038 (24x)
#nodes/problem 40,000 10,000 (1/4 x)
*100 TF version of Red Storm
* * 360 TF version of BGL
Fixed problem performance
Molecular dynamics problem
(LJ liquid)
Parallel Sn Neutronics (provided by LANL)
Scalable computing works
ASCI Red efficiencies for major codes
QS-Particles
100
Scaled parallel efficiency (%)
QS-Fields-Only
QS-1B Cells
80 Rad x-port-1B Cells
Rad x-port - 17M
Rad x-port - 80M
60
Rad x-port - 168M
Rad x-port - 532M
40 Finite Element
Zapotec
20 Reactive Fluid Flow
Salinas
CTH
0
1 10 100 1000 10000
Processors
Balance is critical to scalability
Basic Parallel Efficiency Model
1.20 Red Storm
Scientific & eng. codes (B=1.5)
1.00 ASCI Red
Parallel Efficiency
(B=1.2)
0.80
Ref. Machine
0.60 (B=1.0)
Earth Sim.
0.40 (B=.4)
Cplant (B=.25)
0.20
Blue Gene Light
0.00 (B=.05)
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Std. Linux Cluster
Communication/Computation Load (B=.04)
Relating scalability and cost
Cluster more MPP more
Efficiency ratio (Red/Cplant)
6.00
cost effective cost effective
5.00
4.00
3.00
2.00 Efficiency ratio =
Cost ratio = 1.8
1.00
0.00
1
2
4
8
16
32
64
128
256
512
1024
2048
4096
Average efficiency ratio over
the five codes that consume Processors
>80% of Sandia’s cycles
Eff. Ratio Extrapolation
Scalability determines
cost effectiveness
Sandia’s top priority computing workload:
80,000,000
70,000,000 Cluster more MPP more
cost effective cost effective
60,000,000
Total Node-Hours of Jobs
55M node-hrs 380M node-hrs
50,000,000
40,000,000
30,000,000
20,000,000
10,000,000
0
1 10 100 256 1000 10000
Number of Nodes
Scalability also
limits capability
ITS Speedup curves
~3x processors
1200
Red Speedup
1000
800 Cplant Speedup
Speedup
600
Poly. (Red
400 Speedup)
200 Poly. (Cplant
Speedup)
0
128
256
384
512
640
768
896
1024
1152
1280
1408
0
Processors
Commodity nearly everywhere--
Customization drives cost
• Earth Simulator and Cray X-1 are fully custom Vector
systems with good balance
• This drives their high cost (and their high performance).
• Clusters are nearly entirely high-volume with no truly custom
parts
• Which drives their low-cost (and their low scalability)
• Red Storm uses custom parts only where they are critical to
performance and reliability
• High scalability at minimal cost/performance
Scaling data for some
key engineering codes
Performance on Engineering Codes
Random variation at
small proc. counts
1.20
Scaled Parallel Efficiency
1.00
0.80 ITS, Red
ITS, Cplant
0.60
ACME, Red
0.40 ACME, Cplant
0.20
0.00 Large differential in
128
256
512
1024
1
2
4
8
16
32
64
efficiency at large
proc. counts
Processors
Scaling data for some
key physics codes
PARTISN Diffusion Solver Sizeup Study
Los Alamos’
S6P2, 12 Groups, 13,800 cells/PE
120%
Radiation
100%
Parallel Efficiency
80% ASCI Red transport code
Blue Mountain
60% PARTISN Transport Solver Sizeup Study
White
40%
QSC
S6P2, 12 Groups, 13,800 cells/PE
20%
120%
0%
100%
Parallel Efficiency
25 8
6
10 2
20 24
48
1
2
4
8
16
32
64
12
51
80% ASCI Red
Number of Processor Elements Blue Mountain
60%
White
40%
QSC
20%
0%
25 8
51 6
10 2
20 24
48
1
2
4
8
16
32
64
12
Number of Processor Elements