Embed
Email

camp

Document Sample

Shared by: cuiliqing
Categories
Tags
Stats
views:
1
posted:
10/29/2011
language:
English
pages:
50
Bill Camp, Jim Tomkins

& Rob Leland

How Much Commodity is Enough?

the Red Storm Architecture





William J. Camp, James L. Tomkins & Rob Leland

CCIM, Sandia National Laboratories

Albuquerque, NM

bill@sandia.gov

Sandia MPPs (since 1987)

1987: 1024-processor nCUBE10 [512 Mflops]

 1990--1992 + +: 2 1024-processor nCUBE-2 machines [2 @ 2 Gflops]

 1988--1990: 16384-processor CM-200

 1991: 64-processor Intel IPSC-860

 1993--1996: ~3700-processor Intel Paragon [180 Gflops]

 1996--present: 9400-processor Intel TFLOPS (ASCI Red) [3.2 Tflops]

 1997--present: 400 --> 2800 processors in Cplant Linux Cluster [~3 Tflops]

 2003: 1280-processor IA32- Linux cluster [~7 Tflops]

 2004: Red Storm: ~11600 processor Opteron-based MPP [>40 Tflops]

Our rubric (since 1987)



 Complex, mission-critical, engineering & science applications

 Large systems (1000’s of PE’s) with a few processors per node

 Message passing paradigm

 Balanced architecture

 Use commodity wherever possible

 Efficient systems software

 Emphasis on scalability & reliability in all aspects

 Critical advances in parallel algorithms

 Vertical integration of technologies

A partitioned, scalable computing

architecture

Compute File I/O

Service









Users



/home

Net I/O

Computing domains at Sandia





Peak

Mid-Range

Domain

Volume



# Procs 1 101 102 103 104

X X X

Red Storm

Cplant Linux X X X

Supercluster

Beowulf clusters X X X





Desktop X









 Red Storm is targeting the highest-end market but has real

advantages for the mid-range market (from 1 cabinet on up)

Red Storm Architecture

 True MPP, designed to be a single system-- not a cluster

 Distributed memory MIMD parallel supercomputer

 Fully connected 3D mesh interconnect. Each compute

node processor has a bi-directional connection to the

primary communication network

 108 compute node cabinets and 10,368 compute node

processors (AMD Sledgehammer @ 2.0--2.4 GHz)

 ~10 or 20 TB of DDR memory @ 333MHz

 Red/Black switching: ~1/4, ~1/2, ~1/4 (for data security)

 12 Service, Visualization, and I/O cabinets on each end

(640 S,V & I processors for each color)

 240 TB of disk storage (120 TB per color) initially

Red Storm Architecture

 Functional hardware partitioning: service and I/O nodes,

compute nodes, Visualization nodes, and RAS nodes

 Partitioned Operating System (OS): LINUX on Service,

Visualization, and I/O nodes, LWK (Catamount) on

compute nodes, LINUX on RAS nodes

 Separate RAS and system management network (Ethernet)

 Router table-based routing in the interconnect

 Less than 2 MW total power and cooling

 Less than 3,000 ft2 of floor space

Usage Model





Unix (Linux) Compute I/O

Batch Login Node Resource

Processing with Unix

environment

or









User sees a coherent, single system

Thor’s Hammer Topology



 3D-mesh Compute node topology:

 27 x 16 x 24 (x, y, z) – Red/Black split: 2,688 – 4,992 – 2,688

 Service, Visualization, and I/O partitions

 3 cab’s on each end of each row

• 384 full bandwidth links to Compute Node Mesh

• Not all nodes have a processor-- all have routers

 256 PE’s in each Visualization Partition--2 per board

 256 PE’s in each I/O Partition-- 2 per board

 128 PE’s in each Service Partition-- 4 per board

3-D Mesh topology (Z direction is a

torus) Torus

Interconnect

in Z



Y=16

Visualization,









Visualization

& I/O Nodes









& I/O Nodes

Service

Service









10,368









640

640









Compute Z=24

Node Mesh



X=27

Thor’s Hammer Network Chips



 3D-mesh is created by SEASTAR ASIC:

 Hyper-transport Interface and 6 network router ports on each chip

 In computer partitions each processor has its own SEASTAR

 In service partition, some boards are configured like compute

partition (4 PE’s per board)

 Others have only 2 PE’s per board; but still have 4 SEASTARS

• So, network topology is uniform

 SEASTAR designed by CRAY to our spec’s, Fabricated

by IBM

 The only truly custom part in Red Storm-- complies with HT open

standard

Node architecture



DRAM 1 (or 2) Gbyte or more







CPU

AMD

Opteron

Six Links

ASIC = Application To Other

Specific Integrated ASIC Nodes in X, Y,

Circuit, or a NIC + and Z

“custom chip” Router

System Layout

(27 x 16 x 24 mesh)

Normally Switchable Normally

Unclassified Nodes Classified

{







{

Disconnect Cabinets

Thor’s Hammer Cabinet Layout

Compute Node Cabinet

CPU Boards  Compute Node Partition

2 ft 4 ft  3 Card Cages per Cabinet





8 Boards per Card Cage

4 Processors per Board

} 96 PE



 4 NIC/Router Chips per Board









Cables

 N + 1 Power Supplies

 Passive Backplane

 Service. Viz, and I/O Node Partition

 2 (or 3) Card Cages per Cabinet

 8 Boards per Card Cage

 2 (or 4) Processors per Board

Power

Fan Fan  4 NIC/Router Chips per Board

Supply

 2-PE I/O Boards have 4 PCI-X

busses

Front Side  N + 1 Power Supplies

 Passive Backplane

Performance

 Peak of 41.4 (46.6) TF based on 2 floating point

instruction issues per clock at 2.0 Gigahertz .

 We required 7-fold speedup versus ASCI Red but

based on our benchmarks expect performance will be

8-10 time faster than ASCI Red.

 Expected MP-Linpack performance: ~30--35 TF

 Aggregate system memory bandwidth: ~55 TB/s

 Interconnect Performance:

 Latency 30 (estimated)

Architecture Distributed Memory MIMD Distributed Memory

MIMD

Number of Compute Node Processors 9,460 10,368

Processor Intel P II @ 333 MHz AMD Opteron @ 2 GHz

Total Memory 1.2 TB 10.4 TB (up to 80 TB)

System Memory Bandwidth 2.5 TB/s 55 TB/s

Disk Storage 12.5 TB 240 TB

Parallel File System Bandwidth 1.0 GB/s each color 50.0 GB/s each color

External Network Bandwidth 0.2 GB/s each color 25 GB/s each color

Comparison of ASCI Red

and Red Storm

ASCI Red RED STORM

Interconnect Topology 3D Mesh (x, y, z) 3D Mesh (x, y, z)

38 x 32 x 2 27 x 16 x 24



Interconnect Performance

MPI Latency 15 ms 1 hop, 20 ms max 2.0 ms 1 hop, 5 ms s max

800 MB/s 6.0 GB/s

Bi-Directional Bandwidth

51.2 GB/s 2.3 TB/s

Minimum Bi-section Bandwidth

Full System RAS

RAS Network 10 Mbit Ethernet 100 Mbit Ethernet

RAS Processors 1 for each 32 CPUs 1 for each 4 CPUs



Operating System

Compute Nodes Cougar Catamount

Service and I/O Nodes TOS (OSF1 UNIX) LINUX

RAS Nodes VX-Works LINUX

Red/Black Switching 2260 – 4940 – 2260 2688 – 4992 - 2688

System Foot Print ~2500 ft2 ~3000 ft2

Power Requirement 850 KW 1.7 MW

Red Storm Project

 23 months, design to First Product Shipment!

 System software is a joint project between Cray and Sandia

 Sandia is supplying Catamount LWK and the service node run-time system

 Cray is responsible for Linux, NIC software interface, RAS software, file

system software, and Totalview port

 Initial software development was done on a cluster of workstations with a

commodity interconnect. Second stage involves an FPGA implementation of

SEASTAR NIC/Router (Starfish). Final checkout on real SEASTAR-based

system

 System design is going on now

 Cabinets-- exist

 SEASTAR NIC/Router-- released to Fabrication at IBM earlier this month

 Full system to be installed and turned over to Sandia in stages culminating

in August--September 2004

New Building for Thor’s Hammer

Designing for scalable

supercomputing

Challenges in:

-Design

-Integration

-Management

-Use

SUREty for Very Large Parallel

Computer Systems

Scalability - Full System Hardware and System Software



Usability - Required Functionality Only



Reliability - Hardware and System Software



Expense minimization- use commodity, high-volume parts



SURE poses Computer System Requirements:

SURE Architectural tradeoffs:

• Processor and memory sub-

system balance

• Compute vs interconnect balance

• Topology choices

• Software choices

• RAS

• Commodity vs. Custom technology

• Geometry and mechanical design

Sandia Strategies:

-build on commodity

-leverage Open Source (e.g., Linux)

-Add to commodity selectively (in RS

there is basically one truly custom

part!)

-leverage experience with previous

scalable supercomputers

System Scalability Driven Requirements





Overall System Scalability - Complex

scientific applications such as molecular

dynamics, hydrodynamics & radiation

transport should achieve scaled parallel

efficiencies greater than 50% on the full

system (~20,000 processors).



-

Scalability

System Software;

System Software Performance scales nearly perfectly with the

number of processors to the full size of the computer (~30,000

processors). This means that System Software time (overhead)

remains nearly constant with the size of the system or scales at

most logarithmically with the system size.



- Full re-boot time scales logarithmically with the system size.

- Job loading is logarithmic with the number of processors.

- Parallel I/O performance is not sensitive to # of PEs doing I/O

- Communication Network software must be scalable.

- No connection-based protocols among compute nodes.

- Message buffer space independent of # of processors.

- Compute node OS gets out of the way of the application.

Hardware scalability

•Balance in the node hardware:

•Memory BW must match CPU speed

Ideally 24 Bytes/flop (never yet done)

•Communications speed must match CPU speed

•I/O must match CPU speeds

•Scalable System SW( OS and Libraries)

•Scalable Applications

Usability

>Application Code Support:

Software that supports scalability of the

Computer System

Math Libraries

MPI Support for Full System Size

Parallel I/O Library

Compilers

Tools that Scale to the Full Size of the

Computer System

Debuggers

Performance Monitors

Full-featured LINUX OS support at the

user interface

Reliability

 Light Weight Kernel (LWK) O. S. on compute partition

 Much less code fails much less often

 Monitoring of correctible errors

 Fix soft errors before they become hard

 Hot swapping of components

 Overall system keeps running during maintenance

 Redundant power supplies & memories

 Completely independent RAS System monitors virtually

every component in system

Economy



1. Use high-volume parts where possible

2. Minimize power requirements

Cuts operating costs

Reduces need for new capital investment

3. Minimize system volume

Reduces need for large new capital

facilities

4. Use standard manufacturing processes where

possible-- minimize customization

5. Maximize reliability and availability/dollar

6. Maximize scalability/dollar

7. Design for integrability

Economy

 Red Storm leverages economies of scale

 AMD Opteron microprocessor & standard memory

 Air cooled

 Electrical interconnect based on Infiniband physical devices

 Linux operating system

 Selected use of custom components

 System chip ASIC

• Critical for communication intensive applications

 Light Weight Kernel

• Truly custom, but we already have it (4th generation)

Cplant on a slide

Goal: MPP “look and feel”

Compute

Service

• Start ~1997, upgrade ~1999--2001

• Alpha & Myrinet, mesh topology File I/O



• ~3000 procs (3Tf) in 7 systems Users

• Configurable to ~1700 procs Net I/O

/home

• Red/Black switching

• Linux w/ custom runtime & mgmt.

System Support

• Production operation for several yrs. Sys Admin







I/O Service ATM

Compute Nodes Nodes I/O Nodes

Nodes

… HiPPI



other





… Ethernet







System

… … … …







Operator(s)













ASCI Red

IA-32 Cplant on a slide

Goal: Mid-range capacity

Compute

Service

• Started 2003, upgrade annually

• Pentium-4 & Myrinet, Clos network File I/O



• 1280 procs (~7 Tf) in 3 systems Users

• Currently configurable to 512 procs Net I/O

/home

• Linux w/ custom runtime & mgmt.

• Production operation for several yrs.

System Support

Sys Admin







I/O Service ATM

Compute Nodes Nodes I/O Nodes

Nodes

… HiPPI



other





… Ethernet







System

… … … …







Operator(s)













ASCI Red

Observation:

For most large scientific and engineering applications the

performance is more determined by parallel scalability

and less by the speed of individual CPUs.



There must be balance between processor, interconnect,

and I/O performance to achieve overall performance.



To date, only a few tightly-coupled, parallel computer

systems have been able to demonstrate a high level of

scalability on a broad set of scientific and engineering

applications.

Let’s Compare Balance In Parallel

Systems

Machine Node Speed Network Link BW Communications

Rating(MFlops) (Mbytes/s) Balance

(Bytes/flop)

ASCI RED 400 800(533) 2(1.33)

T3E 1200 1200 1

ASCI RED** 666 800(533) (1.2)0.67

Cplant 1000 140 0.14

Blue Mtn* 500 800 1.6

BlueMtn** 64000 1200 (9600*) 0.02 (0.16*)

Blue Pacific 2650 300 (132) 0.11 (0.05)

White 24000 2000 0.083

Q* 2500 650 0.2

Q** 10000 400 0.04

Comparing Red Storm and BGL



Blue Gene Light** Red Storm*

Node Speed 5.6 GF 5.6 GF (1x)

Node Memory 0.25--.5 GB 2 (1--8 ) GB (4x nom.)

Network latency 7 msecs 2 msecs (2/7 x)

Network BW 0.28 GB/s 6.0 GB/s (22x)

BW Bytes/Flops 0.05 1.1 (22x)

Bi-Section B/F 0.0016 0.038 (24x)

#nodes/problem 40,000 10,000 (1/4 x)

*100 TF version of Red Storm

* * 360 TF version of BGL

Fixed problem performance









Molecular dynamics problem

(LJ liquid)

Parallel Sn Neutronics (provided by LANL)

Scalable computing works



ASCI Red efficiencies for major codes





QS-Particles

100

Scaled parallel efficiency (%)









QS-Fields-Only

QS-1B Cells

80 Rad x-port-1B Cells

Rad x-port - 17M

Rad x-port - 80M

60

Rad x-port - 168M

Rad x-port - 532M

40 Finite Element

Zapotec

20 Reactive Fluid Flow

Salinas

CTH

0

1 10 100 1000 10000

Processors

Balance is critical to scalability



Basic Parallel Efficiency Model



1.20 Red Storm

Scientific & eng. codes (B=1.5)

1.00 ASCI Red

Parallel Efficiency









(B=1.2)

0.80

Ref. Machine

0.60 (B=1.0)

Earth Sim.

0.40 (B=.4)

Cplant (B=.25)

0.20

Blue Gene Light

0.00 (B=.05)

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Std. Linux Cluster

Communication/Computation Load (B=.04)

Relating scalability and cost



Cluster more MPP more

Efficiency ratio (Red/Cplant)









6.00

cost effective cost effective

5.00



4.00



3.00



2.00 Efficiency ratio =

Cost ratio = 1.8

1.00



0.00

1



2



4



8



16



32



64



128



256



512



1024



2048



4096

Average efficiency ratio over

the five codes that consume Processors

>80% of Sandia’s cycles



Eff. Ratio Extrapolation

Scalability determines

cost effectiveness

Sandia’s top priority computing workload:

80,000,000





70,000,000 Cluster more MPP more

cost effective cost effective

60,000,000

Total Node-Hours of Jobs









55M node-hrs 380M node-hrs

50,000,000





40,000,000





30,000,000





20,000,000





10,000,000





0

1 10 100 256 1000 10000

Number of Nodes

Scalability also

limits capability

ITS Speedup curves



~3x processors

1200

Red Speedup

1000

800 Cplant Speedup

Speedup









600

Poly. (Red

400 Speedup)

200 Poly. (Cplant

Speedup)

0

128

256

384

512

640

768

896

1024

1152

1280

1408

0









Processors

Commodity nearly everywhere--

Customization drives cost

• Earth Simulator and Cray X-1 are fully custom Vector

systems with good balance

• This drives their high cost (and their high performance).

• Clusters are nearly entirely high-volume with no truly custom

parts

• Which drives their low-cost (and their low scalability)

• Red Storm uses custom parts only where they are critical to

performance and reliability

• High scalability at minimal cost/performance

Scaling data for some

key engineering codes

Performance on Engineering Codes

Random variation at

small proc. counts

1.20

Scaled Parallel Efficiency









1.00



0.80 ITS, Red

ITS, Cplant

0.60

ACME, Red

0.40 ACME, Cplant

0.20



0.00 Large differential in

128

256

512

1024

1

2

4

8

16

32

64







efficiency at large

proc. counts



Processors

Scaling data for some

key physics codes

PARTISN Diffusion Solver Sizeup Study



Los Alamos’

S6P2, 12 Groups, 13,800 cells/PE



120%

Radiation

100%

Parallel Efficiency









80% ASCI Red transport code

Blue Mountain

60% PARTISN Transport Solver Sizeup Study

White

40%

QSC

S6P2, 12 Groups, 13,800 cells/PE

20%

120%

0%

100%





Parallel Efficiency

25 8

6

10 2

20 24

48

1

2

4

8

16

32

64

12



51









80% ASCI Red

Number of Processor Elements Blue Mountain

60%

White

40%

QSC

20%

0%









25 8

51 6

10 2

20 24

48

1

2

4

8

16

32

64

12

Number of Processor Elements



Related docs
Other docs by cuiliqing
7 Recipes from Joe A.
Views: 0  |  Downloads: 0
Re-installingXPMode
Views: 0  |  Downloads: 0
telefonica_en
Views: 0  |  Downloads: 0
3220 Chap 6 demos
Views: 0  |  Downloads: 0
chap history.docx
Views: 1  |  Downloads: 0
Subcontractor Bid Form - The Fountains
Views: 0  |  Downloads: 0
English
Views: 0  |  Downloads: 0
DESIGNER'S SCHEDULE USE
Views: 0  |  Downloads: 0
Security Service Providers
Views: 44  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!