Cray Roadmap (2004-2010)
John M. Levesque
Senior Technologist (Virtual Steve Scott Chief Architect for X1/X1E/BW)
Cray Proprietary
Cray‟s Computing Vision
Scalable High-Bandwidth Computing
2010 „Cascade‟
„Black Widow‟
2006
„Black Widow 2‟
Sustained Petaflops
X1E
2004
Product Integration
2006
X1
„Strider 3‟
2004 2005
„Strider X‟
Red Storm
RS
„Strider 2‟
Cray X1 Overview
Cray Proprietary
Slide 2
Cray X1
Cray PVP • Powerful vector processors • Very high memory bandwidth • Non-unit stride computation • Special ISA features • Modernized the ISA
T3E • Extreme scalability • Optimized communication • Memory hierarchy • Synchronization features • Improved via vectors
High bandwidth, scalable shared memory supercomputer
Cray X1 Overview Cray Proprietary Slide 3
Key Architectural Features
New vector instruction set architecture (ISA)
– Much larger register set (32x64 vector, 64+64 scalar)
– 64- and 32-bit memory and IEEE arithmetic
– Based on 25 years of experience compiling with Cray1 ISA
Decoupled Execution
– Scalar unit runs ahead of vector unit, doing addressing and control – Hardware dynamically unrolls loops, and issues multiple loops concurrently – Special sync operations keep pipeline full, even across barriers Allows the processor to perform well on short nested loops
Scalable, distributed shared memory (DSM) architecture
– Memory hierarchy: caches, local memory, remote memory
– Low latency, load/store access to entire machine (tens of TBs) – Processors support 1000’s of outstanding refs with flexible addressing – Very high bandwidth network – Coherence protocol, addressing and synchronization optimized for DM
Cray X1 Overview Cray Proprietary Slide 4
Cray X1 Node
P P P P P P P P P P P P P P P P
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
M
mem
M
mem
M
mem
M
mem
M
mem
M
mem
M
mem
M
mem
M
mem
M
mem
M
mem
M
mem
M
mem
M
mem
M
mem
M
mem
IO
IO
51 Gflops, 200 GB/s
• Four multistream processors (MSPs), each 12.8 Gflops • High bandwidth local shared memory (128 Direct Rambus channels) • 32 network links and four I/O links per node
Cray X1 Overview Cray Proprietary Slide 5
NUMA Scalable up to 1024 Nodes
Interconnection Network
• •
Cray X1 Overview
16 parallel networks for bandwidth Global shared memory across machine
Cray Proprietary Slide 6
Network Topology (16 CPUs)
P
M0 M1
P
P
P
M15
node 0
P
M0 M1
P
P
P
M15
node 1
P
M0 M1
P
P
P
M15
node 2
P
M0 M1
P
P
P
M15
node 3
Section 0
Section 1
Section 15
Cray X1 Overview
Cray Proprietary
Slide 7
Network Topology (128 CPUs)
R
R
R
R
16 links
R
R
R
R
Cray X1 Overview
Cray Proprietary
Slide 8
Network Topology (512 CPUs)
Cray X1 Overview
Cray Proprietary
Slide 9
Cray X1 Node Module
Cray X1 Overview
Cray Proprietary
Slide 10
Cray X1 Chassis
Cray X1 Overview
Cray Proprietary
Slide 11
64 Processor Cray X1 System
~820 Gflops
Cray X1 Overview
Cray Proprietary
Slide 12
Cray X1E Product Enhancement
Cray Proprietary
Cray X1E Mid-life Enhancement
• Technology refresh of the X1 (0.13m)
– ~50% faster processors – Scalar performance enhancements – Doubling processor density – Modest increase in memory system bandwidth – Same interconnect and I/O
• Machine upgradeable
– Can replace Cray X1 nodes with X1E nodes
• Shipping the end of this year
Cray X1 Overview
Cray Proprietary
Slide 14
Cray BlackWidow System
• Second generation Vector MPP
– Upward compatible with the Cray X1 – Shipping in 2006
• Major improvement (>> Moore’s Law rate) in:
– Single thread scalar performance – Price performance
• BlackWidow features:
– – – – – – Single chip vector microprocessor Globally addressable memory with 4-way SMP nodes Scalable to tens of thousands of processors Even more bandwidth per flop than the X1 Innovative fault tolerance features Configurable memory capacity, memory BW and network BW
Cray X1 Overview
Cray Proprietary
Slide 15
System Goals
• Balanced Performance between CPU, Memory, Interconnect, and I/O • Highly scalable system hardware and software • High speed, high bandwidth 3D mesh interconnect
• Run a set of applications 7 times faster than ASCI Red
• Run an ASCI Red application on full system for 50 hours • Flexible partitioning for classified and non-classified computing
• High performance I/O subsystem (File system and storage)
Cray X1 Overview
Cray Proprietary
Slide 16
Red Storm System Overview
• 40TF peak performance • 108 compute node cabinets, 16 service and I/O node cabinets, and 16 Red/Black switch cabinets – 10,368 compute processors - 2.0 GHz AMD Opteron™ – 512 service and I/O processors (256P for red, 256P for black) – 10 TB DDR memory • 240 TB of disk storage(120TB for red, 120TB for black) • MPP System Software – Linux + lightweight compute node operating system – Managed and used as a single system – Easy to use programming environment – Common programming environment – High performance file system – Low overhead RAS and message passing • Approximately 3,000 ft² including disk systems
Cray X1 Overview Cray Proprietary Slide 17
Typical Architecture
Intel XeonTM Processor Intel XeonTM Processor
•
6.4 GB/sec
Memory latency ~ 160 ns and bandwidth is shared between mutliple processors Northbridge chip is 2nd most complex chip on the board. Typical chip uses about 11 Watts Any interconnect limited by speed of PCI-X since it’s the fastest place to “plug in” Best place to tie in a high performance interconnect would be through the Northbridge, but this is difficult to do legally without an Intel bus license
Slide 18
•
Northbridge
Southbridge or PCI-X Bridge
•
I/O SPEED PCI-X Slot PCI-X Slot LIMIT PCI-X Slot
1 GB/sec
Cray X1 Overview Cray Proprietary
•
AMD OpteronStrider PE CRAY Generic System
DDR Memory Controller
•
AMD Opteron
HyperTransport
6.4 GB/sec
HT HT
•
PCI-X Bridge Cray
Router (Seastar)
PCI-X Slot PCI-X Slot PCI-X Slot
•
SDRAM memory controller and function of Northbridge is pulled onto the Opteron die. Memory latency reduced to 60-90 ns No Northbridge chip results in savings in heat, power, complexity and an increase in performance Interface off the chip is an open standard (HyperTransport)
Six Network Links Each >3 GB/s x 2
Cray X1 Overview
Cray Proprietary
Slide 19
Using HyperTransport to Interface With System Interconnect
AMD AMD Opteron AMD Opteron HyperTransport
HyperTransport HyperTransport
DDR Memory Controller DDR Memory Controller DDR Memory Opteron Controller
AMD AMD Opteron AMD Opteron HyperTransport
HyperTransport HyperTransport
DDR Memory Controller DDR Memory Controller DDR Memory Opteron Controller
Cray Cray SeaStar Cray SeaStar 6 port Router SeaStar 6 port Router 6 port Router
Cray Cray SeaStar Cray SeaStar 6 port Router SeaStar 6 port Router 6 port Router
• •
6GB/sec (3 GB/sec bi-directional) 3D Torus Interconnect Sound Familiar?
Cray X1 Overview
Cray Proprietary
Slide 20
Cray BlackWidow
The Next Generation HPC System From Cray Inc.
Cray Proprietary
Cray BlackWidow System
• Second generation Vector MPP
– Upward compatible with the Cray X1 – Shipping in 2006
• Major improvement (>> Moore’s Law rate) in:
– Single thread scalar performance – Price performance
• BlackWidow features:
– – – – – – Single chip vector microprocessor Globally addressable memory with 4-way SMP nodes Scalable to tens of thousands of processors Even more bandwidth per flop than the X1 Innovative fault tolerance features Configurable memory capacity, memory BW and network BW
Cray X1 Overview
Cray Proprietary
Slide 22
Cascade
Toward Sustained Petaflop Computing
Cray X1 Overview
Cray Proprietary
Slide 23
HPCS Phases
CRAY SGI Sun HP IBM
Phase I: Concept Development
– – – – Forecast available technology Propose HPCS hw/sw concepts Explore productivity metrics Develop research plan for Phase II
1 Year 2H 2002 – 1H 2003 $3M/year
CRAY
Sun
IBM
Phase II: Concept Validation
– – – – Focused R&D Hardware and software prototyping Experimentation and simulation Risk assessment and mitigation
3 Years 2H 2003 – 1H 2006 $17M/year
?
?
Phase III: Full Scale Product Development
– Commercially available system by 2010 – Outreach and cooperation in software and applications areas
4 Years 2H 2006 – 2010 $?/year
The HPCS program lets us explore technologies we otherwise couldn’t. A three year head start on typical development cycle.
Cray X1 Overview Cray Proprietary Slide 24
Cray‟s Approach to HPCS
• High system efficiency at scale
– Bandwidth is the most critical and expensive part of scalability – Enable very high (but configurable) global bandwidth – Design processor and system to use this bandwidth wisely – Reduce bandwidth demand architecturally
• High human productivity and portability
– – – – Support legacy and emerging languages Provide strong compiler, tools and runtime support Support a mixed UMA/NUMA programming model Develop higher-level programming language and tools
• System robustness
– Provide excellent fault detection and diagnosis – Implement automatic reconfiguration and fault containment – Make all resources virtualized and dynamically Cray X1 Overview reconfigurable Cray Proprietary Slide 25
Using Bandwidth Wisely
• Implement global shared memory
– Lowest latency communication – Lowest overhead communication – Fine-grained overlap of computation and communication
• Tolerate latency with processor concurrency
– Message passing concurrency is constraining and hard to program – Vectors and streaming provide concurrency within a thread – Multithreading provides concurrency between threads
• Exploit locality to reduce bandwidth demand
– “Heavyweight” processors (HWPs) to exploit temporal locality – “Lightweight” processors (LWPs) to exploit spatial locality
• Use other techniques to reduce network traffic
– Atomic memory operations – Single word network transfers when no locality is present
Cray X1 Overview Cray Proprietary Slide 26
Questions?
File Name: BWOpsReview081103.ppt
Cray Proprietary