Insight into
Sun Microsystems' Computer Architecture Development Strategies
CSC, May 22nd 2008
Søren Steenberg
Sun Microsystems
Drive Innovation at Every Level
Processor System Datacenter
Page 2
The Sun Systems Family & Branches
Expanded! Breakthrough! Enhanced! New!
Sun Fire x64 Servers
TM
Sun SPARC CoolThreadsTM T-series Servers
Sun FireTM UltraSPARC Servers
Sun SPARC Enterprise Servers
Page 3
25 years: 100.000 – 1 mill. times MIPS/$
1010
105
10,000MIPS/$1K
100
MIPS/ 1000 USD
10-5
0.01 MIPS/$1K
10-10
Hermann Brunner, Max-Planck-Institut fuer Extraterrestriche Physik, Germany
10-15 1880
1900
1920
1940
Year
1960
1980
2000
2020
Page 4
The “Brick Wall”* of Performance
“In 2006, performance is a factor of three below the traditional doubling every 18 months that we enjoyed between 1986 and 2002. The doubling of uniprocessor performance may now take 5 years.” *
* “The Landscape of Parallel Computing Research: A View from Berkeley”, December 2006 (emphasis added) http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
Page 5
ILP vs. TLP
Thread 1 1 Threads 2 3 4
Execution Units
ILP
(Instruction Level Parallelism)
TLP
(Thread Level Parallelism)
The Memory Bottleneck
10000
Relative Performance
1000
CPU Frequency DRAM Speeds
2x Every Two Years
100
Gap
10
1
2x Every Six Years
1980 1985 1990 1995 2000 2005
Page 7
Today´s Bad news – The Challenges
ILP exhausted, new tricks running short -------------------------------------> Physical constraints, speed of electrons Heat Power Memory latency gap Complexity Number of processor architects/engineers Time-to-market, chip respins Cost of R&D plus cost of manufacturing Larger caches in 3 levels Out of Order execution Deeper superpipelines Superscalar design EPIC (explicitly parallel instruction computing) Branch prediction Speculative prefetches
Page 8
CMT – Multiple Multithreaded Cores
Core 8 Core 7 Core 6 Core 5 Core 4 Core 3 Core 2 Core 1
Thread 4 Thread 3 Thread 2 Thread 1 Thread 4 Thread 3 Thread 2 Thread 1 Thread 4 Thread 3 Thread 2 Thread 1 Thread 4 Thread 3 Thread 2 Thread 1 Thread 4 Thread 3 Thread 2 Thread 1 Thread 4 Thread 3 Thread 2 Thread 1 Thread 4 Thread 3 Thread 2 Thread 1 Thread 4 Thread 3 Thread 2 Thread 1
Time
Memory Latency Compute
9
UltraSPARC T2 Plus
SMP System on a Chip
• Unprecedented throughput and integration
> Leveraged UltraSPARC T2
®
processor > 8 cores, 64 threads > Added 4x Coherence Channels
• • • •
2-socket design Built-in wirespeed security Fast, integrated I/O Virtualization built in
3rd Generation CMT
Page 10
Multithreaded Multicore in the Curriculum
• It has become clear only recently, but has become astonishingly clear: • The future performance growth in microprocessors, at least for the next five years, will almost certainly come from exploitation of threadlevel parallelism (TLP) through multicore processors rather than through exploiting more instruction level parallelism (ILP). • The Sun T1 is a step in this direction.
Computer Architecture: A Quantitative Approach, 4th ed. by John Hennessy and David Patterson
Page 11
Sun is Leading the Way
“We applaud Sun for being the first to enter this emerging market for highly multithreaded servers.”
- Ideas International
"We're at a historic point in computing, moving away from sequential processing to multicore designs...we need to invent new ways to evaluate these new parallel systems. Our initial experiments suggest that Niagara 2 has the highest performance, is the most power efficient and is the most 'software friendly' of the processors we've tested."
- Professor Dave Patterson, Pardee Chair of Computer Science for the University of California at Berkeley
Source 1: “Sun Gets it Right with Niagara 2 Servers,” 10/12/07, http://ideasint.blogs.com/ideasinsights/2007/10/sun-gets-it-rig.html Source 2: "Sun Microsystems Enters Commercial Silicon Market With World's Fastest Commodity Microprocessor," 08/07/07, http://www.sun.com/aboutsun/pr/2007-08/sunflash.20070807.1.xml Page 12
Slide 8-9 from Gartner Figures p. 3 – 2 from Microprocessor Report
Page 13
Slide 8-9 from Gartner Figures p. 3 – 2 from Microprocessor Report
Page 14
Slide 8-9 from Gartner Figures p. 3 – 2 from Microprocessor Report
Page 15
UltraSPARC T2:
TrueSystem On a Chip
FB DIMM FB DIMM FB DIMM FB DIMM
• Up to 8 cores @1.2GHz or 1.4GHz • Up to 64 threads per CPU • Up to 16 FB-DIMMs, 4 memory controllers
> Up to 64GB memory (4GB DIMMs) > 270 GB/s crossbar bandwidth
FB DIMM
FB DIMM
FB DIMM
FB DIMM
MCU
MCU
MCU
MCU
L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$ Full Cross Bar C0 C1 C2 C3 C4 C5 C6 C7
FPU MAU FPU FPU FPU MAU MAU MAU FPU FPU FPU FPU MAU MAU MAU MAU
DMA Sub-System
NIU (E-net+) Sys I/F Buffer Switch Core
PCI-E
Power ~95W x8 @2.5GHz 2x 10GE Ethernet <1.5W/thread4GBytes/s bi-directional
• 8 x fully pipelined Floating Point units / core, 1 per core • Dual 10Gbit Ethernet and PCI-E integrated onto chip • 4MB L2 (8 banks) 16 way associative • Enhanced MAU/Security coprocessor per core > DES, 3DES, AES, RC4, SHA1, SHA256, MD5, RSA to 2048 key, ECC,CRC32 • Advanced Power saving features • 65nm process technology
Page 16
UltraSPARC T2 Plus:
Multi-Socket US T2
Each Coherency Link 6.4GBytes/s per direction, 12.8GB/s total 4 x Coherency links give Snoop bandwidth of 51.2GB/s
4x Coherency Links
FB-DIMM FB-DIMM
FB-DIMM FB-DIMM
• Per Socket > Up to 8 cores @1.2GHz or 1.4GHz > Up to 64 Threads per socket > 4MB L2$ 8 banks x 16 Way SA > Up to 8 FPU's, 1 per core > Up to 8 Crypto cores, 1 per core
> DES, 3DES, AES, RC4, SHA1, SHA256, MD5, RSA to 2048 key, ECC,CRC32
FB-DIMM FB-DIMM
FB-DIMM FB-DIMM
MCU CU CU CU
MCU CU
L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$
Full Cross Bar
CO C1 C2 C3 C4 C5 C6 C7 FPU FPU FPU FPU FPU FPU FPU FPU MAU MAU MAU MAU MAU MAU MAU MAU
System I/F
SSI Bus
PCI-Express
x8 @2.5GHz 4GBytes/s bi-directional
Power ~105W <1.5W/thread
DIMMS > 25Read+13WriteGB/s Memory BW > One x8 PCIe interface • Per system > 2 sockets glue-less > 4 sockets using UST2 Plus XBR > 16 or 32 PCIe lanes on 2 or 4 Socket systems • 65nm process technology
Page 17
> 2 memory Controllers, 16
UltraSPARC T2 Plus 2-Socket System
Dual Channel FBDIMM Dual Channel FBDIMM
Coherency Links
Dual Channel FBDIMM
Dual Channel FBDIMM
Memory Controller Coherence Coherence Unit Unit
Memory Controller Coherence Coherence Unit Unit
Memory Controller Coherence Unit
Memory Controller
Coherence Coherence Coherence Unit Unit Unit
UltraSPARC T2 Plus Cores, Crossbar, L2$
(8 cores, 64 threads, 4MB L2$)
UltraSPARC T2 Plus Cores, Crossbar, L2$
(8 cores, 64 threads, 4MB L2$)
PCI-Express
NCU, DMU
NCX
PCI-Express
NCU, DMU
NCX
System IO (Network, Disk, etc.)
Page 18
UltraSPARC T2 Servers
Page 19
Scale More with Less
The World’s Fastest, Most Energy-Efficient Virtualization Servers
World’s First Dual-Socket CMT Servers
Sun SPARC Enterprise T5240
First True Modular Design
World’s First Eco-Responsible Servers
World’s First 64-Thread Servers with “System on a Chip”
Sun SPARC Enterprise T5220
NEW
Sun SPARC Enterprise T5140
Sun Blade T6320
TM
Sun Fire /Sun SPARC Enterprise T2000
TM
®
Sun SPARC Enterprise T5120
Sun Fire/Sun SPARC Enterprise T1000 Sun Blade T6300
Page 20
Sun SPARC Enterprise T5140 Server
Sun SPARC Enterprise T5240 Server
• Up to 128 threads • Up to 128GB of memory • Up to 2.3TB of storage • Up to 4.6GB/second delivered bandwidth • Tightly coupled thread, memory and interconnects for high scalability • Open source and FREE virtualization capabilities built in • Sun ILOM service processor supports industry-standard management interfaces
Page 21
CoolThreads Power Saving Technology
• Power management at both core and memory
> Reduce instruction issue rates > Control clocks in both core and memory and
reduce power consumption
• Highly efficient 80 Plus and Climate Savers compliant power supplies • System power consumption can be reported to management applications up to every 5 seconds
> Vital for effective datacenter power management
and chargeback
Page 22
Logical Domains: UltraSPARC Virtualization
Solaris or Linux guest domains
File Server Web Server Mail Server
Application
Solaris Control Domain Ultra lightweight Hypervisor in the firmware
OS
Server
SPARC Enterprise CoolThreads Servers
Page 23
Solaris Containers for Virtualization
Strong isolation between Apps
Calendar Server Database Web Server
Application
OS Virtualization built into the kernel Very light weight and scales with any Solaris system
OS
Server
Page 24
Faster can be cooler. Better can be cleaner. Cheaper can be greener.
Page 25