The Cray XD1 Computer and its Reconfigurable Architecture
Dave Strenski stren@cray.com July 11, 2005
Outline
XD1 overview
Architecture Interconnect Active Manager
XD1 FPGAs
Architecture Example execution Core development stragity
FORTRAN to VHDL considerations
Memory allocation Unrolling One verses many cores
XD1 FPGA running examples
MTA kernel and Ising Model FFT kernel from DSPlogic Smith-Waterman kernel from Cray LANL Traffic simulation code Other works in progress
Slide ‹#›
Cray Today
Nasdaq: CRAY
Formed on April 1, 2000 as Cray Inc. Headquartered in Seattle, WA Roughly 900 employees across 30 countries
Four Major Development Sites:
Chippewa Falls, WI Mendota Heights, MN Seattle, WA Vancouver, Canada
Significant Progress in the market
X1 Sales and Sandia National Laboratory Red Storm contract Oak Ridge National Laboratory Leadership Class system DARPA HPCS Phase II funding of $50M through 2006 for Cascade Acquired OctigaBay – 70+ Cray XD1s sold to date
Slide ‹#›
Cray XD1 Overview
Slide ‹#›
Cray XD1 System Architecture
Compute 12 AMD Opteron 32/64 bit, x86 processors High Performance Linux RapidArray Interconnect 12 communications processors 1 Tb/s switch fabric Active Management Dedicated processor Application Acceleration 6 co-processors
Processors directly connected via integrated switch fabric
Slide ‹#›
Cray XD1 Chassis
Six Two-way Opteron Blades Fans Six SATA Hard Drives Three I/O Slots (e.g. JTAG) Four 133 MHz PCI-X Slots Six FPGA Modules
Chassis Front
0.5 Tb/s Switch
12 x 2 GB/s Ports to Fabric Connector for 2nd 0.5 Tb/s Switch and 12 More 2 GB/s Ports to Fabric
Chassis Rear
Slide ‹#›
Compute Blade
4 DIMM Sockets for DDR 400 Registered ECC Memory RapidArray Communications Processor
AMD Opteron 2XX Processor
AMD Opteron 2XX Processor
Connector to Main Board
4 DIMM Sockets for DDR 400 Registered ECC Memory
Slide ‹#›
Cray Innovations
Balanced Interconnect
Active Management
Application Acceleration
Cray XD1
Performance and Usability
Slide ‹#›
Architecture
Intel XeonTM Processor Intel XeonTM Processor
DDR Memory Controller
AMD Opteron
HyperTransport
6.4 GB/sec
6.4 GB/sec Northbridge
HT HT
3.2 GB/sec
Rapid Array
Southbridge or PCI-X Bridge
Rapid Array
PCI-X Slot PCI-X Slot PCI-X Slot 1 GB/sec
I/O SPEED LIMIT
Slide ‹#›
Removing the Bottleneck
GigaBytes GFLOPS GigaBytes per Second
Memory
Processor
I/O
1 GB/s PCI-X
Interconnect
0.25 GB/s GigE
Xeon Server
5.3 GB/s DDR 333
Cray XD1
8 GB/s 6.4GB/s DDR 400
RA
Cray XT3
6.4 GB/s DDR 400 31 GB/s
SS
Cray X1
34.1 GB/s 102 GB/s
Slide ‹#›
Communications Optimizations
Cray Communications Libraries
MPI 1.2 library TCP/IP PVM Shmem Global Arrays System-wide process & time synchronization
RapidArray Communications Processor
HT/RA tunnelling with bonding Routing with route redundancy Reliable transport Short message latency optimization DMA operations System-wide clock synchronization
AMD Opteron 2XX Processor
RapidArray Communications Processor
2 GB/s
3.2 GB/s 2 GB/s
Direct Connected Processor Architecture
Slide ‹#›
Synchronized Linux Scheduler
Not Synchronized
Proc 1
System Overhead
Wasted CPU Cycles Wasted CPU Cycles
Proc 2
Wasted CPU Cycles
System Overhead System Overhead
Wasted CPU Cycles
Proc 3
Wasted CPU Cycles
Wasted CPU Cycles
Barrier 1 complete
Barrier 2 complete
Barrier 3 complete
Synchronized Key
Proc 1
System Overhead System Overhead System Overhead Barrier 3 complete
Compute cycles System cycles Wasted cycles
Proc 2
Proc 3
Barrier 1 complete
Barrier 2 complete
Slide ‹#›
Reducing OS Jitter
50% Linux Synchronization Speedup
% Speedup
40% 30% 20% 10% 0% 1 2 4 8 16 Processors 32 64
Cray XD1 Linux Synchronization increases application scaling Improves efficiency by 42% Lowers application license fees for equivalent processor count
Slide ‹#›
Direct Connect Topology
1 Cray XD1 Chassis 12 AMD Opteron Processors 58 GFLOPS 8 GB/s between SMPs 1.8 msec interconnect Integrated switching 3 Cray XD1 Chassis 36 AMD Opteron Processors 173 GFLOPS 8 GB/s between SMPs 2.0 msec interconnect Integrated switching 25 Cray XD1 Chassis, two racks 300 AMD Opteron Processors 1.4 TFLOPS 2 - 8 GB/s between SMPs 2.0 msec interconnect Integrated switching
Slide ‹#›
Fat Tree Topology
12 Cray XD1 chassis 144 AMD Opteron Processors 691 GFLOPS 4/8 GB/s between SMPs 2.0 msec interconnect Fat tree switching, integrated first & third order 6/12 RapidArray spine switches (24-ports)
Slide ‹#›
Spine switch Spine switch Spine switch
MPI Latency
MPI Latency versus Message Size
35.00 30.00
Latency (microsec)
25.00 20.00 15.00 10.00 5.00 0.00 0 4 8 32 64 128 256 512 1024 2048 4096 Message Length (bytes) Cray XD1 (RapidArray) Quadrics (Elan 4) 4x Infiniband Myrinet (D card)
RapidArray Short Message Latency is 4 times lower than Infiniband The Cray XD1 has sent 2 KB before others have sent their first byte
Slide ‹#›
MPI Throughput
Bandwidth versus Message Size
1400 1200
Bandwidth (MB/s)
1000 800 600 400 200 0
12 8
25 6
51 2
10 24
40 96
81 92 16 38 4
32 76 8 25 60 00
1
4
8
64
Data Length (Bytes) Cray XD1 (1/2 RapidArray Fabric) Quadrics Elan 4 4x Infiniband Myrinet (D card)
The Cray XD1 Delivers 2X the Bandwidth of Infiniband (1 KB Message Size)
Slide ‹#›
1
M
B
Active Manager System
Usability
Single System Command and Control
Resiliency
Dedicated management processors, real-time OS and communications fabric. Proactive background diagnostics with self-healing.
CLI and Web Access
Active Management Software
Automated management for exceptional reliability, availability, serviceability
Slide ‹#›
Active Manager GUI: SysAdmin
GUI provides quick access to status info and system functions
Slide ‹#›
Automated Management
Users & Administrators
Compute Partition 1
File Services Partition
Front End Partition
Compute Partition 2
Compute Partition 1
Partition management Linux configuration Hardware monitoring Software upgrades File system management Data backups
• • • • •
Network configuration Accounting & user management Security Performance analysis Resource & queue management
Single System Command and Control
Slide ‹#›
Self-Monitoring
Parity Heartbeat Temperature Fan speed Diagnostics Air Velocity Voltage Current Hard Drive
Thermals Processors Memory
Fans
Power supply
Active Manager
Power Supply
Interconnect
Dedicated Management Processor, OS, Fabric
Slide ‹#›
Thermal Management
Slide ‹#›
File Systems: Local Disks
One S-ATA HD per SMP; Local Linux directory per HD
EXT2/3
EXT2/3
EXT2/3
RapidArray
EXT2/3
EXT2/3
EXT2/3
Cray XD1
Slide ‹#›
File Systems: SAN
SMP acting as a File Server for the SAN
File Server
NFS
EXT2/3
FC HBA
FC
SAN
Compute
Cray XD1
Slide ‹#›
Programming Environment
Operating System
Cray HPC Enhanced Linux Distribution (derived from SuSe 8.2)
System Management
Active Manager for system administration & workload management
Application Acceleration Kit
IP Cores, Reference Designs, Command-line tools, API, JTAG interface card
Scientific Libraries
AMD Core Math Library (ACML)
Shared Memory Access
Shmem, Global Arrays, OpenMP
3rd Party Tools
Fortran 77/90/95, HPF, C/C++, Java, Etnus TotalView
Communications Libraries
MPI 1.2
Cray XD1 is standards-based for ease of programming – Linux, x86, MPI
Slide ‹#›
Cray XD1‟s FPGA Architecture
Slide ‹#›
The Rebirth of Co-processing
1976
8086 Processor
8087 Coprocessor
2004 AMD Opteron Xilinx Virtex II Pro FPGA
Slide ‹#›
Application Acceleration
Application Accelerator
Application Acceleration
RAP
Reconfigurable Computing Tightly coupled to Opteron FPGA acts like a programmable coprocessor Performs vector operations Well-suited for:
Searching, sorting, signal processing, audio/video/image manipulation, encryption, error correction, coding/decoding, packet processing, random number generation.
RAP
SuperLinear speedup for key algorithms
Slide ‹#›
4 configurations
One Switch
One Switch
Two Switches
Two Switches
Slide ‹#›
Application Acceleration FPGA
...
do for each array element . . . end …
DataSet
Compute Processor
Application Acceleration FPGA …
…
Fine-grained parallelism applied for 100x potential speedup
Slide ‹#›
Compute Blade
Expansion Module
RapidArray Processor
DDR 400 DRAM Application Acceleration FPGA
Slide ‹#›
Opteron Processor
Interconnections
HT
Neighbor Module Expansion Module
HT
RT
HT
Neighbor Module RapidArray RapidArray
Slide ‹#›
Module Detail
HyperTransport
Neighbor Compute Module
3.2 GB/s
2 GB/s
QDR II SRAM
QDR II SRAM
RAP
3.2 GB/s
Acceleration FPGA
2 GB/s
2 GB/s
2 GB/s
3.2 GB/s
QDR II SRAM
QDR II SRAM
RapidArray
Neighbor Compute Module
Slide ‹#›
Virtex II Pro FPGA
Multi-Gigabit Transceivers (Rocket I/O) Virtex-II Series Fabric
MGT MGT
XC2VP30 – XC2VP50 • 422 MHz max. clock rate • 30,000 – 53,000 LEs • 3 – 5 Million „system gates‟ • 136 – 232 Block RAM
300 MHz PowerPC
• 136 – 232 18x18 Multipliers • 8 – 16 Rocket I/O
MGT MGT
Block RAM
Slide ‹#›
Virtex II Family Logic Blocks
RAM16
Virtex-II Family Logic Blocks
SRL16
1 LE
CY Register
= LUT + Register
LUT G LUT F CY
1 Slice
1 CLB
= 2 LEs
= 4 Slices
Register
Slice
XC2VP30-6 Examples
Size Function 64 bit Adder 64 bit Accumulator 18 x 18 Multiplier SP FP Multiplier 1024 FFT (16 bit complex) f (MHz) 194 198 259 188 140 LE’s 66 64 88 252 5526 BRAM 0 0 0 0 22 Mult. 0 0 1 4 12 Number Possible 450 450 136 34 5
Slide ‹#›
Module Variants
A variety of Application Acceleration variants can be manufactured by populating different pin compatible FPGAs and QDR II RAMs.
FPGA
XC2VP30 XC2VP40 XC2VP50
Speed
-6 -6 -7
Logic Elements
30,816 43,632 53,136
PowerPC
2 2 2
18x18 Multipliers
136 192 232
RAMs
K7R163682 K7R323682 K7R643682 (future)
Speed
200 MHz 200 MHz 200 MHz
Dimensions
512K x 36 1M x 36 2M x 36
Quantity
4 4 4
Module Memory
8 MByte 16 MByte 32 MByte
Slide ‹#›
Processor to FPGA
FPGA
Processor
RAP
Req Resp
Req Resp
HyperTransport
RapidArray Transport
• Since the Acceleration FPGA is connected to the local processing node through its HyperTransport I/O bus, the FPGA can be accessed directly using reads and writes. • Additionally, a node can also transfer large blocks of data to and from the Acceleration FPGA using a simple DMA engine in the FPGA‟s RapidArray Transport Core.
Slide ‹#›
FPGA to Processor
FPGA
Processor
RAP
Req Resp
Req Resp
• The Acceleration FPGA can also directly access the memory of a processor. Read and write requests can be performed in bursts of up to 64 bytes. • The Acceleration FPGA can access processor memory without interrupting the processor. • Memory coherency is maintained by the processor.
Slide ‹#›
FPGA to Neighbor
2-3 GB/s
SMP 2
SMP 4
SMP 6
SMP 1
SMP 3
SMP 5
• Each Acceleration FPGA is connected to its neighbors in a ring using the Virtex II Pro MGT (Rocket I/O) transceivers. • The XC2VP40 FPGAs provide a 2 GB/s link to each neighbor FPGA. • The XC2VP50 FPGAs provide a 3 GB/s link to each neighbor FPGA.
Slide ‹#›
Cray XD1 FPGA Programming
Slide ‹#›
Hard, but it could be worse!
Slide ‹#›
Application Acceleration Interfaces
RapidArray Transport Core User Logic QDR RAM Interface Core
ADDR(20:0) D(35:0) Q(35:0) ADDR(20:0) D(35:0) Q(35:0)
TX
RAP
RX
ADDR(20:0) D(35:0) Q(35:0)
ADDR(20:0) D(35:0) Q(35:0)
QDR II SRAM
RapidArray Transport
• • • •
XC2VP30-50 running at up to 200 MHz. 4 QDR II RAM with over 400 HSTL-I I/O at 200 MHz DDR (400 MTransfers/s). 16 bit simplified HyperTransport I/F at 400 MHz DDR (800 MTransfers/s.) QDR and HT I/F take up <20 % of XC2VP30. The rest is available for user applications.
Slide ‹#›
FPGA Linux API
Admininstration Commands fpga_open fpga_close fpga_load – allocate and open fpga – close allocated fpga – load binary into fpga
Operation Commands fpga_start fpga_reset
Mapping Commands fpga_set_ftrmem fpga_memmap Control Commands fpga_wrt_appif_val fpga_rd_appifval Status Commands fpga_status
– start fpga (release from reset) – soft-reset the FPGA
– map application virtual address to allow access by fpga – map fpga ram into application virtual space
– write data into application interface (register space) – read data from application interface (register space)
– get status of fpga
DMA Commands fpga_put fpga_get
Interrupt/Blocking Commands fpga_intwait
– send data to FPGA – receive data from fpga
– blocks process waits for fpga interrupt
Slide ‹#›
Additional High Level Tools
Adelante Celoxica Forte Design Systems Mentor Graphics Prosilog Synopsis
int mask m) (a, { return(a & m); }
SystemC, ANSI C/C++
DSPlogic RCIO Lib
The MathWorks
High Level Flow Xilinx
C Synthesis
MATLAB/ Simulink
process(a, m)is begin z <= a andm; end process;
VHDL, Verilog
System Generator for DSP
VHDL/Verilog Synthesis Mentor Graphics Synopsis Synplicity Xilinx
a m
z Gate Level EDIF File Standard Flow
Xilinx
Place and Route
01001011010101 01010110101001 01000101011010 10100101010101
Binary File for FPGA
Slide ‹#›
Standard Development Flow
Cores Download to XD1
0100010101 1010101011 0100101011 0101011010
Merge Load/Run
Acceleration FPGA
Metadata
RAP I/F, QDR RAM I/F DSPLogic RCIO Core
Binary File
HDL
Synthesize
Implement
From Command line or Application
ModelSim
Xilinx ISE
Verify
Simulate
VHDL, Verilog, C
Xilinx ChipScope
ModelSim
Slide ‹#›
On Target Debugging
• Integrated Logic Analyzer (ILA) blocks are used to capture and store internal logic events based on user defined triggers. • Trapped events can then be read out and displayed on a PC by the ChipScope Software. Acceleration FPGA
User Function 1 User Function 2
ILA
ILA
JTAG
Parallel or USB
JTAG OctigaBay JTAG I/O Card
Xilinx ChipScope Plus Software
Xilinx Parallel Cable III/IV or MultiLINX
Slide ‹#›
FORTRAN to VHDL ideas
program test integer xyz integer a, b, c, n(1000), temp(1000) do i = 1, 1000 n(i) = xyz (a, b, c, temp) end do end
The variable temp is allocated once outside the loop calling the function. This is efficient FORTRAN code because you only allocate the space one.
With an FPGA design you would want to allocated the temporary space on the FPGA.
Slide ‹#›
FORTRAN to VHDL ideas
program test integer xyz integer a, b, n(1000) real delta delta = 0.01 do i = 1, 1000 n(i) = xyz (a, b, delta) end do end function xyz (a, b, delta) if (a .gt. b*delta) then xyz = a else xyz = b endif return end program test integer xyz integer a, b, n(1000) integer delta delta = 100 ! 1/delta do i = 1, 1000 n(i) = xyz (a, b, delta) end do end
function xyz (a, b, delta) if (a*delta .gt. b) then xyz = a else xyz = b endif return end
Convert real variables to integers where possible.
Slide ‹#›
FORTRAN to VHDL ideas
function xyz (i,j,mode) integer i,j,mode do i = 1, 1000 do j = 1, 1000 if (mode .eq. 2) then if (a(i,j,k) .gt. b(i,j,k)) then xyz = a else xyz = b end if else Move code that doesn‟t change outside xyz = 0 the function. Maybe make multiple cores, end if one for each mode. end do end do return end
Slide ‹#›
Mixing FPGAs and MPI
It gets a bit tricky mixing FPGAs with an MPI code. The XD1 has 2 or 4 Opterons per node and only one FPGA. Only one Opteron is able to grab the FPGA at a time.
Job1 CPU Job1 CPU
Job1 CPU
Job2 CPU
Job1 CPU
Job1 CPU
Job2 CPU
Job1 CPU
?
Job1 FPGA Job2 FPGA Job1 FPGA Job2 FPGA
Not Available
Job1 CPU
Job1 CPU
Job1 CPU
Job2 CPU
Job1 CPU
Job1 CPU
Job2 CPU
Job1 CPU
FPGA
Job2 FPGA
Job1 FPGA
Job1 FPGA
Job1 CPU
Job1 CPU
Job1 CPU
Job2 CPU
Job1 CPU
Job1 CPU
Job2 CPU
Job1 CPU
FPGA
Job2 FPGA
Job1 FPGA
Job1 FPGA
Slide ‹#›
Cray XD1 FPGA Examples
Slide ‹#›
Random Number Example
Processor RAP Mersenne Twister RNG
pseudo-random numbers
• FPGA implements “Mersenne Twister” RNG algorithm often used for Monte Carlo analysis. The algorithm generates integers with a uniform distribution and won‟t repeat for 219937-1 values. • FPGA automatically transfers generated numbers into two buffers located in the processor‟s local memory. • Processor application alternately reads the pseudo-random numbers from two buffers. As processor marks the buffers as „empty‟, the FPGA refills them with new numbers.
Slide ‹#›
MTA Example
Application Accelerator
Load/Start a.out in Opteron‟s memory Call FPGA_OPEN
Call FPGA_LOAD
Buffer B Buffer A
Call FPGA_SET_FTRMEM (allocate memory) Call FPGA_START FPGA checks buffer flags
RAP
FPGA generate random numbers FPGA toggles buffer flag Opteron consumes random numbers
RAP
Opteron/FPGA run asynchronously Call FPGA_CLOSE Opteron exits
Slide ‹#›
Random Number Results
Source Platform Speed (32 bit integers/second) Original C Code 2.2 GHz Opteron ~101 Million VHDL Code FPGA (XC2VP30-6) @ 200 MHz ~319 Million ~25% of chip (includes RapidArray Core)
Size
N/A
• FPGA provides 3X performance of fastest available Opteron
• Algorithm takes up a small portion of the smallest FPGA. • Performance is limited by speed at which numbers can be written into processor memory, not by FPGA logic. The logic could easily produce 1.6 billion integers/second by increasing parallelism.
Slide ‹#›
Ising Model with Monte Carlo
Code was developed by Martin Siegert at Simon Fraser University
Uses the MTA random number generation design
Runs 2.5 times faster with the FPGA
Should run faster when the newest MTA design that returns floating point random number instead of integers.
Tar file available for the Cray XD1
Slide ‹#›
FFT design from DSPlogic
Code was developed by Mike Babst and Rod Swift at DSPlogic
Uses 16-bit fixed point data as input and 32-bit fixed point as output, which yields an accuracy similar to single precision results posted at FFTW web site (www.fftw.org)
A one dimensional complex FFT of length 65536 on the FPGA is about 5 times faster then on the 2.2 GHz Opteron using FFTW.
Packing the data more can double the performance to 10x.
Performance depends on the size of the data.
Slide ‹#›
Smith-Waterman
Code was developed internally by Cray CUPS = Cell updates Per Second Rate = FPGA frequency * clocks/cell * num S-M Processing Elements Current: 80 MHz * 1 * 32 = 2.6 Billion CUPS, 60% of the chip
Optimization: 100 MHz * 1 * 50 = 5 Billion CUPS Virtex 4 FPGA: 100 MHz * 1 * 150 = 15 Billion CUPS
Opteron using SSEARCH34
= 100 Million CUPS
Current version running 25 times faster then 2.2 GHz Opteron. Nucleotide (4-bit) version is running in house. Amino acid (8-bit) is just finished, incorporating it into SSEARCH to make it easier to use.
Smith-Waterman on the FPGA is about 10 times faster then BLAST on the Opteron.
Slide ‹#›
Los Alamos Traffic Simulation
Code was developed by Justin Tripp, Henning Mortveit, Anders Hansson, and Maya Gokhale at Los Alamos National Labs
Uses FPGA for straight road sections and Opteron for everything else.
Runs 34.4 times faster with the FPGA relative to a 2.2 GHz Opteron
System integration issues must be optimized to exploit this speedup in the overall simulation.
Slide ‹#›
Other XD1 FPGA Projects
Financial company using the random number generation core for a Monte Carlo simulation.
Seismic companies using FPGAs for FFT and convolutions.
Pharmaceutical companies using FPGAs for searching and sorting.
NCSA is working on a civil engineering “dirt” code.
University of Illinois is working on porting part of NAMD to an FPGA.
Slide ‹#›
Other Useful FPGA designs
JPEG2000 developed by Barco Silex, currently runs on Virtex FPGAs. Working with them on a real time, high resilution compression project.
64-bit floating point Matrix Multiplication by Ling Zhuo and Viktor Prasanna at the University of Southern California. Gets 8.3 Gflops on a XC2VP125 as compared to 5.5 Gflops 3.2 GHz Xeon.
Finite-Difference Time-Domain (FDTD) by Ryan Schneider, Laurence Turner, and Michal Okoniewski at University of Calgary.
Slide ‹#›
Questions
Slide ‹#›