The Future of Vector Processors
M. Valero, R. Espasa and J. Corbal UPC, Barcelona
Kyoto, May 28th, 1999
TOP-500 and Vector Processors
350 300 250 200 150 100 50 0
99 N 8 J9 97 N 7 J9 96 N 6 J9 95 N 5 J9 94 N 4 J9 93 N 3 J9
43 15 96 65 November 98 Fujitsu…27 NEC……18 SGI……..15 Hitachi….5 310
# Systems
% Peak Perf.
Kyoto, May 28th. 1999
2
The Future of Vector ISA’s
• Cross-Pollination of Vector/Superscalar/VLIW
– MMX, Embedded...
• Very-high Performance Architectures
– ILP techniques, IRAM, SDRAM
• Vector Microprocessors
– Numerical Accelerators – Multimedia Applications
Kyoto, May 28th. 1999
3
Talk Outline
• The Past : • The Present :
• Initial Motivation for Vector ISA • Evolution of Vector Processors • Recent Announcements • The Decline of Vector Processors • Cross-Pollination of Vector/Superscalars/VLIW • Very-high Performance Architectures • Vector Microprocessors
– Numerical Accelerators – Multimedia Applications
• The Future :
• Conclusions
Kyoto, May 28th. 1999
4
Characteristics of Numerical Applications
• Examples: Weather prediction, mechanical engineering • Data structures: Huge matrices (dense, sparse) • Data types: 64 bits, floating point • Highly repetitive loops • Compute-intensive • Data-Level Parallel
Kyoto, May 28th. 1999 5
Initial Motivations for Vector Processors
Dependence Graph real*8 x(9992), y(9992), u(9984) subroutine loop integer I real*8 q do I=1,9984 q = u(I) * y(I) y(I) = x(I) + q x(I) = q - u(I) * x(I) enddo end
y(I) u(I) x(I)
* q +
* _
For I=1 to 9984
Kyoto, May 28th. 1999
6
Execution of scalar code
Loop : ld ld ld mulf mulf add addf subf st add st sub bne add R1,0(R10) IF R2,0(R11) R3,0(R12) R4,R1,R2) R5,R2,R3 R11,R11,#8 R6,R4,R3 R7,R4,R5 0(R12),R7 R10, R10,#8 0(R12),R7 R13,R13,#1 Loop R12,R12,#8
D/L ALU M IF W W W IF
D/L ALU M
IF
W
D/L ALU M IF
ALU M D/L
D/L ALU M IF
D/L ALU ALU ALU W IF D/L ALU ALU ALU W
IF
D/L ALU M IF
W
D/L ALU ALU ALU W IF D/L ALU ALU ALU W IF D/L ALU M IF W W
D/L ALU M IF
14 cycles / Iteration Perfect Memory !!!
D/L ALU M IF
W
W W W
D/L ALU M
IF D/L ALU M
IF D/L ALU M
Kyoto, May 28th. 1999
7
Generation of Vector Code
ld.w ld.w ld.w Loop : mov ld.l ld.l mul.d ld.l add.d st.l mul.d sub.d st.l add.w add.w lt.w jbrs.t #9984,s2 #0,a2 #8,vs s2, vl -y(a2),v0 -u(a2),v1 v1,v0,v2 -x(a2),v3 v3,v2,v0 v0,-y(a2) v1,v3,v1 v2,v1,v0 v0,-x(a2) #1024,a2 # -128,s2 # 0,s2 loop ; vl <- min(s2,128) ; v0 <- y(I:I+127) ; v1 <- u(I:I+127) ; q(I:I+127) <- u(I:I+127)*y() ; v3 <-x(I:I+127) ; v0 <- x(I:I+127) + q(I:I+127) ; y(I:I+127) <- x(I:I+127) + q( ) ; v1 <- u(I:I+127) *x(I:I+127) ; v0 <- q( ) - u( ) * x( ) ; x(I:I+127) <- q( ) - u( ) * x( ) ; increment index (128 * 8) ; 128 iterations less to process A vector iteration is equivalent to 128 scalar iterations
…. …. . …. …. …. …. …
0 1 2
127
DLP !!!
Kyoto, May 28th. 1999
8
Execution of vector code
Loop : mov ld.l ld.l mul.d ld.l add.d st.l mul.d sub.d st.l add.w add.w lt.w jbrs.t s2, vl -y(a2),v0 -u(a2),v1 v1,v0,v2 -x(a2),v3 v3,v2,v0 A vector iteration is v0,-y(a2) equivalent to 128 v1,v3,v1 scalar iterations v2,v1,v0 v0,-x(a2) #1024,a2 # - 128,s2 5.1 cycles / Iteration #0,s2 Memory Latency = 24 cycles !!! loop 14 vector instructions = 1792 scalar instructions
One L/S Port One Adder, One Multiplier
Kyoto, May 28th. 1999
9
Vector Processor
Main Memory
Instructions (scalar + vector) + Data
Ri := Rj op Rk Scalar data Vector data VR[i] := VR[j] op VR[k]
...
Instr.
... Vector Reg. Control Unit
Branch (cond.)
Scalar Reg.
Kyoto, May 28th. 1999
10
Why Vector ISA ?
• Natural way to express Data-Level Parallelism
– Fewer instructions (3)
• Easy way to convey this information to the hardware • Good hardware implementation
– Affordable/ incremental parallelism – Simple control/ faster clock (2) (1)
• Mechanism to deal with memory latency • Problem : Memory Bandwidth...
Kyoto, May 28th. 1999 11
Vector versus Scalar Architectures
Number of instructions (in millions)
R10k 120 100 80 60 40 20 0
p jd dl m e5 av w tv ca m to or 2c su 7 sa na 2d o dr hy
12
Convex C3
Vector instruction semantics “encode” many different scalar instructions : - Loop counters - Branch computations - Addresses generation
Rate from 140 to 2
im sw
F. Quintana, R. Espasa and M. Valero “ A case for merging the ILP..” PDP-98
Kyoto, May 28th. 1999
Easy to convey information to the hardware
• Data path :
• No pressure at fetch, decode and issue • Decentralized control • Faster cycle times
• Vector memory instructions :
• Spatial locality can be made clearly visible to the hardware through “strides” • No overhead and good prefetching • Reduction of memory latency overhead • Memory uses facts, not guesses
Kyoto, May 28th. 1999 13
Key parameters for vector processors
• Cycle time • Scalar processor:
– # of registers and FU’s – Cache
• Vector processor
– # of vector registers – # of FU’s and # of pipes/ FU
• Connection to memory:
– # of busses and width
• Number of processors
Kyoto, May 28th. 1999 14
Cray Y-MP Architecture
0
4 28
P0
4*4 8*8 224
228 232
P1
tc = 6 ns.
4*4
256 modules. ta = 30 ns.
3 228 7 231
333 Mflops / processor
31
255
P7
8*8
4*4
Synchronization
Kyoto, May 28th. 1999
15
Vector Processors (1 of 2)
Year Machine 1972 TI-ASC 1973 STAR-100 1975 Cray-1 Fujitsu 1982 VP 2000 1983 Cray-XMP Hitachi 1983 S810/20 1984 NEC-SX2 1985 Cray-2 Hitachi 1987 S820/80 Tc (ns) #FPU’s 60 40 12.5 7 9.5 19/14 6 4.1 4 Flops/ cycle 2 4 2 2 2 2 2 2 6?? 4 2 3 LD/ST path LS L,L,S LS words/ #regs cycle 4(32) 3 1 4 2+1 Elements / register 8 64 1024-32 64
4 LS,LS 2 L,L,S
8-256 8
12?? L,L,L,LS 8 or 2 16 L,LS 8 or 4 2 LS 1 12 L,LS 8 or 4
32 256 8+8k 256/64-256 8 64 32 512
Kyoto, May 28th. 1999
16
Vector Processors (2 of 2 )
Year Machine 1987 Convex C2 1988 Cray Y-MP 1989 1990 1992 1993 1994 1996 1998 Fujitsu VP 2600 NEC SX-3 Cray C90 Hitachi S-3800 Convex C4 Nec SX-4 Nec SX-5 Tc (ns) #FPU’s 40 6.3 3.2 2.9 4 2 7.4 8 4 Flops/ LD/ST cycle path 2 2 LS 2 4 4 2 2 L,L,S 16 LS,LS 16 L,L,S 4 L,L,S words/ #regs cycle 1 2+1 Elements / register 8 128 8 64
8 2048-64 64-2048 8+4 8+16k 256/64-256 4+2 8 128 -
2(?) 16(?) L,L,L,LS 8 or 2 2 2 LS 1 2 16 LS,LS 16 2 32 LS,LS 32
8 128 8+16k 256/64-256 8+16k 256/64-256
Kyoto, May 28th. 1999
17
Evolution of Cray Machines
Machine Cray-1 Cray-XMP Cray-2 Cray-YMP Cray-C90 Cray-J90 Cray-T90 Cray-SV-1 Year Tc Mhz 1976 80 1982 105 1982 243 1989 167 1992 243 1995 100 1994 450 1998 Mflops/ Memory Load CPU # CPU's BW/CPU latency(ns) 160 1 640 MB/s 150 210 2 2.5 GB/s 123 486 4 or 8 1.9 GB/s 200 334 8 4 GB/s 100 970 16 12 GB/s 95 200 32 1.6 GB/S 340 1800 32 21 GB/s 70/116
Tc : x6
ILP : x2
# of proc. x32
Total : x400
18
Courtesy from SGI/CRAY
Kyoto, May 28th. 1999
Vector Innovations (1 of 2 )
• Star-100/Cyber-200 had many of them: • Cray-1 introduced vector registers • BSP had instructions for recurrences and multioperand • Instructions to optimize masked vector operations • Instructions to handle Index and Bit sequence on mask register • Flexible addressing of subvector registers(C4)
Kyoto, May 28th. 1999 19
– Gather/scatter – Masked operations for conditionals
Vector Innovations ( 2 of 2 )
• • • • • • • • Multi-pipes (Star/Cyber) Vector with Virtual Memory Flexible chaining (multi-ported register-file) Multilevel register-file (NEC) Scalar units sharing vector FU’s (Fujitsu) Combined vector and scalar instructions (Titan) Short vectors (CS-2 and CM-5) Scalar processor: LIW( Fujitsu), SS(NEC)
Kyoto, May 28th. 1999 20
Automatic vectorization
• Compiler technology for vectorization: over 25 years of development
– – – – – – – – Dependence analysis Elimination of false dependences Strip mining Loop interchange Partial vectorization Idiom recognition IF conversion Vector parallelization
21
Kyoto, May 28th. 1999
Vector Architectures : Present
• New announcements (NEC, Cray, Fujitsu) • The decline of vector processors • Cross-pollination of Vector/ Superscalar/ VLIW processors
Kyoto, May 28th. 1999
22
NEC SX-5
• • • • • Announced on June 5th. of 1998 8 Gflops, CMOS, tc = 4 ns Superscalar processor at 500 Mflops 32 results/cycle (2 FPU, 16-pipe) 32 data memory accesses/cycle (2 ports,16 data/port). Memory bandwidth of 64 GB/s • System composed by 32 nodes of 128 Gflops providing 4 Tflop/s
Kyoto, May 28th. 1999 23
Cray SV1
• • • • • • Announced on June 16th. of 1998 CMOS, 250 Mhz and 4 Gigaflop/proc. Vector cache memory 2 FU’s of 8 operations/cycle “Multi-Streaming” Processor Scalable vector architecture (32 nodes of 32 processors…4 Teraflops) • Future processor enhancements !!!
Kyoto, May 28th. 1999 24
Fujitsu VP5000
• • • • • • Announced on April 20 th. of 1999 9.2 Gflop/s, CMOS, 0.22 micr, 33 Mtrs/chip Linpack 1000*1000 gives 8758 Mflop/s Crossbar provides 2*1.6 GB/s per processor System composed by 512 PE’s or 4.9 Teraflops Maximum of 16 GB/PE or 8 TB/512 PE’s
Kyoto, May 28th. 1999
25
The decline of vector processors
• Why have vector machines declined so fast in popularity?
– Cost (Scalar parallel machines use commodity parts) – Too restricted in applications (lack of vectorization in many programs)
• Massive use of computers to run so called “Non-numerical Applications”
Kyoto, May 28th. 1999 26
Characteristics of non-numerical Applications
• • • • • • • Examples: OLTP,DSS, simulators, games… General data structures: Lists, trees, tables… Data types: Scalar integers of 8 to 64 bits Frequent control flow change…Speculation Short distance data dependencies... Forwarding Instruction/data locality……Caches Fine-grain ILP……..Out-of-order
Kyoto, May 28th. 1999 27
Micro Killers ???
Peak performance = Tc * ILP
Year Machine 1976 1978 1992 1992 1994 1996 1997 1997 1998 1998 1998 Cray-1 I-8086 Cray C-90 Alpha 21064 Pentium NEC SX-4 IBM P2SC Alpha 21164 HP PA8200 NEC SX-5 Pentium Tc (Mhz) #op/cycle 80 10 243 150 100 125 160 500 240 250 400 2 4 1 1 16 4* 2 4* 32 1 Peak Perf. Mflops 160 970 150 100 2000 640 1000 960 8000 400
28
Kyoto, May 28th. 1999
Bandwidth and Performance
Alpha Power chip HP-8200 21264 IBM 240 Mhz 500 Mhz 160 Mhz
Main memory L2 cache size
Cray T90 NEC SX-4 450 Mhz 125 Mhz
2 GB/s 16 MB 5 Gb/s
2 GB/s 768 MB/s 128 KB 2 MB 3.84 GB/s 704 bytes
24 GB/S 16 GB/S
L1 cache size
64 KB 8 GB/s
24 GB/s 8 KB
16 GB/s 128 KB
Register file size 576 bytes Functional Units
16 GB/s 5.12 GB/s 15.3 GB/s 43.2GB/s 48 GB/s 2 (2 pipe) 2 (2 pipe) 2 (2 pipe) 2 (8 pipes) 2 FPU 1 Gflops 640 Mflops 960Mflops 1.8Gflops 2 Gflops
29
Kyoto, May 28th. 1999
Peak performance and Bandwidth
100 90 80 70 60 50 40 30 20 10 0 0 1000 2000 3000 Vector length
Efficiency (%)
Z(I)=C0+A(I)*(C1+B(I)* (C2+C(I)*(C3+D(I)*
(C4+E(I)*(C5+F(I)*
(C6+G(I)*(C7+H(I)* (C8+K(I)*(C9+L(I))))))))))
VPP500
IBM RS6000 *
4000
Courtesy from Fujitsu
Kyoto, May 28th. 1999
* Measurement condition : RS6000-590(66.6MHz) FORTRAN77 - 03 - qarch=pwr2 - qtune=pwr2 30
Vector ideas used in SS’s/VLIW processors
• Address prediction and Prefetching • Exploitation of data locality(the stride value is used for locality detection and exploitation) • Predicate execution(VLIW) • Multiply and add, chaining • Multi-size operands • Data reuse and vectorization • Addressing modes (auto-increment) • Multithreading ( 2 scalar processors in Fujitsu machines) • Dynamic load/store elimination
Kyoto, May 28th. 1999 31
Predictions for ALL instructions
100 90 80 70 60 50 40 30 20 10 0
co c gc s es pr m go eg ijp 88 m k r pe l xl isp
Last value Stride Context 1 Context 3
Y.Sazeides and J.E. Smith ¨The predictability of data values¨MICRO-30.1997
Kyoto, May 28th. 1999 32
Characterization of Vector Programs
100 90 80 70 60 50 40 30 20 10 0
77 EC SP SM FE Y D FD TR A N BD 52 O FL 2D C R A
R. Espasa “ Advanced Vector Architectures “. PhD Thesis, Feb.97
Kyoto, May 28th. 1999 33
% vector access % vectorization Avg. VL
SS’s ideas usable in vector processors
• • • • Decoupled Vector Architectures Multithreaded Vector Architectures Out-of-order Vector Architectures Simultaneous Multithreaded Vector Architecture • Victim Register File
R. Espasa, M. Valero and J.E. Smith HPCA96, HPCA97, MICRO97, ICS97...
Kyoto, May 28th. 1999 34
ILP+DLP: Out-of-order Vector
Fetch Decode & Rename
S registers
A registers
LD/ST
V registers
Reorder Buffer
Memory
R. Espasa, M. Valero, J.E. Smith “Out-of-order Vector Architecture” MICRO30, 1997.
Kyoto, May 28th. 1999 35
OOO Vector Performance
R. Espasa, M. Valero, J.E. Smith “Out-of-order Vector Architecture” MICRO30, 1997.
Kyoto, May 28th. 1999 36
Vector Processors : The Future
• Very high-performance architectures
• Vector Microprocessors
• Numerical Accelerators • Multimedia Applications
Kyoto, May 28th. 1999
37
Architectures for a Billion Transistors
• • • • • • Advanced/Superspeculative Architectures Trace Processors Simultaneous Multithreading Multiprocessor on a chip RAW processors IRAM
Billion -Transistor Architectures. IEEE Computer Sept. 1997
Kyoto, May 28th. 1999 38
SMV
• Simultaneous Multithreaded Vector Arch. • Mixes three paradigms
– DLP: vector unit – ILP: O-o-O execution – TLP: multithreaded fetch unit
• Requires a memory system with
– high performance at low cost – low pin-count
R. Espasa and M. Valero ¨Exploiting Instruction and Data-Level Parallelism¨IEEE MICRO Sep. 1997
Kyoto, May 28th. 1999 39
Billion Trans. Vector Architecture
Float point queue (64)
FPRF
FPU 1 FPU 2
I cache
PC 8 program counters (one/ thread)
Decode
128 reg
2 data
ALU 1 ALU 2
I FV
8 rename tables
M e
1 1
Integer queue (64)
IRF
128 reg
Memory
@ gen @ gen
m
Memory
(one/thread)
queue (64)
o r y
Inst fetch Thread ID
Inst decode
Vector
Memory queue (64)
Register File
Instruction Slots
Reorder Buffer
Instruction Issue
Kyoto, May 28th. 1999
128 reg
K (data) VFU 1 k k VFU 2 k k k VFU 3 k VFU 4 k k
Execution Pipeline
B
40
R. Espasa and M. Valero ¨Exploiting Instruction and Data-Level Parallelism¨IEEE MICRO Sep. 1997
SMV Performance
R. Espasa and M. Valero ¨Exploiting Instruction and Data-Level Parallelism¨IEEE MICRO Sep. 1997
Kyoto, May 28th. 1999 41
V-IRAM1
0.18 µm, 200 MHz, 1.6GFLOPS(64b)/6.4GOPS(16b)/32MB
2-way Superscalar processor Vector Instruction Queue + x
÷
Load/Store
4 x 64 or 8 x 32 or 16 x 16
I/O
I/O
8K I cache 8K D cache 4 x 64
Vector Registers 4 x 64
Serial I/O
M M 4 x 64 … M M … M M M 4 x 64 … M
Memory Crossbar Switch M M M M M M M M … … M M M 4… 64 x M M M … M
I/O
I/O
M
M x … 4… 64 M M
M … 4 x 64 … M M
D.A. Patterson “ New directions in Computer Architecture” Berkeley, June 1998
Kyoto, May 28th. 1999 42
Conflict-free access to vectors
Idea: Out-of-order access
Memory Modules Interconnection Network P1 P2 P3 Pn Interconnection Network P1 P2 P3 Pn
Sections
M. Valero et al. ISCA 92, ISCA 95, IEEE-TC 95, ICS 92, ICS 94,...
Kyoto, May 28th. 1999 43
Command Memory System
Command = <@,Length,Stride,size> Break commands into bursts at the section controller
Memory Modules
Interconnection Network P1 P2 P3 Pn Interconnection Network P1
P2
P3 Pn
Commands Sections Controller
J. Corbal, R. Espasa and M. Valero “ Command-Vector Memory System” PACT98
Kyoto, May 28th. 1999 44
System configuration in 2009
X-bar
100GB/Sec
Memory(5TB)
X-Bar
Chip
200GF 32Chips 6.4TFLOPS
Memory(5TB)
X-Bar
800GB/Sec
Chip
200GF
Sustained Scalar 250GFLOPS? Vector 1TFLOPS?
Chip
200GF
Chip
200GF
32Chips 6.4TFLOPS
32 SMP(cc-NUMA) Nodes 200TFLOPS/160TB
T. Watanabe SC98, Orlando.
Kyoto, May 28th. 1999 45
Vector Microprocessors
• Ways of reducing the design impact
• Short Vectors (64 x 16 words = 8 Kbytes) • Vector Functionall units shared with INT/FP units • Vector Register renaming to allow precise exceptions • Cache hierarchy tuned to vector execution • Vector data locality allows large data transactions
• Very large bandwidth between cache and vector registers
• High performance for numerical and multimedia applications
Kyoto, May 28th. 1999 46
General Architecture
RDRAM
Fetch Decode
I-Cache VRF
RDRAM
RDRAM
FP
INT
1024
Vector Cache
RDRAM
8
Rambus Controller
Kyoto, May 28th. 1999
47
Vector PC Vs SuperScalar
25 20 15 10 5 0 Hydro2D Dyfesm Swm256 Tomcatv OoO-SS 1x2 VEC 16 1x2 VEC 16 16x32
Kyoto, May 28th. 1999
48
Cache Hierarchy
•Where should be allocated the Vector Cache?
DIRECT RAMBUS DIRECT RAMBUS
L2
VC
VC CPU
L1
CPU
Kyoto, May 28th. 1999
49
Performance of the cache hierarchies
BDNA
8 7 6 5 4 3 2 1 0 2 8 16 32
FLOPS/CYCLE
FLO52
7 6 5 4 6 3 2 1 0 2 8 16 32
FLOPS/CYCLE
HYDRO2D
12 10 8
4 2 0 2 8 16 32
FLOPS/CYCLE
VECTOR CACHE on L1
Kyoto, May 28th. 1999
VECTOR CACHE on L2
PERFECT CACHE
50
Importance of media Applications
“On the next five years, (1998-2002), we believe that media processing will become the dominant force in computer architecture” (K. Diefendorf and P. K. Dubey in IEEE Computer Journal, Sep.97, pp. 43-45)
“90% of Desktop Cycles will Be Spent on Media Applications by 2000” ( Scott Kirkpatrick of IBM )
Kyoto, May 28th. 1999
51
Characteristics of media Applications
• Examples: Image/ speech processing, communications, virtual reality, graphics… • Data structures: matrices and vectors • Data types: Integer(8 -32 bits), FP (32- 64) • Demand for high memory bandwidth • Low data locality and latency problem • No critical data-dependences • Real time necessity • Fine/coarse grain parallelism
Kyoto, May 28th. 1999 52
Multimedia Applications and Architectures
••• • ••• • ••• • ••• •
Superscalar + MMX
Re-discover the parallelism at run-time using a lot of hardware
Scientific Applications Multimedia
VLIW
Vector Architectures
Simple hardware, but loss of parallelism As many instructions as SS approach
Natural way to express and execute DLP applications
Kyoto, May 28th. 1999
53
MMX-like processors
• Multimedia extensions are designed to exploit the parallelism inherent in multimedia aplications • Targeted to leverage full compatibility with existing operating systems and applications, plus minimum chip area investment. • The highlights of multimedia extensions are:
• Single Instruction, Multiple Data (SIMD) techniques
• New data types (Multimedia Vectors, 32/64 bits)
• Multimedia registers • SIMD-like instructions, over small integer data types
Kyoto, May 28th. 1999 54
MMX instruction example
• PADDW: Parallel ADD of 4x16-bit data type with Wrap Around (No Saturation)
0
15
31
47
63
A1
A2
A3
xFFFF
+
B1
+
B2
+
B3
+
x0006
A1+B1
Kyoto, May 28th. 1999
A2+B2
A3+B3
x0005
55
Superscalar Multimedia Processors
PowerPC Altivec Register File 32*128 Mapped Onto Separate Integer Support 8/16/32 FP Support Yes Usual stuff+ Lots Multiply /MAC Lots Min/Max/Avg Yes Pack/Unpack Yes Byte ReorderingAll Unaligned Data 3 Inst. Announced 2Q98 Intel MMX 8*64 FP 8/16/32 MMX2 Lots Mult No Yes Some No 2Q96 Sun VIS 32*64 FP 8/16/32 No Lots Mult No Yes Some 2 Inst. 4Q94 MIPS V HP Alpha /MDMX MAX2 MVI 32*64 32*64 32*64 FP IntegerInteger 8/16 bit 16/32 8 bit MIPS V/ No No Lots Some None Lots Some None Min/MaxAvg Min/Max Yes Yes Yes Many All None Yes No No 4Q96 4Q95 4Q96
Microprocessor Report Vol 12, N 6, May 11, 1998
Kyoto, May 28th. 1999 56
Multimedia Applications and Architectures
••• • ••• • ••• • ••• •
Superscalar + MMX
Re-discover the parallelism at run-time using a lot of hardware
Scientific Applications Multimedia
VLIW
Vector Architectures
Simple hardware, but loss of parallelism As many instructions as SS approach
Natural way to express and execute DLP applications
Kyoto, May 28th. 1999
57
Multimedia Embedded Systems
• NEC V830R/AV includes MIX2, a multimedia
instruction extension (SIMD, MMX-like approach)
• Hitachi SH4 includes FP 4-length vector instructions, targeted at geometry transformation in 3D rendering applications • ARM10 Thumb Family processors will include a
Vector FP unit capable of delivering 600 MFLOPS
Kyoto, May 28th. 1999 58
Widen is better…(?)
• Most multimedia algorithms exhibit vectors no longer than 8/16 elements => widening the multimedia registers could provide diminishing returns.
0
15
0
15
31
47
63
0
15
31
47
63
79
95
111
127
A1 + B1 C1
A1 + B1 C1
A2 + B2
C2
A3 + B3
C3
A4 + B4
C4
A1 + B1 C1
A2 + B2
C2
A3 + B3
C3
A4 + B4
C4
A5 + B5 C5
A6 + B6
C6
A7 + B7
C7
A8 + B8
C8
Kyoto, May 28th. 1999
59
VLIW : Widening vs Replication
Bus configurations:
Memory
1 word
Memory
1 word 1 word
Memory
2 words
Memory
2 words 2 words
Register File
Register File
Register File
Register File
D. López et al. ¨Increasing Memory Bandwidth with Wide Busses¨ICS-97
Kyoto, May 28th. 1999 60
Widening and Replication Performance
8 7 6 5 4 3 2 1 2 4 8 16 Wide 1 wide 2 Wide 4
D. López et al. ¨ Widening versus replicating...¨ ICS98, MICRO98
Kyoto, May 28th. 1999 61
Multimedia Applications and Architectures
••• • ••• • ••• • ••• •
Superscalar + MMX
Re-discover the parallelism at run-time using a lot of hardware
Scientific Applications Multimedia
VLIW
Vector Architectures
Simple hardware, but loss of parallelism As many instructions as SS approach
Natural way to express and execute DLP applications
Kyoto, May 28th. 1999
62
Torrent T0 Microprocessor
• The first single-chip vector microprocessor.
• Can sustain over 24 operations per cycle while having a issue rate of only one 32-bit instruction per cycle • Features:
• 16 vector registers (32 32-bit elements each) • 2 Vector arithmetic units (8 pipes each) • Reconfigurable composite operation pipelines • 128-bit wide, external memory interface • MIPS-II, 32-bit instruction set, scalar unit.
K. Asanovic et al. “ The T0 vector microprocessor “. Hot Chips VII, 1995
Kyoto, May 28th. 1999 63
Torrent T0 Microprocessor
K. Asanovic et al. “ The T0 vector microprocessor “. Hot Chips VII, 1995
Kyoto, May 28th. 1999 64
Vector versus Superscalar Processors
• Comparison of Die Area
– Processor Die Area (in mm2 scaled to 0.25m)
250 200 150 100
66.92 67.77 69.81 250.0
Control Registers Datapath
50
14.73 21.86
37.77
0
Torrent-0 Alpha 21164 UltraSPARC II MIPS R10000 HP PA-8000 Alpha 21264 6-way OoO, Rob128
C. G. Lee and D. J. DeVries “ Initial Results on … “. MICRO-30, 1997.
Kyoto, May 28th. 1999 65
Vector versus Superscalar Processors
• Component Percentages
100 90 80 70 60 50 40 30 20 10 0
Torrent-0 Alpha 21164 UltraSPARC II MIPS R10000 HP PA-8000 Alpha 21264 6-way OoO, Rob128
Datapath
Registers
Control
C. G. Lee and D. J. DeVries “ Initial Results on … “. MICRO-30, 1997.
Kyoto, May 28th. 1999 66
Imagine project
• Focused on developing a programmable architecture that achieves performance similar to special purpose hardware on graphics and image processing. • Matches media applications demands to the current VLSI capabilities by using a stream-based programming model. • Most multimedia kernels exhibit a streaming nature. • Individual stream elements can be operated on in parallel, thus exploiting data parallelism.
Bill Dally “ Tomorrow Computing Engines”Keynote HPCA98
Kyoto, May 28th. 1999 67
Imagine architecture
• • • • Organized around a large stream register file (64Kb) Memory operations move entire streams of data Data streams pass through a set of arithmetic clusters (8) Each cluster unit operates a single element under VLIW control
Streaming Memory System SDRAM
C C
CLUSTER 7
SDRAM ...
Stream Register File
CLUSTER 1 CLUSTER 0
SDRAM
C C
SDRAM
Controller
Bill Dally “ Tomorrow Computing Engines”Keynote HPCA98
Kyoto, May 28th. 1999 68
...
Matrix extensions for Multimedia
• By combining conventional vector approach together with SIMD MMX-like
instructions, we can exploit additional levels of DLP with matrix oriented multimedia extensions.
0
15
31
47
63
15
31
47
63
A1 A5
0 15 31 47 63
A2 A6
A3 A7
A4 A8
B1
B2 B6
B3 B7
B4 B8
+
B5
A9
A13
A10
A14
A11
A15
A12
A16
B9
B10
B11
B15
B12
B16
A1 + B1 C1
A1 + B1 C1
A2 + B2
C2
A3 + B3
C3
A4 + B4
C4
B13 B14
C1 C5 C9
C2 C6 C10
C3 C7 C11
C4 C8 C12
C13 C14 C15 C16
Kyoto, May 28th. 1999 69
Relative Performance
INVERSE DCT TRANSFORM
7 6 5 4 3 2 1 0 way 1 way 2 way 4 way 8
5 0 way 1 way 2 way 4 way 8 15 10 25 20
MPEG-2 MOTION ESTIMATION
9 8 7 6 5 4 3 2 1 0
RGB-YCC Color CONVERSION
way 1
way 2
way 4
way 8
MMX
MDMX
MOM
Kyoto, May 28th. 1999
70
Applications and Architectures
Numerical Applications
Integer
+
+ +
Subroutines
Very Slow
Integer
FPU
Very Big Improvement !!!
Integer
FPU
+
VFPU
Additional Speed
Kyoto, May 28th. 1999
71
Future Applications
• Integer SPEC-like • Commercial (OLTP,DSS)
Integer Integer Commercial
• Numerical • Multimedia
Numerical
Multimedia
Kyoto, May 28th. 1999
72
Acknowledgments
• • • • • • • • Roger Espasa James E. Smith Luis A. Villa Francisca Quintana Jesús Corbal David López Josep Llosa Eduard Ayguade
Kyoto, May 28th. 1999
• • • • • •
Krste Asanovic William Dally Christoforos E. Kozyrakis Corinna G. Lee David A. Patterson Steve Wallace
73
The End
Kyoto, May 28th. 1999
74