The Future of Vector Processors

The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999 TOP-500 and Vector Processors 350 300 250 200 150 100 50 0 99 N 8 J9 97 N 7 J9 96 N 6 J9 95 N 5 J9 94 N 4 J9 93 N 3 J9 43 15 96 65 November 98 Fujitsu…27 NEC……18 SGI……..15 Hitachi….5 310 # Systems % Peak Perf. Kyoto, May 28th. 1999 2 The Future of Vector ISA’s • Cross-Pollination of Vector/Superscalar/VLIW – MMX, Embedded... • Very-high Performance Architectures – ILP techniques, IRAM, SDRAM • Vector Microprocessors – Numerical Accelerators – Multimedia Applications Kyoto, May 28th. 1999 3 Talk Outline • The Past : • The Present : • Initial Motivation for Vector ISA • Evolution of Vector Processors • Recent Announcements • The Decline of Vector Processors • Cross-Pollination of Vector/Superscalars/VLIW • Very-high Performance Architectures • Vector Microprocessors – Numerical Accelerators – Multimedia Applications • The Future : • Conclusions Kyoto, May 28th. 1999 4 Characteristics of Numerical Applications • Examples: Weather prediction, mechanical engineering • Data structures: Huge matrices (dense, sparse) • Data types: 64 bits, floating point • Highly repetitive loops • Compute-intensive • Data-Level Parallel Kyoto, May 28th. 1999 5 Initial Motivations for Vector Processors Dependence Graph real*8 x(9992), y(9992), u(9984) subroutine loop integer I real*8 q do I=1,9984 q = u(I) * y(I) y(I) = x(I) + q x(I) = q - u(I) * x(I) enddo end y(I) u(I) x(I) * q + * _ For I=1 to 9984 Kyoto, May 28th. 1999 6 Execution of scalar code Loop : ld ld ld mulf mulf add addf subf st add st sub bne add R1,0(R10) IF R2,0(R11) R3,0(R12) R4,R1,R2) R5,R2,R3 R11,R11,#8 R6,R4,R3 R7,R4,R5 0(R12),R7 R10, R10,#8 0(R12),R7 R13,R13,#1 Loop R12,R12,#8 D/L ALU M IF W W W IF D/L ALU M IF W D/L ALU M IF ALU M D/L D/L ALU M IF D/L ALU ALU ALU W IF D/L ALU ALU ALU W IF D/L ALU M IF W D/L ALU ALU ALU W IF D/L ALU ALU ALU W IF D/L ALU M IF W W D/L ALU M IF 14 cycles / Iteration Perfect Memory !!! D/L ALU M IF W W W W D/L ALU M IF D/L ALU M IF D/L ALU M Kyoto, May 28th. 1999 7 Generation of Vector Code ld.w ld.w ld.w Loop : mov ld.l ld.l mul.d ld.l add.d st.l mul.d sub.d st.l add.w add.w lt.w jbrs.t #9984,s2 #0,a2 #8,vs s2, vl -y(a2),v0 -u(a2),v1 v1,v0,v2 -x(a2),v3 v3,v2,v0 v0,-y(a2) v1,v3,v1 v2,v1,v0 v0,-x(a2) #1024,a2 # -128,s2 # 0,s2 loop ; vl <- min(s2,128) ; v0 <- y(I:I+127) ; v1 <- u(I:I+127) ; q(I:I+127) <- u(I:I+127)*y() ; v3 <-x(I:I+127) ; v0 <- x(I:I+127) + q(I:I+127) ; y(I:I+127) <- x(I:I+127) + q( ) ; v1 <- u(I:I+127) *x(I:I+127) ; v0 <- q( ) - u( ) * x( ) ; x(I:I+127) <- q( ) - u( ) * x( ) ; increment index (128 * 8) ; 128 iterations less to process A vector iteration is equivalent to 128 scalar iterations …. …. . …. …. …. …. … 0 1 2 127 DLP !!! Kyoto, May 28th. 1999 8 Execution of vector code Loop : mov ld.l ld.l mul.d ld.l add.d st.l mul.d sub.d st.l add.w add.w lt.w jbrs.t s2, vl -y(a2),v0 -u(a2),v1 v1,v0,v2 -x(a2),v3 v3,v2,v0 A vector iteration is v0,-y(a2) equivalent to 128 v1,v3,v1 scalar iterations v2,v1,v0 v0,-x(a2) #1024,a2 # - 128,s2 5.1 cycles / Iteration #0,s2 Memory Latency = 24 cycles !!! loop 14 vector instructions = 1792 scalar instructions One L/S Port One Adder, One Multiplier Kyoto, May 28th. 1999 9 Vector Processor Main Memory Instructions (scalar + vector) + Data Ri := Rj op Rk Scalar data Vector data VR[i] := VR[j] op VR[k] ... Instr. ... Vector Reg. Control Unit Branch (cond.) Scalar Reg. Kyoto, May 28th. 1999 10 Why Vector ISA ? • Natural way to express Data-Level Parallelism – Fewer instructions (3) • Easy way to convey this information to the hardware • Good hardware implementation – Affordable/ incremental parallelism – Simple control/ faster clock (2) (1) • Mechanism to deal with memory latency • Problem : Memory Bandwidth... Kyoto, May 28th. 1999 11 Vector versus Scalar Architectures Number of instructions (in millions) R10k 120 100 80 60 40 20 0 p jd dl m e5 av w tv ca m to or 2c su 7 sa na 2d o dr hy 12 Convex C3 Vector instruction semantics “encode” many different scalar instructions : - Loop counters - Branch computations - Addresses generation Rate from 140 to 2 im sw F. Quintana, R. Espasa and M. Valero “ A case for merging the ILP..” PDP-98 Kyoto, May 28th. 1999 Easy to convey information to the hardware • Data path : • No pressure at fetch, decode and issue • Decentralized control • Faster cycle times • Vector memory instructions : • Spatial locality can be made clearly visible to the hardware through “strides” • No overhead and good prefetching • Reduction of memory latency overhead • Memory uses facts, not guesses Kyoto, May 28th. 1999 13 Key parameters for vector processors • Cycle time • Scalar processor: – # of registers and FU’s – Cache • Vector processor – # of vector registers – # of FU’s and # of pipes/ FU • Connection to memory: – # of busses and width • Number of processors Kyoto, May 28th. 1999 14 Cray Y-MP Architecture 0 4 28 P0 4*4 8*8 224 228 232 P1 tc = 6 ns. 4*4 256 modules. ta = 30 ns. 3 228 7 231 333 Mflops / processor 31 255 P7 8*8 4*4 Synchronization Kyoto, May 28th. 1999 15 Vector Processors (1 of 2) Year Machine 1972 TI-ASC 1973 STAR-100 1975 Cray-1 Fujitsu 1982 VP 2000 1983 Cray-XMP Hitachi 1983 S810/20 1984 NEC-SX2 1985 Cray-2 Hitachi 1987 S820/80 Tc (ns) #FPU’s 60 40 12.5 7 9.5 19/14 6 4.1 4 Flops/ cycle 2 4 2 2 2 2 2 2 6?? 4 2 3 LD/ST path LS L,L,S LS words/ #regs cycle 4(32) 3 1 4 2+1 Elements / register 8 64 1024-32 64 4 LS,LS 2 L,L,S 8-256 8 12?? L,L,L,LS 8 or 2 16 L,LS 8 or 4 2 LS 1 12 L,LS 8 or 4 32 256 8+8k 256/64-256 8 64 32 512 Kyoto, May 28th. 1999 16 Vector Processors (2 of 2 ) Year Machine 1987 Convex C2 1988 Cray Y-MP 1989 1990 1992 1993 1994 1996 1998 Fujitsu VP 2600 NEC SX-3 Cray C90 Hitachi S-3800 Convex C4 Nec SX-4 Nec SX-5 Tc (ns) #FPU’s 40 6.3 3.2 2.9 4 2 7.4 8 4 Flops/ LD/ST cycle path 2 2 LS 2 4 4 2 2 L,L,S 16 LS,LS 16 L,L,S 4 L,L,S words/ #regs cycle 1 2+1 Elements / register 8 128 8 64 8 2048-64 64-2048 8+4 8+16k 256/64-256 4+2 8 128 - 2(?) 16(?) L,L,L,LS 8 or 2 2 2 LS 1 2 16 LS,LS 16 2 32 LS,LS 32 8 128 8+16k 256/64-256 8+16k 256/64-256 Kyoto, May 28th. 1999 17 Evolution of Cray Machines Machine Cray-1 Cray-XMP Cray-2 Cray-YMP Cray-C90 Cray-J90 Cray-T90 Cray-SV-1 Year Tc Mhz 1976 80 1982 105 1982 243 1989 167 1992 243 1995 100 1994 450 1998 Mflops/ Memory Load CPU # CPU's BW/CPU latency(ns) 160 1 640 MB/s 150 210 2 2.5 GB/s 123 486 4 or 8 1.9 GB/s 200 334 8 4 GB/s 100 970 16 12 GB/s 95 200 32 1.6 GB/S 340 1800 32 21 GB/s 70/116 Tc : x6 ILP : x2 # of proc. x32 Total : x400 18 Courtesy from SGI/CRAY Kyoto, May 28th. 1999 Vector Innovations (1 of 2 ) • Star-100/Cyber-200 had many of them: • Cray-1 introduced vector registers • BSP had instructions for recurrences and multioperand • Instructions to optimize masked vector operations • Instructions to handle Index and Bit sequence on mask register • Flexible addressing of subvector registers(C4) Kyoto, May 28th. 1999 19 – Gather/scatter – Masked operations for conditionals Vector Innovations ( 2 of 2 ) • • • • • • • • Multi-pipes (Star/Cyber) Vector with Virtual Memory Flexible chaining (multi-ported register-file) Multilevel register-file (NEC) Scalar units sharing vector FU’s (Fujitsu) Combined vector and scalar instructions (Titan) Short vectors (CS-2 and CM-5) Scalar processor: LIW( Fujitsu), SS(NEC) Kyoto, May 28th. 1999 20 Automatic vectorization • Compiler technology for vectorization: over 25 years of development – – – – – – – – Dependence analysis Elimination of false dependences Strip mining Loop interchange Partial vectorization Idiom recognition IF conversion Vector parallelization 21 Kyoto, May 28th. 1999 Vector Architectures : Present • New announcements (NEC, Cray, Fujitsu) • The decline of vector processors • Cross-pollination of Vector/ Superscalar/ VLIW processors Kyoto, May 28th. 1999 22 NEC SX-5 • • • • • Announced on June 5th. of 1998 8 Gflops, CMOS, tc = 4 ns Superscalar processor at 500 Mflops 32 results/cycle (2 FPU, 16-pipe) 32 data memory accesses/cycle (2 ports,16 data/port). Memory bandwidth of 64 GB/s • System composed by 32 nodes of 128 Gflops providing 4 Tflop/s Kyoto, May 28th. 1999 23 Cray SV1 • • • • • • Announced on June 16th. of 1998 CMOS, 250 Mhz and 4 Gigaflop/proc. Vector cache memory 2 FU’s of 8 operations/cycle “Multi-Streaming” Processor Scalable vector architecture (32 nodes of 32 processors…4 Teraflops) • Future processor enhancements !!! Kyoto, May 28th. 1999 24 Fujitsu VP5000 • • • • • • Announced on April 20 th. of 1999 9.2 Gflop/s, CMOS, 0.22 micr, 33 Mtrs/chip Linpack 1000*1000 gives 8758 Mflop/s Crossbar provides 2*1.6 GB/s per processor System composed by 512 PE’s or 4.9 Teraflops Maximum of 16 GB/PE or 8 TB/512 PE’s Kyoto, May 28th. 1999 25 The decline of vector processors • Why have vector machines declined so fast in popularity? – Cost (Scalar parallel machines use commodity parts) – Too restricted in applications (lack of vectorization in many programs) • Massive use of computers to run so called “Non-numerical Applications” Kyoto, May 28th. 1999 26 Characteristics of non-numerical Applications • • • • • • • Examples: OLTP,DSS, simulators, games… General data structures: Lists, trees, tables… Data types: Scalar integers of 8 to 64 bits Frequent control flow change…Speculation Short distance data dependencies... Forwarding Instruction/data locality……Caches Fine-grain ILP……..Out-of-order Kyoto, May 28th. 1999 27 Micro Killers ??? Peak performance = Tc * ILP Year Machine 1976 1978 1992 1992 1994 1996 1997 1997 1998 1998 1998 Cray-1 I-8086 Cray C-90 Alpha 21064 Pentium NEC SX-4 IBM P2SC Alpha 21164 HP PA8200 NEC SX-5 Pentium Tc (Mhz) #op/cycle 80 10 243 150 100 125 160 500 240 250 400 2 4 1 1 16 4* 2 4* 32 1 Peak Perf. Mflops 160 970 150 100 2000 640 1000 960 8000 400 28 Kyoto, May 28th. 1999 Bandwidth and Performance Alpha Power chip HP-8200 21264 IBM 240 Mhz 500 Mhz 160 Mhz Main memory L2 cache size Cray T90 NEC SX-4 450 Mhz 125 Mhz 2 GB/s 16 MB 5 Gb/s 2 GB/s 768 MB/s 128 KB 2 MB 3.84 GB/s 704 bytes 24 GB/S 16 GB/S L1 cache size 64 KB 8 GB/s 24 GB/s 8 KB 16 GB/s 128 KB Register file size 576 bytes Functional Units 16 GB/s 5.12 GB/s 15.3 GB/s 43.2GB/s 48 GB/s 2 (2 pipe) 2 (2 pipe) 2 (2 pipe) 2 (8 pipes) 2 FPU 1 Gflops 640 Mflops 960Mflops 1.8Gflops 2 Gflops 29 Kyoto, May 28th. 1999 Peak performance and Bandwidth 100 90 80 70 60 50 40 30 20 10 0 0 1000 2000 3000 Vector length Efficiency (%) Z(I)=C0+A(I)*(C1+B(I)* (C2+C(I)*(C3+D(I)* (C4+E(I)*(C5+F(I)* (C6+G(I)*(C7+H(I)* (C8+K(I)*(C9+L(I)))))))))) VPP500 IBM RS6000 * 4000 Courtesy from Fujitsu Kyoto, May 28th. 1999 * Measurement condition : RS6000-590(66.6MHz) FORTRAN77 - 03 - qarch=pwr2 - qtune=pwr2 30 Vector ideas used in SS’s/VLIW processors • Address prediction and Prefetching • Exploitation of data locality(the stride value is used for locality detection and exploitation) • Predicate execution(VLIW) • Multiply and add, chaining • Multi-size operands • Data reuse and vectorization • Addressing modes (auto-increment) • Multithreading ( 2 scalar processors in Fujitsu machines) • Dynamic load/store elimination Kyoto, May 28th. 1999 31 Predictions for ALL instructions 100 90 80 70 60 50 40 30 20 10 0 co c gc s es pr m go eg ijp 88 m k r pe l xl isp Last value Stride Context 1 Context 3 Y.Sazeides and J.E. Smith ¨The predictability of data values¨MICRO-30.1997 Kyoto, May 28th. 1999 32 Characterization of Vector Programs 100 90 80 70 60 50 40 30 20 10 0 77 EC SP SM FE Y D FD TR A N BD 52 O FL 2D C R A R. Espasa “ Advanced Vector Architectures “. PhD Thesis, Feb.97 Kyoto, May 28th. 1999 33 % vector access % vectorization Avg. VL SS’s ideas usable in vector processors • • • • Decoupled Vector Architectures Multithreaded Vector Architectures Out-of-order Vector Architectures Simultaneous Multithreaded Vector Architecture • Victim Register File R. Espasa, M. Valero and J.E. Smith HPCA96, HPCA97, MICRO97, ICS97... Kyoto, May 28th. 1999 34 ILP+DLP: Out-of-order Vector Fetch Decode & Rename S registers A registers LD/ST V registers Reorder Buffer Memory R. Espasa, M. Valero, J.E. Smith “Out-of-order Vector Architecture” MICRO30, 1997. Kyoto, May 28th. 1999 35 OOO Vector Performance R. Espasa, M. Valero, J.E. Smith “Out-of-order Vector Architecture” MICRO30, 1997. Kyoto, May 28th. 1999 36 Vector Processors : The Future • Very high-performance architectures • Vector Microprocessors • Numerical Accelerators • Multimedia Applications Kyoto, May 28th. 1999 37 Architectures for a Billion Transistors • • • • • • Advanced/Superspeculative Architectures Trace Processors Simultaneous Multithreading Multiprocessor on a chip RAW processors IRAM Billion -Transistor Architectures. IEEE Computer Sept. 1997 Kyoto, May 28th. 1999 38 SMV • Simultaneous Multithreaded Vector Arch. • Mixes three paradigms – DLP: vector unit – ILP: O-o-O execution – TLP: multithreaded fetch unit • Requires a memory system with – high performance at low cost – low pin-count R. Espasa and M. Valero ¨Exploiting Instruction and Data-Level Parallelism¨IEEE MICRO Sep. 1997 Kyoto, May 28th. 1999 39 Billion Trans. Vector Architecture Float point queue (64) FPRF FPU 1 FPU 2 I cache PC 8 program counters (one/ thread) Decode 128 reg 2 data ALU 1 ALU 2 I FV 8 rename tables M e 1 1 Integer queue (64) IRF 128 reg Memory @ gen @ gen m Memory (one/thread) queue (64) o r y Inst fetch Thread ID Inst decode Vector Memory queue (64) Register File Instruction Slots Reorder Buffer Instruction Issue Kyoto, May 28th. 1999 128 reg K (data) VFU 1 k k VFU 2 k k k VFU 3 k VFU 4 k k Execution Pipeline B 40 R. Espasa and M. Valero ¨Exploiting Instruction and Data-Level Parallelism¨IEEE MICRO Sep. 1997 SMV Performance R. Espasa and M. Valero ¨Exploiting Instruction and Data-Level Parallelism¨IEEE MICRO Sep. 1997 Kyoto, May 28th. 1999 41 V-IRAM1 0.18 µm, 200 MHz, 1.6GFLOPS(64b)/6.4GOPS(16b)/32MB 2-way Superscalar processor Vector Instruction Queue + x ÷ Load/Store 4 x 64 or 8 x 32 or 16 x 16 I/O I/O 8K I cache 8K D cache 4 x 64 Vector Registers 4 x 64 Serial I/O M M 4 x 64 … M M … M M M 4 x 64 … M Memory Crossbar Switch M M M M M M M M … … M M M 4… 64 x M M M … M I/O I/O M M x … 4… 64 M M M … 4 x 64 … M M D.A. Patterson “ New directions in Computer Architecture” Berkeley, June 1998 Kyoto, May 28th. 1999 42 Conflict-free access to vectors Idea: Out-of-order access Memory Modules Interconnection Network P1 P2 P3 Pn Interconnection Network P1 P2 P3 Pn Sections M. Valero et al. ISCA 92, ISCA 95, IEEE-TC 95, ICS 92, ICS 94,... Kyoto, May 28th. 1999 43 Command Memory System Command = <@,Length,Stride,size> Break commands into bursts at the section controller Memory Modules Interconnection Network P1 P2 P3 Pn Interconnection Network P1 P2 P3 Pn Commands Sections Controller J. Corbal, R. Espasa and M. Valero “ Command-Vector Memory System” PACT98 Kyoto, May 28th. 1999 44 System configuration in 2009 X-bar 100GB/Sec Memory(5TB) X-Bar Chip 200GF 32Chips 6.4TFLOPS Memory(5TB) X-Bar 800GB/Sec Chip 200GF Sustained Scalar 250GFLOPS? Vector 1TFLOPS? Chip 200GF Chip 200GF 32Chips 6.4TFLOPS 32 SMP(cc-NUMA) Nodes 200TFLOPS/160TB T. Watanabe SC98, Orlando. Kyoto, May 28th. 1999 45 Vector Microprocessors • Ways of reducing the design impact • Short Vectors (64 x 16 words = 8 Kbytes) • Vector Functionall units shared with INT/FP units • Vector Register renaming to allow precise exceptions • Cache hierarchy tuned to vector execution • Vector data locality allows large data transactions • Very large bandwidth between cache and vector registers • High performance for numerical and multimedia applications Kyoto, May 28th. 1999 46 General Architecture RDRAM Fetch Decode I-Cache VRF RDRAM RDRAM FP INT 1024 Vector Cache RDRAM 8 Rambus Controller Kyoto, May 28th. 1999 47 Vector PC Vs SuperScalar 25 20 15 10 5 0 Hydro2D Dyfesm Swm256 Tomcatv OoO-SS 1x2 VEC 16 1x2 VEC 16 16x32 Kyoto, May 28th. 1999 48 Cache Hierarchy •Where should be allocated the Vector Cache? DIRECT RAMBUS DIRECT RAMBUS L2 VC VC CPU L1 CPU Kyoto, May 28th. 1999 49 Performance of the cache hierarchies BDNA 8 7 6 5 4 3 2 1 0 2 8 16 32 FLOPS/CYCLE FLO52 7 6 5 4 6 3 2 1 0 2 8 16 32 FLOPS/CYCLE HYDRO2D 12 10 8 4 2 0 2 8 16 32 FLOPS/CYCLE VECTOR CACHE on L1 Kyoto, May 28th. 1999 VECTOR CACHE on L2 PERFECT CACHE 50 Importance of media Applications “On the next five years, (1998-2002), we believe that media processing will become the dominant force in computer architecture” (K. Diefendorf and P. K. Dubey in IEEE Computer Journal, Sep.97, pp. 43-45) “90% of Desktop Cycles will Be Spent on Media Applications by 2000” ( Scott Kirkpatrick of IBM ) Kyoto, May 28th. 1999 51 Characteristics of media Applications • Examples: Image/ speech processing, communications, virtual reality, graphics… • Data structures: matrices and vectors • Data types: Integer(8 -32 bits), FP (32- 64) • Demand for high memory bandwidth • Low data locality and latency problem • No critical data-dependences • Real time necessity • Fine/coarse grain parallelism Kyoto, May 28th. 1999 52 Multimedia Applications and Architectures ••• • ••• • ••• • ••• • Superscalar + MMX Re-discover the parallelism at run-time using a lot of hardware Scientific Applications Multimedia VLIW Vector Architectures Simple hardware, but loss of parallelism As many instructions as SS approach Natural way to express and execute DLP applications Kyoto, May 28th. 1999 53 MMX-like processors • Multimedia extensions are designed to exploit the parallelism inherent in multimedia aplications • Targeted to leverage full compatibility with existing operating systems and applications, plus minimum chip area investment. • The highlights of multimedia extensions are: • Single Instruction, Multiple Data (SIMD) techniques • New data types (Multimedia Vectors, 32/64 bits) • Multimedia registers • SIMD-like instructions, over small integer data types Kyoto, May 28th. 1999 54 MMX instruction example • PADDW: Parallel ADD of 4x16-bit data type with Wrap Around (No Saturation) 0 15 31 47 63 A1 A2 A3 xFFFF + B1 + B2 + B3 + x0006 A1+B1 Kyoto, May 28th. 1999 A2+B2 A3+B3 x0005 55 Superscalar Multimedia Processors PowerPC Altivec Register File 32*128 Mapped Onto Separate Integer Support 8/16/32 FP Support Yes Usual stuff+ Lots Multiply /MAC Lots Min/Max/Avg Yes Pack/Unpack Yes Byte ReorderingAll Unaligned Data 3 Inst. Announced 2Q98 Intel MMX 8*64 FP 8/16/32 MMX2 Lots Mult No Yes Some No 2Q96 Sun VIS 32*64 FP 8/16/32 No Lots Mult No Yes Some 2 Inst. 4Q94 MIPS V HP Alpha /MDMX MAX2 MVI 32*64 32*64 32*64 FP IntegerInteger 8/16 bit 16/32 8 bit MIPS V/ No No Lots Some None Lots Some None Min/MaxAvg Min/Max Yes Yes Yes Many All None Yes No No 4Q96 4Q95 4Q96 Microprocessor Report Vol 12, N 6, May 11, 1998 Kyoto, May 28th. 1999 56 Multimedia Applications and Architectures ••• • ••• • ••• • ••• • Superscalar + MMX Re-discover the parallelism at run-time using a lot of hardware Scientific Applications Multimedia VLIW Vector Architectures Simple hardware, but loss of parallelism As many instructions as SS approach Natural way to express and execute DLP applications Kyoto, May 28th. 1999 57 Multimedia Embedded Systems • NEC V830R/AV includes MIX2, a multimedia instruction extension (SIMD, MMX-like approach) • Hitachi SH4 includes FP 4-length vector instructions, targeted at geometry transformation in 3D rendering applications • ARM10 Thumb Family processors will include a Vector FP unit capable of delivering 600 MFLOPS Kyoto, May 28th. 1999 58 Widen is better…(?) • Most multimedia algorithms exhibit vectors no longer than 8/16 elements => widening the multimedia registers could provide diminishing returns. 0 15 0 15 31 47 63 0 15 31 47 63 79 95 111 127 A1 + B1 C1 A1 + B1 C1 A2 + B2 C2 A3 + B3 C3 A4 + B4 C4 A1 + B1 C1 A2 + B2 C2 A3 + B3 C3 A4 + B4 C4 A5 + B5 C5 A6 + B6 C6 A7 + B7 C7 A8 + B8 C8 Kyoto, May 28th. 1999 59 VLIW : Widening vs Replication Bus configurations: Memory 1 word Memory 1 word 1 word Memory 2 words Memory 2 words 2 words Register File Register File Register File Register File D. López et al. ¨Increasing Memory Bandwidth with Wide Busses¨ICS-97 Kyoto, May 28th. 1999 60 Widening and Replication Performance 8 7 6 5 4 3 2 1 2 4 8 16 Wide 1 wide 2 Wide 4 D. López et al. ¨ Widening versus replicating...¨ ICS98, MICRO98 Kyoto, May 28th. 1999 61 Multimedia Applications and Architectures ••• • ••• • ••• • ••• • Superscalar + MMX Re-discover the parallelism at run-time using a lot of hardware Scientific Applications Multimedia VLIW Vector Architectures Simple hardware, but loss of parallelism As many instructions as SS approach Natural way to express and execute DLP applications Kyoto, May 28th. 1999 62 Torrent T0 Microprocessor • The first single-chip vector microprocessor. • Can sustain over 24 operations per cycle while having a issue rate of only one 32-bit instruction per cycle • Features: • 16 vector registers (32 32-bit elements each) • 2 Vector arithmetic units (8 pipes each) • Reconfigurable composite operation pipelines • 128-bit wide, external memory interface • MIPS-II, 32-bit instruction set, scalar unit. K. Asanovic et al. “ The T0 vector microprocessor “. Hot Chips VII, 1995 Kyoto, May 28th. 1999 63 Torrent T0 Microprocessor K. Asanovic et al. “ The T0 vector microprocessor “. Hot Chips VII, 1995 Kyoto, May 28th. 1999 64 Vector versus Superscalar Processors • Comparison of Die Area – Processor Die Area (in mm2 scaled to 0.25m) 250 200 150 100 66.92 67.77 69.81 250.0 Control Registers Datapath 50 14.73 21.86 37.77 0 Torrent-0 Alpha 21164 UltraSPARC II MIPS R10000 HP PA-8000 Alpha 21264 6-way OoO, Rob128 C. G. Lee and D. J. DeVries “ Initial Results on … “. MICRO-30, 1997. Kyoto, May 28th. 1999 65 Vector versus Superscalar Processors • Component Percentages 100 90 80 70 60 50 40 30 20 10 0 Torrent-0 Alpha 21164 UltraSPARC II MIPS R10000 HP PA-8000 Alpha 21264 6-way OoO, Rob128 Datapath Registers Control C. G. Lee and D. J. DeVries “ Initial Results on … “. MICRO-30, 1997. Kyoto, May 28th. 1999 66 Imagine project • Focused on developing a programmable architecture that achieves performance similar to special purpose hardware on graphics and image processing. • Matches media applications demands to the current VLSI capabilities by using a stream-based programming model. • Most multimedia kernels exhibit a streaming nature. • Individual stream elements can be operated on in parallel, thus exploiting data parallelism. Bill Dally “ Tomorrow Computing Engines”Keynote HPCA98 Kyoto, May 28th. 1999 67 Imagine architecture • • • • Organized around a large stream register file (64Kb) Memory operations move entire streams of data Data streams pass through a set of arithmetic clusters (8) Each cluster unit operates a single element under VLIW control Streaming Memory System SDRAM C C CLUSTER 7 SDRAM ... Stream Register File CLUSTER 1 CLUSTER 0 SDRAM C C SDRAM Controller Bill Dally “ Tomorrow Computing Engines”Keynote HPCA98 Kyoto, May 28th. 1999 68 ... Matrix extensions for Multimedia • By combining conventional vector approach together with SIMD MMX-like instructions, we can exploit additional levels of DLP with matrix oriented multimedia extensions. 0 15 31 47 63 15 31 47 63 A1 A5 0 15 31 47 63 A2 A6 A3 A7 A4 A8 B1 B2 B6 B3 B7 B4 B8 + B5 A9 A13 A10 A14 A11 A15 A12 A16 B9 B10 B11 B15 B12 B16 A1 + B1 C1 A1 + B1 C1 A2 + B2 C2 A3 + B3 C3 A4 + B4 C4 B13 B14 C1 C5 C9 C2 C6 C10 C3 C7 C11 C4 C8 C12 C13 C14 C15 C16 Kyoto, May 28th. 1999 69 Relative Performance INVERSE DCT TRANSFORM 7 6 5 4 3 2 1 0 way 1 way 2 way 4 way 8 5 0 way 1 way 2 way 4 way 8 15 10 25 20 MPEG-2 MOTION ESTIMATION 9 8 7 6 5 4 3 2 1 0 RGB-YCC Color CONVERSION way 1 way 2 way 4 way 8 MMX MDMX MOM Kyoto, May 28th. 1999 70 Applications and Architectures Numerical Applications Integer + + + Subroutines Very Slow Integer FPU Very Big Improvement !!! Integer FPU + VFPU Additional Speed Kyoto, May 28th. 1999 71 Future Applications • Integer SPEC-like • Commercial (OLTP,DSS) Integer Integer Commercial • Numerical • Multimedia Numerical Multimedia Kyoto, May 28th. 1999 72 Acknowledgments • • • • • • • • Roger Espasa James E. Smith Luis A. Villa Francisca Quintana Jesús Corbal David López Josep Llosa Eduard Ayguade Kyoto, May 28th. 1999 • • • • • • Krste Asanovic William Dally Christoforos E. Kozyrakis Corinna G. Lee David A. Patterson Steve Wallace 73 The End Kyoto, May 28th. 1999 74

Related docs
Graphics processors
Views: 0  |  Downloads: 0
Future of processors
Views: 19  |  Downloads: 1
A Study of Slipstream Processors
Views: 20  |  Downloads: 0
Future Generation Processors
Views: 3  |  Downloads: 0
Vector Diagrams
Views: 69  |  Downloads: 0
Client and Server processors
Views: 0  |  Downloads: 0
Other docs by rraul
All to Jesus I Surrender
Views: 291  |  Downloads: 1
Installment land contract
Views: 457  |  Downloads: 37
AP US History
Views: 2113  |  Downloads: 7
OUTLINE - Property
Views: 563  |  Downloads: 22
Idaho Public Policy Survey
Views: 736  |  Downloads: 2
IP Table
Views: 351  |  Downloads: 6
Burnham v S C of CA
Views: 297  |  Downloads: 5
Contract of receiver
Views: 215  |  Downloads: 1
dv108s
Views: 148  |  Downloads: 0
Condition of receiver
Views: 199  |  Downloads: 1
Hannah s evidence outline
Views: 288  |  Downloads: 10
app005
Views: 121  |  Downloads: 0
de172
Views: 102  |  Downloads: 0
Let Us Worship the Father
Views: 308  |  Downloads: 3
Default clauses in note
Views: 226  |  Downloads: 1