VLSI Architecture Past, Present, and Future - Download as PowerPoint

Document Sample
VLSI Architecture Past, Present, and Future - Download as PowerPoint Powered By Docstoc
					                 VLSI Architecture
             Past, Present, and Future

                    William J. Dally
              Computer Systems Laboratory
                  Stanford University

                    March 23, 1999

3/23/99: 1
             Past, Present, and Future

• The last 20 years has seen a 1000-fold
  increase in grids per chip and a 20-fold
  reduction in gate delay
• We expect this trend to continue for the next
  20 years
• For the past 20 years, these devices have
  been applied to implicit parallelism
• We will see a shift toward implicit parallelism
  over the next 20 years

3/23/99: 2
                                         Technology Evolution

     3                                                                2
10                                                               10



     2                                                                1
10                                                               10


                                 wire pitch (um)
     1                                                                0
10                                                               10

                                                                                    gate delay (ns)
                    gate length (um)
     0                                                                -1
10                                                               10



     -1                                                               -2
10                                                               10
 1960             1970       1980        1990      2000   2010    1960     1970   1980        1990    2000   2010




     3/23/99: 3
             Technology Evolution (2)

 Parameter     1979     1999       2019         Units
 Gate Length 5          0.2        0.008        m
 Gate Delay    3000     150        7.5          ps
 Clock Cycle 200        2.5        0.08         ns
 Gates/Clock 67         17         10
 Wire Pitch    15       1          .07          m
 Chip Edge     6        15         38           mm
 Grids/Chip    1.6  105 2.3  108 3.0  1011

3/23/99: 4
             Architecture Evolution

    Year          Microprocessor High-end
                                 Processor
    1979          i8086          Cray 1
                  0.5 MIPS       70 MIPS
                  0.001 MFLOPS 250 MFLOPS
    1999          Compaq 21264
                  500 MIPS, 500 MFLOPS (x 4?)
    2019           X           MP with 1000
                  10000 MIPS,  Xs
                  10000 MFLOPS
3/23/99: 5
                           Incremental Returns
             Performance


                                                   Quad-issue out of order

                                   Dual-issue in order

                               Pipelined RISC




                           Processor Cost (Die Area)


3/23/99: 6
                                Efficiency and Granularity


                                      2P+M
             Peak Performance

                                                2P+2M




                                       P+M




                                   System Cost (Die Area)

3/23/99: 7
             VLSI in 1979




3/23/99: 8
              VLSI Architecture in 1979

•     5m NMOS technology
•     6mm die size
•     100,000 grids per chip, 10,000 transistors
•     8086 microprocessor
        – 0.5MIPS




3/23/99: 9
       1979-1989: Attack of the Killer Micros

• 50% per year improvement in performance
• Transistors applied to implicit parallelism
        – pipeline processor (10 CPI --> 1 CPI)
        – shorten clock cycle
          (67 gates/clock --> 30 gates/clock)
• in 1989 a 32-bit processor w/ floating point
  and caches fits on one chip
        – e.g., i860 40MIPS, 40MFLOPS
        – 5,000,000 grids, 1M transistors (many memory)


3/23/99: 10
1989-1999: The Era of Diminishing Returns

• 50% per year increase in performance through 1996, but
        – projects delayed, performance below expectations
        – 50% increase in grids, 15% increase in frequency (72% total)
• Squeaking out the last implicit parallelism
        – 2-way to 6-way issue, out-of-order issue, branch prediction
        – 1 CPI --> 0.5 CPI, 30 gates/clock --> 20 gates/clock
• Convert data parallelism to ILP
• Examples
        – Intel Pentium II (3-way o-o-o)
        – Compaq 21264 (4-way o-o-o)



3/23/99: 11
        1979-1999: Why Implicit Parallelism?

• Opportunity
        – large gap between micros and fastest processors
• Compatibility
        – software pool ready to run on implicitly parallel
          machines
• Technology
        – not available for fine-grain explicitly parallel
          machines



3/23/99: 12
1999-2019: Explicit Parallelism Takes Over

• Opportunity
        – no more processor gap
• Technology
        – interconnection, interaction, and shared memory
          technologies have been proven




3/23/99: 13
Technology for Fine-Grain Parallel Machines

• A collection of workstations does not make a
  good parallel machine. (BLAGG)
     –    Bandwidth - large fraction (0.1) of local memory BW
     –    LAtency - small multiple (3) of local memory latency
     –    Global mechanisms - sync, fetch-and-op
     –    Granularity - of tasks (100 inst) and memory (8MB)




3/23/99: 14
              Technology for Parallel Machines
                    Three Components
• Networks
        – 2 clocks/hop latency
        – 8GB/s global bandwidth
• Interaction mechanisms
        – single-cycle communication and synchronization
• Software




3/23/99: 15
                                          k-ary n-cubes

          70                                                   • Link bandwidth, B,
                                                                 depends on radix, k, for
          60
                                                                 both wire- and pin-
          50                                                     limited networks.
Latency




          40                                                   • Select radix to trade-off
          30                       4K Nodes
                                                                 diameter, D, against B.
                                   L = 256
                                                                                          L
                                                                                  T        D
          20                       Bs= 16K

          10                                                                              B
          0                                                                          L nk
               0    2      4      6       8     10     12                        T   
                                                                                    Ck 4
                           Dimension
  Dally, “Performance Analysis of k-ary n-cube Interconnection Networks”, IEEE TC, 1990
Delay of Express Channels
                           The Torus Routing Chip

                                                 • k-ary n-cube topology
                                                        – 2D Torus Network
                                                        – 8bit x 20MHz Channels
                                                 •    Hardware routing
                                                 •    Wormhole routing
                                                 •    Virtual channels
                                                 •    Fully Self-Timed Design
                                                 •    Internal Crossbar Architecture



Dally and Seitz, “The Torus Routing Chip”, Distributed Computing, 1986
                                 The Reliable Router

                                                     • Fault-tolerant
                                                            – Adaptive routing (adaptation of
                                                              Duato’s algorithm)
                                                            – Link-level retry
                                                            – Unique token protocol
                                                     • 32bit x 200MHz channels
                                                            – Simultaneous bidirectional
                                                              signalling
                                                            – Low latency plesiochronous
                                                              synchronizers
                                                     • Optimisitic routing
Dally, Dennison, Harris, Kan, and Xanthopoulos, “Architecture and Implementation of the Reliable Router”, Hot Interconnects II, 1994
Dally, Dennison, and Xanthopoulos, “Low-Latency Plesiochronous Data Retiming, “ ARVLSI 1995
Dennison, Lee, and Dally, “High Performance Bidirectional Signalling in VLSI Systems,” SIS 1993
              Equalized 4Gb/s Signaling




3/23/99: 20
              End-to-End Latency
                              • Software sees ~10s latency
                                with 500ns network
 Regs
                              • Heavy compute load associated
                                with sending a message
          Send                    – system call
Tx Node                           – buffer allocation
                                  – synchronization
                              • Solution: treat the network like
           Net
                                memory, not like an I/O device
                                  – hardware formatting,
                                    addressing, and buffer
          Buffer   Dispatch         allocation


              Rx Node
                     Network Summary

• We can build networks with 2-4 clocks/hop latency (12-24
  clocks for a 512-node 3-cube)
     – networks faster than main memory access of modern machines
     – need end-to-end hardware support to see this, no ‘libraries’
• With high-speed signaling, bandwdith of 4GB/s or more
  per channel (512GB/s bisection) is easy to achieve
     – nearly flat memory bandwidth
• Topology is a matter of matching pin and bisection
  constraints to the packaging technology
     – its hard to beat a 3-D mesh or torus
• This gives us B and LA (of BLAGG)

 3/23/99: 22
              The Importance of Mechanisms

                 A            B
                Serial Execution




3/23/99: 23
              The Importance of Mechanisms

                    A             B
                Serial Execution


                                            COM
              OVH       A                          Sync

                    COM
                            OVH       B
                Parallel Execution (High Ovherhead 0.5)




3/23/99: 24
              The Importance of Mechanisms

                    A                 B
                 Serial Execution


                                              COM
              OVH           A                       Sync

                    COM
                                OVH       B
                 Parallel Execution (High Ovherhead 0.5)


                    A

                        B
                 Parallel Execution
              (Low Ovherhead 0.062)
3/23/99: 25
                  Granularity and Cost Effectiveness
•   Parallel Computers Built for
     – Capability - run problems that are too
       big or take too long to solve any other
       way                                        P   $               M
             • absolute performance at any cost
     – Capacity - get throughput on lots of
       small problems
             • performance/cost
•   A parallel computer built from
    workstation size nodes will always have
    lower perf/cost than a workstation                    P $   P $   P $     P $

     – sublinear speedup
     – economies of scale
•   A parallel computer with less memory                   M     M        M    M
    per node can have better perf/cost than
    a workstation

    3/23/99: 26
              MIT J-Machine (1991)




3/23/99: 27
                    Exploiting fine-grain threads

• Where will the parallelism come
  from to keep all of these processors
  busy?
    – ILP - limited to about 5
    – Outer-loop parallelism
            • e.g., domain decomposition
            • requires big problems to get lots of
              parallelism
• Fine threads
    – make communication and
      synchronization very fast (1 cycle)
    – break the problem into smaller
      pieces
    – more parallelism

   3/23/99: 28
       Mechanism and Granularity Summary

• Fast communication and synchronization mechanisms
  enable fine-grain task decomposition
        – simplifies programming
        – exposes parallelism
        – facilitates load balance
• Have demonstrated
        – 1-cycle communication and synchronization locally
        – 10-cycle communication, synchronization, and task dispatch
          across a network
• Physically fine-grain machines have better
  performance/cost than sequential machines

3/23/99: 29
               A 2009 Multicomputer




                                         Processor

                                      Memory
                                       8MB

System: 16 Chips     Chip: 64 Tiles   Tile: P + 8MB




 3/23/99: 30
    Challenges for the Explicitly Parallel Era

• Compatibility
• Managing locality
• Parallel software




3/23/99: 31
                        Compatibility

• Almost no fine-grain parallel software exists
• Writing parallel software is easy
        – with good mechanisms
• Parallelizing sequential software is hard
        – needs to be designed from the ground up
• An incremental migration path
        – run sequential codes with acceptable performance
        – parallelize selected applications for considerable
          speedup


3/23/99: 32
              Performance Depends on Locality

• Applications have data/time-
  dependent graph structure
        – Sparse-matrix solution
              • non-zero and fill-in structure
        – Logic simulation
              • circuit topology and activity
        – PIC codes
              • structure changes as particles move
        – ‘Sort-middle’ polygon rendering
              • structure changes as viewpoint
                moves




3/23/99: 33
                Fine-Grain Data Migration
                    Drift and Diffusion
• Run-time relocation based on pointer use
    – move data at both ends of pointer
    – move control and data
• Each ‘relocation cycle’
    – compute drift vector based on pointer use
    – compute diffusion vector based on
      density potential (Taylor)
    – need to avoid oscillations
• Should data be replicated?
    – not just update vs. invalidate
    – need to duplicate computation to avoid
      communication



  3/23/99: 34
                                     Migration and Locality
                      6


                      5
Distance (in tiles)




                      4


                      3


                      2           NoMigration
                                  OneStep
                      1           Hierarchy
                                  Mixed
                      0
                          1   5      9        13     17    21    25   29   33   37
             3/23/99: 35                           Migration Period
             Parallel Software:
         Focus on the Real Problems
• Almost all demanding problems have ample
  parallelism
• Need to focus on fundamental problems
  – extracting parallelism
  – load balance
  – locality
     • load balance and locality can be covered by excess parallelism
• Avoid incidental issues
     – aggregating tasks to avoid overhead
     – manually managing data movement and replication
     – oversynchronization
3/23/99: 36
                          Parallel Software:
                           Design Strategy
• A program must be designed for parallelism
  from the ground up
        – no bottlenecks in the data structures
              • e.g., arrays instead of linked lists
• Data parallelism
        – many for loops (over data,not time) can be forall
        – break dependencies out of the loop
        – synchronize on natural units (no barriers)




3/23/99: 37
Conclusion: We are on the threshold of the
           explicitly parallel era
• As in 1979, we expect a 1000-fold increase in ‘grids’
  per chip in the next 20 years
• Unlike 1979 these ‘grids’ are best applied to explicitly
  parallel machines
        – Diminishing returns from sequential processors (ILP) - no
          alternative to explicit parallelism
        – Enabling technologies have been proven
              • interconnection networks, mechanisms, cache coherence
        – Fine-grain machines are more efficient than sequential
          machines
• Fine-grain machines will be constructed from multi-
  processor/DRAM chips
• Incremental migration to parallel software
3/23/99: 38