Docstoc

P1-Sterling

Document Sample
P1-Sterling Powered By Docstoc
					                   Presentation to 2004 Workshop on Extreme
                                      Supercomputing Panel:

               Roadmap and Change
                   How Much and How Fast

                         Thomas Sterling
                    California Institute of Technology
                                    and
                    NASA Jet Propulsion Laboratory
                            October 12, 2004



October 12, 2004               Thomas Sterling - Caltech & JPL   1
                             29 Years Ago Today




October 12, 2004   Thomas Sterling - Caltech & JPL   2
                                      Linpack Zettaflops in 2032

 10 Zflops

  1 Zflops

100 Eflops

 10 Eflops

  1 Eflops

100 Pflops

 10 Pflops

  1 Pflops
                     SUM             N=1              N=500
100 Tflops




                                                                                                  Courtesy of Thomas Sterling
 10 Tflops

  1 Tflops

100 Gflops

 10 Gflops
  1 Gflops

100 Mflops
         1993   1995   2000   2005   2010      2015     2020     2025     2030   2035   2040 2043

  October 12, 2004                          Thomas Sterling - Caltech & JPL                   3
October 12, 2004   Thomas Sterling - Caltech & JPL   4
                                    The Way We Were: 1974

      IBM 370 market mainstream
              Approx. 1 Mflops
      DEC PDP-11 geeks delight
      Seymour Cray started
       working on Cray-1
              Approx. 100 Mflops
      2nd generation
       microprocessor
              e.g. Intel 8008
      Core memory
      1103 1Kx1 DRAM chips
      Punch cards, paper tapes,
       teletypes, selectrics


October 12, 2004                    Thomas Sterling - Caltech & JPL   5
                                            What Will Be Different

      Moore’s Law will have flatlined
      Nano-scale atomic level devices
              Assuming we solve lithography
               problem
      Local clock rates ~100 GHz
              Fastest today is > 700 GHz
      Local actions strongly
       preferential to global actions
      Non-conventional technologies
       may be employed
              Optical
              Quantum dots
              Rapid Single Flux Quantum                                 L1

               (RSFQ) gates
                                                            JJ1               JJ2




October 12, 2004                       Thomas Sterling - Caltech & JPL              6
                                                     What we will need

      1 nano-watt per Megaflops
              Energy received from Tau
               Ceti (per m2)
      Approximately 1 square
       meter for 1 Zetaflops ALUs
              10 billion execution sites
      > 10 billion-way parallelism
      Including memory and
       communications: 2000 m2
      3-D packaging (4m)3
      Global latency of ~ 10,000
       cycles
      Including average latency, =>
       1 trillion-way parallelism

October 12, 2004                       Thomas Sterling - Caltech & JPL   7
                       Parcel Simulation Latency Hiding
                                            Experiment
                                                       Nodes




                                                        Flat
                                      Nodes                          Nodes
                                                      Network


    Remote Memory
      Requests      Control                            Nodes
                                                                          Test
                    Experiment        Remote Memory
                                                                       Experiment
                                        Requests

                                                                           ALU
Remote Memory
  Requests              ALU                            Input                             Output
                                                      Parcels                            Parcels



                    Local Memory      Remote Memory                    Local Memory
                                        Requests

                Process Driven Node                                 Parcel Driven Node
 October 12, 2004                             Thomas Sterling - Caltech & JPL                  8
                                                                                   Latency Hiding with Parcels
                                                                          with respect to System Diameter in cycles
                                                                            Sensitivity to Remote Latency and Remote Access Fraction
                                                                                                     16 Nodes
                                                                             deg_parallelism in RED (pending parcels @ t=0 per node)
                                                        1000
Total transactional work done/Total process work done




                                                                256         64

                                                         100
                                                                                               16

                                                                                                                                                     1/4%
                                                                                                                    4
                                                                                                                                                     1/2%
                                                          10                                                                                         1%
                                                                                                                                       2             2%
                                                                                                                                                     4%
                                                                                                                                                 1


                                                           1
                                                                 4




                                                                            4




                                                                                        4




                                                                                                             4




                                                                                                                               4




                                                                                                                                                 4
                                                                6




                                                                           6




                                                                                       6




                                                                                                            6




                                                                                                                              6




                                                                                                                                                6
                                                               64




                                                                          64




                                                                                      64




                                                                                                           64




                                                                                                                             64




                                                                                                                                               64
                                                               24

                                                               96




                                                                          24
                                                                          96




                                                                                      24

                                                                                      96




                                                                                                           24

                                                                                                           96




                                                                                                                             24

                                                                                                                             96




                                                                                                                                               24
                                                                                                                                               96
                                                              25




                                                                         25




                                                                                     25




                                                                                                          25




                                                                                                                            25




                                                                                                                                              25
                                                             38




                                                                        38




                                                                                    38




                                                                                                         38




                                                                                                                           38




                                                                                                                                             38
                                                            10

                                                            40




                                                                       10
                                                                       40




                                                                                   10

                                                                                   40




                                                                                                        10

                                                                                                        40




                                                                                                                          10

                                                                                                                          40




                                                                                                                                            10
                                                                                                                                            40
                                                           16




                                                                      16




                                                                                  16




                                                                                                       16




                                                                                                                         16




                                                                                                                                           16
                                                         0.1
                                                                                        Remote Memory Latency (cycles)


                                     October 12, 2004                                          Thomas Sterling - Caltech & JPL                       9
                                                                              Latency Hiding with Parcels
                                                  Idle Time with respect to Degree of Parallelism
                                                                                              Idle Tim e/Node
                                                                                        (num ber of nodes in black)


                          8.E+05
                                                      2                   4               8                               32
                                       1                                                                  16                             64            128              256

                          7.E+05



                          6.E+05
                                            Process
                                            Transaction
Idle time/node (cycles)




                          5.E+05



                          4.E+05



                          3.E+05



                          2.E+05



                          1.E+05



                          0.E+00
                                                             6




                                                                            8




                                                                                                                 6




                                                                                                                                 8




                                                                                                                                                                6




                                                                                                                                                                                8
                                       64




                                                      32




                                                                     16




                                                                                         64




                                                                                                         32




                                                                                                                          16




                                                                                                                                              64




                                                                                                                                                         32




                                                                                                                                                                         16
                              1

                                   8




                                                 4




                                                                 2




                                                                                1

                                                                                    8




                                                                                                    4




                                                                                                                      2




                                                                                                                                     1

                                                                                                                                         8




                                                                                                                                                   4




                                                                                                                                                                    2
                                                           25




                                                                          12




                                                                                                               25




                                                                                                                               12




                                                                                                                                                              25




                                                                                                                                                                              12
                                                                                    Parallelism Level (parcels/node at tim e=0)



          October 12, 2004                                                               Thomas Sterling - Caltech & JPL                                                      10
                                       Architecture Innovation

        Extreme memory bandwidth
        Active latency hiding
        Extreme parallelism
        Message-driven split-transaction computations (parcels)
        PIM
              e.g. Kogge, Draper, Sterling, …
              Very high memory bandwidth
              Lower memory latency (on chip)
              Higher execution parallelism (banks and row-wide)
      Streaming
              Dally, Keckler, …
              Very high functional parallelism
              Low latency (between functional units)
              Higher execution parallelism (high ALU density)



October 12, 2004                     Thomas Sterling - Caltech & JPL   11
                   Continuum Computer Architecture

      Merges state, logic, and communication in single
       building block
      Parcel driven computation
              Fine grain split transaction computing
              Move data through vectors of instructions in store
              Move instruction stream through vector of data
              Gather-scatter an intrinsic
              Very efficient Futures for produces-multi-consumer computing
      Combines strengths of PIM and Streaming
              All register architecture (fully associative)
              Functional units within a cycle of neighbors
              Extreme parallelism
              Intrinsic latency hiding
October 12, 2004                    Thomas Sterling - Caltech & JPL       12
                                                        A      A      A   A   A   A
       Assoc. Memory                                    L      L      L   L   L   L
                       Inst. Reg.                       U      U      U   U   U   U
                                                        A      A      A   A   A   A
                                                        L      L      L   L   L   L
                                                        U      U      U   U   U   U
                                                        A      A      A   A   A   A
                                                        L      L      L   L   L   L
          ALU                                           U      U      U   U   U   U
                                                        A      A      A   A   A   A
                        Control
                                                        L      L      L   L   L   L
                                                        U      U      U   U   U   U



October 12, 2004                    Thomas Sterling - Caltech & JPL                   13
                                                                    Conclusions

      Zettaflops at nano-scale technology is possible
              Size requirements tolerable
                   But packaging is a challenge;
              Latency challenge does not sink the idea
      Major obstacles
              Power
              Latency
              Parallelism
              Reliability
              Programming
      Architecture can address many of these
      Continuum Computing Architecture
              Combines advantages of PIM and streaming
              Strong candidate for future Zetaflops computer



October 12, 2004                         Thomas Sterling - Caltech & JPL      14
October 12, 2004   Thomas Sterling - Caltech & JPL   15

				
DOCUMENT INFO