The Shape of Things to Come The Future of Multi-core

Document Sample
The Shape of Things to Come The Future of Multi-core Powered By Docstoc
					     The Shape of
    Things to Come:
The Future of Multi-core
Microprocessors in HPC
           Peter M. Kogge
    Associate Dean for Research
    McCourtney Prof. of CS & Engr
      University of Notre Dame
        IBM Fellow (retired)



                                    1
                       Thesis
•   Multi-core microprocessors taking over
     –   For good technical reasons

•   With Multi-Threading increasingly
    common
•   But they still do not solve the memory wall
•   This talk: There may be a better way
     –   “Lighten up” the thread state
     –   Make the cores “anonymous,” and multi-threaded
     –   Place in large numbers on or close to memory




                                                          2
    The Moore’s Law We Know & Love
   • ~ 4X Functionality every 3 years
   • Underlying technology improvement:
      – Growth in transistor density         Yes
      – Growth in transistor switching speed Yes
      – Growth in size of producible die     Not in commercial volumes

   • Microprocessors: Functionality=IPS
      – ~1/2 from higher clock rate No: heat
      – ~1/2 from more complex microarchitectures No: complexity
   • Memory: Functionality = Storage capacity
      – ~2X from smaller transistors Yes
      – Shrinkage in architecture of basic bit cell Yes, but ..
      – Increase in die size Not at commercially viable prices
And it is silent on inter-chip I/O
                                                                         3
   Technology Trends Forcing
      Explicit Parallelism
• Classical Single thread performance
  flattening
• Answer: Programmer-Visible Parallelism:
  – Break program into independent threads
  – Chip-level Multi-processing (CMP): multiple cores
    on same die
  – Multi-thread parallelism: executing multiple threads
    on same core (“virtual multi-core”)

• Both are possible – on same die

                                                           4
                                      How Many Can We Fit on a cm2?
                                      Assume we scale entire current single core chip & replicate to fill 280 sq mm die
                                     1000
Number of uP per Square Centimeter




                                     100




                                      10




                                       1
                                       1970   1975    1980   1985   1990   1995   2000   2005    2010   2015   2020

                                              Answer Potentially 1000’s!!!!
                                                                                                                          5
     Multi-core Power and Clock
                      Decreasing     Assume
      Max Limit                                   Increasing
                       ~linearly     Constant
      Will Grow                                   As Square
                          with         for
     only Slightly                                   with
                      Technology     Multicore    Technology



    Chip Power = Cap/device*#_devices/core*cores/chip

                     * Clock * Voltage2

          Max Clock Rate
             Grows                  Reaching an
             Rapidly                Asymptotic
              with                     Limit
           Technology
But Must Fit Total Equation!!
                                                               6
             Rewriting for Clock

Clock = Max_chip_power(T) * Reduction_in_core_area
        -------------------------------------------------------------
                     Cap_per_device * V2


This is what governs Core Frequency
       Not Faster Transistors!!!


                                                                   7
                   Relative Change In Factors
                   1.40

                   1.20
Relative to 2004




                   1.00

                   0.80

                   0.60

                   0.40

                   0.20

                   0.00
                      2004           2008          2012           2016            2020

                     Max Power (N)          Area (N)                Cap per Device (D)
                     Vdd (D)                Power Limited Clock




                                                                                         8
If All We Do is Tile Cores …..
      100
GHz




      10




       1
       2004   2008            2012            2016    2020

                 Peak Clock     Power Limited Clock


      Clock Rates Will DECREASE
                                                             9
                              What Kind of Core Should We
                                      Replicate?
               14                                                                                                                                   18

               12                                                                                                                                   16
                                                                                                                                                    14




                                                                                                                                    Relative Area
Relative IPS




               10
                                                                                                                                                    12
                8
                                                                                                                                                    10
                6
                                                                     Complex designs                                                                 8
                4                                                                                                                                    6

                2                                                 Give most performance
                                                                  8
                                                                                                                                                     4
                                                                                                                                                     2                                                            8
                    0
                        1.00 1.50
                                  2.00 2.50                   1
                                                                  4
                                                                                  But also largest                                                       0
                                                                                                                                                                     1.00 1.50
                                                                                                                                                                                                              4

                            Relative        3.00 3.50                                                                                                                          2.00 2.50                  1
                                       Clock          4.00
                                                             Issue Width               area                                                                              Relati ve
                                                                                                                                                                                    Clock
                                                                                                                                                                                         3.00 3.50
                                                                                                                                                                                                   4.00
                                                                                                                                                                                                          Issue Width
                                                                                                                                         3.0

                                                                                                                                       2.5




                                                                                                                                                         IPS per Unit Area
                                                                                                                                      2.0

                                                                                                                                     1.5

                                                                                                                                    1.0

                                                                                                                                    0.5

                                                                                                                                    0.0
          But simpler gives       1                                                                                          1.00
                                                                                                                      1.50
                                    2                                                                          2.00
      better performance/area Issue Width 6
                                       4
                                                                                          3.50
                                                                                                 3.00
                                                                                                        2.50
                                                                             8
                                                                                 104.00                   Relative Clock

                                                              Simpler is Better
                                                                                                                                                                                                                        10
                Multi-Threading
•Thread: execution of a series of inter-dependent
 instructions in support of a single program
• Today’s single threaded CPUs
  – Dependencies reduce ability to keep function units busy
  – Limited support for memory operations “in flight”

•Multi-threading: allowing multiple threads to take
 turns using same CPU logic
  – Typical requirement: multiple register sets
  – Valuable fallout: reduction in microarchitectural complexity
• Variations:
  – Coarse-grained MT: Change thread only at some major event
  – Fine grained MT: Change thread every few instructions
  – Simultaneous MT: interleave instructions from multiple threads



                                                                     11
                               A Brief History
                        of Multi-threaded Processors
                    7

                    6
                                                                                   HTMT           Eldorado
                    5
Relevant Features




                                                                                              Niagara

                                                               Horizon       MTA
                    4                                                                           PIM Lite

                                                                                     P5, U4
                                                          J-Machine
                    3                                                                     Hyper Threading

                                                 HEP
                    2

                    1
                                         Space Shuttle
                           6600
                                             IOP
                    0
                    1960          1970             1980               1990         2000                2010

                                                                                                             12
        What is Today’s Multi-Core
              Design Space
                                                             Cache/                  Cache/
                                                             M emory     ...         Memory
       Cache/Memory                 Cache/Memory             Core                    Core
           ...                                                 Interconnect & Control
  Cache           Cache         C      C      M          C
                                                             Core                    Core
   ...             ...          O
                                R
                                       O
                                       R
                                              E
                                              M   ...
                                                         O
                                                         R
                                E      E                 E
                                                             M emory
                                                             Cache/
                                                                         ...         Memory
                                                                                     Cache/
Core    Core   Core     Core

   (a) Hierarchical Designs      (b) Pipelined Designs           (c) Array Designs


                               • IBM Cell
• Intel Core Duo               • Most Router chips                     • Terasys
• IBM Power5                   • Many Video chips                      • Execube
• Sun Niagara                                                          • Yukon
•…


                                                                                              13
                 EXECUBE: 1st True Multi-Core
                     (1st Silicon 1993)
       • First DRAM-based Multi-Core with Memory
       • Designed from onset for “glueless” one-part-type scalability
       • On-chip bandwidth: 6.2 GB/s; Utilization modes > 4GB/s
      MEMORY       MEMORY        MEMORY    MEMORY


    MEMORY      MEMORY          MEMORY    MEMORY




       CACHE             CACHE



           CPU
                                Include                   8         3D Binary Hypercube
                                “High Bandwidth”    Compute Nodes   SIMD/MIMD on a chip
                                Features in ISA      on ONE Chip
Kogge, “EXECUBE,” ICPP, 1994.


                                                                                      14
                                  Micron Yukon
                                                                                     Host                         SDRAM-like interface
                                                                                   (remote)

                                                                                       FIFO

                                                                            Task Dispatch Unit                             HMI

• 0.15mm eDRAM/ 0.18mm logic                                               FIFO                            FIFO

                                                                                  Synchronisation


  process                                                          M16 PE
                                                                  sequencer
                                                                                                              DRAM
                                                                                                              Control
                                                                                                               Unit

• 128Mbits DRAM




                                                                                          Register Files
                                                          256 Processing                                             16MBytes
   – 2048 data bits per access                              Elements                                              Embedded DRAM



• 256 8-bit integer processors
   – Configurable in multiple topologies

• On-chip programmable controller
• Operates like an SDRAM




G. Kirsch, “Active Memory: Micron’s Yukon,” IPDPS 2003.


                                                                                                                                         15
                                    Bluegene/L
       • Two simple cores with dense embedded DRAM technology
       • Included 4MB of on-chip embedded DRAM
       • Designed to scale simply to bigger systems
       • Basis for several of world’s TOP500 machines



                  4 MB EDRAM
                    L2 Cache

Node-Node I/F     Interface Logic     Memory I/F



                L1I L1D     L1I L1D

                                           S. S. Iyer, et al, “Embedded DRAM: Technology platform for the Blue Gene/L chip,”
                PPC 440     PPC 440        IBM J. R&D, Volume 49, Number 2/3, Page 333 (2005)


                DP FPU      DP FPU
                                                                                                                          16
              CELL
(A Pipelined, Array, Hierarchical
            MC Chip)         Each SPE has
 XDR                                                           256KB local
       Mem   SPE        SPE        SPE         SPE               memory
XDR

                                                         I/O

       PPC   SPE        SPE        SPE         SPE




                    Roadrunner system: 16K MC Opterons
                                        + 16K Cell chips



             http://www.research.ibm.com/cell/cell_chip.html
                                                                         17
Intel Experimental Teraflop Chip




   •http://inteldeveloperforum.com.edgesuite.net/event3/092606_jr/msl.htm\


                                                                             18
      Our Work: Climbing the
   Memory Wall a “Different Way”
• Enabling concepts:
  – Processing In Memory
     • Lowering the wall, improving both bandwidth & latency
  – Relentless Multi-threading with Light Weight Threads
     • Changing the number of times we must climb it
     • Reducing state needed to keep behind, or move!
  – Enhancing the semantics of a memory location
     • Allowing incorporation of metadata “at the memory”

• Architectures & execution models to support both
• Emphasis on “Data Intensive” Systems
• And technology to provide high density, high
  volume, scalable single-part type systems

                                                               19
         “Processing-In-Memory”
• High density memory on same chip




                                                  Row Decode Logic
  with high speed logic
                                                                           Memory
                                                                           Array


• Very fast access from logic to memory
                                                                           1 “Full Word”/Row


                                                                          1 Column/Full Word Bit



• Very high bandwidth                                                  Sense Amplifiers/Latches

                                                                       Column Multiplexing


• ISA/microarchitecture designed to
                                               Address
                                                                       “Wide Word” Interface
                                                           Performance Monitor
                                           A                                                          Tiling a Chip
  utilize high bandwidth
                                                                     Wide Register File
                                                                          Broadcast Bus
                                           S                            Wide ALUs


• Tile with “memory+logic” nodes           A               Permutation Network

                                                           Thread State Package

                                           P     Global Address Translation


      1993      On-Chip
                             2004               Parcel Decode and Assembly                         A Memory/Logic Node
              Memory Units                                                          Interconnect
                                               incoming                                                outgoing
                                                parcels                                                 parcels




               Nearby
              Processing              Parcel = Object Address + Method_name + Parameters
                Logic

                                                                                                                      20
                        Merging Multi-Core with PIM
• “Fill the die”: Add cores to fill die
                           – Contacts for external memory bandwidth will dominate die area
• “Processing in Memory”: merge with memory
                           – Lots of local bandwidth, single part type design
                  120                                                                                                           120.0%

                  100
                        Fill the Die Just with Cores                                                                                     Merge into Memory
                                                                                                                                100.0%
Number of Cores




                   80                                                                                                           80.0%




                                                                                                                % Utilization
                   60                                                                                                           60.0%


                   40                                                                                                           40.0%


                   20                                                                                                           20.0%


                    0                                                                                                            0.0%
                    2003     2005       2007         2009          2011        2013         2015         2017                        2003        2005        2007        2009    2011        2013        2015         2017

                             Extra Cores based on area only               Extra cores based on available pins                        % Area that is core - 1:1 case              % Area that is memory - 1:1 case
                             Total implementable number of cores                                                                     % available pins that are used - 1:1 case   % Area that is core - 1:20 case
                                                                                                                                     % Area that is memory - 1:20case            % available pins that are used - 1:20 case




See P. Kogge, “An Exploration of the Technology Space for Multi-Core Memory/Logic Chips for Highly Scalable Parallel Systems,” IEEE Int.
Workshop on Innovative Architectures, Turtle Bay, Hawaii, Jan. 2005.




                                                                                                                                                                                                                              21
       A Short History of PIM@ND
                  Our IBM Origins
 16b                    64 KB                                                                                                                      SRAM
                                                                                                                                D                              1
 CPU                                                                                                                            R                              3
                                                                                                                                A                      ASAP    9
                                                                                                                               M                               4
                                                                                                                               I/F                            I/F
                                                                                                                                              CPU Core

                                                                                                                                     PCI Memory I/F

       EXECUBE                          RTAIS                                     EXESPHERE                                     PIM Fast
 • 1st DRAM MIMD PIM              • FPGA ASAP prototype                     • 9 PIMs + FPGA interconnect                  • PIM for Spacecraft
 • 8-way hypercube                • Early multi-threading                   • Place PIM in PC memory space                • Parcels & Scalability



                                             PIM Lite
                                       • 1st Mobile, Multithreaded PIM
                                       • Demo all lessons learned                                            PIM Macros
                                                                                                        • Fabbed in adv. Technologies
                                                                                        SRAM PIM        • Explore key layout issues
                         A Cray Inc.
                           HPCS                                                                                                       C



Coming Soon
                                                                                                                                          A
                                                                                                                                              C
                                                                                                                               CPU
                           Partner
                                                                                                                                                  H
                                                                                                                                                   E




To a Computer
Near You                                                                               DRAM PIM
                                                                                                                                          Conventional
                                                                                                                                          Motherboard




                       Cascade                                           HTMT                                                        DIVA
                   • World’s First Trans-petaflop                  • 2-level PIM for petaflop                             • Multi-PIM memory module
                   • PIM-enhanced memory system                    • Ultra-scalability                                    • Model potential PIM programs




                                   www.cse.nd.edu/~pim                                                                                                              22
        EXECUBE: 1st True PIM with Dense
                    Memory
       • First DRAM-based Multi-Core with Memory
       • Designed from onset for “glueless” one-part-type scalability
       • On-chip bandwidth: 6.2 GB/s; Utilization modes > 4GB/s
      MEMORY       MEMORY        MEMORY    MEMORY


    MEMORY      MEMORY          MEMORY    MEMORY




       CACHE             CACHE



           CPU
                                Include                   8         3D Binary Hypercube
                                “High Bandwidth”    Compute Nodes   SIMD/MIMD on a chip
                                Features in ISA      on ONE Chip
Kogge, “EXECUBE,” ICPP, 1994.


                                                                                      23
       One Step Further:
   Reducing The Thread State
•“Overprovision” memory with huge numbers of
 anonymous processors
 – Each multi-threaded

•Reduce state of a thread to ~ a cache line
•Allow thread states to reside in memory
•Make creating a new thread “near” some
 memory a cheap operation
•Use enhanced memory state semantics to
 control thread scheduling
Latency reduced by huge factors

                                               24
PIM Lite: The First Such Design
                                       •   “Looks like memory” at Interfaces
                                       •   ISA: 16-bit multithreaded MIMD/SIMD
                                            –   “Thread” = IP/FP pair
                                            –   “Registers” = wide words in frames

                                       •   Multiple nodes per chip
                                       •   1 node logic area ~ 10.3 KB SRAM
                                           (comparable to MIPS R3000)
                                       •   TSMC 0.18u 1-node in fab now
                                       •   3.2 million transistors (4-node)

                                           Parcel in (via chip data bus)         Parcel out (via chip data bus)
         memory interconnect network
                PIM

Memory
                                                                                                   Write-
                                           Thread Instr         Frame                  Data
                                                                           ALU                     Back
 CPU
                                           Queue Memory         Memory                Memory
                                                                                                   Logic




         Memory interconnect network

                                                                                                            25
New Idea: Make Even Memory
  References into Threads
   Target Address   Operands & Working Registers     PC          Code
                           Additional Data Payload
                           MEMORY REQUEST FORMAT = Threadlet State

                                                              LOCAL
                                                                LOCAL
                                                                  LOCAL
                                                                 ISA
 HEAVYWEIGHT                                                        LOCAL
                                                                   ISA
                                                           PROCESSINGISA
     ISA                                                     PROCESSINGMT
                                                               PROCESSING
  PROCESSING                                                     PROCESSING
                            PARCEL                           MEMORY
                                                               MEMORY
                           NETWORK                               MEMORY
    CACHE                                                     LOCALMEMORY
                         INTERCONNECT                           LOCAL
                                                                  LOCAL
                                                             ADDRESS
                                                                    LOCAL
                                                               ADDRESS
                                                          MANAGEMENT
                                                                 ADDRESS
    Threadlet                                               MANAGEMENT
                                                                   ADDRESS
                                                              MANAGEMENT
  PROCESSING                                                    MANAGEMENT
                                                              PIGLET
                                                                PIGLET
                                                           PROCESSING
                                                                  PIGLET
“CLASSICAL”                                                  PROCESSING
                                                                    Threadlet
HOST CPU NODE                                                  PROCESSING
                                                                 PROCESSING

                                                             PIM NODES

                                                                                26
      Our Vision: “PIM-Centric” Multi-Core,
        Relentlessly Multi-Threaded HPC
                     Systems
                                                                     1.E+07




                     PIM
            PIM PIM “PIM PIM PIM PIM PIM PIM
           PIM PIM APIM DIMM”
                        PIM PIM PIM PIM PIM
          PIM PIM PIM PIM PIM PIM PIM PIM
         PIM PIM PIM PIM PIM PIM PIM PIM




                                                        Chip Count
                 A “PIM Cluster”                                     1.E+06

                                               “Host”
PIM Cluster
                    Interconnection
                        Network


       I/O                                                           1.E+05
                                       PIM Cluster




                                                                              2003

                                                                                     2004

                                                                                            2005

                                                                                                   2006

                                                                                                          2007

                                                                                                                 2008

                                                                                                                        2009

                                                                                                                               2010

                                                                                                                                      2011

                                                                                                                                             2012

                                                                                                                                                    2013

                                                                                                                                                            2014

                                                                                                                                                                   2015

                                                                                                                                                                          2016

                                                                                                                                                                                 2017

                                                                                                                                                                                        2018
                       SPD
                     PIM Cluster

                                                                       PB of Commodity DRAM                      Today's Approach                          Shrink Model
                                                                       Shrink and Merge Model                    Constant Die Model                        PIM Model



                                                                              Up to 10X reduction in Chip Count!




                                                                                                                                                                                               27
               Our Work
• Start with Memory, Not Logic
• Add enough cores to just balance memory
• And OPosition at Memory Interface
• Make the Cores Simple!!
• Make the Cores Multi-threaded
• Add local semantics to the memory
• Reduce the Thread State Down to Fitting
  in a Memory Reference


                                            28