Docstoc

Intel Multi-Core Technology

Document Sample
Intel Multi-Core Technology Powered By Docstoc
					Intel Multi-Core Technology
     Intel Multi-Core Technology
• New Energy Efficiency by Parallel Processing
  – Multi cores in a single package
  – Second generation high k + metal gate 32nm
    Technology
• Intel Turbo Boost technology
  – Changing frequency depending on workload
• Intel Hyper-Threading Technology
  – Two threads on a single core
• Tera-scale computing
  – Intend to scale multi-core to 100 cores and beyond
      Multi-Core Hyper-Thread
• Multi-core chips allow 2 or more cores on a
  single package on a computer.
• Multi-core chips do more work per clock cycle,
  running at lower clock frequency.
• Hyper-thread allows efficient use of a single
  processor
  – by allowing multiple threads to share the core’s
    resources
           Interaction with the
            Operating System
• OS perceives each core as a separate processor

• OS scheduler maps threads/processes
  to different cores

• Most major OS support multi-core today:
  Windows, Linux, Mac OS X, …
                    Why multi-core ?
• Difficult to make single-core
  clock frequencies even higher
• Deeply pipelined circuits:
   –   heat problems
   –   speed of light problems
   –   difficult design and verification
   –   large design teams necessary
   –   server farms need expensive
       air-conditioning
• Many new applications are multithreaded
• General trend in computer architecture (shift
  towards more parallelism)
     Instruction-level parallelism
• Parallelism at the machine-instruction level
• The processor can re-order, pipeline
  instructions, split them into microinstructions,
  do aggressive branch prediction, etc.
• Instruction-level parallelism enabled rapid
  increases in processor speeds over the last 15
  years
      Thread-level parallelism (TLP)
• This is parallelism on a more coarser scale
• Server can serve each client in a separate thread
  (Web server, database server)
• A computer game can do AI, graphics, and
  physics in three separate threads
• Single-core superscalar processors cannot fully
  exploit TLP
• Multi-core architectures are the next step in
  processor evolution: explicitly exploiting TLP
          What applications benefit
             from multi-core?
•   Database servers
•   Web servers (Web commerce)
•   Compilers
•   Multimedia applications
•   Scientific applications, CAD/CAM
•   In general, applications with
    Thread-level parallelism
    (as opposed to instruction-level
    parallelism)
  A technique complementary to multi-core:
        Simultaneous multithreading
• Problem addressed:                                    L1 D-Cache D-TLB

  The processor pipeline                              Integer      Floating Point
  can get stalled:




                               L2 Cache and Control
  – Waiting for the result                                  Schedulers

    of a long floating point                                Uop queues
    (or integer) operation
                                                            Rename/Alloc
  – Waiting for data to
    arrive from memory                                BTB     Trace Cache    uCode
                                                                             ROM
 Other execution units         Bus                            Decoder

 wait unused                                                BTB and I-TLB
  Simultaneous multithreading (SMT)
• Permits multiple independent threads to execute
  SIMULTANEOUSLY on the SAME core
• Weaving together multiple “threads”
  on the same core
   1. Example: if one thread is waiting for a floating
      point operation to complete, another thread can
      use the integer units
Without SMT, only a single thread can
       run at any given time
                                   L1 D-Cache D-TLB

      L2 Cache and Control
                              Integer          Floating Point

                                        Schedulers

                                     Uop queues

                                    Rename/Alloc

                             BTB     Trace Cache        uCode ROM

                                         Decoder
      Bus




                                    BTB and I-TLB

                                                Thread 1: floating point
Without SMT, only a single thread can
       run at any given time
                                   L1 D-Cache D-TLB

      L2 Cache and Control
                              Integer          Floating Point

                                        Schedulers

                                     Uop queues

                                     Rename/Alloc

                             BTB     Trace Cache      uCode ROM

                                         Decoder
      Bus




                                     BTB and I-TLB

                               Thread 2:
                               integer operation
SMT processor: both threads can run
          concurrently
                                   L1 D-Cache D-TLB

      L2 Cache and Control
                              Integer          Floating Point

                                        Schedulers

                                      Uop queues

                                     Rename/Alloc

                             BTB      Trace Cache       uCode ROM

                                         Decoder
      Bus




                                     BTB and I-TLB

                               Thread 2:        Thread 1: floating point
                               integer operation
   SMT not a “true” parallel processor
• Enables better threading (e.g. up to 30%)
• OS and applications perceive each simultaneous
  thread as a separate
  “virtual processor”
• The chip has only a single copy
  of each resource
• Compare to multi-core:
  each core has its own copy of resources
                                  Multi-core:
                       threads can run on separate cores
                         L1 D-Cache D-TLB                                      L1 D-Cache D-TLB

                        Integer     Floating Point                           Integer      Floating Point
L2 Cache and Control




                                                      L2 Cache and Control
                             Schedulers                                            Schedulers

                             Uop queues                                            Uop queues

                             Rename/Alloc                                          Rename/Alloc

                       BTB     Trace Cache    uCode                          BTB     Trace Cache    uCode
                                              ROM                                                   ROM
                               Decoder                                               Decoder
Bus




                                                      Bus


                             BTB and I-TLB                                         BTB and I-TLB
                                  Multi-core:
                       threads can run on separate cores
                         L1 D-Cache D-TLB                                      L1 D-Cache D-TLB

                        Integer     Floating Point                           Integer      Floating Point
L2 Cache and Control




                                                      L2 Cache and Control
                             Schedulers                                            Schedulers

                             Uop queues                                            Uop queues

                             Rename/Alloc                                          Rename/Alloc

                       BTB     Trace Cache    uCode                          BTB     Trace Cache    uCode
                                              ROM                                                   ROM
                               Decoder                                               Decoder
Bus




                                                      Bus


                             BTB and I-TLB                                         BTB and I-TLB
   Combining Multi-core and SMT
• Cores can be SMT-enabled (or not)
• The different combinations:
  – Single-core, non-SMT: standard uniprocessor
  – Single-core, with SMT
  – Multi-core, non-SMT
  – Multi-core, with SMT
• The number of SMT threads:
  2, 4, or sometimes 8 simultaneous threads
• Intel calls them “Hyper-Threads” (HT
  Technology)
                   SMT Dual-core: all four threads can
                          run concurrently
                         L1 D-Cache D-TLB                                      L1 D-Cache D-TLB

                       Integer      Floating Point                           Integer      Floating Point
L2 Cache and Control




                                                      L2 Cache and Control
                             Schedulers                                            Schedulers

                             Uop queues                                            Uop queues

                             Rename/Alloc                                          Rename/Alloc

                       BTB     Trace Cache    uCode                          BTB     Trace Cache    uCode
                                              ROM                                                   ROM
                               Decoder                                               Decoder
Bus




                                                      Bus


                             BTB and I-TLB                                         BTB and I-TLB
  Comparison: multi-core vs SMT
• Multi-core:
  – Since there are several cores,
    each is smaller and not as powerful
    (but also easier to design and manufacture)
  – However, great with thread-level parallelism
• SMT
  – Can have one large and fast superscalar core
  – Great performance on a single thread
  – Mostly still only exploits instruction-level
    parallelism
      The memory hierarchy for
            threading
• If simultaneous multithreading only:
  – all caches shared
• Multi-core chips:
  – L1 caches private
  – L2 caches private in some architectures
    and shared in others
• Memory is always shared
          Intel Xeon Dual-core
                                      hyper-threads
• Dual-core
  Intel Xeon processors




                          CORE1




                                             CORE0
• Each core is                    L1 cache           L1 cache
  hyper-threaded
                                       L2 cache

• Private L1 caches
                                       memory
• Shared L2 caches
CORE1     Designs with private L2 caches


                   CORE0




                                      CORE1




                                                            CORE0
        L1 cache           L1 cache           L1 cache              L1 cache

        L2 cache           L2 cache           L2 cache         L2 cache

                                              L3 cache         L3 cache
             memory
                                                     memory
  Both L1 and L2 are private
                                              A design with L3 caches
  Examples: AMD Opteron,
  AMD Athlon, Intel Pentium D                 Example: Intel Itanium 2
       Private vs shared caches
• Advantages of private:
  – They are closer to core, so faster access
  – Reduces contention
• Advantages of shared:
  – Threads on different cores can share the same
    cache data
  – More cache space available if a single (or a few)
    high-performance thread runs on the system
     The cache coherence problem
• Since we have private caches:
  How to keep the data consistent across caches?
• Each core should perceive the memory as a
  monolithic array, shared by all the cores



MESI
cache
Coherence
Protocol
The Core i3 500 series products are dual cores and they do have hyper-
threading and support virtualization, but they do not have Turbo Boost.


The Core i5 600 series products are dual cores which have hyper-threading,
Turbo Boost, virtualization, and the AES instruction set.
TDP: Thermal
Design Power
     The Turbo Boost Technology
• When using fewer cores, transistors built into
  the chip disconnected from the power bus
• When programs need a single thread, then
  the connected core is automatically pumped
  up with extra voltage and over clock for a
  short period of time until the job is done.
• The Turbo Boost decodes when to do what to
  maximize performance
    The Turbo Boost Technology
• Of course, consistently over clocking a
  machine can overheat the chip and render it
  useless fairly rapidly
• Intel has ensured that its Mobile Nehalem
  parts (codenamed Clarksfield) protect
  themselves through self monitoring and
  shutting down if temperature limits are
  breached. (How about constantly shutting
  down the cores !!)
Teaching a new course at UCCS, fall 2012

ECE5990/4990 Power Electronics


Graduate students welcome to take this course

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:7
posted:5/9/2012
language:English
pages:29