Docstoc

Introduction to the Intel Micro architecture - How to write

Document Sample
Introduction to the Intel Micro architecture - How to write Powered By Docstoc
					 How to write powerful parallel Applications
08:30-09.00   Welcome and Coffee
09.00-09:45   Introduction to the Intel Micro architecture and Software Implications
09.45-10:15   Introduction to Software Design Cycle - From Serial to Parallel Applications
10:15-10:30   Break
10:30-11:30   How to Optimize Applications and Identify Areas for Parallelization
11:30-12:30   Introduction to Parallel Programming Methods
12:30-13:30   Lunch
              Expressing Parallelism: Using Intel® C++ and Fortran Compilers, Professional
13:30-14:30
                Editions 10.1, for Performance, Multi-threading
14:30-15:15   Expressing Parallelism: Introducing Threading Through Libraries
15:15-15:30   Break
15:30-16:00   Pinpoint Program Inefficiencies and Threading Bugs - Data Races and Deadlocks
              Performance Tuning Threaded Software using Intel VTune Performance Analyzer
16:00-16:45
                and Thread Profiler
16:45-17:15   Parallel programming Techniques and Program Testing in Cluster Environments
  Intel® Core™
Microarchitecture
       Edmund Preiss
EMEA Software Solutions Group
       Core™ Architecture

Moores Law and Processor Evolution
Introduction on Core architecture
–New features added in 2007:
  – Intro to 45nm Technology -> Shrink
New Core™ Advanced Features
Selected Software Implications
                                      Implications of Moore’s Law
                                               1010                  Scaling
10                           As the
                                               109
100
10
                           number of                                      + Wafer Size
                           transistors
10-1                        goes UP            108
                                                                                  + Volume
10-2
                                               107
                                               10
10-3                               Cost per                                              = Lower Costs
                                  transistor   106
10-4                             goes DOWN

10-5                                           105

10-6                                           104

10-7                                           103

       Source: WSTS/Dataquest/Intel




                                          Source: Fortune Magazine
    New Microarchitecture History
Examples:

   EPIC* (Itanium®)                  x86                                 IXA* (xScale)




Examples:

     P5                   P6                Intel NetBurst®                 Banias              Intel® Core™
                                                                                                    Conroe
                                                                                                  Woodcrest
                                                                                                    Merom
Examples:


                                               Pentium® 4
                 Pentium® Pro                                            Pentium® M
  Pentium®                                     Pentium® D
                 Pentium® II/III                                         Core Duo®
                                                 Xeon®




              * IXA – Intel Internet Exchange Architecture/   EPIC – Explicitly Parallel Instruction Computing
                   Intel Processor Family
                       Design Cycles
               Shrink/Derivative
2 YEARS



           Presler Yonah Dempsey
                                           65nm   Increase performance per given clock cycle
              New Microarchitecture
          Intel® Core™ Microarchitecture          Increase processor frequencies

               Shrink/Derivative
                                                  Extend energy efficiency
2 YEARS




                Penryn Family
                                           45nm
             New Microarchitecture                Deliver lead product for 45nm High k +
                  Nehalem                         metal gate process technology

                Shrink/Derivative
2 YEARS




                                                  Deliver optimized processors across each
                                           32nm   product segment and power envelope
             New Microarchitecture
Details of the Intel Core
      Architecture
                       Intel Core
                      Innovations
                       Core 1         Core 2


   Wider                                               Deeper

   Intel® Wide                                 Intel® Intelligent
Dynamic Execution                              Power Capability
                                Bus

 Intel® Advanced                                Intel® Smart
Digital Media Boost                            Memory Access

                           L2 Cache



                                                    Smarter
                          Intel® Advanced
                            Smart Cache
    Faster
CoreTM vs. NetBurstTM µ-arch: Overview

Processor component        Intel NetBurstTM      Intel CoreTM
Pipeline Stages                   31                  14
Threads per core                  2                   1
L1 Cache Org.              (12K uop I/16K     (32K I/32K Data)
                               Data)
L2 Cache Org.                 2 x 2MB         1 x 4MB (shared)
Instr. Decoders                   1                   4
Integer Units              2 (2x core freq)    3 (1x core freq)
SIMD Units                   2 x 64-bits         3 x 128-bits
SIMD Inst. Issued per             1                   3
Clock
FP Units                   3 (Add/Mul/Div)    3 (Add/Mul/Div)
FP Inst. Issued per               1                Up to 2
clock                                         (Add + Mul or Div)
Power                           135W                80W

          means per core
             45nm Technology
                                                  ”Penryn”/”Wolfdale”/“Wolfdale DP”
Penryn – code name for an                                Dual Core Package
enhanced Intel® CoreTM
microarchitecture at 45 nm                                 Core         Core
  – Industry’s first 45 nm High-K processor
    technology                                         32K I-Cache 32K I-Cache
  – ~2x transistor density                             32K D-Cache 32K D-Cache
  – >20% gain in transistor switching speed
  – ~30% decrease in transistor switching power              6M L2 Cache
  – Dual core, quad core
  – Shared L2 cache                                               Bus
  – Intel 64 architecture
  – 128-bit SSE
                                                        2 Threads, 1 Package
                                                      (similar to Intel® Core™ 2
                                                            Duo processor)
    Core™ Microarchitecture
              Instruction Fetch                                          Instruction Fetch
               and Pre Decode                                             and Pre Decode

              Instruction Queue                   2MB/4MB                Instruction Queue
                                                   Shared
uCode                                                                                            uCode
 ROM               Decode                         L2 Cache                   Decode               ROM

                                                    Up to
               Rename/Alloc                       10.4 GB/s               Rename/Alloc
                                                     FSB
    Retirement Unit                                                           Retirement Unit
    (Reorder Buffer)                                                          (Reorder Buffer)

                  Schedulers                                                 Schedulers
ALU Branch   ALU FAdd   ALU FMul                                           ALU FMul   ALU FAdd   ALU Branch
 MMX/SSE     MMX/SSE    MMX/SSE    LOAD   STORE                            MMX/SSE    MMX/SSE     MMX/SSE
                                                          STORE   LOAD
  FPMove      FPMove     FPMove                                             FPMove     FPMove      FPMove

             L1 D-Cache and D-TLB                                   L1 D-Cache and D-TLB
                                                       Intel® Core™
          Instruction Fetch
                                                     Microarchitecture
           And PreDecode
                    6


          Instruction Queue           2M/4M
                    5                Shared L2        Primary interfaces
                                                      Primary interfaces
uCode
                                       Cache             ••   Front end
                                                               Front end
 ROM           Decode
                                                         ••   Execution
                                                               Execution
                    4
                                                         ••   Memory
                                                               Memory
           Rename/Alloc                Up to
                                     10.6 GB/s
                                        FSB
 Retirement Unit              4




            Schedulers
   ALU       ALU       ALU
                                           Memory
 Branch     FAdd      FMul
                              Load   Store Order
MMX/SSE   MMX/SSE   MMX/SSE                 Buffer
 FPmove    FPmove    FPmove


    L1 D-Cache and D-TLB
                                                        Intel® Core™
          Instruction Fetch                           Microarchitecture
           And PreDecode
                    6                                           Front End
                                                                Front End
          Instruction Queue           2M/4M            Up to 6 instructions per cycle can
                                                        Up to 6 instructions per cycle can
                    5                Shared L2        be sent to the IQ
                                                       be sent to the IQ
uCode
 ROM           Decode                  Cache
                                                       Typical programs average
                                                        Typical programs average
                    4                                 slightly less than 4 bytes per
                                                       slightly less than 4 bytes per
                                       Up to          instruction
                                                       instruction
           Rename/Alloc
                                     10.6 GB/s
                                                       4 decoders:1 “large” and 3
                                                        4 decoders:1 “large” and 3
                                        FSB           “small”.
Retirement Unit               4
                                                       “small”.
                                                       – All decoders handle “simple” 1-uop
                                                           All decoders
                                                        – instructions. handle “simple” 1-uop
                                                           instructions.
                                                       – Larger handles instructions up to 4
                                                           Larger handles instructions up to 4
                                                        – uops
                                                           uops
            Schedulers
                                                      Detects short loops and locks
                                                       Detects short loops and locks
   ALU       ALU       ALU                            them in the instruction queue
                                                       them in the instruction queue
 Branch     FAdd      FMul                   Memory   (IQ)
                              Load   Store   Order     (IQ)
MMX/SSE   MMX/SSE   MMX/SSE                            – Reduced front end power
                                             Buffer         Reduced front end power
                                                        – consumption - total saving of up to
 FPmove    FPmove    FPmove
                                                           consumption - total saving of up to
                                                          14%
                                                           14%
   L1 D-Cache and D-TLB
          Instruction Queue             Without
   inc       esp
                                      Macro-Fusion
   store [mem3], ebx
   jne       targ                    Read five instructions from
   cmp       eax, [mem2]             Instruction Queue

   load      eax, [mem1]             Each instruction gets decoded
                                     separately
           dec0 dec1 dec2 dec3

                                        inc    esp
Cycle 1
                                      store [mem3], ebx
                                   jne    targ
Cycle 2
                                 cmp eax, [mem2]
                              load eax, [mem1]
                                                                 ct3
      Instruction Queue         With Intel’s New
inc      esp
                                 Macro-Fusion
store [mem3], ebx
jne      targ                      Read five Instructions from
cmp      eax, [mem2]               Instruction Queue

load     eax, [mem1]               Send fusable pair to single
                                   decoder
      dec0 dec1 dec2 dec3
                                   All in one cycle


                                     inc   esp
                                  store [mem3], ebx
                             cmpjne eax, [mem2], targ
                          load eax, [mem1]
Slide 15

ct3        66% improvement due to
           macro fusion and +1 decoder
           Visually make NGMA bigger/better
           ctaggard, 03/03/2006
                                                       Intel® Core™
          Instruction Fetch
                                                     Microarchitecture
           And PreDecode                                          Execution
                                                                  Execution
                    6                                              Out-of-Order
                                                                   Out-of-Order
          Instruction Queue           2M/4M          4 uops renamed //retired per clock
                                                      4 uops renamed retired per clock
                                                     Uops written to RS and ROB
                                                      Uops written to RS and ROB
                    5                Shared L2        –– RS waits for sources to arrive allowing
                                                          RS
                                                         OOOwaits for sources to arrive allowing
uCode
                                       Cache              OOOexecution
                                                                execution
 ROM           Decode                                 –– ROB waits for results to show up for
                                                          ROB waits
                                                         retirement for results to show up for
                    4                                     retirement

           Rename/Alloc                Up to         6 dispatch ports from RS
                                                      6 dispatch ports from RS
                                                      ––   33execution ports (integer / /fp / /simd)
                                     10.6 GB/s        ––
                                                              execution ports (integer fp simd)
                                                           load
                                                            load
                                        FSB           –
                                                       –
                                                           store (address)
                                                            store (address)
Retirement Unit               4
                                                      ––   store (data)
                                                            store (data)

                                                     128-bit SSE implementation
                                                      128-bit SSE implementation
                                                      –– Port 00has packed multiply (4 cycles SP
                                                                 has packed multiply (4 cycles SP
                                                         5Port pipelined)
                                                           DP
            Schedulers                                    5 DP pipelined)
                                                      –– Port 11has packed add (3 cycles all
                                                          Port has
                                                         precisions) packed add (3 cycles all
                                                          precisions)
   ALU       ALU       ALU
                                           Memory
 Branch     FAdd      FMul                           FP data has one additional cycle
MMX/SSE   MMX/SSE   MMX/SSE
                              Load   Store Order      FP data has one
                                                     bypass latency additional cycle
                                                      bypass latency
                                            Buffer
 FPmove    FPmove    FPmove                           –– Do not mix SSE FP and SSE integer ops
                                                          Do not register
                                                         on samemix SSE FP and SSE integer ops
                                                          on same register
   L1 D-Cache and D-TLB
Intel® Advanced Digital Media
           Boost
         In Each Core                                                                             SSE Operation
                                                                                                   (SSE/SSE2/SSE3)
                 Single
                                                            SOURCE                    127                                 0
                  Cycle
                Execution                                                                  X4          X3    X2      X1

                                                         SSE/2/3 OP

                                                                                           Y4          Y3     Y2     Y1
       DECODE      DECODE                                     DEST


                                                  Intel® Core™
                                                  Microarchitecture
                                                         CLOCK
                                                                    X4opY4 X3opY3 X2opY2 X1opY1
                                                         CYCLE 1

   EXECUTE         EXECUTE                         Others                                         CLOCK
                                                                                                            X2opY2 X1opY1
                                                                                                  CYCLE 1

                                                                CLOCK                  X4opY4 X3opY3
                                                                CYCLE 2


Perf                                                      •   Increased Performance
                            ADVANTAGE                     •   128 bit Single Cycle In Each Core
 Energy                                                   •   Improved Energy Efficiency
                              *Graphics not representative of actual die photo or relative size
                                                       Intel® Core™
          Instruction Fetch
                                                     Microarchitecture
           And PreDecode                                           Memory
                                                                   Memory
                    6
                                                                 sub-system
                                                                  sub-system
          Instruction Queue           2M/4M           Loads & Stores
                                                       Loads & Stores
                    5                Shared L2         –– 128-bit load and 128-bit store per
                                                           128-bit load and 128-bit store per
                                                          cycle
uCode                                                      cycle
 ROM           Decode                  Cache          Data Prefetching
                                                       Data Prefetching
                                                      Memory Disambiguation
                                                       Memory Disambiguation
                    4
                                                      Shared Cache
                                                       Shared Cache
           Rename/Alloc                Up to          L1D cache prefetching
                                                       L1D cache prefetching
                                     10.6 GB/s         –– Data Cache Unit Prefetcher (aka
                                                           Data Cache Unit Prefetcher (aka
                                                          streaming prefetcher
                                                           streaming prefetcher
                                        FSB                – Recognizes ascending access patterns
                                                               Recognizes ascending
                                                            – in recently loaded data access patterns
Retirement Unit               4
                                                               in recently loaded data
                                                           – Prefetches the next line into the
                                                               Prefetches the next line into the
                                                            – processors cache
                                                               processors cache
                                                       –– Instruction Based Stride Prefetcher
                                                           Instruction Based Stride Prefetcher
                                                           – Prefetches based upon a load having a
                                                               Prefetches based upon a load having a
                                                            – regular stride
                                                               regular stride
            Schedulers                                     – Can prefetch forward or backward 2
                                                               Can prefetch forward or backward 2
                                                            – Kbytes (1/2 default page size)
                                                               Kbytes (1/2 default page size)
   ALU       ALU       ALU                            L2 cache prefetching: Data Prefetch
                                                       L2 cache prefetching: Data Prefetch
 Branch     FAdd      FMul
                                           Memory     Logic (DPL)
                                                       Logic (DPL)
                              Load   Store Order       –– Prefetches data to the 2nd level cache
MMX/SSE   MMX/SSE   MMX/SSE                                Prefetches data to the 2nd
                                                          before the DCU requests thelevel cache
 FPmove    FPmove    FPmove
                                            Buffer
                                                           before the DCU requests thedata
                                                                                         data
                                                       –– Maintains 22tables for tracking loads
                                                           Maintains tables for tracking loads
                                                           –    Upstream – 16 entries
                                                            –    Upstream – 16 entries
   L1 D-Cache and D-TLB                                    –
                                                            –
                                                                Downstream – 4 entries
                                                                 Downstream – 4 entries
         Intel Smart Memory Access:
                 Prefetchers
                      Shared
youngest        L1      L2
               Data    Data
    Load4     Cache   Cache

    Load3
    Load2
   Load1
oldest
         Intel Smart Memory Access:
                 Prefetchers
                             Shared
youngest          L1           L2
                 Data         Data
    Load4       Cache        Cache

    Load3
    Load2
   Load1
oldest




            Memory is too far away
         Intel Smart Memory Access:
                 Prefetchers
                             Shared
youngest          L1           L2
                 Data         Data
    Load4       Cache        Cache

    Load3
    Load2
   Load1
oldest




            Caches are closer
            when they have the data
         Intel Smart Memory Access:
                 Prefetchers
                                 Shared
youngest          L1               L2
                 Data             Data
    Load4       Cache            Cache

    Load3
    Load2
   Load1
oldest



            Prefetchers detect
            applications data
            reference patterns
         Intel Smart Memory Access:
                 Prefetchers
                             Shared
youngest          L1           L2
                 Data         Data
    Load4       Cache        Cache

    Load3
    Load2
   Load1
oldest




            And bring the data
            closer to data consumer
         Intel Smart Memory Access:
                 Prefetchers
                        Shared
youngest         L1       L2
                Data     Data
    Load4      Cache    Cache

    Load3
    Load2
   Load1
oldest




            Solving the Problem of Where
 Some Implications of Core 2
 Architecture for Developers
who want to thread their apps
Advanced Smart Cache Benefits

 –Two threads which “communicate”
  frequently should be scheduled to same
  two cores sharing L2 cache
  – Use the thread/processor affinity feature in
    your applications



        Core 1   Core 2   Core 3   Core 4   Quad Core
                                            Processor
          L2 Cache           L2 Cache
                                   FSB
        Memory Related

          Avoid False Sharing

What is false sharing?
Multiple threads repeatedly write to the
same cache line shared by processors
  – Usually different data
  – Cache lines get invalidated
     – Forces additional reads from memory
  – Severe performance impact in tight loops, in general
     – Threads read/write to the same cache line very rapidly
 Some Words on Pipelines (1)
Modern CPUs may be understood by considering their basic design
paradigm, the so-called pipeline. The pipeline is designed to break up the
processing of a single instruction in independenent parts that idealy are
executed in an identical time window.

The independent parts of the processing are called pipeline stages.

Since identical processing time in each stage can‘t be guaranteed, most
pipeline stages control a buffer or queue that supplies instructions if the
previous stage is still busy or in which instruction can be stored if the next
stage is still busy.

Underflow or Overflow of a queue will cause the respective stage to run idle
and will cause a pipeline stall.



                     Full
                    Buffer           Buffer
                                      Full
                                     Empty           Empty
                                                     Buffer
                                                      Full            Buffer
                                                                       Full
                                                                      Empty


                                       Stall
   Fetch             Busy
                    Decode            Busy
                                       Idle
                                     Allocate        Execute
                                                      Busy
                                                       Idle           Busy
                                                                       Idle
                                                                      Retire
 Some Words on Pipelines (2)
In order to achieve the best performance
         Pipeline stalls must be avoided
Since Core 2 performance makes use of speculative
execution, a wrongly taken branch might lead to a
pipeline flush to keep the instructions consistent.
        Pipeline flushes must be avoided
Understanding the Core 2 pipeline and being able
to detect pipeline problems will highly improve the
performance of your software
Knowledge of the Pipeline and its registers increase
the understanding and efficient usage of Vtune
Performance analyser
– E.g. look for Cache Misses, Branch Mispredictions
  Uop Flow – Refer to Vtune Event Counters
                                                             To L2 Cache
Fetch / Decode                                                                                                              Execute


          32 KB                                                                                                                         32 KB
                           Next IP                                      Bus Unit
    Instruction Cache                                                                                                                 Data Cache

                                                                                                                                               FP




                                                                                                          Port
                                                                                                                                      SIMD    Add
                                                                                                                          Integer
                                                                                                                         Arithmetic
                    Branch Target




                                                                               Scheduler / Dispatch Ports
                                            Reservation Stations (RS)
                                                                                                                                                          FP
                        Buffer                                                                                                               Integer    Div/Mul




                                                                                                  Port
                                                                                                                          Integer     SIMD Shift/Rotate
                                                                                                                         Arithmetic
   Instruction




                                                    32 entry
                        Microcode




                                                                          Port Port Port Port
     Decode
                        Sequencer                                                                                         Integer     SIMD
    (4 issue)                                                                                                            Arithmetic

                                                                                                                 Load

                 Register Allocation
                                                                                                                 Store
                    Table (RAT)                                                                                  Addr      Memory
                                                                                                                            Order              Retire
                                                                                                                 Store      Buffer
                                                                                                                 Data      (MOB)
      RESOURCE_STALLS measures                                                                                                                  Re-Order Buffer
       RESOURCE_STALLS measures                                                                                                                 (ROB) – 96 entry
        here transfer from Decode
         here transfer from Decode

     RS_UOPS_DISPATCHED measures
     RS_UOPS_DISPATCHED measures
             at Execution                                                                                                                        IA Register Set
             at Execution
                                       UOPS_RETIRED measures at
                                        UOPS_RETIRED measures at
                                              Retirement
                                               Retirement
Detailed description in Processor Manuals
http://www.intel.com/products/processor/manuals/
Backup

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:6
posted:3/11/2011
language:English
pages:32