Docstoc

manzano

Document Sample
manzano Powered By Docstoc
					Features that you (most probably) didn’t
    know your Microprocessor had
             Joseph B. Manzano
                Spring 2009
Outline
 The Powerful and the Fallen
 The Mutualists
 The Just Passing
 The Olympic Sprinters
 The Threads’ Commune
 Breaking the Despotic Rule of the Lock
         The Powerful and The Fallen
                Multiple Issue Architectures: Increase your IPC / Take advantages of ILP


Common           Issue           Hazard          Scheduling           Distinguishing Examples
Name             Structure       Detection                            characteristics
Superscalar      Dynamic         Hardware        Static               In order execution      Sun
(static)                                                                                      UltraSPARC II
                                                                                              and III
Superscalar      Dynamic         hardware        Dynamic              Some out of order       IBM Power2
(dynamic)                                                             execution

Superscalar      Dynamic         Hardware        Dynamic With         Speculative out of      Pentium 3 and
(speculative)                                    speculation          order execution         4

VLIW / LIW       Static          Software        Static               No hazards between      Trimedia, i860
                                                                      issues packets

EPIC             Mostly Static   Mostly          Mostly Static        Explicit                Itanium
                                 Software                             Dependences
                                                                      marked by compiler

       Register Renaming         Tomasulo Algorithm              Reorder Buffer            Scoreboarding
       The Powerful and The Fallen
                                  Based on the CDC 6000 Architecture
               Scoreboarding      Important Feature: Scoreboard
                                  Issue: WAW, Decode: RAW, execute and write results: WAR




                                                                                              Reorder Buffer
    Tomasulo Algorithm     Implemented in the IBM360/91’s floating point unit.
                           Important Feature: Reservation Station and CDB
                           Issue: tag if not available, copy if they are;
                           Execute: stall RAW monitoring the CDB
                           Write results: Send results to the CDB and dump the store buffer contents;
                           Exception Handling: No insts can be issued until a branch can be resolved
Register Renaming
  The Powerful and The Fallen




Power5
    Dual Core Two way SMT IBM PowerPC SuperScalar Architecture.
                                                    Picture Courtesy of IBM from “Power5 Microarchitecture”
        The Powerful and The Fallen




                                          Picture Courtesy of Intel from “Hyper-Threading Technology Architecture and
Intel Xeon Out of Order Engine Pipeline   Microarchitecture”
Outline
 The Powerful and the Fallen
 The Mutualists
 The Just Passing
 The Olympic Sprinters
 The Threads’ Commune
 Breaking the Despotic Rule of the Lock
The Mutualists
 Vector Processing
   Super Computer of the past
   SIMD type of design
   Elements of the data stream are worked by a single type of
    instruction
   Simplifies hardware design
   Moving toward more “general” purpose vector processing
       The Mutualists
Created by STI       The Cell Broadband Engine           Composed of nine computing elements

                                •A modified Vector Arch
•The brain of the system        •Limited memory: 256 KiB
•Organizer                      •All accesses are to and from this local memory
•Runs Linux                     •Main Memory Accesses  DMA transfers
•PowerPC dual issue arch
                                                                •Each SPE has a MFC unit
                                                                •Issue and receive DMA to and
                                                                from main memory
                                SPE               SPE           •Gate Keeper of the bus
       PPE                            MFC               MFC

                                                              Flex IO      •Four rings
                                                                           •Has QoS in a
                                      BEI                                  limited fashion
       PPSS                                                   Memory
                                                              Interface    (RAM)

 Maintain coherency and consistency between all memory units (the MFC,
 main memory and PPE caches, but not across the local memory of SPEs)
Outline
 The Powerful and the Fallen
 The Mutualists
 The Just Passing
 The Olympic Sprinters
 The Threads’ Commune
 Breaking the Despotic Rule of the Lock
The Just Passing
 Cache  “Invisible” architecture component
 Not so much in the last years
   PowerPC and other architecture provides instructions to
    control
   dcbf[e], dcbst[e], dcbz[e], icbi[e], isync
 Instruction available to touch, to zeroed out, to reserve, or to
  lock a line in place.
 But for some interesting designs look no further than …
The Just Passing
                   XBOX 360
                   Xenon
                   Architectures




                   Picture Courtesy of IBM from ”XBOX 360 System
                   Microarchitecture”
Outline
 The Powerful and the Fallen
 The Mutualists
 The Just Passing
 The Olympic Sprinters
 The Threads’ Commune
 Breaking the Despotic Rule of the Lock
The Olympic Sprinters
 The Hertz race is over; however …
   Some processors are still at it …
   Power 6 and 7 running at 4 and 5 GHz
   Intel Polaris: 3.6 to 6 GHz
 Many hardware re-designs are in order
   Make pipelines shorter, simpler
   Get rid of “extra” hardware features
     The Olympic Sprinters
                                                  Pictures Courtesy of Intel from “IBM
                                                  Power6 Microarchitecture”




Power6
         Running at frequencies from 4 to 5 GHz
                  13 FO4 versus 23 FO4 pipeline
Outline
 The Powerful and the Fallen
 The Mutualists
 The Just Passing
 The Olympic Sprinters
 The Threads’ Commune
 Breaking the Despotic Rule of the Lock
The Threads’ Commune
 Large shared memory systems are becoming scarce
   Scalability issues due to synchronization
   Contention
   Coherency and Consistency
 Novel Solutions have emerged
   Explicit memory hierarchies with very weak memory models
   Massive Multithreading on chip
   Synchronization in memory
The Threads’ Commune
 Cray XMT
   128 Hardware streams
     A stream is 31 64-bit registers, 8 target registers, and a control register
   Three functional units: M, A and C
   500 MHz
   Full and Empty bits per word (2-bits)
 An example of a very high SMT design
   The Threads’ Commune
 SMT / HT designs
                                        Issue Slot




 Time




         Super Scalar       Coarse MT                Fine MT       SMT




  http://www.intel.com/technology/computing/dual-core/demo/popup/demo.htm
       The Threads’ Commune
                                                                                                           Programs
                   1                              2                                    3           4       running in
                                                                                                           parallel

     i =n                                                                 i =n
                                             Sub-
                                           problem                                                Serial
      .                                                                    .
          .       i =3                        A                                .       i =1       Code     Concurrent
              .                                                                    .                       threads of
                    i =2                               Su b-                               i =0            computation
                                                      problem
                         i =1                            B
                                                                          Subproblem A


                                                                                                           Hardware
                                                                                                    ....   stream s
                                                                                                           (128)

          Unused streams

                                                                                                           Instruction
                                                                                                           Ready
                                                                                                           Pool;



                                                                                                           Pipeline of
                                                                                                           executing
                                                                                                           instructions


Cray MTA2 picture from Jonh Feo’s “Can programmers and Machines ever be
friends”
The Threads’ Commune
 Data Race or Race Condition
   “There is an anomaly of concurrent accesses by two or more
    threads to a shared memory and at least one of the accesses is a
    write”
 The orchestration of two or more threads (or processes) to
  complete a task in a correct manner and to avoid any data
  races
 Problems
   Separation of lock and guarded data
The Threads’ Commune
 Coherency and Consistency
   Caching elements and make sure that everyone sees the last
    copy
   If an element is written by processor A then how processor B
    and C will know that they have the latest copy?
   Very difficult problem!
   One of the scalability problems of Shared memory
The Threads’ Commune
 How Cray XMT solves these problems?
   For Synchronization: Join the lock with each data word and put
    the synchronization requirement on the memory instead that
    the processor
   For coherence and consistency: DO NOT cache remote data
    (outside the local 8 GiB)
Outline
 The Powerful and the Fallen
 The Mutualists
 The Just Passing
 The Olympic Sprinters
 The Threads’ Commune
 Breaking the Despotic Rule of the Lock
Breaking the Despotic Rule of the Lock
 Synchronization
   Atomicity and Seriability
     Locks and Barriers
     Around hundreds to ten thousands of cycles and grows linearly (in the
      best cases) or polynomial (in the worst cases) with the number of
      processors
   The lock
     The most used synch primitive!
     Alternatives: Lock-free data structures
Breaking the Despotic Rule of the Lock
 Lock Free Data Structures
   Used to implement non blocking or / and wait free algorithms
   Prevents deadlocks, livelocks and priority inversions
   Potential problems: ABA problem
     It tells us no-one is working on this now, but not if someone has done it
      before
 Transactional Memory
   Based on transactions (an atomic bundle operations)
   If two transactions conflict then one is bound to fail
     Side Note
     A Review of LL and SC
      PowerPC and many other architecture instructions
      Provide a way to optimistically execute a piece of code
      In case that a “violation” has taken place, discard your results
      Many implementations
        PowerPC: lwarx and stwcx




27
     Side Note
     The LL and SC behavior
      The lwarx instruction               The stwcx instruction
        Loads a word aligned                Conditionally Store a
         location                             location to a given memory
        Side Effects:                        location.
          A reservation is created            Conditionally  Depends on
          Storage coherence                    the reservation
           mechanism is notified that a      If success, all changes will
           reservation exists                 be committed to memory
                                             If not, changes will be
                                              discarded.




28
     Side Note
     Reservations
      At most one per processor
      A reservation is lost when
        Processor holding the reservation executes
            A lwarx or ldarx
            A stwcx or stdcx (No matter if the reservation matches or not)
         Other processors executes
            A store or a dcbz to the granule
         Some other mechanism modifies a storage location in the same reservation
          granule
      Interrupts does not clean reservations
         But interrupt handlers might
      Granularity
        The length of the memory block to keep under surveillance


29
            Side Note
            Examples

       LL a = ?

     a *= 100;
     …

            SC a

     brnz

            a
                                            Memory

                        Storage Mechanism    a=?


30
            Side Note
            Examples

       LL a = ?

                            LL a = ?
     a *= 100;
                           a += 100;

            SC a              SC a
                           brnz
     brnz

            a                     a
                                            Memory

                        Storage Mechanism    a=?


31
            Side Note
            Examples

       LL a = ?

                            LL a = ?
     a *= 100;
                           a += 100;        a = 100;

            SC a              SC a
                           brnz
     brnz

            X                     X
                                                 Memory

                        Storage Mechanism         a = 100


32
            Side Note
            Examples

       LL a = ?

                            LL a = ?
     a *= 100;
                           a += 100;

            SC a              SC a
                           brnz
     brnz

            X                     X
                                            Memory

                        Storage Mechanism    a = 100


33
       Side Note
       Examples

  LL a = ?

                      LL a = 100
a *= 100;
                      a += 100;

       SC a              SC a
                      brnz
brnz

       X                     a
                                       Memory

                   Storage Mechanism    a = 100
       Side Note
       Examples

LL a = 100

                      LL a = 100
a *= 100;
                      a += 100;

       SC a              SC a
                      brnz
brnz

       a                     a
                                       Memory

                   Storage Mechanism    a = 100
       Side Note
       Examples

LL a = 100

                      LL a = 100
a *= 100;
                      a += 100;

       SC a              SC a
                      brnz
brnz

       X                     a
                                       Memory

                   Storage Mechanism    a = 200
            Side Note
            Examples

     LL a = 100


     a *= 100;


            SC a

     brnz

            X
                                            Memory

                        Storage Mechanism    a = 200


37
            Side Note
            Examples

     LL a = 200


     a *= 100;


            SC a

     brnz

            a
                                            Memory

                        Storage Mechanism    a = 200


38
       Side Note
       Examples

LL a = 200


a *= 100;


       SC a

brnz

       a
                                       Memory

                   Storage Mechanism   a =20000
    Breaking the Despotic Rule of the Lock
 Sun Rock Processor
    Execute Ahead
    Scouting Threads
    Simultaneous Multithreading
    Transactional Memory
    Checkpoint
    Cache memory with extra bits for
     tracking speculative execution
    32 logical threads and 16 physical cores




                                                Pictures courtesy of “Rock: A SPARC CMT Processor”
Breaking the Despotic Rule of the Lock
 Take a “RISC”-y Approach
   Small transaction  HW
   Best effort
     Use the checkpoint mechanism!
   Transactions == Software construct
     Checkpoint in case of failure
     Commit on successful transaction
     Executed speculative by a strand
     Use the cache store buffers and locks cache lines until commit ( tracking
      lines with the “s-bits” )
                                                                   UltraSparc T1

Multi-core Trends                                               Codename: Niagara
                                                         8 Core Processor, 32 Logical Threads
                                                                                                                           Codename: Rock
 in this Decade                                                          AMD Turion64 X2                          16 Core Processor, 32 Logical Threads
                                                                       IA32 x86 Dual Core Chip

                                                                               Intel Core Duo
                                                                                                                         Intel Core 2
                                             Pentium D                        IA32 x86 Dual Core
                                                                                                                Codename: Penryn, Wolfdale
                                         IA32 x86 2 Core Chip                        Chip
                                                                                                                IA32 x86 Dual & Quad Core Chip
                             Power5                                                       CBE
                     64 bit PowerPC 2 Core                                              PowerPC                                          Power7
                            with SMT                                                   9 Core chip
2001       2002          2003         2004           2005          2006            2007            2008            2009           2010            2011



      Power 4                                                                                                         Codename:
                                                            Xenon                     Power 6
  64 bit PowerPC 2                                     64 bit PowerPC 3            64 bit PowerPC
                                                                                                                        Nehalem
         Core                                                                                                       1 to 8 Core Chip
                                                           Core chip              2 Core with SMT


                                                  Xeon Dual Core                 Intel Core 2 Duo
                                                IA32 x86 2 Core Chip            IA32 x86 2 Core Chip                             Codename: Sandy Bridge


                                                AMD Opteron
       IBM                                   Code Name: Denmark
                                                                                                           AMD
                                                                                                  Code Name: Barcelona
                                              IA32 x86 2 Core Chip                               IA32 x86 Native 4 Core Chip
       Intel
                                                                                                                     UltraSparc T2
       AMD                                                                                                        Codename: Niagara
                                                                                                           8 Core Processor, 64 Logical Threads
       SUN
Sources
 The Powerful and the Fallen
    Sinharoy, B et al, “Power5 System Microarchitecture”, IBM Journal of Research and
     Development, Vol 49, June/September 2005
    Marr, D et al, “Hyper-Threading Technology Architecture and Microarchitecture” Intel
     Technology Journal, Vol 6, Issue 1, 2002
 The Mutualists
 The Just Passing
    Andrews, Jeff and Baker, Nick “XBOX 360 System Architecture”, IEEE Micro, Volume 26,
      Issue 2 March 2006
 The Olympic Sprinters
    Le, H.Q. et al, “Power6 System Microarchitecture,” IBM Journal of Research and
      Development, Vol 61, November 2007
 The Threads’ Commune
    Konecny, P, “Introducing the Cray XMT,” May 5th, 2007
    Feo, J ,“Can programmers and machines can ever be friends?”
 Breaking the Despotic Rule of the Lock
    Chaundhry, S, “Rock: A SPARC CMT Processor”, August 26, 2008

				
DOCUMENT INFO