Taking Multicore Chip Multithreading (CMT) to the next Level by kpg20724

VIEWS: 143 PAGES: 39

									Taking Multicore Chip
Multithreading (CMT) to the
next Level of throughput
performance with SMP:

The Victoria Falls
aka UltraSPARC T2 Plus

    Denis Sheahan
    Distinguished Engineer
    Sun Microsystems Inc.
     Chip Multi-threaded concepts
     Aim of UltraSPARC T2 Plus
     Hardware design decisions and implementation
     Multi-core SMP OS scaling
     Virtualization and Consolidation
     Scaling Applications

                                                    Page 2
Memory Bottleneck
                   CPU Frequency
                   DRAM Speeds
                                                      y     2Y
   100                                            Ever                       Gap
                                            -- 2x
                                     CP                            s
                                                          ry 6 Year
                                            DRAM -- 2x Eve

         1980            1985               1990               1995   2000         2005
          Source: Sun World Wide Analyst Conference Feb. 25, 2003
  CMT Implementation
Four threads share a single pipeline
Every cpu cycle an instruction from
a different thread is executed                                        Niagara Processor
                                                                       Shared Pipeline
                                                                     Utilization: Up to 85%
Thread 4               C       M       C       M       C       M
Thread 3           C       M       C       M       C       M
Thread 2       C       M       C       M       C       M
Thread 1   C       M       C       M       C       M
                   Memory Latency                  Compute

                                                                                        Page 4
Aims of UltraSPARC T2 Plus
     Create an SMP version of CMT to extend the highly
     threaded Niagara design
     Use T2 as the basis for these systems
     Minimal modifications to T2 for shorter time to market
     Create two-way and four-way systems without the need
     for a traditional SMP backplane
     Avoid any hardware bottlenecks to scaling
     High throughput low latency interconnect
     Scale memory bandwidth with nodes
     Scale I/O bandwidth with nodes
     Include hardware features to enable software scaling

                                                         Page 5
UltraSPARC T2:
Basis for T2 Plus                                         • Up to 8 cores @1.4GHz
                                                          • Up to 64 threads per CPU
                                                          • Memory
      FB DIMM      FB DIMM     FB DIMM       FB DIMM
                                                             > Up to 64GB memory with 4GB
      FB DIMM      FB DIMM     FB DIMM       FB DIMM           DIMMs
                                                             > Up to 16 fully buffered Dimms
            MCU         MCU           MCU           MCU
                                                             > Memory BW = 60+GB/S
     L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$
                                                          • 8x FPUs, 1 fully pipelined
                     Full Cross Bar                         floating point unit/core
     C0    C1 C2         C3   C4      C5    C6      C7    • 4MB L2$ (8 banks) 16 way set
     FPU   FPU    FPU   FPU   FPU     FPU   FPU     FPU
                                                          • Security co-processor per core
                                                             > DES, 3DES, AES, RC4, SHA1,
                          Sys I/F
                    Buffer Switch Core
                                                               SHA256, MD5, RSA to 2048 key,
                        Power 80W
 2x 10GE Ethernet                           x8 @2.5GHz
Victoria Falls aka                                                         • Up to 4 sockets with 8 cores @1.4GHz
Niagara T2 Plus                                                            • N2 core
Coherence Plane 0 (6.4GB/s per direction, 12.8GB/s total)
                                                                           • Threads
4 x Planes, delivering 51.2GB/s snoop B/width)
                                                                             > Up to 256 threads
                                                               To 2nd

                                                               T2 Plus
                                                                             > 64 threads per socket

                    {                                                      • Memory
                                                                             > Up to 128GB memory
                             Memory               Memory                     > Up to 32 full buffered Dimms
                           Controller            Controller
                          CU         CU        CU         CU

                        L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$

                                  Full Cross Bar

                        C0 C1 C2 C3 C4 C5 C6 C7

                                      Sys I/F          PCI-

     Subsystem                        To 2nd           PCI-E      To 2nd
                                      CPU             SWITCH      CPU
T2 Plus Hardware design decisions - Silicon layout
      Majority of the T2 Plus layout the same as T2
      Includes packaging and pin count ie processor could
      not be any bigger or have more pins
      Coherency units require space on silicon so reallocate
                 for 2
      the space Data x 10Gig on T2 for the coherency units
      on T2 Plus.
      Also regained some space in the move from 4 to 2
      All I/O on T2 Plus via the x8 PCI-E link per chip
      Can scale the I/O with processors
      Two-way systems have 2x the I/O, four-way systems
      have 4x the I/O

                                                               Page 8
T2 Plus Hardware design decisions - Interconnect
       Reduce the Memory Control Units (MCU) from four (T2)
       to two (T2 Plus).
       Each MCU then serves four banks on the T2 Plus.
       Use half the FBDIMM channels for memory and the
       other half to create 4 bidirectional links for interconnect
       PhysicalData of interconnect same as FBDIMM
       Address space partitioned into 4 coherence planes
       using physical address bits
       Reuse the FBDIMM pins for the interconnect
       Result is a high speed SERDES link
       Provides high bandwidth and a low latency interconnect

                                                                Page 9
UltraSPARC T2 Plus 2-Socket System

      Dual Channel            Dual Channel                    Dual Channel           Dual Channel
        FBDIMM                  FBDIMM                          FBDIMM                 FBDIMM

       Memory Controller      Memory Controller                 Memory Controller     Memory Controller

  Coherence    Coherence   Coherence   Coherence             Coherence Coherence    Coherence   Coherence
    Unit         Unit        Unit        Unit                  Unit      Unit         Unit        Unit

     Niagara2 Cores, Crossbar, L2$                            Niagara2 Cores, Crossbar, L2$
          (8 cores, 64 threads, 4MB L2$)                          (8 cores, 64 threads, 4MB L2$)

         PCI-Express          NCU, DMU       NCX                 PCI-Express           NCU, DMU     NCX

                           System IO (Network, Disk, etc.)
T2 Plus Hardware design decisions - Memory
     Each T2 Plus processor has its own local memory
     connected to its FBDIMM channels
        21GB/sec (Theoretical Peak) Read
        10GB/sec (Theoretical Peak) Write
     Remote memory latency has a 76ns penalty for access,
     about 1.5x latency to local memory This makes T2
     Plus systems NUMA though not highly so.
     Two memory interleaving modes implemented in
     512 byte interleaving spreads memory accesses evenly
     across all nodes
     1 Gigabyte interleaving where OS can become involved
     in optimal memory placement

                                                        Page 11
T2 Plus Hardware design decisions - I/O
     Each T2 Plus processor has its own x8 PCI-E link.
     Two-way systems have 2 x8 links, four-way systems
     have 4 x8 links
     T2 implemented PCI-E strict ordering which guaranteed
     data integrity but can unnecessarily block unrelated
 •             Data
     Enhancement for T2 Plus introduced relaxed DMA
     Ordering. Allows request reordering and higher
     Relaxed ordering helps scaling
     A device interrupt is delivered to a local cpu via a PCI-E
     memory write (MSI or MSI-X) but if necessary can be
     forwarded to a remote cpu for completion via a cross
     call. This adds very little overhead.

                                                              Page 12
T5140 1U                               Extreme Rackmount Density
UltraSPARC T2 Platform                 • 1RU chassis, 26.5” depth
                                       • Two sockets
                                         > 6 /8 cores @ 1.4GHz each
                                         > Up to 128 threads
                                       • 16 memory slots
                                         > Up to 64GB of memory, FB-DIMM
                                       High Reliability
• Optimized for maximum throughput     • Up to 8 hot plug SATA/SAS 2.5”
  per Rack Unit                          disk drive (Raid 0,1)
• Up to 128 threads / 64GB memory in   • Redundant, hot-swappable PSUs
                                         and Fans
  a redundant 1U chassis
                                       • 3 PCI-E expansion slots
                                       • 4 10/100/1,000 Mbps Ethernet
                                         as standard
                                       • 4 USB ports

                                                                      Page 13
T5240 2U                                Rackmount Density
UltraSPARC T2 Platform                  •
                                          2RU chassis, 26.5” depth
                                          Two sockets
                                                6/8 cores @1.4GHz
                                                Up to 128 threads
                                            32 memory slots
                                                Up to 128GB of memory, FB-DIMM
    Aim is Data center simplification   High Reliability
                                          Up to 16 hot plug SATA/SAS
                                          2.5” disk drive (Raid 0,1)
                                          Redundant, hot-swappable
                                          PSUs and Fans
                                          6 PCI-E expansion slots
                                          4 10/100/1000 Mbps ethernet
                                          4 USB ports

                                                                        Page 14
Scaling the OS
     There are a set of common issues when scaling the OS
     Single threaded kernel services
     Scaling contended mutexes
     Optimising memory placement
     Optimal scheduling across cores
     Scaling I/O
     Scaling the Network
     Local vs remote DMA
     Delivering interrupts

                                                        Page 15
Single Threaded kernel services - Tick accounting

       Solaris performs accounting and book keeping activities
       every clock tick.
       To do this, a cyclic timer is created to fire every clock
   •             Data
       Code then goes around all the active CPUs in the
       system, to determine if any user thread is running on a
       CPU and charges it with one tick.
       Measures number of ticks a user thread is using and
       adjusts time quantum used by a thread.
       Scheduler dispatching decisions are made using the
       number of ticks
       LWP interval timers are processed every tick, if they
       have been set.
                                                              Page 16
Tick Accounting (cont.)
      As the number of CPUs increases, the tick accounting
      loop gets larger.
      Tick accounting is a single threaded activity and on a
      busy system with many CPUs, the loop can often take
      more than a tick (10ms) to process if the locks it needs
      to acquire are busy.
      This causes the clock to fall behind and timeout
      processing to be delayed. System becomes
      Solution is to involve multiple cpus in the Tick
      scheduling. Cpus are collected in groups and one is
      chosen to schedule all their ticks
      Algorithm is used to spread the tick accounting evenly
      across active cpus.so each has to take less than 1% of
      the load
                                                                 Page 17
Scaling contended mutexes
     mutexes in Solaris are optimised for the non contended
     case. Contended mutexes are considered rare
     With up to 256 active threads in a single OS instance,
     calls on contended mutexes can become hotter.
     A side effect can be the thundering herd where multiple
     callers areData
                 woken at the same time and attempt to
     acquire the lock.
     This is quite common in the network stack
     Solaris solution is an Exponential backoff algorithm.
     This has the advantage of little overhead in the non-
     contended case and performance gains in the
     contended case

                                                               Page 18
T2 Plus NUMA aware OS
      OS has been NUMA aware since Solaris 9. Works on
      the concept of a logical group, lgroup, which is a portion
      of lower latency local memory and its associated cpus.
      This feature is called Memory Placement Optimisation
      Processes are assigned to a local lgroup where hot
      areas of their address space such as heap and stack
      are allocated and where they will be scheduled
      On T2 Plus systems an lgroup spans a processor and
      its local memory. lgroupinfo shows the groups and free
      To implement NUMA aware memory placement all the
      memory is 1GB interleaved
      MPO is greatly complicated by Virtualization as the
      Hypervisor needs to be MPO aware when allocating
      memory to a Logical Domain
                                                              Page 19
Scaling the Network Stack
      The T2 Plus systems have a 10Gig NIC integrated on
      the Motherboard
      We implemented Large Segment Offload (LSO) to
      increase transmit throughput and reduce CPU
      Large buffers (up to 64k) are sent to lower layers where
      the driver fragments them into packets
      This avoids fragmenting at the IP layer and copying of
      many smaller packets
      Up to 30% better single thread throughput
      Up to 20% better multi-thread throughput
      10% less cpu consumed by the network stack
                                                             Page 20
T2 Plus Scheduling Optimizations
 • Solaris Processor Groups
   > Abstraction introduced to capture a group of CPUs with some
     (hardware) sharing relationship
      > int/FP pipelines, caches, chips, MMUs, crypto units, etc.
   > PGs used by dispatcher to implement multi-level CMT load balancing
     and affinity polices
 • T2 Plus
   > Groupings created for int/FP pipelines
   > Balances running threads across both levels
       > 16 - 32 threads => 1 per core i.e. 1 per floating point pipeline
       > 64 – 128 threads => 2 per core i.e. 1 per integer pipeline
 • Aim is to spread the load evenly across all available cpus
 • Logical domains can also use this information for scheduling

                                                                            Page 21
The Hypervisor: Virtualization for T2 Plus

  • “sun4v” architecture                              Operating

       Solaris X                    Solaris X
        update                      (genunix)

      sun4u code                  Solaris X (sun4v)

       US-Z CPU                                           interface
                                  SPARC hypervisor
       CPU “Z”                      SPARC CPU

                                                                  Page 22
The Sun4v/HV/LDOMs Model
• Replace HW domains
  with Logical Domains         Logical      Logical      Logical
  > Highly flexible            Domain 1     Domain 2     Domain 3

• Each Domain runs an                       Solaris
                                                           Open           App

  independent OS               Service          App
  > Capability for specialized Domain
                                               App         App

    domains                                  Container   Container 1 Container 2

                Hardware            CPU         CPU        CPU          CPU
                Shared CPU,
                Memory, IO    I/O     Mem     Mem                 Mem

                                                                          Page 23
Virtualized I/O
                  Logical Domain A                       Service Domain
                          App                                                Device Driver
                                  App                                       /pci@B/qlc@6
                                                         Virtual Device      Nexus Driver
                                                             Service           /pci@B
                      Virtual Device
Privileged                Driver

                                                                               Virtual Nexus I/F
Hyper        Hypervisor                 Domain Channel

                                                                               I/O MMU
                                                                              Root   Bridge

                                                                          PCI B
                                                                                                   Page 24
T2 Plus - Scaling Applications
      Scaling and throughput are the most important goals for
      performance on highly threaded systems
      High Utilisation of the pipelines is key to performance.
      Need many software threads or processes to utilise the
      hardware threads.
      Threads have to be actually working
      Multiple instances of an application may be necessary
      to achieve scaling
      A single thread can become the bottleneck for the
      Spinning in a tight loop waiting for a hot lock is not good
      for CMT
      Critical sections can also reduce scaling

                                                               Page 25
      Solaris maintains a count of ticks in a kstat which is
      updated on entry and exit of idle, user and kernel
      states. On traditional processors these kstats indicates
      the percentage of time spent in each state
      T2 Plus utilises the pipeline completely differently, 4
      threads share each of the two integer pipelines per core
      and all 8 share the FP pipeline.
      If a thread stalls it is parked by the hardware and its
      cycles given to the other 3 threads until the stall
      completes. A thread is also parked on entering the idle
      From a kstat perspective Solaris believes it is running
      on 4 processors. Depending on the amount of stall in
      the threads the actual distribution of cpu cycles can be
      radically different to the Solaris view

                                                                 Page 26
Utilization and Corestat
• In order to determine the real pipeline utilization we use low
  level hardware counters to count the number of instructions per
  second per pipeline – both integer and floating point
• We have written a tool called corestat which collects data from
  the low level performance counters and aggregates it to
  present utilization percentages
• Available from http://cooltools.sunsource.net/corestat/
• Corestat reports :
            Individual integer pipeline utilization (Default mode)
            Floating Point Unit utilization -g flag)

                                                             Page 27
    Corestat: Example output
#   corestat
    Core Utilization for Integer pipeline
Core,Int-pipe %Usr %Sys %Usr+Sys
-------------      -----     -----   --------
   0,0              1.74     3.00    4.74
   0,1              1.56     2.70    4.25
   1,0              1.69     2.90    4.59
   1,1              1.55     2.69    4.25
   2,0              1.69     2.90    4.59
   2,1              1.56     2.69    4.24
   3,0             28.92     2.10 31.03
   3,1             1.56     2.68     4.24
   4,0             1.68     2.95     4.63
   4,1             1.56     2.69     4.24
   5,0             1.69     2.91     4.60
   5,1             1.56     2.69     4.24
   6,0             1.69     2.92     4.61
   6,1            31.77     1.89 33.66
-------------     -----    -----   --------
   Avg             5.73     2.69    8.42

                                                Page 28
Application Locking issues
      As in the OS the most common reason Applications fail to Scale on
      T2 Plus is hot locks
      We have found this especially true when migrating from 2-4 way
      systems. Systems with 2-4 cpus/cores tend to serialize access to
      hot locks and structures
      On T2 Plus locks tend to reside in the L2 cache. Cache to cache
      transfers are much lower latency and lock acquire time can be a lot
      less. More threads can be waiting on the same lock
      When developing locking code the CMT architecture is different.
      Use of a low-impact, long-latency opcode to add delay in the busy-
      wait loop. The low-impact opcode frees up cycles so that other
      threads sharing the core get more useful work done
      Try and use randomized exponential backoff in between attempts to
      acquire the lock. Without backoff multiple threads concurrently
      attempt atomic operations on the same address
                                                                           Page 29
Memory Bandwidth - T2 Plus STREAM Results

      vcpus   Copy    Scale    Add     Triad
       128    27950   23212   28901   29248
       112    30057   24989   31480   31108
        96    28073   23546   29797   29766
        80    28218   23371   29511   29548
        64    28088   23523   29650   29555
        48    28033   23410   29495   29246
        32    27975   23041   26444   25751
        16    25000   17500   14447   14113

                                               Page 30
SPECint_rate2006: UltraSPARC T2 Plus
• 2x scaling from T2 to 2 x T2 Plus


 System       Peak   Base   Configuration
 Sun T5140    157    141    2 x UltraSPARC T2 Plus       1.4GHz
 Sun T5120    78     72     1 x UltraSPARC T2            1.4GHz
 Sun T5120    68     63     1 x UltraSPARC T2            1.2GHz
 IBM p570     122    108    2 x Power 6                  4.7 GHz
 SuperMicro   147    121    2 x Xeon X5482               3.2 Ghz

                                      *See Slide “Benchmark Disclosure”
                                                                          Page 31
SPECfp_rate2006: UltraSPARC T2 Plus
 • 1.9x scaling from T2 to 2x T2 Plus


 System       Peak   Base   Configuration
 Sun T5140    119    111    2 x UltraSPARC T2 Plus        1.4GHz
 Sun T5120    62     57     1 x UltraSPARC T2             1.4GHz
 IBM p570     116    98     2 x Power 6                   4.7 GHz
 Dell T7400   81     76     2 x Xeon X5482                3.2 Ghz

                                      *See Slide “Benchmark Disclosure”
                                                                          Page 32
    SpecJBB performance 16 JVM
    1.94x scaling from T2 to 2xT2 Plus
    T5140 run with 16 JVMs

 System                     Metric
Way/GHz/cpu                 #core
Sun Fire T5140              373k                               2 x 1.4 Sun T2
Plus           16
Sun Fire T5120              192k                               1 x 1.4 Sun T2
IBM System x3650       323k          2 x 3.16GHz Xeon        8
IBM p6 570          346k                      4 x 4.7GHz Power 68

                                         *See Slide “Benchmark Disclosure”
                                                                             Page 33
Java on UltraSPARC T2 Plus
SPECjbb2005 Multi-JVM Results
           IBM p 570   IBM System   HP DL360 G5      Dell PE2950         Sun SE
                       3650                          III                 T5240
                                         *See Slide “Benchmark Disclosure”        Page 34
  1.7x scaling from T2 to 2 x T2 Plus

 System        Metric                           Appserver
              Way/GHz/cpu #core
Sun Fire T5140       3331    Oracle                           2 x 1.4
Sun T2 Plus     16
Sun Fire T5120       2000    Oracle                           1 x 1.4
Sun T2                8
IBM p570          1197      WS 6.1                                       2x
4.7GHz Power 6          4
Inspur NF280D    1538       BEA 10                                 2x 2.6GHz
Xeon 5355 8                          *See Slide “Benchmark Disclosure”
                                                                          Page 35
• Run in house on a T5240
• Discover software for law firms
• Pure 64-bit Java application
• Ingesting a huge amount of data and running a
  compute intensive learning algorithm
• Traditional solution x86 clusters
• Went from 3GB of data an hour on 2-way dual core
  x86 servers to 50GB an hour on the T5240

                                                     Page 36
     The implementation of T2 Plus enables highly threaded
     two-way and four-way servers
     The high bandwidth, low latency interconnect is
     essential to achieve good scalability
     Scalable memory bandwidth and physical I/O is also
     required Data
     An OS running on a highly threaded processor must
     also be scalable and aware of the underlying hardware
     implementation. It must also be virtualizable
     High throughput and utilisation of the per core pipelines
     is key to Application scaling.
     Requires scalable locking
     May require multiple instances

                                                             Page 37
More Info
• Start with our Engineering blogs
• http://blogs.sun.
• All benchmarks are posted here:
• http://www.sun.

                                                 Page 38
Benchmark Disclosure
SPEC, SPECjAppServer are registered trademarks of Standard Performance Evaluation Corporation. All results from www.spec.org as of 04/01/08.
SPECjAppServer2004 Sun SPARC Enterprise T5240 (16 cores, 2 chip) 3,331 JOPS@Standard. SPECjAppServer2004; IBM p570 (4 cores, 2 chips)
1,197.51 JOPS@Standard. SPECjAppServer2004; Inspur NF280D (8 cores, 2 chips) 1,538.65 JOPS@Standard.SPECjAppServer2004.
SPECjAppServer2004 Sun SPARC Enterprise T5220 (8 cores, 1 chip) 2000 JOPS@Standard. SPECjAppServer2004

SPEC, SPECjbb reg tm of Standard Performance Evaluation Corporation. Sun SPARC Enterprise T5120 results submitted to
SPEC. Other results as of 09/28/07 on www.spec.org. T5240 results submitted to SPEC for review.
Other results as of 04/09/2008 on www.spec.org. Sun SPARC Enterprise T5240 (2 chip, 16 cores, 128 threads) 373,405
SPECjbb2005 bops, 23,338 SPECjbb2005 bops/JVM. IBM p570 (4 chips, 8 cores, 16 threads) SPECjbb2005 bops = 346,742,
SPECjbb2005 bops/JVM = 86,686. IBM System x3650 (2 chips, 8 cores, 8 threads) SPECjbb2005 bops = 323,172, SPECjbb2005
bops/JVM = 80,793. HP ProLiant DL360 G5 (2 chips, 8 cores, 8 threads) SPECjbb2005 bops = 301,321, SPECjbb2005 bops/JVM
= 75330. Dell PowerEdge 2950 III (2 chips, 8 cores, 8 threads) SPECjbb2005 bops = 305,411, SPECjbb2005 bops/JVM = 76,353

SPEC, SPECint reg tm of Standard Performance Evaluation Corporation. Sun result submitted to SPEC, other results from
www.spec.org as of 4/7/08. Sun SPARC Enterprise T5240 (UltraSPARC T2 Plus, 2 chips, 16 cores), 157 SPECint_rate2006, 142
SPECint_rate_base2006; IBM p 570 (POWER6, 2 chips, 4 cores), 122 SPECint_rate2006; Sun SPARC Enterprise T5220
(UltraSPARC T2, 1 chip, 8 cores), 83.2 SPECint_rate2006. Supermicro (Xeon X548, 2 chips, 8 cores), 147 SPECint_rate2006

SPEC, SPECfp reg tm of Standard Performance Evaluation Corporation. Sun result submitted to SPEC, other results from
www.spec.org as of 4/7/08. Sun SPARC Enterprise T5240 (UltraSPARC T2 Plus, 2 chips, 16 cores), 119 SPECfp_rate2006, 111
SPECfp_rate_base2006; IBM p 570 (POWER6, 2 chips, 4 cores), 116 SPECfp_rate2006; Sun SPARC Enterprise T5220
(UltraSPARC T2, 1 chip, 8 cores), 62.3 SPECfp_rate2006; Dell T7400 (Xeon X5482, 2 chips, 8 cores), 81 SPECfp_rate2006
SPEC, SPECjAppServer, SPECint, SPECfp are registered trademarks of Standard Performance Evaluation Corporation.

                                                                                                                                      Page 39

To top