Slide 1 - CSAIL - MIT by gegeshandong

VIEWS: 2 PAGES: 36

									Victim Replication: Maximizing
Capacity while Hiding Wire Delay
in Tiled Chip Multiprocessors

     Michael Zhang & Krste Asanovic
     Computer Architecture Group
     MIT CSAIL
Chip Multiprocessors (CMPs) are Here




     IBM Power5          AMD Opteron           Intel Montecito
     with 1.9MB L2       with 2MB L2           With 24MB L3


      Easily utilizes on-chip transistors
      Naturally exploits thread-level parallelism
      Dramatically reduces design complexity

      Future CMPs will have more processor cores
      Future CMPs will have more cache
Current Chip Multiprocessors

 core     core    core      core    Layout: “Dance-Hall”
 L1$      L1$      L1$      L1$
                                      Core + L1 cache
        Intra-Chip Switch             L2 cache


                                    Small L1 cache: Very low
           L2                        access latency
          Cache

                                    Large L2 cache: Divided into
A 4-node CMP with                    slices to minimize access
a large L2 cache                     latency and power usage
Current Chip Multiprocessors

 core        core       core        core      Layout: “Dance-Hall”
 L1$         L1$         L1$         L1$
                                                Core + L1 cache
           Intra-Chip Switch                    L2 cache
L2 Slice    L2 Slice   L2 Slice   L2 Slice

L2 Slice    L2 Slice   L2 Slice   L2 Slice

L2 Slice    L2 Slice   L2 Slice   L2 Slice    Small L1 cache: Very low
L2 Slice

L2 Slice
            L2 Slice

            L2 Slice
                       L2 Slice

                       L2 Slice
                                  L2 Slice

                                  L2 Slice
                                               access latency
L2 Slice    L2 Slice   L2 Slice   L2 Slice

L2 Slice    L2 Slice   L2 Slice   L2 Slice

L2 Slice    L2 Slice   L2 Slice   L2 Slice    Large L2 cache: Divided into
A 4-node CMP with                              slices to minimize access
a large L2 cache                               latency and power usage
Increasing CMP Cache Capacities lead to Non-
Uniform Cache Access Latency (NUCA)

 core        core       core        core
                                              Current: Caches are designed with
                                               (long) uniform access latency for the
 L1$         L1$         L1$         L1$
                                               worst case:
           Intra-Chip Switch
                                                Best Latency == Worst Latency
L2 Slice    L2 Slice   L2 Slice   L2 Slice

L2 Slice    L2 Slice   L2 Slice   L2 Slice
                                              Future: Must design with non-uniform
L2 Slice    L2 Slice   L2 Slice   L2 Slice
                                               access latencies depending on the on-
L2 Slice    L2 Slice   L2 Slice   L2 Slice
                                               die location of the data:
L2 Slice    L2 Slice   L2 Slice   L2 Slice

L2 Slice    L2 Slice   L2 Slice   L2 Slice

L2 Slice    L2 Slice   L2 Slice   L2 Slice
                                                Best Latency << Worst Latency
L2 Slice    L2 Slice   L2 Slice   L2 Slice

                                              Challenge: How to minimize average
A 4-node CMP with                              cache access latency:
a large L2 cache
                                                Average Latency  Best Latency
Current Research on NUCAs

core     core    core      core    Targeting uniprocessor machines
L1$      L1$      L1$      L1$


       Intra-Chip Switch           Data Migration: Intelligently place
                                    data such that the active working
                                    set resides in cache slices closest
                                    to the processor
                                      D-NUCA [ASPLOS-X, 2002]
                                      NuRAPID [MICRO-37, 2004]
Data Migration does not Work Well with CMPs

core     core    core      core
                                   Problem: The unique copy of
 L1$     L1$      L1$      L1$      the data cannot be close to
                                    all of its sharers
       Intra-Chip Switch



                                   Behavior: Over time, shared
                                    data migrates to a location
                                    equidistant to all sharers
                                      Beckmann & Wood [MICRO-36, 2004]



       Intra-Chip Switch


 L1$     L1$      L1$      L1$

core     core     core     core
This Talk: Tiled CMPs with Directory-
Based Cache Coherence Protocol
                            Switch                    Tiled CMPs for Scalability
core           L1$                                       Minimal redesign effort
                                                         Use directory-based protocol for
       L2$                L2$                             scalability
       Slice              Slice
       Data               Tag

                                                      Managing the L2s to minimize
c   L1 SW c     L1 SW c      L1 SW c       L1 SW
                                                       the effective access latency
L2$
Data
       L2$
       Tag
             L2$
             Data
                    L2$
                    Tag
                          L2$
                          Data
                                  L2$
                                  Tag
                                        L2$
                                        Data
                                               L2$
                                               Tag
                                                         Keep data close to the requestors
c   L1 SW c     L1 SW c      L1 SW c       L1 SW         Keep data on-chip
L2$    L2$   L2$    L2$   L2$     L2$   L2$    L2$
Data   Tag   Data   Tag   Data    Tag   Data   Tag



                                                      Two baseline L2 cache designs
c   L1 SW c     L1 SW c      L1 SW c       L1 SW

L2$    L2$   L2$    L2$   L2$     L2$   L2$    L2$
Data   Tag   Data   Tag   Data    Tag   Data   Tag

c   L1 SW c     L1 SW c      L1 SW c       L1 SW
                                                         Each tile has own private L2
L2$    L2$   L2$    L2$   L2$     L2$   L2$    L2$       All tiles share a single distributed L2
Data   Tag   Data   Tag   Data    Tag   Data   Tag
Private L2 Design Provides Low Hit Latency

                  Switch                    Switch    The local L2 slice is used
core       L1$             core      L1$               as a private L2 cache for
                                                       the tile
 Private
           L2$
                           Private
                                     L2$                 Shared data is duplicated in
   L2$           DIR         L2$           DIR
  Data
           Tag
                            Data
                                     Tag                  the L2 of each sharer
                                                         Coherence must be kept
     Sharer i                  Sharer j                   among all sharers at the L2
                                                          level


                                                      On an L2 miss:
                                                         Data not on-chip
                                                         Data available in the private
                                                          L2 cache of another chip
Private L2 Design Provides Low Hit Latency

                     Switch                       Switch    The local L2 slice is used
core       L1$                   core      L1$               as a private L2 cache for
                                                             the tile
 Private
           L2$
                                 Private
                                           L2$                 Shared data is duplicated in
   L2$             DIR             L2$           DIR
  Data
           Tag
                                  Data
                                           Tag                  the L2 of each sharer
                                                               Coherence must be kept
   Requestor                     Owner/Sharer                   among all sharers at the L2
                                                                level
                                  Switch
             core          L1$
                                                            On an L2 miss:
                                                               Data not on-chip
                 Private
                   L2$
                           L2$
                                 DIR                           Data available in the private
                           Tag
                  Data                       Off-chip           L2 cache of another tile
                                             Access
                                                                (cache-to-cache reply-
             Home Node
 statically determined by address
                                                                forwarding)
Private L2 Design Provides Low Hit Latency

                            Switch                 Characteristics:
  core          L1$                                   Low hit latency to resident L2 data
                                                      Duplication reduces on-chip capacity
   Private
                 L2$
     L2$                  DIR
                 Tag
    Data
                                                   Works well for benchmarks with
                                                    working sets that fits into the
 c L1   SW   c L1   SW   c L1   SW   c L1   SW      local L2 capacity
Private     Private     Private     Private
        Dir         Dir         Dir         Dir
  L2          L2          L2          L2

 c L1   SW   c L1   SW   c L1   SW   c L1   SW


Private     Private     Private     Private
        Dir         Dir         Dir         Dir
  L2          L2          L2          L2

 c L1   SW   c L1   SW   c L1   SW   c L1   SW


Private     Private     Private     Private
        Dir         Dir         Dir         Dir
  L2          L2          L2          L2

 c L1   SW   c L1   SW   c L1   SW   c L1   SW


Private     Private     Private     Private
        Dir         Dir         Dir         Dir
  L2          L2          L2          L2
Shared L2 Design Provides Maximum Capacity

                  Switch                       Switch    All L2 slices on-chip form
core     L1$                  core      L1$               a distributed shared L2,
                                                          backing up all L1s
Shared
         L2$
                              Shared
                                        L2$                 No duplication, data kept in a
 L2$            DIR            L2$            DIR
 Data
         Tag
                               Data
                                        Tag                  unique L2 location
                                                            Coherence must be kept
  Requestor                   Owner/Sharer                   among all sharers at the L1
                                                             level
                               Switch
           core         L1$
                                                         On an L2 miss:
                                                            Data not in L2
               Shared
                L2$
                        L2$
                              DIR                           Coherence miss (cache-to-
                        Tag
                Data                      Off-chip           cache reply-forwarding)
                                          Access
            Home Node
statically determined by address
Shared L2 Design Provides Maximum Capacity

                          Switch                 Characteristics:
 core         L1$                                   Maximizes on-chip capacity
                                                    Long/non-uniform latency to L2 data
   Shared
                L2$
    L2$                 DIR
                Tag
    Data
                                                 Works well for benchmarks with
                                                  larger working sets to minimize
c L1   SW   c L1   SW   c L1   SW   c L1   SW     expensive off-chip accesses
Shared     Shared     Shared     Shared
       Dir        Dir        Dir        Dir
  L2         L2         L2         L2

c L1   SW   c L1   SW   c L1   SW   c L1   SW


Shared     Shared     Shared     Shared
       Dir        Dir        Dir        Dir
  L2         L2         L2         L2

c L1   SW   c L1   SW   c L1   SW   c L1   SW


Shared     Shared     Shared     Shared
       Dir        Dir        Dir        Dir
  L2         L2         L2         L2

c L1   SW   c L1   SW   c L1   SW   c L1   SW


Shared     Shared     Shared     Shared
       Dir        Dir        Dir        Dir
  L2         L2         L2         L2
Victim Replication: A Hybrid Combining the
Advantages of Private and Shared Designs

 Private design             Shared design
  characteristics:            characteristics:
   Low L2 hit latency to      Long/non-uniform L2 hit
    resident L2 data            latency
   Reduced L2 capacity        Maximum L2 capacity
Victim Replication: A Hybrid Combining the
Advantages of Private and Shared Designs

 Private design            Shared design
  characteristics:           characteristics:
  Low L2 hit latency to      Long/non-uniform L2 hit
    resident L2 data           latency
   Reduced L2 capacity      Maximum L2 capacity



   Victim Replication: Provides low hit latency
   while keeping the working set on-chip
Victim Replication: A Variant of
the Shared Design
                    Switch                           Switch
                                                               Implementation: Based
core      L1$                         core     L1$
                                                                on the shared design

 Shared                               Shared
  L2$
          L2$
              DIR                      L2$
                                               L2$
                                                   DIR         L1 Cache: Replicates
          Tag                                  Tag
  Data                                 Data                     shared data locally for
                                                                fastest access latency
    Sharer i                             Sharer j

                                                               L2 Cache: Replicates the
                             Switch                             L1 capacity victims 
          core       L1$                                        Victim Replication

           Shared
                      L2$
            L2$           DIR
                      Tag
            Data


           Home Node
Victim Replication: The Local Tile
Replicates the L1 Victim During Eviction
                    Switch                           Switch

core      L1$                         core     L1$
                                                               Replicas: L1 capacity
                                                                victims stored in the
 Shared                               Shared
                                                                Local L2 slice
          L2$                                  L2$
  L2$         DIR                      L2$         DIR
          Tag                                  Tag
  Data                                 Data

                                                               Why? Reused in the
    Sharer i                             Sharer j
                                                                near future with fast
                                                                access latency
                             Switch

          core       L1$
                                                               Which way in the
                                                                target set to use to
           Shared
            L2$
                      L2$
                      Tag
                          DIR                                   hold the replica?
            Data


           Home Node
The Replica should NOT Evict More
Useful Cache Blocks from the L2 Cache
                    Switch                           Switch

core      L1$                         core     L1$
                                                               Replica is NOT always made
 Shared                               Shared
          L2$                                  L2$
  L2$         DIR                      L2$         DIR
  Data
          Tag
                                       Data
                                               Tag            1.   Invalid blocks
                                                              2.   Home blocks w/o sharers
    Sharer i                             Sharer j
                                                              3.   Existing replicas
                                                              4.   Home blocks w/ sharers
                             Switch
                                                                   Never evict actively
          core       L1$
                                                                   shared home blocks
                                                                   in favor of a replica
           Shared
                      L2$
            L2$           DIR
                      Tag
            Data


           Home Node
Victim Replication Dynamically Divides the
Local L2 Slice into Private & Shared Partitions
                  Switch                           Switch

 core      L1$                      core     L1$


 Private   L2$                      Shared   L2$
  L2$      Tag
               DIR
                                     L2$     Tag
                                                 DIR          Victim Replication
                                                              dynamically creates
Private Design                 Shared Design                  a large local private,
                                                              victim cache for the
                           Switch
                                                              local L1 cache
           core      L1$


                     L2$                               Shared L2$
                         DIR
                     Tag
                                                       Private L2$
                                                       (filled w/ L1 victims)
     Victim Replication
Experimental Setup
 Processor Model: Bochs                                    Applications
     Full-system x86 emulator running Linux 2.4.24
     8-way SMP with single in-order issue cores            Linux 2.4.24
                                                                     S            S
                                                            c   L1       c   L1
 All latencies normalized to one 24-F04 clock cycle
                                                            L2       D   L2       D
     Primary caches reachable in one cycle
                                                                     S            S
                                                            c   L1       c   L1

 Cache/Memory Model                                        L2       D   L2       D

     4x2 Mesh with 3 Cycle near-neighbor latency           c   L1
                                                                     S
                                                                         c   L1
                                                                                  S

     L1I$ & L1D$: 16KB each, 16-Way, 1-Cycle, Pseudo-LRU   L2       D   L2       D
     L2$: 1MB, 16-Way, 6-Cycle, Random                              S            S
                                                            c   L1       c   L1
     Off-chip Memory: 256 Cycles
                                                            L2       D   L2       D


 Worst-case cross chip contention-free latency is
  30 cycles                                                      DRAM
The Plan for Results
     Three configurations evaluated:
     1. Private L2 design  L2P
     2. Shared L2 design  L2S
     3. Victim replication  L2VR


     Three suites of workloads used:
     1. Multi-threaded workloads
     2. Single-threaded workloads
     3. Multi-programmed workloads


     Results show Victim Replication’s Performance
      Robustness
Multithreaded Workloads

 8 NASA Advanced Parallel Benchmarks:
   Scientific (computational fluid dynamics)
   OpenMP (loop iterations in parallel)
   Fortran: ifort –v8 –O2 –openmp


 2 OS benchmarks
   dbench: (Samba) several clients making file-centric system calls
   apache: web server with several clients (via loopback interface)
   C: gcc 2.96


 1 AI benchmark: Cilk checkers
   spawn/sync primitives: dynamic thread creation/scheduling
   Cilk: gcc 2.96, Cilk 5.3.2
                  Access Latency (cycles)




              0
                   1
                         2
                               3
                                     4
                                            5
      B
          T

      C
          G


      EP


      FT


       IS


      LU


      M
          G
                                                Average Access Latency




      SP

  ap
     ac
       he
 db
    en
       ch
ch
                                     L2S
                                     L2P




   ec
     ke
        rs
Average Access Latency,
with Victim Replication
                          5
Access Latency (cycles)




                                                                           L2P
                          4                                                L2S
                                                                           L2VR
                          3

                          2

                          1

                          0
                              BT   CG   EP   FT   IS   LU   MG   SP   apache   dbench   checkers
Average Access Latency,
with Victim Replication
                          5
Access Latency (cycles)




                                                                                                    L2P
                          4                                                                         L2S
                                                                                                    L2VR
                          3

                          2

                          1

                          0
                               BT      CG      EP      FT       IS    LU      MG      SP     apache   dbench   checkers

         1st                  L2VR    L2P     L2VR    L2P     Tied   L2P     L2VR    L2P     L2P      L2P      L2VR

       2nd                    L2P     L2VR    L2S     L2VR    Tied   L2VR    L2S     L2VR    L2VR     L2VR     L2S
                              0.1%    32.0%   18.5%   3.5%           4.5%    17.5%   2.5%    3.6%     2.1%     14.4%

        3rd                   L2S     L2S     L2P     L2S     Tied   L2S     L2P     L2S     L2S      L2S      L2P
                              12.2%   111%    51.6%   21.5%          40.3%   35.0%   22.4%   23.0%    11.5%    29.7%
  FT: Private Design is the Best When Working
  Set Fits in Local L2 Slice
       Average Data             Access

1.8
      Access Latency          Breakdown         The large capacity of the shared design
                       100%                      is not utilized as shared and private
1.6                                              designs have similar off-chip miss rates

1.4                                             The short access latency of the private
                                                 design yields better performance
1.2

 1                                              Victim replication mimics the private
                                                 design by creating replicas, with
                       99%
0.8                                              performance within 5%

0.6

0.4                                                   Off-chip misses        Not Good …
                                                      Hits in Non-Local L2   O.K.
0.2                                                   Hits in Local L2       Very Good
                                                      Hits in L1             Best
 0                     98%
        L2P L2S L2VR          L2P L2S L2VR
CG: Large Number of L2 Hits Magnifies Latency
Advantage of Private Design
  Average Data             Access
 Access Latency          Breakdown         The latency advantage of the private
6              100%
                                            design is magnified by the large number
                                            of L1 misses that hits in L2 (>9%)
5
                                           Victim replication edges out shared
                                            design with replicas, by falls short of the
4
                                            private design

3                  95%


2


                                                 Off-chip misses
1
                                                 Hits in Non-Local L2
                                                 Hits in Local L2
0                  90%                           Hits in L1
    L2P L2S L2VR         L2P L2S L2VR
 MG: Victim Replication is the Best When
 Working Set Does not Fit in Local L2
       Average Data            Access
      Access Latency         Breakdown         The capacity advantage of the shared
2.5                   100%
                                                design yields many fewer off-chip
                                                misses

 2
                                               The latency advantage of the private
                                                design is offset by costly off-chip
                      99%                       accesses
1.5
                                               Victim replication is even better than
                                                shared design by creating replicas to
 1                                              reduce access latency
                      98%


0.5                                                 Off-chip misses
                                                    Hits in Non-Local L2
                                                    Hits in Local L2
                                                    Hits in L1
 0                    97%
       L2P L2S L2VR          L2P L2S L2VR
Checkers: Dynamic Thread Migration Creates
Many Cache-Cache Transfers
   Average Data               Access
  Access Latency            Breakdown
 5                   100%                     Virtually no off-chip accesses
4.5
                                              Most of hits in the private design come
 4                   99%                       from more expensive cache-to-cache
                                               transfers
3.5

 3                   98%                      Victim replication is even better than
                                               shared design by creating replicas to
2.5
                                               reduce access latency
 2                   97%

1.5

 1                   96%                           Off-chip misses
                                                   Hits in Non-Local L2
0.5                                                Hits in Local L2
                                                   Hits in L1
 0                   95%
      L2P L2S L2VR          L2P L2S L2VR
               Victim Replication Adapts to
               the Phases of the Execution
% of replica in cache


                                      CG                          FT
                        40                                40




                        20                                20




                        0                                  0
                             0             5.0 Billion Instrs 0        6.6 Billion Instrs



                                 Each graph shows the percentage of
                                 replicas in the L2 caches averaged
                                 across all 8 caches
Single-Threaded Benchmarks
                          Switch                 SpecINT2000 are used as Single-
 Active
 Thread       L1$                                 Threaded benchmarks
                                                    Intel C compiler version 8.0.055
   Shared
                L2$
    L2$                 DIR
                Tag
    Data
                                                 Victim replication automatically
                                                  turns the cache hierarchy into
c L1   SW   c L1   SW   c L1   SW   c L1   SW     three levels with respect to the
Shared
  L2
       Dir
           Shared
             L2
                  Dir
                      Shared
                        L2
                             Dir
                                 Shared
                                   L2
                                        Dir       node hosting the active thread
c L1   SW   c L1   SW   c L1   SW   c L1   SW


Shared     Shared     Shared     Shared
       Dir        Dir        Dir        Dir
  L2         L2         L2         L2

c L1   SW   c L1   SW   c L1   SW   c L1   SW


Shared     Shared     Shared     Shared
       Dir        Dir        Dir        Dir
  L2         L2         L2         L2

c L1   SW   c L1   SW   c L1   SW   c L1   SW


Shared     Shared     Shared     Shared
       Dir        Dir        Dir        Dir
  L2         L2         L2         L2
Single-Threaded Benchmarks
                          Switch                 SpecINT2000 are used as Single-
 Active
 Thread       L1$                                 Threaded benchmarks
                                                    Intel C compiler version 8.0.055
   Mostly
                L2$
   Replica              DIR
                Tag
    Data
                                                 Victim replication automatically
                                                  turns the cache hierarchy into
c L1   SW   c L1   SW   c L1   SW   T L1   SW     three levels with respect to the
                                    L1.5
Shared
  L2
       Dir
           Shared
             L2
                  Dir
                      Shared
                        L2
                             Dir with Dir
                                replicas
                                                  node hosting the active thread
c L1   SW   c L1   SW   c L1   SW   c L1   SW
                                                    Level 1: L1 cache
Shared     Shared     Shared     Shared
                                                    Level 2: All remote L2 slices
       Dir        Dir        Dir        Dir
  L2         L2         L2         L2

c L1   SW   c L1   SW   c L1   SW   c L1   SW


Shared     Shared     Shared     Shared
                                                    “Level 1.5”: The local L2 slice acts as
       Dir        Dir        Dir        Dir
  L2         L2         L2         L2
                                                     a large private victim cache which
c L1   SW   c L1   SW   c L1   SW   c L1   SW
                                                     holds data used by the active thread
Shared     Shared     Shared     Shared
       Dir        Dir        Dir        Dir
  L2         L2         L2         L2
Three Level Caching
                                    bzip                          mcf
% of replica in cache


                        100                               100

                                                                        Thread running
                                                                        on one tile

                                                                  Thread moving
                                                                  between two tiles



                         0                                  0
                          0                3.8 Billion Instrs 0             1.7 Billion Instrs



                              Each graph shows the percentage of
                              replicas in the L2 caches for each of
                              the 8 caches
Single-Threaded Benchmarks
                                                 Average Data Access Latency
                   10
Latency (cycles)




                    8                                                                                        L2P
                                                                                                             L2S
                    6                                                                                        L2VR

                    4

                    2

                    0
                                                       c




                                                                                                                 r
                                           n


                                                 p




                                                                  cf




                                                                                                        ex
                          ip




                                                             ip
                                   ty




                                                                                              f
                                                                           er



                                                                                      k




                                                                                                              vp
                                                                                            ol
                                                     gc
                                               ga
                                        eo




                                                                                      m
                                                                  m
                        bz




                                                           gz
                                 af




                                                                                                     rt
                                                                         rs




                                                                                          tw
                                                                                  rlb
                               cr




                                                                                                  vo
                                                                       pa


                                                                                pe
                   Victim replication is the best policy in 11 out of 12
                   benchmarks with an average saving of 23% over
                   shared design and 6% over private design
              Multi-Programmed Workloads
                              Average Data Access Latency
                   3
                                                             L2P
                                                                    Created using
Latency (cycles)




                                                             L2S
                                                             L2VR
                   2                                                SpecINTs, each
                                                                    with 8 different
                   1
                                                                    programs chosen
                                                                    at random

                   0
                       MP0   MP1     MP2      MP3      MP4   MP5


           1st : Private design, always the best
           2nd : Victim replication, performance within 7% of private design
           3rd : Shared design, performance within 27% of private design
Concluding Remarks

Victim Replication is
 Simple: Requires little modification from a shared
  L2 design

 Scalable: Scales well to CMPs with large number of
  nodes by using a directory-based cache coherence
  protocol

 Robust: Works well for a wide range of workloads
  1. Single-threaded
  2. Multi-threaded           Thank You!
  3. Multi-programmed

								
To top