Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out

Lec6-faircaching.ppt - ECE Users Pages

VIEWS: 2 PAGES: 55

									ECE8833 Polymorphous and Many-Core Computer Architecture

       Lecture 6 Fair Caching Mechanisms for CMP




                                       Prof. Hsien-Hsin S. Lee
              School of Electrical and Computer Engineering
     Cache Sharing in CMP [Kim, Chandra, Solihin, PACT’04]

                    Processor Core 1                                Processor Core 2

                              L1 $                                       L1 $


                                                             L2 $

                                                             ……




                    [Kim, Chandra, Solihin PACT2004]
   Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
ECE8833 H.-H. S. Lee 2009                                                              2
     Cache Sharing in CMP
                    Processor Core 1 ←t1                            Processor Core 2

                             L1 $                                        L1 $


                                                             L2 $

                                                             ……




   Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
ECE8833 H.-H. S. Lee 2009                                                              3
     Cache Sharing in CMP
                    Processor Core 1                                t2→ Processor Core 2

                             L1 $                                             L1 $


                                                             L2 $

                                                             ……




   Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
ECE8833 H.-H. S. Lee 2009                                                                  4
     Cache Sharing in CMP
                    Processor Core 1 ←t1                            t2→ Processor Core 2

                             L1 $                                             L1 $


                                                             L2 $

                                                             ……



      t2’s throughput is significantly reduced due to unfair cache sharing.

   Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
ECE8833 H.-H. S. Lee 2009                                                                  5
     Shared L2 Cache Space Contention
                            10

                             8
        gzip's
                             6
      Normalized
     Cache Misses            4
          Per
                             2
      Instruction
                             0
                                 gzip(alone) gzip+applu        gzip+apsi   gzip+art   gzip+swim
                        1.2
                            1

         gzip's 0.8
       Normalized 0.6
          IPC     0.4
                        0.2
                            0
                                 gzip(alone) gzip+applu        gzip+apsi   gzip+art   gzip+swim
   Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
ECE8833 H.-H. S. Lee 2009                                                                         6
     Impact of Unfair Cache Sharing
           • Uniprocessor scheduling                                  time slice


                                   t2           t3                        t4
                            t1                               t1

           • 2-core CMP scheduling                                    time slice

       P1:                  t1     t1           t1           t1           t1
       P2:                  t2     t3           t2           t3
                                                                          t4
           • gzip will get more time slices than others if gzip is set to run at
             higher priority (and it could run slower than others  priority
             inversion)
           • It could further slows down the other processes (starvation)
           • Thus the overall throughput is reduced (uniform slowdown)

                                                                                   7
ECE8833 H.-H. S. Lee 2009                                                              7
     Stack Distance Profiling Algorithm
                                               CTR    CTR   CTR   CTR
                              HIT              Pos    Pos   Pos   Pos
                            Counters            0      1     2     3

                            Cache Tag         MRU                 LRU




                            HIT Counters      Value
                             CTR Pos 0         30
                             CTR Pos 1         20
                             CTR Pos 2         15
                             CTR Pos 3         10
                                Misses = 25




  [Qureshi+, MICRO-39]
ECE8833 H.-H. S. Lee 2009                                               8
     Stack Distance Profiling




      • A counter for each cache way, C>A is the counter for misses
      • Show the reuse frequency for each way in a cache
      • Can be used to predict the misses for associativity smaller than “A”
             – Misses for 2-way cache for gzip = C>A + Σ Ci where i = 3 to 8
      • art does not need all the space for likely poor temporal locality
      • If the given space is halved for art and given to gzip, what happens?
ECE8833 H.-H. S. Lee 2009                                                       9
     Fairness Metrics [Kim et al. PACT’04]
           • Uniform slowdown
                                                   T _ shared i T _ shared j
                                                               
                                                    T _ alonei   T _ alonej


                    Execution time of
                     ti when it runs
                         alone.




   Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
ECE8833 H.-H. S. Lee 2009                                                      10
     Fairness Metrics [Kim et al. PACT’04]
           • Uniform slowdown
                                                   T _ shared i T _ shared j
                                                               
                                                    T _ alonei   T _ alonej
                  Execution time of
                   ti when it shares
                  cache with others.




   Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
ECE8833 H.-H. S. Lee 2009                                                      11
     Fairness Metrics [Kim et al. PACT’04]
           • Uniform slowdown
                                                   T _ shared i T _ shared j
                                                               
                                                    T _ alonei   T _ alonej


           • We want to minimize:
                  – Ideally:                            M 0  X i  X j , where X i 
                                                          ij                            T _ shared i
                                                                                         T _ alonei


           Try to equalize the ratio of miss increase
                         of each thread




   Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
ECE8833 H.-H. S. Lee 2009                                                                              12
     Fairness Metrics [Kim et al. PACT’04]
           • Uniform slowdown
                                                   T _ shared i T _ shared j
                                                               
                                                    T _ alonei   T _ alonej


           • We want to minimize:
                  – Ideally:                            M 0  X i  X j , where X i 
                                                          ij                            T _ shared i
                                                                                         T _ alonei
                                             Miss _ shared i
                 M  X i  X j , where X i 
                     ij
                     1
                                             Miss _ alonei

                                                          MissRate _ shared i
                 M 3  X i  X j , where X i 
                   ij

                                                          MissRate _ alonei

                 M 5  X i  X j , where X i  MissRate _ sharedi  MissRate _ alonei
                   ij




   Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
ECE8833 H.-H. S. Lee 2009                                                                              13
     Partitionable Cache Hardware
           • Modified LRU cache replacement policy
                  – G. E. Suh, et. al., HPCA 2002
                                   Current Partition                Target Partition   Per-thread
                                                                                        Counter
                                       P1: 448B                       P1: 384B
                                       P2: 576B                       P2: 640B
                                                                    LRU
                                                                    LRU
                                                      LRU
           P2 Miss                                                           LRU




   Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
ECE8833 H.-H. S. Lee 2009                                                                           14
     Partitionable Cache Hardware
           • Modified LRU cache replacement policy
                  – G. Suh, et. al., HPCA 2002
                                   Current Partition                Target Partition
                                       P1: 448B                       P1: 384B
                                       P2: 576B                       P2: 640B
                                                                    LRU
                                                                    LRU
                                                      LRU
           P2 Miss                                      *                    LRU

                                                                                        Partition granularity
                                   Current Partition                Target Partition   could be as coarse as
                                                                                       one entire cache way
                                       P1: 384B                       P1: 384B
                                       P2: 640B                       P2: 640B
                                                                    LRU
                                                                    LRU
                                                      LRU
                                                        *                    LRU
   Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
ECE8833 H.-H. S. Lee 2009                                                                                   15
     Dynamic Fair Caching Algorithm
                                                    MissRate alone
                                                                              Counters to keep miss rates
      Ex) Optimizing                                    P1:                     running the process alone
        M3 metric                                                            (from stack distance profiling)
                                                        P2:

               MissRate shared                      Counters to keep dynamic
                                                    miss rates (running with a
                    P1:                                   shared cache)                    10K accesses found
                                                                                             to be the best
                    P2:

                                                                            Repartitioning
                                                                              interval
                                              Target Partition
                                               P1:
                                                                                  Counters to keep
                                               P2:                               target partition size

   Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
ECE8833 H.-H. S. Lee 2009                                                                                       16
     Dynamic Fair Caching Algorithm
                                                    MissRate alone
                                                        P1:20%
           1st Interval
                                                        P2: 5%

               MissRate shared
                    P1:20%
                    P1:
                    P2:
                    P2:15%

                                                                     Repartitioning
                                                                       interval
                                              Target Partition
                                               P1:256KB
                                               P2:256KB
   Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
ECE8833 H.-H. S. Lee 2009                                                             17
     Dynamic Fair Caching Algorithm
                                                    MissRate alone
                                                        P1:20%
         Repartition!
                                                        P2: 5%

               MissRate shared             Evaluate M3
                                           P1: 20% / 20%
                    P1:20%
                                           P2: 15% / 5%
                    P2:15%

                                                                       Repartitioning
                                                                         interval
                                              Target Partition
                                               P1:192KB
                                               P1:256KB              Partition
                                                                     granularity:
                                               P2:256KB
                                               P2:320KB
                                                                     64KB
   Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
ECE8833 H.-H. S. Lee 2009                                                               18
     Dynamic Fair Caching Algorithm
                                                    MissRate alone
                                                        P1:20%
          2nd Interval
                                                        P2: 5%

               MissRate shared                  MissRate shared
                    P1:20%                           P1:20%
                    P2:15%                           P2:15%

                                                                     Repartitioning
                                                                       interval
                                              Target Partition
                                               P1:192KB
                                               P2:320KB
   Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
ECE8833 H.-H. S. Lee 2009                                                             19
     Dynamic Fair Caching Algorithm
                                                    MissRate alone
                                                        P1:20%
         Repartition!
                                                        P2: 5%

                                                                     Evaluate M3
               MissRate shared                  MissRate shared
                                                                     P1: 20% / 20%
                    P1:20%                           P1:20%          P2: 10% / 5%
                    P2:15%                           P2:10%

                                                                       Repartitioning
                                                                         interval
                                              Target Partition
                                               P1:192KB
                                               P1:128KB
                                               P2:320KB
                                               P2:384KB
   Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
ECE8833 H.-H. S. Lee 2009                                                               20
     Dynamic Fair Caching Algorithm
                                                    MissRate alone
                                                        P1:20%
          3rd Interval
                                                        P2: 5%

                                                MissRate shared      MissRate shared
                                                     P1:20%            P1:25%
                                                                       P1:20%
                                                     P2:10%            P2: 9%
                                                                       P2:10%

                                                                      Repartitioning
                                                                        interval
                                              Target Partition
                                               P1:128KB
                                               P2:384KB
   Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
ECE8833 H.-H. S. Lee 2009                                                              21
     Dynamic Fair Caching Algorithm
                                                    MissRate alone
                                                        P1:20%               Do Rollback if:
         Repartition!
                                                                             P2: Δ<Trollback
                                                        P2: 5%
                                                                             Δ=MRold-MRnew

                                                MissRate shared      MissRate shared
                                                     P1:20%            P1:25%
                                                     P2:10%            P2: 9%

                                                                      Repartitioning
                                                                        interval
                                              Target Partition
           The best Trollback                  P1:192KB
                                               P1:128KB
        threshold found to be
                20%                            P2:320KB
                                               P2:384KB
   Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
ECE8833 H.-H. S. Lee 2009                                                                      22
     Generic Repartitioning Algorithm



                                          Pick the largest and smallest as
                                              a pair for repartitioning




                            Repeat for all candidate processes




ECE8833 H.-H. S. Lee 2009                                                23
Utility-Based Cache Partitioning (UCP)
     Running Processes on Dual-Core [Qureshi & Patt, MICRO-39]




                    # of ways given (1 to 16)                # of ways given (1 to 16)
      • LRU: in real runs on avg., 7 ways were allocated to equake and 9 to vpr
      • UTIL
             – How much you use (in a set) is how much you will get
             – Ideally, 3 ways to equake and 13 to vpr
ECE8833 H.-H. S. Lee 2009                                                                25
     Defining Utility

          Utility Uab = Misses with a ways – Misses with b ways
                 Misses per 1000 instructions




                                                                              Low Utility
                                                                              High Utility
                                                                              Saturating Utility



                                                Num ways from 16-way 1MB L2

   Slide courtesy: Moin Qureshi, MICRO-39
ECE8833 H.-H. S. Lee 2009                                                                          26
     Framework for UCP

                                    UMON1       PA        UMON2

                                I$            Shared          I$
                       Core1                                       Core2
                            D$               L2 cache         D$

                                            Main Memory
        Three components:
         Utility Monitors (UMON) per core
         Partitioning Algorithm (PA)
         Replacement support to enforce partitions
   Slide courtesy: Moin Qureshi, MICRO-39
ECE8833 H.-H. S. Lee 2009                                                  27
     Utility Monitors (UMON)
    For each core, simulate LRU policy using Auxiliary Tag Dir (ATD)
      UMON-global (one way-counter for all sets)
    Hit counters in ATD to count hits per recency position
    LRU is a stack algorithm: hit counts  utility
    E.g., hits(2 ways) = H0+H1
                                    (MRU) H0 H1 H2 H3     H15(LRU)
                                          + + + +       ... +
              Set A
              Set B
              Set C
              Set D
              Set E
              Set F
              Set G
              Set H
                                                  ATD
ECE8833 H.-H. S. Lee 2009                                               28
     Utility Monitors (UMON)
    Extra tags incur hardware and power overhead
    DSS reduces overhead [Qureshi et al. ISCA’06]




                                             (MRU) H0 H1 H2 H3     H15(LRU)
                                                     + + + +     ... +
              Set A                          Set A
              Set B                          Set B
              Set C                          Set C
              Set D                          Set D
              Set E                          Set E
              Set F                          Set F
              Set G                          Set G
              Set H                          Set H
                                                         ATD
ECE8833 H.-H. S. Lee 2009                                                     29
     Utility Monitors (UMON)
    Extra tags incur hardware and power overhead
    DSS reduces overhead [Qureshi et al. ISCA’06]
    32 sets sufficient based on Chebyshev’s inequality
           Sample every 32 sets (simple static) used in the paper
    Storage < 2KB/UMON (or 0.17% L2)
                                             (MRU) H0 H1 H2 H3         H15(LRU)
                                                     + + + +         ... +
              Set A                          Set B
              Set B                          Set E
              Set C                          Set F
              Set D
                                                        UMON (DSS)
              Set E
              Set F
              Set G
              Set H
                                                            ATD
ECE8833 H.-H. S. Lee 2009                                                         30
     Partitioning Algorithm (PA)
         Evaluate all possible partitions and select the best

         With a ways to core1 and (16-a) ways to core2:
              Hitscore1 = (H0 + H1 + … + Ha-1) ---- from UMON1
              Hitscore2 = (H0 + H1 + … + H16-a-1) ---- from UMON2

         Select a that maximizes (Hitscore1 + Hitscore2)

         Partitioning done once every 5 million cycles

         After each partitioning interval
            Hit counters in all UMONs are halved
            To retain some past information
ECE8833 H.-H. S. Lee 2009                                           31
     Replacement Policy to Reach Desired Partition
      Use way partitioning [Suh+ HPCA’02, Iyer ICS’04]
      • Each Line contains core-id bits
      • On a miss, count ways_occupied in the set by miss-causing
        app
      • Binary decision for dual-core (in this paper)


                            ways_occupied < ways_given

                            Yes                     No

                 Victim is the LRU line    Victim is the LRU line
                 from other app            from miss-causing app


ECE8833 H.-H. S. Lee 2009                                           32
     UCP Performance (Weighted Speedup)




          UCP improves average weighted speedup by 11% (Dual Core)
ECE8833 H.-H. S. Lee 2009                                            33
     UPC Performance (Throughput)




                        UCP improves average throughput by 17%

ECE8833 H.-H. S. Lee 2009                                        34
Dynamic Insertion Policy
     Conventional LRU

                    MRU                            LRU




                            Incoming
                              Block
                                       Slide Source: Yuejian Xie
  Slide Source: Yuejian Xie
ECE8833 H.-H. S. Lee 2009                                          36
     Conventional LRU

                    MRU                                          LRU




                                Occupies one cache block
                              for a long time with no benefit!




  Slide Source: Yuejian Xie
ECE8833 H.-H. S. Lee 2009                                              37
     LIP: LRU Insertion Policy [Qureshi et al. ISCA’07]

                    MRU                                LRU




                                            Incoming
                                              Block

  Slide Source: Yuejian Xie
ECE8833 H.-H. S. Lee 2009                                    38
     LIP: LRU Insertion Policy [Qureshi et al. ISCA’07]

                    MRU                                                      LRU




                               Useless Block      Evicted at next eviction

                                   Useful Block   Moved to MRU position




  Adapted Slide from Yuejian Xie
ECE8833 H.-H. S. Lee 2009                                                          39
     LIP: LRU Insertion Policy [Qureshi et al. ISCA’07]

                    MRU                                                              LRU




                              Useless Block              Evicted at next eviction

                              Useful Block               Moved to MRU position


                                 LIP is not entirely new, Intel has tried this in 1998 when
                                 designing “Timna” (integrating CPU and Gfx accelerator
  Slide Source: Yuejian Xie
                                 that share L2)
ECE8833 H.-H. S. Lee 2009                                                                     40
     BIP: Bimodal Insertion Policy [Qureshi et al. ISCA’07]
             LIP may not age older lines
             Infrequently insert lines in MRU position
             Let e = Bimodal throttle parameter

             if ( rand() < e )
                     Insert at MRU position; // LRU replacement policy
             else
                     Insert at LRU position;
                            Promote to MRU if reused




ECE8833 H.-H. S. Lee 2009                                                41
     DIP: Dynamic Insertion Policy [Qureshi et al. ISCA’07]

          Two types of workloads: LRU-friendly or BIP-friendly

                                                                     DIP
          DIP can be implemented by:
          1. Monitor both policies (LRU and BIP)              BIP          LRU
                                                        1-ε          ε
          2. Choose the best-performing policy
                                                        LIP         LRU
          3. Apply the best policy to the cache


          Need a cost-effective implementation  “Set Dueling”



ECE8833 H.-H. S. Lee 2009                                                        42
     Set Dueling for DIP [Qureshi et al. ISCA’07]
  Divide the cache in three:
        •       Dedicated LRU sets                                   miss
                                                       LRU-sets             +
        •       Dedicated BIP sets                                                n-bit cntr
        •       Follower sets (winner of LRU,BIP)      BIP-sets             –
                                                                     miss
  n-bit saturating counter                                                       MSB = 0?
  misses to LRU sets: counter++                                                 YES    No
                                                                            Use LRU   Use BIP
  misses to BIP sets : counter--
                                                     Follower Sets

  Counter decides policy for follower sets:
        •       MSB = 0, Use LRU
        •       MSB = 1, Use BIP
                                                    monitor  choose  apply
                                                        (using a single counter)

  Slide Source: Moin Qureshi
ECE8833 H.-H. S. Lee 2009                                                                43
Promotion/Insertion Pseudo Partitioning
     PIPP [Xie & Loh ISCA’09]
      • What’s PIPP?
             – Promotion/Insertion Pseudo Partitioning
             – Achieving both capacity (UCP) and dead-time management (DIP).
      • Eviction
             – LRU block as the victim
      • Insertion
             – The core’s quota worth of blocks away from LRU
      • Promotion
             – To MRU by only one.
                                                 New     Insert Position = 3
                                  Promote                (Target Allocation)    To Evict


             MRU                         Hit                              LRU
    Slide Source: Yuejian Xie
ECE8833 H.-H. S. Lee 2009                                                              45
     PIPP Example
          Core0 quota: 5 blocks                        Core0’s               Core1’s
          Core1 quota: 3 blocks                         Block                 Block


                                        Request

                                          D



                                                      Core1’s quota=3


                  1         A   2   3         4   B         5           C


       MRU                                                                  LRU
    Slide Source: Yuejian Xie
ECE8833 H.-H. S. Lee 2009                                                              46
     PIPP Example
          Core0 quota: 5 blocks                            Core0’s        Core1’s
          Core1 quota: 3 blocks                             Block          Block


                                        Request
                                          6



                                                  Core0’s quota=5


                  1         A   2   3         4        D        B    5


       MRU                                                               LRU
    Slide Source: Yuejian Xie
ECE8833 H.-H. S. Lee 2009                                                           47
     PIPP Example
          Core0 quota: 5 blocks                            Core0’s        Core1’s
          Core1 quota: 3 blocks                             Block          Block


                                        Request
                                          7



                                                  Core0’s quota=5


                  1         A   2   6         3        4        D    B


       MRU                                                               LRU
    Slide Source: Yuejian Xie
ECE8833 H.-H. S. Lee 2009                                                           48
     PIPP Example
          Core0 quota: 5 blocks                       Core0’s        Core1’s
          Core1 quota: 3 blocks                        Block          Block


                                        Request
                                          D




                  1         A   2   7         6   3       4     D


       MRU                                                          LRU
    Slide Source: Yuejian Xie
ECE8833 H.-H. S. Lee 2009                                                      49
     How PIPP Does Both Management

                                Core0          Core1       Core2   Core3

               Quota             6               4          4       2




            MRU                                                         LRU
                                        Insert closer to
                                         LRU position
    Slide Source: Yuejian Xie
ECE8833 H.-H. S. Lee 2009                                                     50
     Pseudo Partitioning Benefits
          Core0 quota: 5 blocks             Core0’s   Core1’s
          Core1 quota: 3 blocks              Block     Block


                                  Request

                                    New

             Strict
            Partition




           MRU0                      LRU0   MRU1      LRU1

    Slide Source: Yuejian Xie
ECE8833 H.-H. S. Lee 2009                                       51
     Pseudo Partitioning Benefits
          Core0 quota: 5 blocks                          Core0’s   Core1’s
          Core1 quota: 3 blocks                           Block     Block


                                         Request

                                           New

            Pseudo
            Partition




           MRU                  Core1 “stole” a line from Core0    LRU

    Slide Source: Yuejian Xie
ECE8833 H.-H. S. Lee 2009                                                    52
     Pseudo Partitioning Benefits




ECE8833 H.-H. S. Lee 2009           53
     Single Reuse Block
      Directly to MRU




                        New
          (TADIP)




                            MRU   LRU
      Promote By One




                        New
          (PIPP)




                         MRU      LRU

    Slide Source: Yuejian Xie
ECE8833 H.-H. S. Lee 2009               54
     Algorithm Comparison

                                 Capacity     Dead-time
           Algorithm                                                 Note
                                Management   Management

                                                              Baseline, no explicit
               LRU
                                                                 management


               UCP                                             Strict partitioning


                                                          Insert at LRU and promote to
         DIP / TADIP
                                                                    MRU on hit

                                                            Pseudo-partitioning and
             PIPP                                           incremental promotion

    Slide Source: Yuejian Xie
ECE8833 H.-H. S. Lee 2009                                                                55

								
To top