Docstoc

Slide 1 - WWW4 Server

Document Sample
Slide 1 - WWW4 Server Powered By Docstoc
					                         ICCD 2010
                 Amsterdam, the Netherlands


   Improving Cache Performance by
 Combining Cost-Sensitivity and Locality
   Principles in Cache Replacement
               Algorithms
                                       Mazen Kharbutli
        Rami Sheikh
                                  Jordan Univ. of Science and
North Carolina State University
                                         Technology




                                                                1
Outline

          • Motivation and Contribution
          • Related Work
          • LACS Storage Organization
          • LACS Implementation
          • Evaluation Environment
          • Evaluation
          • Conclusion

                                          2
 Motivation

 The processor-memory performance gap.
    L2 cache performance is very crucial.

 Traditionally, L2 cache replacement algorithms focus on
  improving the hit rate.
    But, cache misses have different costs.
    Better to take the cost of a miss into consideration.

 Processor’s ability to (partially) hide the L2 cache miss latency
  differs between misses.
    Depends on: dependency chain, miss bursts ..etc.



                                                                      3
 Motivation

 Issued Instructions per Miss Histogram.




                                            4
 Contributions


 A novel, effective, but simple cost estimation method.
    Based on the number of instructions a processor manages
     to issue during the miss latency.
    A reflection of the processor’s ability to hide the miss latency.

          Number of issued instructions during the miss
                  Small                  Large

       High cost miss/block            Low cost miss/block




                                                                     5
 Contributions

 LACS: Locality-Aware Cost-Sensitive Cache Replacement
         Algorithm.
    Integrates our novel cost estimation method with a locality
     algorithm (e.g. LRU).
    Attempts to reserve high cost blocks in the cache while their
     locality is still high.
    On a cache miss, a low-cost block is chosen for eviction.

 Excellent performance improvement at feasible cost.
    Performance improvement: 15% average and up to 85%.
    Effective in uniprocessors and CMPs.
    Effective for different cache configurations.

                                                                     6
Outline

          • Motivation and Contribution
          • Related Work
          • LACS Storage Organization
          • LACS Implementation
          • Evaluation Environment
          • Evaluation
          • Conclusion

                                          7
 Related Work


 Cache replacement algorithms traditionally attempt to reduce
  the cache miss rate.
     Belady’s OPT algorithm [Belady 1966].
     Dead block predictors [Kharbutli 2008 ..etc].
     OPT emulators [Rajan 2007].

 Cache misses are not uniform and have different costs
  [Srinivasan 1998, Puzak 2008].
    A new class of replacement algorithms.
    Miss cost can be latency, power consumption, penalty ..etc.



                                                                   8
 Related Work

 Jeong and Dubois [1999, 2003, 2006]:
    In the context of CC-NUMA multiprocessors.
    Cost of miss mapping to remote memory higher than if
     mapping to local memory.
    LACS estimates cost based on processor’s ability to tolerate
     the miss latency not the miss latency value itself.

 Jeong et al. [2008]:
    In the context of uniprocessors.
    Next access predicted: Load (high cost); Store (low cost).
    All load misses treated equally.
    LACS does not treat load misses equally (different costs).
    A store miss may have a high cost.

                                                                    9
 Related Work


 Srinivasan et al. [2001]:
    Critical blocks preserved in special critical cache.
    Criticality estimated from load’s dependence chain.
    No significant improvement under realistic configurations.
    LACS does not track the dependence chain. Uses a simpler
     cost heuristic.
    LACS achieves considerable performance improvement
     under realistic configurations.




                                                                  10
 Related Work


 Qureshi et al. [2006]:
    Based on Memory-level Parallelism (MLP).
    Cache misses occur in isolation (high cost) or concurrently
     (low cost).
    Suffers from pathological cases. Integrated with a
     tournament predictor to choose between it and LRU (SBAR).
    LACS does not slow down any of the 20 benchmarks in our
     study.
    LACS outperforms MLP-SBAR in our study.




                                                               11
Outline

          • Motivation and Contribution
          • Related Work
          • LACS Storage Organization
          • LACS Implementation
          • Evaluation Environment
          • Evaluation
          • Conclusion

                                          12
LACS Storage Organization


     P
                             IIC   (32 bits)


     L1
     $
    Total Storage Overhead ≈ 48 KB
         9.4% of a 512KB Cache
     L2$               IIRs (32 bits
          4.7% of a 1MB Cache each)
            MSHR




                         Prediction Table
          Each entry: 6-bit hashed tag, 5-bit cost, 1-bit confidence
               (8K sets x 4 ways x 1.5 bytes/entry = 48 KB)
                                                                  13
Outline

          • Motivation and Contribution
          • Related Work
          • LACS Storage Organization
          • LACS Implementation
          • Evaluation Environment
          • Evaluation
          • Conclusion

                                          14
 LACS Implementation

 On an L2 cache miss on block B in set S:
                                                   (3)
            (1)
                                 (2)           When miss
       Copy IIC into
                            Find a victim       returns,
           IIR
                                             update B’s info




                       MSHR[B].IIR = IIC




                                                               15
 LACS Implementation

 On an L2 cache miss on block B in set S:
                                                        (3)
            (1)
                                (2)              When miss
       Copy IIC into
                           Find a victim          returns,
           IIR
                                               update B’s info




        Identify all low cost blocks in set S.
           If there is at least one, choose a victim
            randomly from among them.
           Otherwise, the LRU block is the victim.
        Block X is a low cost block if:
           X.cost > threshold, and
           X.conf == 1

                                                                 16
 LACS Implementation

 On an L2 cache miss on block B in set S:
                                                   (3)
            (1)
                               (2)             When miss
       Copy IIC into
                          Find a victim         returns,
           IIR
                                             update B’s info




        When miss returns, calculate B’s new cost:
          newCost = IIC – MSHR[B].IIR
        Update B’s table info:
          if(newCost ≈ B.cost) B.conf=1, else B.conf=0
          B.cost = newCost




                                                               17
Outline

          • Motivation and Contribution
          • Related Work
          • LACS Storage Organization
          • LACS Implementation
          • Evaluation Environment
          • Evaluation
          • Conclusion

                                          18
 Evaluation Environment

 Evaluation using SESC: a detailed, cycle-accurate, execution-
  driven simulator.

 20 of the 26 SPEC2000 benchmarks are used.
    Reference input sets.
    2 billion instructions simulated after skipping the first 2 billion
     instructions.
 Benchmarks divided into two groups (GrpA, GrpB).
    GrpA: L2 cache performance-constrained - ammp, applu,
     art, equake, gcc, mcf, mgrid, swim, twolf, and vpr.

 L2 cache: 512 KB, 8-way, WB, LRU.

                                                                      19
Outline

          • Motivation and Contribution
          • Related Work
          • LACS Storage Organization
          • LACS Implementation
          • Evaluation Environment
          • Evaluation
          • Conclusion

                                          20
 Evaluation
 Performance Improvement:




 L2 Cache Miss Rates:




                             21
 Evaluation

 Fraction of LRU blocks reserved by LACS that get re-used:
  ammp applu    art   equake   gcc   mcf   mgrid   swim   twolf   vpr
   94%   22%   51%     15%     89%   1%    33%     11%    21%     22%

                Low-cost blocks in the cache: <20%
        OPT evicted blocks that were low-cost: 40% to 98%
  Strong correlation between blocks evicted by OPT and their cost.

 L2 Cache Miss Rates:




                                                                        22
 Evaluation

 Performance improvement in a CMP architecture:




                                                   23
 Evaluation

 Sensitivity to cache parameters:
  Configuration      Minimum         Average   Maximum
  256 KB, 8-way         0%             3%        9%
  512 KB, 8-way         0%            15%       85%
  1 MB, 8-way           -3%            8%       47%
  2 MB, 8-way           -3%           19%       195%
  512 KB, 4-way         0%            12%       69%
  512 KB, 16-way        -1%           17%       101%




                                                         24
Outline




          25
    Conclusion

   LACS’s Exquisite Features:
       Novelty
            New metric for measuring cost-sensitivity.

       Combines Two Principles
            Locality and cost-sensitivity.

       Performance Improvements at Feasible Cost
            15% average speedup in L2 cache performance-constrained
             benchmarks.
            Effective in uniprocessor and CMP architectures.
            Effective for different cache configurations.



                                                                       26
Thank You !

Questions?



              27

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:5
posted:7/3/2011
language:English
pages:27