Principle -- Locality Cache Memory Principles Cache Memory

Document Sample
Principle -- Locality Cache Memory Principles Cache Memory Powered By Docstoc
					 Week 9 7 Large and Fast: Exploiting Memory                              Principle -- Locality
Hierarchy 7.1 Introduction 7.2 The Basics of                Consider the loop:
                                                                                                 After a word is accessed,
                                                            R1 <- looplength
Caches 7.3 Measuring and Improving Cache                                                         the following word is
Performance 7.4 Virtual Memory 7.5 A Common                 R2 <- 0                              accessed soon
                                                            R4 <- addr(b)      % load            == Spatial Locality
Framework for Memory Hierarchies
                                                            R5 <- addr(c)      % address
Week 10. 7 Large and Fast: Exploiting Memory                R6 <- addr(a)      % pointers
Hierarchy 7.6 Real Stuff: The Pentium Pro and               loop: R3 <- mem(R4+R2) %load b(i)
PowerPC 604 Memory Hierarchies 7.7 Fallacies                R7 <- mem(R5+R2) %load c(i)              After an instruction is
and Pitfalls 7.8 Concluding Remarks 7.9                     R7 <- R7 + R3      %b(i) + c(i)          accessed, the same
Historical Perspective and Further Reading 7.10             R1 <- R1 - 1       % decr. counter instruction is
                                                            mem(R6+R2)<- R7 %store a(i)              accessed soon
Key Terms                                                                                            => Temporal Locality
                                                            R2 <- R2 + 8       % bump index
  4/29/02          CP2005 Week 9 Cache plus revision   1
                                                            BNEZ R1,loop
                                                               4/29/02
                                                                               % close loop plus revision
                                                                                CP2005 Week 9 Cache                     2




       Cache Memory Principles                                      Cache Memory Principles
  • Combine memory size/performance                            • Net effect:
    with locality principle                                         – Speed of fast (cache) memory combined
       – Place a small fast cache memory in                           with
         processor                                                  – Sizeof large (main) memory
       – Cache recently accessed data (temporal                • L1 cache is designed as an integral
         locality)                                               part of the processor
       – Cache data close to recently accessed
                                                               • Caches use a form of prediction
         data (spatial locality)

  4/29/02          CP2005 Week 9 Cache plus revision   3       4/29/02            CP2005 Week 9 Cache plus revision     4




                                                                               Terminology
                                                           • Block, Line minimum unit that may be present,
                                                             usually fixed length
 Exploit                                                   • Hit = Block is found in cache
 locality                                                  • Miss = Block not found in cache
                                                           • Victim = Line replaced on a miss
   via                                                     • Miss rate = Fraction of accesses that miss
memory                                                     • Hit rate = ( 1 - miss rate)
hierarchy                                                  • Hit Time = Time to access cache
                                                           • Miss Penalty = time to deliver new block to
                                                             processor
  4/29/02          CP2005 Week 9 Cache plus revision   5       4/29/02            CP2005 Week 9 Cache plus revision     6




                                                                                                                               1
       Cache Contents: What are                                            Data Address is used to organise
             we storing?                                                        cache storage strategy
 • Separate (common today): Separate Instruction & Data
   "Split", "Harvard"
                                                                          • Word is organised by                          Word address bits fields
    – 2 times bandwidth
                                                                            byte bits
    – Place closer to Instruction and Data ports
    – Can customize for each application
                                                                          • Block is organised by
    – imposed associativity
                                                                            bits denoting the word                     Tag
                                                                          • Location in cache is                       Index
    – No interlocks on simultaneous requests                                                                           Block
    – Self-modifying code can cause problems                                indexed by row
                                                                                                                       Byte
 • Both Instructions and Data "Unified"                                   • Tag is identification of
    – Less costly                                                           a block in a cache row
    – Handles writes into Instruction-stream
   4/29/02                 CP2005 Week 9 Cache plus revision        7     4/29/02            CP2005 Week 9 Cache plus revision                  8




    Miss Rate Performance Metric
                                                                              Performance: Bandwidth
• Fraction of memory access that miss in the cache
  [Miss rate generally preferred over Hit rate]
                                                                        • Memory traffic (bandwidth) also
• Misses affect performance more than hits
   –   HIT = 1 cycle                                                      important
   –   MISS = 10 to 100 or more cycles                                     – bandwidth -- words per cycle
   –   for HR = .96 --> .94 (slight difference)
                                                                           – on a miss, a full line is moved
   –   for MR = .04 --> .06 (50% worse)
• Miss rate by itself is only an indirect indicator of                  • Memory traffic is especially
  performance                                                             important in Multiprocessor systems
• Time is the direct measure
                                                                           – E.g. with shared bus
   – effect of miss rate on performance can be very complicated,
     e.g. super scalar out-of-order, overlapped misses.



   4/29/02                 CP2005 Week 9 Cache plus revision        9     4/29/02            CP2005 Week 9 Cache plus revision                 10




   Cache Organizations: Full Associative                                 Block Replacement: On a miss, which line
                                                                                (victim) should be replaced?
• Line goes in any line frame
                                                                        Least-recently used (LRU):
   – Tag compares typically done via content
                                                                           – Replace block unused for longest time
     addressable memory (CAM)
                                                                           – Optimize based on temporal locality
                                                                           – Relatively complicated LRU state for large sets
                                                                        Random
                                                                           – Select victim at random
                                                                           – Pseudo-random for hardware testing
                                                                        Not most recently used (NMRU)
                                                                           – Keep track of MRU
                                                                           – Randomly select from among others
                                                                           – A good compromise
   4/29/02                 CP2005 Week 9 Cache plus revision       11     4/29/02            CP2005 Week 9 Cache plus revision                 12




                                                                                                                                                     2
                                                                            Cache Organizations: Direct Mapped
               Block Replacement
    • First-in First-out (FIFO)
        – Replace block loaded first
        – Simple, not as good as LRU
                                                                       • Line goes in
    • Optimal (Belady's algorithm)
                                                                         exactly one
        – Replace block used furthest in time
                                                                         line frame
        – Requires knowledge of future
                                                                       • Both tags and
        – Useful only in performance studies to
          establish limits
                                                                         data in SRAM

    4/29/02            CP2005 Week 9 Cache plus revision       13         4/29/02               CP2005 Week 9 Cache plus revision                                14




                                                                                                                Direct mapped
     Cache Organizations: Set Associative
•   Cache
    organized as
    sets of lines
•   Line goes in
    exactly one set
•   Implemented
    with SRAM
•   Example 2-way                                                       Example 24 bit address with 8 byte block
    set associative
                                                                        and 2048 blocks in cache of 16384 bytes
    4/29/02            CP2005 Week 9 Cache plus revision       15         4/29/02               CP2005 Week 9 Cache plus revision                                16
THIS IS LIKE MANY COPIES OF DIRECT MAPPED CACHE




      Bit fields for 4 byte word in 32 bit                                 Example of direct mapped cache
                                                                       • Example shows address entries that map to the same location in
       address with 2b words per block                                   cache for one byte per word, one word per block, one block per row
                                                                                                                                C ache

                                                                                                                                                      8 cache entries
                                                                                                                       000
                                                                                                                       001
                                                                                                                       0 10
                                                                                                                       0 11
                                                                                                                       1 00
                                                                                                                       1 01
                                                                                                                       1 10
                                                                                                                       1 11




                                                                                                    Index
                                                                                                                                                      Data mapped
                                                                        Word address bits fields                                                      by address
 Field         Address Bits         Usage                                                                                                             modulo 8
 Word field    0:3          address bits within the word
                            being accessed
 Block field   4 : 4+b-1    identifies word within the block,          Tag
                            field could be empty                       Index
 Set field     no bits                                                 Block
 Tag field     4+b : 31     identifies tag field                       Byte
                            (unique identifier for block on its row)                       00 001    00101    0100 1   0 1101       1000 1   1 0101    1100 1   111 01

                                                                                                                            Memory

    4/29/02            CP2005 Week 9 Cache plus revision       17         4/29/02               CP2005 Week 9 Cache plus revision                                18




                                                                                                                                                                         3
                                                                                                                                                              ADDRESS bit positions)
                                                                            Direct                                                                             Address (show in g
                                                                                                                                                               31 30        13 1 2 11          210


Contents of a direct mapped cache                                           cache                                            Hit
                                                                                                                                                                            20                 10
                                                                                                                                                                                                     Byte
                                                                                                                                                                                                    offset

                                                                                                                                                                                                                           Dat a
                                                                                                                                                        Ta g

                                                                                                                                                                                 In dex


                                                                       Separate address into                                                          In dex Valid T ag                    D ata
• Data == Cached block                                                 fields:                                                                          0
                                                                                                                                                        1

• TAG == Most significant bits of cached                               •Byte offset in word                                                             2


                                                                       •Index for row of cache
  block address that identify the block in that
                                                                       •Tag identifier of block
  cache row from other blocks that map to                                                                                                             102 1

  that same row                                                        Cache of 2^n words, a                                                          102 2
                                                                                                                                                      102 3
                                                                       block being a 4 byte word,                                                                       20                     32
• VALID == Flag bit to indicate the cache                              has 2^n*(63-n) bits for 32
  content is valid                                                     bit address
                                                                        #rows=2^n
                                                                        #bits/row=32+32-2-n+1=63-n
4/29/02               CP2005 Week 9 Cache plus revision           19            4/29/02                                  CP2005 Week 9 Cache plus revision                                                         20




                                                                                                              Multi-word Blocks
                                                                                                                Address (showing bit positions)
                                                                                                                    31      16 15        4 32 1 0

          Reading: Hits and Misses                                        Hit
                                                                                                                            16      12      2 Byte
                                                                                                                                                                                                                   D ata
                                                                                                 T ag                                        offset
                                                                                                                    Index                                                                       Block of fset
                                                                                                16 bits                                  128 bits
• Hit requires no special handling. The data is                                            V     T ag                                      Data


  available
• Instruction fetch cache miss:                                                                                                                                                                         4K
                                                                                                                                                                                                        entries

     – Stall the pipeline, apply the PC to memory and
       fetch the block. Re-fetch the instruction when
                                                                                                    16         32                    32                        32                         32
       the miss has been serviced
     – Same for data fetch                                                                                                                  Mux
                                                                                                                                               32



4/29/02               CP2005 Week 9 Cache plus revision           21            4/29/02                                  CP2005 Week 9 Cache plus revision                                                         22




                                                                                                    Miss Rates Vs Block Size
                    Block Size                                                                 4 0%
• Block (line) size is the data size that is both                                              3 5%
    – associated with an address tag,
                                                                                               3 0%
    – transferred to/from memory
                                                                                               2 5%
                                                                                  Mi ss rate




• Small blocks
                                                                                               2 0%
    – poor spatial locality
    – higher address tag overhead                                                              1 5%

• Large blocks                                                                                 1 0%

    – Unused data may be be transferred                                                         5%

    – Useful data to be prematurely replaced: "Cache Pollution“                                 0%
                                                                                                          4                         16                                 64                                    256
                                                                                                                                           Block size (byte s)                                 1 KB
                                                                                                                                                                                               8 KB
                                                                          Cache size                                                                                                           16 KB
                                                                                                                                                                                               64 KB
                                                                           Ref: Fig 7.12
                                                                                                                                                                                               256 KB
4/29/02               CP2005 Week 9 Cache plus revision           23            4/29/02                                  CP2005 Week 9 Cache plus revision                                                         24




                                                                                                                                                                                                                                   4
   Block Size Tradeoff                                                                                                        Example: 1 KB Direct Mapped Cache with 32 Byte Blocks

      • In general, larger block size take advantage of spatial locality BUT:                                             • For a 2 ** N byte cache:
           – Larger block size means larger miss penalty:                                                                     –The uppermost (32 - N) bits are always the Cache Tag
                • Takes longer time to fill up the block                                                                      –The lowest M bits are the Byte Select (Block Size = 2 ** M)
           – If block size is too big relative to cache size, miss rate will go up                                           31                                                            9                       4               0
                • Too few cache blocks                                                                                                              Cache Tag        Example: 0x50          Cache Index            Byte Select
      • In general, Average Access Time:                                                                                                                                                     Ex: 0x01                   Ex: 0x00
                                                                                                                                                            Stored as part
           – = Hit Time x (1 - Miss Rate) + Miss Penalty x Miss Rate                                                                                        of the cache “state”
                                                                              Average
                                                                              Access                                         Valid Bit         Cache Tag                                    Cache Data
  Miss                             Miss
  Penalty                                                                      Time




                                                                                                                                                                                                         : :
                                   Rate Exploits Spatial Locality                                                                                                                           Byte 31            Byte 1 Byte 0 0
                                                                                    Increased Miss Penalty                                                  0x50                            Byte 63            Byte 33 Byte 32 1
                                                      Fewer blocks:                      & Miss Rate
                                                      compromises                                                                                                                                                                  2
                                                      temporal locality                                                                                                                                                            3

                                                                                                                                   :                             :                                             :
                      Block Size                 Block Size                               Block Size




                                                                                                                                                                                                               :
                                                                                                                                                                                            Byte 1023                     Byte 992 31
        4/29/02                         CP2005 Week 9 Cache plus revision                              25                   4/29/02                             CP2005 Week 9 Cache plus revision                                  26




        Extreme Example: single big line                                                                                          Another Extreme Example: Fully Associative
         Valid Bit      Cache Tag                                   Cache Data                                            • Fully Associative Cache, N blocks of 32 bytes each
                                                                    Byte 3 Byte 2 Byte 1 Byte 0 0                             – Forget about the Cache Index
                                                                                                                              – Compare the Cache Tags of all cache entries in parallel
       • Cache Size = 4 bytes                         Block Size = 4 bytes
                                                                                                                              – Example: Block Size = 32 Byte blocks, we need N 27-bit comparators
            – Only ONE entry in the cache
                                                                                                                          • By definition: Conflict Miss = 0 for a fully associative cache
       • If an item is accessed, likely that it will be accessed again soon
                                                                                                                              31                                                                                       4            0
            – But it is unlikely that it will be accessed again immediately!!!
                                                                                                                                                                     Cache Tag (27 bits long)                           Byte Select
            – The next access will likely be a miss again                                                                                                                                                                Ex: 0x01
                 • Continually loading data into the cache but
                    discard (force out) them before they are used again                                                                                 Cache Tag                           Valid Bit Cache Data




                                                                                                                                                                                                                   : :
                 • Worst nightmare of a cache designer: Ping Pong Effect                                                                        X                                                  Byte 31       Byte 1 Byte 0
       • Conflict Misses are misses caused by:                                                                                             X                                                       Byte 63       Byte 33 Byte 32
                                                                                                                                                X
            – Different memory locations mapped to the same cache index
                                                                                                                                           X
                 • Solution 1: make the cache size bigger
                 • Solution 2: Multiple entries for the same Cache Index                                                                   X
                                                                                                                                                                           :                         :                     :
        4/29/02                         CP2005 Week 9 Cache plus revision                              27                   4/29/02                             CP2005 Week 9 Cache plus revision                                  28




        A Two-way Set Associative Cache                                                                                     Disadvantage of Set Associative Cache
      • N-way set associative: N entries for each Cache Index                                                             • N-way Set Associative Cache versus Direct Mapped Cache:
          – N direct mapped caches operates in parallel                                                                        – N comparators vs. 1
                                                                                                                               – Extra MUX delay for the data
      • Example: Two-way set associative cache
                                                                                                                               – Data comes AFTER Hit/Miss decision and set selection
          – Cache Index selects a “set” from the cache                                                                    • In a direct mapped cache, Cache Block is available BEFORE Hit/Miss:
          – The two tags in the set are compared in parallel                                                                   – Possible to assume a hit and continue. Recover later if miss.
          – Data is selected based on the tag result
                                                                                                                                                                               Cache Index
                                                      Cache Index                                                   Valid     Cache Tag                  Cache Data                      Cache Data                    Cache Tag        Valid
Valid      Cache Tag                 Cache Data                 Cache Data                Cache Tag         Valid
                                                                                                                                                        Cache Block 0                   Cache Block 0
                                    Cache Block 0              Cache Block 0
                                                                                                                      :                :                        :                                :                         :             :
  :               :                      :                                :                  :               :

                                                                                                                          Adr Tag
       Adr Tag                                                                                                                         Compare                       Sel1 1      Mux    0 Sel0                  Compare
                  Compare                    Sel1 1      Mux     0 Sel0                 Compare
                                                                                                                                                                      OR
                                              OR
                                                                                                                                                                                    Cache Block
                                                             Cache Block                                                                                         Hit
        4/29/02                           Hit
                                        CP2005 Week 9 Cache plus revision                              29                   4/29/02                             CP2005 Week 9 Cache plus revision                                  30




                                                                                                                                                                                                                                                5
       Example: Fully Associative Cache                                                                   Three Cs of Caches:
• Contains a number                                                                     1. Compulsory misses: These are cache misses caused by the first
  of lines identified                                                                      access to the block that has never been in cache (also known as
                                                                                           cold-start misses)
  by tags
                                                                                        2. Capacity misses: These are cache misses caused when the cache
  – Load/stores check                                                                      cannot contain all the blocks needed during execution of a program.
    tag,                                                                                   Capacity misses occur because of blocks being replaced and later
        • if hit, use data from                                                            retrieved when accessed.
          cache                                                                         3. Conflict misses: These are cache misses that occur in set-
        • if miss, access                                                                  associative or direct-mapped caches when multiple blocks compete
          memory (or higher                                                                for the same set. Conflict misses are those misses in a direct-
          level cache) copy line                                                           mapped or set-associative cache that are eliminated in a fully
          into cache (and                                                                  associative cache of the same size. These are also called collision
          remove some other                                                                misses.
  4/29/02
          block)         CP2005 Week 9 Cache plus revision                       31         4/29/02                CP2005 Week 9 Cache plus revision             32




 A Summary on Sources of Cache Misses                                                     Summary #1:
 • Compulsory (cold start or process migration, first reference): first                   • The Principle of Locality:
   access to a block                                                                          – Program likely to access a relatively small portion of the address
      – “Cold” fact of life: not a whole lot you can do about it                                space at any instant of time.
      – Note: If you are going to run “billions” of instruction,                                  • Temporal Locality: Locality in Time
        Compulsory Misses are insignificant                                                       • Spatial Locality: Locality in Space
 • Conflict (collision):                                                                  • Three Major Categories of Cache Misses:
      – Multiple memory locations mapped                                                      – Compulsory Misses: sad facts of life. Example: cold start misses.
        to the same cache location
                                                                                              – Conflict Misses: increase cache size and/or associativity.
      – Solution 1: increase cache size                                                                     Nightmare Scenario: ping pong effect!
      – Solution 2: increase associativity                                                    – Capacity Misses: increase cache size
 • Capacity:                                                                              • Cache Design Space
      – Cache cannot contain all blocks access by the program                                 – total size, block size, associativity
      – Solution: increase cache size                                                         – replacement policy
 • Invalidation: other process (e.g., I/O) updates memory                                     – write-hit policy (write-through, write-back)
                                                                                              – write-miss policy
  4/29/02                    CP2005 Week 9 Cache plus revision                   33         4/29/02                CP2005 Week 9 Cache plus revision             34




            Cache design parameters                                                             Program to measure cache size
  Design change            effect on miss rate                   possible negative        • Select arrays of size 1K words to about 4M words
                                                                 performance effect       • Initialise the arrays to some small values
  Increase block           decreases miss rate                   may increase             • Measure the time to do a memory read and write such as
  size                     due to compulsory                     miss-penalty
                                                                                            array[index]=array[index]+1 using many
                           misses                                                           iterations for index value taking “power of two” strides
                                                                                            through the arrays
  Increase size            decreases capacity may                access time              • Cache misses will occur just when the stride size equals the
                           increase misses                                                  cache size
                                                                                          • Effective block size will be the stride size when large arrays
  Increase                 decreases miss rate                   may increase access        have reached peak miss rate
  associativity                                                  time due to conflict
                                                                                          • Effective associative size is obtained from stride steps from
                                                                 misses
                                                                                            low miss rate when stride is large
  4/29/02                    CP2005 Week 9 Cache plus revision                   35         4/29/02                CP2005 Week 9 Cache plus revision             36




                                                                                                                                                                      6
32 byte block                                                                                           Cache Size
                                                                                   • Cache size is the total capacity of the cache
                                                                                        – Bigger caches exploit temporal locality better than
                                                                                          smaller caches
                                                                                        – Bigger caches are also slower
                                                                                   • Example Tradeoff:
                                                                                        – 256K cache with .02 miss rate 3 cycle access?
                                                                                   or
                                                                                        – 64K cache with .05 miss rate 2 cycle access?



     256K byte cache
    4/29/02                 CP2005 Week 9 Cache plus revision                37         4/29/02          CP2005 Week 9 Cache plus revision   38




                        Associativity                                                             Cache Write Policies
                                                                                  • Write-back
•   Typical values for associativity
    – 1 -- direct-mapped                                                            – Update memory only replacement on by writing dirty
    – n = 2, 4, 8, 16 -- n-way set-associative                                        victims
    – All blocks -- fully-associative                                               – Dirty bits held with tags so clean blocks can be replaced
•   Larger associativity                                                              without updating memory
    – Lower miss rates
                                                                                  • Write-through
    – Performance stable, e.g. performance not subject to position of data
•   Smaller associativity                                                           – Update memory on each write
    – Simple design in general                                                      – Keeps memory up-to-date (almost)
    – Expect faster access (hit) time: Direct mapped doesn't need select MUX in
      critical path
                                                                                    – Write Traffic independent of cache parameters




    4/29/02                 CP2005 Week 9 Cache plus revision                39         4/29/02          CP2005 Week 9 Cache plus revision   40




                                                                                                  Cache Write Policies
                         Memory Traffic
                                                                                   • On hit, is main memory written to?
                                                                                        – yes -- write-through(store-through), write both
    • Traffic with write-through:
                                                                                          memory and cache
       #load misses * line size + #stores
    • Traffic with write-back:                                                          – no -- write-back(store-in, copy-back), write only
      (# load misses + store misses + dirty misses)*
                                                                                          to cache
      line size                                                                    • On miss, is a line brought into cache?
                                                                                        – yes -- write-allocate (often used with write-back)
    • Write-back tends to have less traffic (useful
      for microprocessors)                                                              – no -- no-write-allocate (often used with write-
                                                                                          through)

    4/29/02                 CP2005 Week 9 Cache plus revision                41         4/29/02          CP2005 Week 9 Cache plus revision   42




                                                                                                                                                  7

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:175
posted:4/9/2010
language:English
pages:7
Description: Principle -- Locality Cache Memory Principles Cache Memory ...