Cache by atika18

VIEWS: 20 PAGES: 23

									Cache Basics

By : Fajar hp




LTS             1
 Memory Hierarchy

 • As you go further, capacity and latency increase




            L1 data or
Registers   instruction   L2 cache     Memory
  1KB         Cache         2MB         1GB           Disk
 1 cycle       32KB       15 cycles   300 cycles     80 GB
             2 cycles                              10M cycles




 LTS                                                            2
Accessing the Cache
        Byte address

                    101000

                         Offset
                                          8-byte words




8 words: 3 index bits
                                            Direct-mapped cache:
                                            each address maps to
                                              a unique address


                                            Sets
                             Data array
LTS                                                                3
The Tag Array
        Byte address

                    101000

         Tag
                                          8-byte words



              Compare

                                            Direct-mapped cache:
                                            each address maps to
                                              a unique address



  Tag array                  Data array
LTS                                                                4
Increasing Line Size
        Byte address            A large cache line size  smaller tag array,
                                  fewer misses because of spatial locality
                10100000
                                                              32-byte cache
       Tag             Offset                                  line size or
                                                                 block size




  Tag array                        Data array
LTS                                                                            5
Associativity

        Byte address       Set associativity  fewer conflicts; wasted power
                                because multiple data and tags are read
                10100000

       Tag                     Way-1                Way-2




  Tag array                            Data array
LTS            Compare                                                   6
Example

• 32 KB 4-way set-associative data cache array with 32
  byte line sizes

• How many sets?

• How many index bits, offset bits, tag bits?

• How large is the tag array?




LTS                                                      7
Cache Misses

• On a write miss, you may either choose to bring the block
  into the cache (write-allocate) or not (write-no-allocate)

• On a read miss, you always bring the block in (spatial and
  temporal locality) – but which block do you replace?
     no choice for a direct-mapped cache
     randomly pick one of the ways to replace
     replace the way that was least-recently used (LRU)
     FIFO replacement (round-robin)




LTS                                                        8
Writes

• When you write into a block, do you also update the
  copy in L2?
     write-through: every write to L1  write to L2
     write-back: mark the block as dirty, when the block
      gets replaced from L1, write it to L2

• Writeback coalesces multiple writes to an L1 block into one
  L2 write

• Writethrough simplifies coherency protocols in a
  multiprocessor system as the L2 always has a current
  copy of data
LTS                                                         9
Lecture 15: Cache Performance

• Topics: improving cache performance (Sections 5.4-5.7)




LTS                                                        10
Cache Misses

• On a write miss, you may either choose to bring the block
  into the cache (write-allocate) or not (write-no-allocate)

• On a read miss, you always bring the block in (spatial and
  temporal locality) – but which block do you replace?
     no choice for a direct-mapped cache
     randomly pick one of the ways to replace
     replace the way that was least-recently used (LRU)
     FIFO replacement (round-robin)




LTS                                                        11
Writes

• When you write into a block, do you also update the
  copy in L2?
     write-through: every write to L1  write to L2
     write-back: mark the block as dirty, when the block
      gets replaced from L1, write it to L2

• Writeback coalesces multiple writes to an L1 block into one
  L2 write

• Writethrough simplifies coherency protocols in a
  multiprocessor system as the L2 always has a current
  copy of data
LTS                                                         12
Reducing Cache Miss Penalty

• Multi-level caches

• Critical word first

• Priority for reads

• Victim caches




LTS                           13
Multi-Level Caches

• The L2 and L3 have properties that are different from L1
     access time is not as critical for L2 as it is for L1 (every
      load/store/instruction accesses the L1)
     the L2 is much larger and can consume more power
      per access

• Hence, they can adopt alternative design choices
    serial tag and data access
    high associativity




LTS                                                           14
Read/Write Priority

• For writeback/thru caches, writes to lower levels are placed
  in write buffers

• When we have a read miss, we must look up the write
  buffer before checking the lower level

• When we have a write miss, the write can merge with
  another entry in the write buffer or it creates a new entry

• Reads are more urgent than writes (probability of an instr
  waiting for the result of a read is 100%, while probability of
  an instr waiting for the result of a write is much smaller) –
  hence, reads get priority unless the write buffer is full
LTS                                                             15
Victim Caches

• A direct-mapped cache suffers from misses because
  multiple pieces of data map to the same location

• The processor often tries to access data that it recently
  discarded – all discards are placed in a small victim cache
  (4 or 8 entries) – the victim cache is checked before going
  to L2

• Can be viewed as additional associativity for a few sets
  that tend to have the most conflicts



LTS                                                          16
Types of Cache Misses

• Compulsory misses: happens the first time a memory
  word is accessed – the misses for an infinite cache

• Capacity misses: happens because the program touched
  many other words before re-touching the same word – the
  misses for a fully-associative cache

• Conflict misses: happens because two words map to the
  same location in the cache – the misses generated while
  moving from a fully-associative to a direct-mapped cache

• Sidenote: can a fully-associative cache have more misses
  than a direct-mapped cache of the same size?
LTS                                                     17
Reducing Miss Rate

• Large block size – reduces compulsory misses, reduces
  miss penalty in case of spatial locality – increases traffic
  between different levels, space wastage, and conflict misses

• Large caches – reduces capacity/conflict misses – access
  time penalty

• High associativity – reduces conflict misses – rule of thumb:
  2-way cache of capacity N/2 has the same miss rate as
  1-way cache of capacity N – access time penalty

• Way prediction – by predicting the way, the access time
  is effectively like a direct-mapped cache – can also reduce
  power consumption
 LTS                                                        18
Compiler Optimizations

• Loop interchange: loops can be re-ordered to exploit
  spatial locality

      for (j=0; j<100; j++)
         for (i=0; i<5000; i++)
            x[i][j] = 2 * x[i][j];

       is converted to…

      for (i=0; i<5000; i++)
         for (j=0; j<100; j++)
            x[i][j] = 2 * x[i][j];



LTS                                                      19
Blocking

• Re-organize data accesses so that a piece of data is
  used a number of times before moving on… in other
  words, artificially create temporal locality

      for (i=0;i<N;i++)                   for (jj=0; jj<N; jj+= B)
        for (j=0;j<N;j++) {               for (kk=0; kk<N; kk+= B)
           r=0;                           for (i=0;i<N;i++)
           for (k=0;k<N;k++)                for (j=jj; j< min(jj+B,N); j++) {
             r = r + y[i][k] * z[k][j];        r=0;
           x[i][j] = r;                        for (k=kk; k< min(kk+B,N); k++)
        }                                        r = r + y[i][k] * z[k][j];
                                               x[i][j] = x[i][j] + r;
                                            }



LTS
         y                z                                                      20
Exercise

• Original code could have 2N3 + N2 memory accesses,
  while the new version has 2N3/B + N2

      for (i=0;i<N;i++)                   for (jj=0; jj<N; jj+= B)
        for (j=0;j<N;j++) {               for (kk=0; kk<N; kk+= B)
           r=0;                           for (i=0;i<N;i++)
           for (k=0;k<N;k++)                for (j=jj; j< min(jj+B,N); j++) {
             r = r + y[i][k] * z[k][j];        r=0;
           x[i][j] = r;                        for (k=kk; k< min(kk+B,N); k++)
        }                                        r = r + y[i][k] * z[k][j];
                                               x[i][j] = x[i][j] + r;
                                            }




LTS
         y                z                                                      21
Tolerating Miss Penalty

• Out of order execution: can do other useful work while
  waiting for the miss – can have multiple cache misses
  -- cache controller has to keep track of multiple
  outstanding misses (non-blocking cache)

• Hardware and software prefetching into prefetch buffers
  – aggressive prefetching can increase contention for buses




LTS                                                        22
LTS   23

								
To top