Docstoc

Intro_Memory_20_Hierarchy

Document Sample
Intro_Memory_20_Hierarchy Powered By Docstoc
					CSCE 430/830 Computer Architecture

   Introduction to Memory Hierarchy

                 Adopted from
           Professor David Patterson
    Electrical Engineering and Computer Sciences
           University of California, Berkeley
   Since 1980, CPU has outpaced DRAM ...
            Q. How do architects address this gap?
              A. Put smaller, faster ―cache‖ memories
Performance          between CPU and DRAM.
 (1/latency)       Create a ―memory hierarchy‖.        CPU
                                                   CPU 60% per yr


                                                       2X in 1.5 yrs

                                            Gap grew 50% per
                                                  year
                                                            DRAM
                                                       DRAM
                                                            9% per yr
                                                            2X in 10 yrs


                  CSCE 430/830, Memory Hierarchy Introduction   Year
1977: DRAM faster than microprocessors
                                                        Apple ][ (1977)
                                                            CPU: 1000 ns
                                                            DRAM: 400 ns




                    Steve
      Steve        Wozniak
      Jobs



              CSCE 430/830, Memory Hierarchy Introduction
Levels of the Memory Hierarchy
Capacity                                                                        Upper Level
Access Time                                                     Staging
Cost                                                            Xfer Unit          faster
CPU Registers
100s Bytes                     Registers
<10s ns
                                       Instr. Operands        prog./compiler
                                                              1-8 bytes
Cache
K Bytes (MB?)
10-100 ns
                               Cache
1-0.1 cents/bit                                                cache cntl
                                       Blocks                  8-128 bytes
Main Memory
M Bytes (GB?)                   Memory
200ns- 500ns
$.0001-.00001 cents /bit                                        OS
Disk
                                       Pages                    512-4K bytes
G Bytes (TB?), 10 ms
(10,000,000 ns)                 Disk
  -5 -6
10 - 10 cents/bit                                               user/operator
                                       Files                    Mbytes
Tape                                                                                 Larger
infinite
sec-min                          Tape                                        Lower Level
10 -8

                       CSCE 430/830, Memory Hierarchy Introduction
  Memory Hierarchy: Apple iMac G5
 Managed                      Managed                 Managed by OS,
by compiler                  by hardware                hardware,
                                                        application
  07       Reg     L1 Inst   L1 Data        L2        DRAM        Disk

 Size      1K       64K        32K         512K       256M        80G
Latency
            1,       3,         3,          11,        88,         107,   iMac G5
Cycles,
 Time
          0.6 ns   1.9 ns     1.9 ns      6.9 ns      55 ns       12 ms   1.6 GHz
Goal: Illusion of large, fast, cheap memory
Let programs address a memory space that
 scales to the disk size, at a speed that is
     usually as fast as register access
                    CSCE 430/830, Memory Hierarchy Introduction
       iMac‟s PowerPC 970: All caches on-chip
            L1 (64K Instruction)


 R
eg
ist
er
                                                        512K
 s                                                       L2




(1K)


                 L1 (32K Data) Hierarchy Introduction
                 CSCE 430/830, Memory
  The Principle of Locality
• The Principle of Locality:
   – Program access a relatively small portion of the address space at
     any instant of time.
• Two Different Types of Locality:
   – Temporal Locality (Locality in Time): If an item is referenced, it will
     tend to be referenced again soon (e.g., loops, reuse)
   – Spatial Locality (Locality in Space): If an item is referenced, items
     whose addresses are close by tend to be referenced soon
     (e.g., straightline code, array access)
• Last 15 years, HW relied on locality for speed


       It is a property of programs which is exploited in machine design.



                   CSCE 430/830, Memory Hierarchy Introduction
                                      Programs with locality cache well ...
                                                                         Bad locality behavior
Memory Address (one dot per access)




                                                                                                   Temporal
                                                                                                    Locality




                                                                                      Spatial
                                                                                      Locality
                                                                                                        Time
                                              Donald J. Hatfield, Jeanette Gerald: Program
                                                  CSCE 430/830, Memory Hierarchy Systems Journal
                                              Restructuring for Virtual Memory. IBM Introduction
                                              10(3): 168-192 (1971)
 Memory Hierarchy: Terminology
• Hit: data appears in some block in the upper level
  (example: Block X)
   – Hit Rate: the fraction of memory access found in the upper level
   – Hit Time: Time to access the upper level which consists of
       RAM access time + Time to determine hit/miss
• Miss: data needs to be retrieve from a block in the
  lower level (Block Y)
   – Miss Rate = 1 - (Hit Rate)
   – Miss Penalty: Time to replace a block in the upper level +
       Time to deliver the block the processor
• Hit Time << Miss Penalty (500 instructions on 21264!)
                                              Lower Level
                           To Processor    Upper Level         Memory
                                            Memory
                                               Blk X
                       From Processor                             Blk Y

                 CSCE 430/830, Memory Hierarchy Introduction
Cache Measures

• Hit rate: fraction found in that level
   – So high that usually talk about Miss rate
   – Miss rate fallacy: as MIPS to CPU performance,
     miss rate to average memory access time in memory
• Average memory-access time
      = Hit time + Miss rate x Miss penalty
              (ns or clocks)
• Miss penalty: time to replace a block from
  lower level, including time to replace in CPU
   – access time: time to lower level
     = f(latency to lower level)
   – transfer time: time to transfer block
     =f(BW between upper & lower levels)


                CSCE 430/830, Memory Hierarchy Introduction
 4 Questions for Memory Hierarchy


• Q1: Where can a block be placed in the upper level?
      (Block placement)
• Q2: How is a block found if it is in the upper level?
      (Block identification)
• Q3: Which block should be replaced on a miss?
      (Block replacement)
• Q4: What happens on a write?
      (Write strategy)




             CSCE 430/830, Memory Hierarchy Introduction
Q1: Where can a block be placed in the upper level?

   • Block 12 placed in 8 block cache:
      – Fully associative, direct mapped, 2-way set associative
      – S.A. Mapping = (Block Number) Modulo (Number Sets)

           Full Mapped        Direct Mapped         2-Way Assoc
                              (12 mod 8) = 4       (12 mod 4) = 0
            01234567             01234567             01234567

  Cache



                        1111111111222222222233
              01234567890123456789012345678901

  Memory


                CSCE 430/830, Memory Hierarchy Introduction
Direct Mapped Block Placement

 Cache

                address maps to block:
  *0 *4 *8 *C
                location = (block address MOD # blocks in cache)




  00 04 08 0C 10 14 18 1C 20 24 28 2C 30 34 38 3C 40 44 48 4C



  Memory
Fully Associative Block Placement

  Cache

                            arbitrary block mapping
                            location = any




  00 04 08 0C 10 14 18 1C 20 24 28 2C 30 34 38 3C 40 44 48 4C



  Memory
Set-Associative Block Placement

   Cache

                               address maps to set:
   *0 *0 *4 *4 *8 *8 *C *C     location = (block address MOD # sets in cache)
                               (arbitrary location in set)

   Set 0 Set 1   Set 2 Set 3




  00 04 08 0C 10 14 18 1C 20 24 28 2C 30 34 38 3C 40 44 48 4C




  Memory
Q2: How is a block found if it is in the upper level?


  • Tag on each block
     – No need to check index or block offset
  • Increasing associativity shrinks index, expands
    tag




                         Block Address                          Block
                        Tag                   Index            Offset




                 CSCE 430/830, Memory Hierarchy Introduction
   Direct-Mapped Cache Design
                Cache
ADDRESS   Tag               Byte Offset                DATA   HIT =1
                Index
0x0000000        3      0
                 ADDR
                        V Tag             Data
                     1 0x00001C0          0xff083c2d
                     0
                     1 0x0000000          0x00000021
                     1 0x0000000
                 CACHE SRAM               0x00000103
                     0
                     0
                     1
                     0 0x23F0210          0x00000009
                 DATA[59] DATA[58:32]     DATA[31:0]



                                =
  Set Associative Cache Design

                                                                      Address

• Key idea:                                              31 30   12 11 10 9 8              3210



   – Divide cache into sets                                      22                    8


   – Allow block anywhere in a set
                                     Index   V   Tag   Data       V      Tag    Data              V   Tag   Data        V    Tag   Data
• Advantages:                          0
                                       1
   – Better hit rate                   2


• Disadvantage:                      253
                                     254
                                     255
   – More tag bits                                                                                                            22      32

   – More hardware
   – Higher access time



                                                                                                        4-to-1 multiplexor



                                                                   Hit                                         Data

                                           A Four-Way Set-Associative Cache
Fully Associative Cache Design
• Key idea: set size of one block
   –1 comparator required for each block
   –No address decoding
   –Practical only for small caches due to
    hardware demands
   tag in 11110111                           data out 1111000011110000101011

               =     tag 00011100 data 0000111100001111111101
               =     tag 11110111 data 1111000011110000101011
               =     tag 11111110 data 0000000000001111111100
               =     tag 00000011 data 1110111100001110000001
               =     tag 11100110 data 1111111111111111111111
In-Class Exercise
 • Given the following requirements for cache design for
   a 32-bit-address computer: (1) cache contains 16KB
   of data, and (2) each cache block contains 16 words.
   (3) Placement policy is 4-way set-associative.
    – What are the lengths (in bits) of the block offset
      field and the index field in the address?



    – What are the lengths (in bits) of the index field and
      the tag field in the address if the placement is 1-
      way set-associative?



               CSCE 430/830, Memory Hierarchy Introduction
Q3: Which block should be replaced on a miss?

 • Easy for Direct Mapped
 • Set Associative or Fully Associative:
    – Random
    – LRU (Least Recently Used)

 Assoc:      2-way             4-way                   8-way
 Size      LRU Ran           LRU Ran                 LRU     Ran
 16 KB     5.2% 5.7%          4.7% 5.3%             4.4%    5.0%
 64 KB     1.9% 2.0%          1.5% 1.7%             1.4%    1.5%
 256 KB   1.15% 1.17%        1.13% 1.13%            1.12% 1.12%



               CSCE 430/830, Memory Hierarchy Introduction
Q3: After a cache read miss, if there are no empty
cache blocks, which block should be removed from
the cache?

The Least Recently Used
                                         A randomly chosen block?
(LRU) block? Appealing,
                                            Easy to implement, how
but hard to implement for
                                                  well does it work?
high associativity

 Miss Rate for 2-way Set Associative Cache
         Size              Random               LRU            Also,
                                                               try
        16 KB               5.7%                5.2%
                                                               other
        64 KB               2.0%                1.9%           LRU
                                                               approx.
        256 KB              1.17%              1.15%




                 CSCE 430/830, Memory Hierarchy Introduction
   Q4: What happens on a write?
                         Write-Through                 Write-Back
                                                  Write data only to the
                     Data written to cache               cache
                            block
     Policy
                     also written to lower-        Update lower level
                         level memory             when a block falls out
                                                     of the cache
     Debug                     Easy                         Hard
 Do read misses
 produce writes?
                                No                          Yes
Do repeated writes
 make it to lower               Yes                         No
      level?

Additional option (on miss)-- let writes to an un-cached
 address; allocate a new cache line (“write-allocate”).
              CSCE 430/830, Memory Hierarchy Introduction
Write Buffers for Write-Through Caches

                                  Cache            Lower
         Processor                                  Level
                                                   Memory
                                Write Buffer


   Holds data awaiting write-through to
           lower level memory
 Q. Why a write buffer ?             A. So CPU doesn‟t stall
 Q. Why a buffer, why                A. Bursts of writes are
 not just one register ?             common.
 Q. Are Read After Write             A. Yes! Drain buffer before
 (RAW) hazards an issue              next read, or send read 1st
 for write buffer?                   after check write buffers.
             CSCE 430/830, Memory Hierarchy Introduction
5 Basic Cache Optimizations
•    Reducing Miss Rate: 3 Cs of misses
1.   Larger Block size (compulsory misses)
2.   Larger Cache size (capacity misses)
3.   Higher Associativity (conflict misses)

• Reducing Miss Penalty
4. Multilevel Caches

• Reducing hit time
5. Giving Reads Priority over Writes
     •   E.g., Read complete before earlier writes in write buffer



                  CSCE 430/830, Memory Hierarchy Introduction
In-Class Exercises
• In systems with a write-through L1 cache backed by a
  write-back L2 cache instead of main memory, a merging
  write buffer can be simplified. Explain how this can be
  done. Are there situations where having a full write buffer
  (instead of the simple version you‟ve just proposed) could
  be helpful?
    – The merging buffer links CPU to the L2 cache. Two CPU
      writes cannot merge if they are to different sets in L2. So, for
      each new entry into the buffer a quick check on only those
      address bits that determine the L2 set number need be
      performed at first. If there is no match, then the new entry is
      not merged. Otherwise, all address bits can be checked for
      a definitive result.
    – As the associativity of L2 increases, the rate of false positive
      matches from the simplified check will increase, reducing
      performance.


                 CSCE 430/830, Memory Hierarchy Introduction
In-Class Exercises
• As caches increase in size, blocks often increase
  in sizes as well.
1. If a large instruction cache has larger data blocks, is there still a
   need for prefetching? Explain the interaction between
   prefetching and increased block size in instruction caches.
    –   Program basic blocks are often short (<10 instr.), and thus
        program executions does not continue to follow sequential
        locations for very long. As block gets larger, program is more
        likely to not execute all instructions in the block but branch out
        early, making instruction prefetching less attractive.
2. Is there a need for data prefetch instructions when data blocks
   get larger?
    –   Data structures often comprise lengthy sequences of memory
        addresses, and program access of data structure often takes the
        form of sequential sweep. Large data blocks work well with such
        access patterns, and prefetching is likely still of value to the
        highly sequential access patterns. (go-to)

                 CSCE 430/830, Memory Hierarchy Introduction
Outline
•   Review
•   Memory hierarchy
•   Locality
•   Cache design
•   Virtual address spaces
•   Page table layout
•   TLB design options
•   Conclusion




             CSCE 430/830, Memory Hierarchy Introduction
The Limits of Physical Addressing
         “Physical addresses” of memory locations

A0-A31                                                       A0-A31

CPU                                                         Memory
D0-D31                                                       D0-D31

                                Data

     oAll programs share one address space:
     The physical address space
     oMachine language programs must be
     aware of the machine organization
     oNo way to prevent a program from
     accessing any machine resource

              CSCE 430/830, Memory Hierarchy Introduction
Solution: Add a Layer of Indirection
    “Virtual Addresses”                          “Physical
                                                Addresses”
A0-A31                                                       A0-A31
                         Virtual     Physical

CPU                         Address                         Memory
                           Translation
D0-D31                                                       D0-D31

           Data
   • User programs run in a standardized
   virtual address space
   • Address Translation hardware, managed
   by the operating system (OS), maps virtual
   address to physical memory
   •Hardware supports “modern” OS features:
   Protection, Translation, Sharing
              CSCE 430/830, Memory Hierarchy Introduction
    Three Advantages of Virtual Memory
• Translation:
   – Program can be given consistent view of memory, even though physical
     memory is scrambled
   – Makes multithreading reasonable (now used a lot!)
   – Only the most important part of program (“Working Set”) must be in
     physical memory.
   – Contiguous structures (like stacks) use only as much physical memory
     as necessary yet still grow later.
• Protection:
   – Different threads (or processes) protected from each other.
   – Different pages can be given special behavior
       » (Read Only, Invisible to user programs, etc).
   – Kernel data protected from User programs
   – Very important for protection from malicious programs
• Sharing:
   – Can map same physical page to multiple users
     (“Shared memory”)

                    CSCE 430/830, Memory Hierarchy Introduction
     Page tables encode virtual address spaces
                           Physical
            Page Table
                         Memory Space       A virtual address space
                           frame
                                             is divided into blocks
                           frame
                                            of memory called pages
                           frame
                           frame              A machine
                                        usually supports
  virtual                                 pages of a few
 address
                                                   sizes
                                          (MIPS R4000):
OS                       A page table is indexed by a
manages                        virtual address
the page
table for
each ASID A      valid page table entry codes physical memory
                         “frame” address for Introduction
                         CSCE 430/830, Memory Hierarchy
                                                        the page
                        Details of Page Table
           Page Table     Physical
                        Memory Space     Virtual Address
                                                            12
                           frame          V page no.       offset
                           frame
                           frame                       Page Table
                           frame   Page Table
                                   Base Reg             Access
                                            index   V   Rights   PA
                                            into
                                            page
 virtual
                                            table   table located
address                                              in physical P page no.      offset
                                                       memory                     12
                                                                      Physical Address
   • Page table maps virtual page numbers to physical
     frames (“PTE” = Page Table Entry)
   • Virtual memory => treat memory  cache for disk
   • 4 fundamental questions: placement, identification,
     replacement, and write policy?
   Page tables may not fit in memory!
               A table for 4KB pages for a 32-bit address
                          space has 1M entries
         Each process needs its own address space!

  Two-level Page Tables

          32 bit virtual address
    31          22 21   12 11        0
         P1 index P2 index Page Offset



Top-level table wired in main memory

Subset of 1024 second-level tables in
  main memory; rest are on disk or
            unallocated

                         CSCE 430/830, Memory Hierarchy Introduction
    VM and Disk: Page replacement policy
                                                                          Page Table
                                           Dirty bit: page dirty used
                                               written.     1 0             ...
                                                              1       0
                                           Used bit: set to   0       1
                                              1 on any        1       1
                                             reference
                                                              0       0
           Set of all pages
             in Memory                   Tail pointer:
                                         Clear the used
                                         bit in the
                                         page table
Head pointer                                                                  Freelist
Place pages on free
list if used bit
is still clear.
Schedule pages                  Architect‟s role:
with dirty bit set to         support setting dirty
be written to disk.                                                       Free Pages
                                 and used bits
                        CSCE 430/830, Memory Hierarchy Introduction
TLB Design Concepts




    CSCE 430/830, Memory Hierarchy Introduction
MIPS Address Translation: How does it work?
      “Virtual Addresses”                     “Physical
                                             Addresses”
   A0-A31               Virtual   Physical                A0-A31
                            Translation
   CPU                      Look-Aside                Memory
                              Buffer
   D0-D31                      (TLB)                      D0-D31

             Data                               What is
                                               the table
       Translation Look-Aside Buffer (TLB)            of
        A small fully-associative cache of    mappings
    mappings from virtual to physical addresses that it
                                               caches?
                  TLB also contains
          protection bits for virtual address
    Fast common case: Virtual address is in TLB,
      process has permission to read/write it.
    The TLB caches page table entries
                                                                           Physical and virtual
                                                                           pages must be the
                                                                               same size!
TLB caches
 page table                                                                   Physical
  entries.
   virtual address                 for ASID
                                                                               frame
    page    off                                                               address
                      Page Table


                          2

                          0

                          1
                          3                                                V=0 pages either
                                       physical address
                                                                           reside on disk or
                                         page    off
                        TLB                                                have not yet been
                     frame page               MIPS handles TLB misses in       allocated.
                        2   2                      software (random
                        0   5
                                                 replacement). Other        OS handles V=0
                                               machines use hardware.        “Page fault”
 Can TLB and caching be overlapped?
          Virtual Page Number                                 Page Offset

                                                    Index            Byte Select

                 Virtual

             Translation
             Look-Aside                       Cache Tags Valid Cache Data
               Buffer
                (TLB)                                                 Cache Block
                Physical

                      Cache Tag         =
                                                                      Cache Block



                                      Hit
  This works, but ...
Q. What is the downside?
   A. Inflexibility. Size of cache
   limited by page size.
                CSCE 430/830, Memory Hierarchy Introduction          Data out
Problems With Overlapped TLB Access
 Overlapped access only works as long as the address bits used to
      index into the cache do not change as the result of VA translation

 This usually limits things to small caches, large page sizes, or high
      n-way set associative caches if you want a large cache

 Example: suppose everything the same except that the cache is
     increased to 8 K bytes instead of 4 K:

                                  11      2
                                 cache
                                  index   00
                                                   This bit is changed
                                                   by VA translation, but
                       20           12             is needed for cache
                  virt page #      disp            lookup
    Solutions:
         go to 8K byte page sizes;
         go to 2 way set associative cache; or
         SW guarantee VA[13]=PA[13]


                                                    1K    2 way set assoc cache
             10
                            4                 4
                     CSCE 430/830, Memory Hierarchy Introduction
Use virtual addresses for cache?
    “Virtual Addresses”                           “Physical
                                                 Addresses”
A0-A31                          Virtual     Physical           A0-A31
              Virtual
                                  Translation
CPU          Cache                Look-Aside                Main Memory
              D0-D31
 D0-D31
                                    Buffer                     D0-D31
                                     (TLB)


          Only use TLB on a cache miss !

  Downside: a subtle, fatal problem. What is it?

  A. Synonym problem. If two address spaces
  share a physical frame, data may be in cache
 twice. Maintaining consistency is a nightmare.
              CSCE 430/830, Memory Hierarchy Introduction
In-Class Exercise
• Some memory systems handle TLB misses in sw
   (as an exception), while others use hw.
1. What the trade-offs between these two methods?
  –   Hw faster but less flexible
2. Will sw handling of TLB misses always be slower
   ? Explain.
  –   Factors other whether miss handling is done in hw or
      ws can quickly dominate handling time. E.g., Is page
      tbl itself paged? Can sw implement a more efficient
      page tbl search algorithm than hw? Hw Prefetching?
3. Are there page table structures that would be
   difficult to handle in hw but possible in sw?
  –   Page table structures that change dynamically.


              CSCE 430/830, Memory Hierarchy Introduction
Summary #1/3:
The Cache Design Space
  • Several interacting dimensions                         Cache Size
     –   cache size
     –   block size                                                      Associativity
     –   associativity
     –   replacement policy
     –   write-through vs write-back
     –   write allocation
                                                                      Block Size
  • The optimal choice is a compromise
     – depends on access characteristics
         » workload
                                                     Bad
         » use (I-cache, D-cache, TLB)
     – depends on technology / cost
                                                   Good    Factor A        Factor B
  • Simplicity often wins
                                                           Less             More



                     CSCE 430/830, Memory Hierarchy Introduction
 Summary #2/3: Caches
• The Principle of Locality:
   – Program access a relatively small portion of the address space at any
     instant of time.
       » Temporal Locality: Locality in Time
       » Spatial Locality: Locality in Space
• Three Major Categories of Cache Misses:
   – Compulsory Misses: sad facts of life. Example: cold start misses.
   – Capacity Misses: increase cache size
   – Conflict Misses: increase cache size and/or associativity.
                Nightmare Scenario: ping pong effect!
• Write Policy: Write Through vs. Write Back
• Today CPU time is a function of (ops, cache misses)
  vs. just f(ops): affects Compilers, Data structures, and
  Algorithms



                   CSCE 430/830, Memory Hierarchy Introduction
Summary #3/3: TLB, Virtual Memory
• Page tables map virtual address to physical address
• TLBs are important for fast translation
• TLB misses are significant in processor performance
   – funny times, as most systems can‟t access all of 2nd level cache without
     TLB misses!
• Caches, TLBs, Virtual Memory all understood by examining how
  they deal with 4 questions:
  1) Where can block be placed?
  2) How is block found?
  3) What block is replaced on miss?
  4) How are writes handled?
• Today VM allows many processes to share single memory
  without having to swap all processes to disk; today VM
  protection is more important than memory hierarchy benefits,
  but computers insecure
• Prepare for debate + quiz on Wednesday

                   CSCE 430/830, Memory Hierarchy Introduction
Classifying Misses: 3C
– Compulsory—The first access to a block is not in the cache, so
  the block must be brought into the cache. Also called cold start
  misses or first reference misses.
  (Misses in even an Infinite Cache)
– Capacity—If the cache cannot contain all the blocks needed
  during execution of a program, capacity misses will occur due
  to blocks being discarded and later retrieved.
  (Misses in Fully Associative Size X Cache)
– Conflict—If block-placement strategy is set associative or direct
  mapped, conflict misses (in addition to compulsory & capacity
  misses) will occur because a block can be discarded and later
  retrieved if too many blocks map to its set. Also called collision
  misses or interference misses.
  (Misses in N-way Associative, Size X Cache)




                 CSCE 430/830, Memory Hierarchy Introduction
        Classifying Misses: 3C

                                                                 3Cs Absolute Miss Rate (SPEC92)




                              0.14
                                                 1-way                   Conflict
                              0.12
                                                       2-way
                               0.1
         Miss Rate per Type




                                                               4-way
                              0.08
                                                                   8-way
                              0.06
                                                                           Capacity
                              0.04
                              0.02
                                     0
                                         1


                                                  2


                                                          4


                                                                   8


                                                                           16


                                                                                    32


                                                                                             64


                                                                                                   128
Compulsory vanishingly                                     Cache Size (KB)                   Compulsory

small                                        CSCE 430/830, Memory Hierarchy Introduction
2:1 Cache Rule

                                                   miss rate 1-way associative cache size X
                                                 = miss rate 2-way associative cache size X/2


                      0.14                                      Conflict
                                         1-way
                      0.12
                                               2-way
                       0.1
 Miss Rate per Type




                                                       4-way
                      0.08
                                                            8-way
                      0.06
                                                                    Capacity
                      0.04
                      0.02
                             0
                                 1


                                          2


                                                   4


                                                           8


                                                                    16


                                                                            32


                                                                                   64


                                                                                          128
                                                    Cache Size (KB)                Compulsory
                                     CSCE 430/830, Memory Hierarchy Introduction
                        3C Relative Miss Rate


                        100%
                                       1-way
                         80%
                                             2-way
   Miss Rate per Type




                                                  4-way                          Conflict
                         60%                         8-way

                         40%
                                                                  Capacity

                         20%

                          0%
                               1


                                     2


                                               4


                                                       8


                                                                16


                                                                         32


                                                                                   64


                                                                                            128
                                                                                   Compulsory
Flaws: for fixed block size                    Cache Size (KB)
Good: insight => invention
                                   CSCE 430/830, Memory Hierarchy Introduction
            Improve Cache Performance
     improve cache and memory access times:


         Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty




                                               Section 5.5       Section 5.3   Section 5.4

CPUtime  IC * (CPI Execution    MemoryAcce
                                   Instruction
                                               ss
                                                    * MissRate * MissPenalty * ClockCycleTime )


     •    Improve performance by:
                1. Reduce the miss rate,
                2. Reduce the miss penalty, or
                3. Reduce the time to hit in the cache.



                           CSCE 430/830, Memory Hierarchy Introduction
    Reducing Cache Misses: 1. Larger Block Size

Using the principle of locality. The larger the block, the greater the chance parts
of it will be used again.

          25%                                                   Size of Cache


          20%                                                          1K

                                                                       4K
          15%
  Miss
                                                                       16K
  Rate
          10%
                                                                       64K
            5%                                                         256K

            0%
                 16


                           32


                                     64


                                               128


                                                        256
                       Block Size (bytes)

                      CSCE 430/830, Memory Hierarchy Introduction
Increasing Block Size
• One way to reduce the miss rate is to increase
  the block size
   –Take advantage of spatial locality
   –Decreases compulsory misses
• However, larger blocks have disadvantages
   –May increase the miss penalty (need to
    get more data)
   –May increase hit time (need to read
    more data from cache and larger mux)
   –May increase miss rate, since conflict
    misses
• Increasing the block size can help, but don‟t
  overdo it. CSCE 430/830, Memory Hierarchy Introduction
   Block Size vs. Cache Measures
   • Increasing Block Size generally increases
     Miss Penalty and decreases Miss Rate
   • As the block size increases the AMAT starts
     to decrease, but eventually increases



Miss         X         Miss                    =                   Avg.
Penalty                Rate                                        Memory
                                                                   Access
                                                                   Time

                              Block Size              Block Size
          Block Size

                  CSCE 430/830, Memory Hierarchy Introduction
Reducing Cache Misses: 2. Higher Associativity

• Increasing associativity helps reduce conflict
  misses
• 2:1 Cache Rule:
    – The miss rate of a direct mapped cache of size
      N is about equal to the miss rate of a 2-way set
      associative cache of size N/2
    – For example, the miss rate of a 32 Kbyte direct
      mapped cache is about equal to the miss rate
      of a 16 Kbyte 2-way set associative cache
• Disadvantages of higher associativity
    – Need to do large number of comparisons
    – Need n-to-1 multiplexor for n-way set
      associative
    – Could increase hit time
              CSCE 430/830, Memory Hierarchy Introduction
AMAT vs. Associativity


         Cache Size         Associativity
         (KB)      1-way    2-way      4-way    8-way
         1         7.65     6.60       6.22     5.44
         2         5.90     4.90       4.62     4.09
         4         4.60     3.95       3.57     3.19
         8         3.30     3.00       2.87     2.59
         16        2.45     2.20       2.12     2.04
         32        2.00     1.80       1.77     1.79
         64        1.70     1.60       1.57     1.59
         128       1.50     1.45       1.42     1.44
Red means A.M.A.T. not improved by more associativity
Does not take into account effect of slower clock on rest of program




                CSCE 430/830, Memory Hierarchy Introduction
      Reducing Cache Misses: 3. Victim Cache
• Data discarded from cache is placed in an extra small buffer (victim cache).
• On a cache miss check victim cache for data before going to main memory
• Jouppi [1990]: A 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped
data cache
• Used in Alpha, HP PA-RISC CPUs.


                                  Address                                    CPU

                                                                       Address
                                                      =?
                                                                       In
                        Tag
                                                  Victim Cache              Out

                                     Data
                                     Cache
                       =?



                                                                            Write
                                                                            Buffer

                                                                 Lower Level Memory


                              CSCE 430/830, Memory Hierarchy Introduction
Reducing Cache Misses:
4. Way Prediction and Pseudoassociative Caches

     Way prediction helps select one block among those in a
      set, thus requiring only one tag comparison (if hit).
       Preserves advantages of direct-mapping (why?);
       In case of a miss, other block(s) are checked.
     Pseudoassociative (also called column associative)
      caches
       Operate exactly as direct-mapping caches when hit,
          thus again preserving advantages of the direct-
          mapping;
       In case of a miss, another block is checked (as if in
          set-associative caches), by simply inverting the
          most significant bit of the index field to find the
          other block in the “pseudoset”.
       real hit time < pseudo-hit time
       too many pseudo hits would defeat the purpose

            CSCE 430/830, Memory Hierarchy Introduction
Reducing Cache Misses:
5. Compiler Optimizations




           CSCE 430/830, Memory Hierarchy Introduction
Reducing Cache Misses:
5. Compiler Optimizations




           CSCE 430/830, Memory Hierarchy Introduction
Reducing Cache Misses:
5. Compiler Optimizations




           CSCE 430/830, Memory Hierarchy Introduction
    Reducing Cache Misses:
    5. Compiler Optimizations
•    Blocking: improve temporal and spatial locality
     a) multiple arrays are accessed in both ways (i.e., row-major and column-major),
        namely, orthogonal accesses that can not be helped by earlier methods
     b) concentrate on submatrices, or blocks




     c)   All N*N elements of Y and Z are accessed N times and each element of X is accessed
          once. Thus, there are N3 operations and 2N3 + N2 reads! Capacity misses are a
          function of N and cache size in this case.




                      CSCE 430/830, Memory Hierarchy Introduction
    Reducing Cache Misses:
    5. Compiler Optimizations (cont’d)
•    Blocking: improve temporal and spatial locality
     a) To ensure that elements being accessed can fit in the cache, the original
        code is changed to compute a submatrix of size B*B, where B is called the
        blocking factor.
     b) To total number of memory words accessed is 2N3//B + N2
     c) Blocking exploits a combination of spatial (Y) and temporal (Z) locality.




                   CSCE 430/830, Memory Hierarchy Introduction
Reducing Cache Miss Penalty:
1. Multi-level Cache
 a) To keep up with the widening gap between CPU and main
    memory, try to:
   i. make cache faster, and
   ii. make cache larger
   by adding another, larger but slower cache between cache and the main
   memory.




             CSCE 430/830, Memory Hierarchy Introduction
      Adding an L2 Cache

• If a direct mapped cache has a hit rate of 95%, a hit time of
  4 ns, and a miss penalty of 100 ns, what is the AMAT?


• If an L2 cache is added with a hit time of 20 ns and a hit rate
  of 50%, what is the new AMAT?




                  CSCE 430/830, Memory Hierarchy Introduction
        Adding an L2 Cache

• If a direct mapped cache has a hit rate of 95%, a hit time of
  4 ns, and a miss penalty of 100 ns, what is the AMAT?
   AMAT = Hit time + Miss rate x Miss penalty = 4 + 0.05 x 100 = 9 ns


• If an L2 cache is added with a hit time of 20 ns and a hit rate
  of 50%, what is the new AMAT?
   AMAT = Hit TimeL1 + Miss RateL1 x (Hit TimeL2 + Miss RateL2 x Miss PenaltyL2 )
          =4 + 0.05 x (20 + 0.5x100) = 7.5 ns




                        CSCE 430/830, Memory Hierarchy Introduction
    Reducing Cache Miss Penalty:
    2. Handling Misses Judiciously
   Critical Word First and Early Restart
     CPU needs just one word of the block at a time:
          critical word first: fetch the required word first, and
          early start: as soon as the required word arrives, send it to CPU.
   Giving Priority to Read Misses over Write Misses
     Serves reads before writes have been completed:
          while write buffers improve write-through performance, they
           complicate memory accesses by potentially delaying updates to
           memory;
          instead of waiting for the write buffer to become empty before
           processing a read miss, the write buffer is checked for content that
           might satisfy the missing read.




          in a write-back scheme, the dirty copy upon replacing is first written to
           the write buffer instead of the memory, thus improving performance.



                    CSCE 430/830, Memory Hierarchy Introduction
Reducing Cache Miss Penalty:
3. Compiler-Controlled Prefetching
 Compiler inserts prefetch instructions
 An Example
    for(i:=0; i<3; i:=i+1)
       for(j:=0; j<100; j:=j+1)
           a[i][j] := b[j][0] * b[j+1][0]
    16-byte blocks, 8KB cache, 1-way write back, 8-byte
      elements; What kind of locality, if any, exists for a and b?
      a.  3 100-element rows (100 columns) visited; spatial locality:
         even-indexed elements miss and odd-indexed elements hit,
         leading to 3*100/2 = 150 misses
      b. 101 rows and 3 columns visited; no spatial locality, but
         there is temporal locality: same element is used in ith and (i
         + 1)st iterations and the same element is access in each i
         iteration (outer loop). 100 misses for i = 0 and 1 miss for j =
         0 for a total of 101 misses
    Assuming large penalty (50 cycles and at least 7
     iterations must be prefetched). Splitting the loop into
     two, we have
             CSCE 430/830, Memory Hierarchy Introduction
Reducing Cache Miss Penalty:
3. Compiler-Controlled Prefetching
 An Example (continued)
   for(j:=0; j<100; j:=j+1){
            prefetch(b[j+7][0];
            prefetch(a[0][j+7];
            a[0][j] := b[j][0] * b[j+1][0];};
   for(i:=1; i<3; i:=i+1)
            for(j:=0; j<100; j:=j+1){
                     prefetch(a[i][j+7];
                     a[i][j] := b[j][0] * b[j+1][0]}
    Assuming that each iteration of the pre-split loop
      consumes 7 cycles and no conflict and capacity misses,
      then it consumes a total of 7*300 + 251*50 = 14650 cycles
      (total iteration cycles plus total cache miss cycles);


            CSCE 430/830, Memory Hierarchy Introduction
 Reducing Cache Miss Penalty:
 3. Compiler-Controlled Prefetching (cont’d)
 An Example (continued)
   the first loop consumes 9 cycles per iteration (due to the
    two prefetch instruction)
   the second loop consumes 8 cycles per iteration (due to
    the single prefetch instruction),
   during the first 7 iterations of the first loop array a incurs
    4 cache misses,
   array b incurs 7 cache misses,
   during the first 7 iterations of the second loop for i = 1
    and i = 2 array a incurs 4 cache misses each
   array b does not incur any cache miss in the second
    split!.
   the split loop consumes a total of
    (1+1+7)*100+(4+7)*50+(1+7)*200+(4+4)*50 = 3450
   Prefetching improves performance: 14650/3450=4.25 folds

              CSCE 430/830, Memory Hierarchy Introduction
  Reducing Cache Hit Time:

 Small and simple caches
   smaller is faster:
      small index, less address translation time
      small cache can fit on the same chip with CPU
      low associativity: in addition to a simpler/shorter tag
       check, 1-way cache allows overlapping tag check with
       transmission of data which is not possible with any
       higher associativity!
 Avoid address translation during indexing
   Make the common case fast:
     use virtual address for cache because most memory
       accesses (more than 90%) take place in cache, resulting
       in virtual cache



               CSCE 430/830, Memory Hierarchy Introduction
 Reducing Cache Hit Time:
 Make the common case fast (continued):
   there are at least three important performance aspects that
     directly relate to virtual-to-physical translation:
       1) improperly organized or insufficiently sized TLBs may create
          excess not-in-TLB faults, adding time to program execution time
       2) for a physical cache, the TLB access time must occur before the
          cache access, extending the cache access time
       3) two-line address (e.g., an I-line and a D-line address) may be
          independent of each other in virtual address space yet collide in
          the real address space, when they draw pages whose lower page
          address bits (and upper cache address bits) are identical
    problems with virtual cache:
       1) Page-level protection must be enforced no matter what during
          address translation (solution: copy protection info from TLB on a
          miss and hold it in a field for future virtual indexing/tagging)
       2) when a process is switched in/out, the entire cache has to be
          flushed out „cause physical address will be different each time,
          i.e., the problem of context switching (solution: process identifier
          tag -- PID)
                CSCE 430/830, Memory Hierarchy Introduction
    Reducing Cache Hit Time:
 Avoid address translation during indexing (continued)
        problems with virtual cache:
           3) different virtual addresses may refer to the same physical
              address, i.e., the problem of synonyms/aliases
                 HW solution: guarantee every cache block a unique phy.
                   Address
                 SW solution: force aliases to share some address bits (e.g.,
                   page-coloring)
                 Virtually indexed and physically tagged
 Pipelined cache writes
    the solution is to reduce CCT and increase # of stages – increases instr.
     throughput
 Trace caches
    Finds a dynamic sequence of instructions including taken branches to
     load into a cache block:
       Put traces of the executed instructions into cache blocks as
         determined by the CPU
       Branch prediction is folded in to the cache and must be validated
         along with the addresses to have a valid fetch.
       Disadvantage: store the same instructions multiple times
                    CSCE 430/830, Memory Hierarchy Introduction

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:18
posted:4/28/2010
language:English
pages:72