Introduction to Memory Hierarchy by fjhuangjun


									ECE6130: Computer Architecture:
   Memory Hierarchy design

             Dr. Xubin He
   Tel: 931-3723462, Brown Hall 319

        What you have learned so far…
• Fundamentals of Computer Design
   – Cost and Technology Trends, Amdahl’s law, Principles of locality, CPU
     performance Equations
   – Classifications, Addressing Modes, Operands and Operations, MIPS
• Pipelining
   – Hazards, MIPS 5-stage pipeline
   – Dynamic scheduling, Dynamic Branch Prediction

  Next: Memory Hierarchy Design
  Cache Performance, Techniques to Improve Memory
  Performance, Memory Organization Technology, case study
   Since 1980, CPU has outpaced DRAM
            Q. How do architects address this gap?
              A. Put smaller, faster “cache” memories
Performance          between CPU and DRAM.
 (1/latency)       Create a “memory hierarchy”.        CPU
                                                   CPU 60% per yr

                                                       2X in 1.5 yrs

                                     Gap grew 50% per
                                                     9% per yr
                                                     2X in 10 yrs

                                                        Year   3
                           What is a cache?
• Small, fast storage used to improve average access time to slow memory.
• Exploits spatial and temporal locality
• In computer architecture, almost everything is a cache!
    –   Registers “a cache” on variables – software managed
    –   First-level cache a cache on second-level cache
    –   Second-level cache a cache on memory
    –   Memory a cache on disk (virtual memory)
    –   TLB a cache on page table
    –   Branch-prediction buffer a cache on prediction information?


        Bigger                        L2-Cache                        Faster


                                 Disk, Tape, etc.                              4
 1977: DRAM faster than
                    Apple ][ (1977)
                    CPU: 1000 ns
                    DRAM: 400 ns

Steve   Wozniak

           Levels of the Memory Hierarchy
Capacity                                             Upper Level
Access Time
CPU Registers
100s Bytes                 Registers
<1 ns
                                  Instr. Operands
K Bytes
1 ns
1-0.1 cents/bit
Main Memory
M-G Bytes                  Memory
100 ns
$.0001-.00001 cents /bit
G-T Bytes, 10 ms
(10,000,000 ns)            Disk
  -5 -6
10 - 10 cents/bit
Tape                                                      Larger
sec-min                     Tape                    Lower Level
10 -8
    Memory Hierarchy: Apple iMac G5
 Managed                      Managed           Managed by OS,
by compiler                  by hardware          hardware,
 iMac      Reg     L1 Inst   L1 Data    L2      DRAM    Disk
 Size      1K       64K       32K      512K     256M    80G
            1,       3,         3,      11,      88,     107,   iMac G5
          0.6 ns   1.9 ns     1.9 ns   6.9 ns   55 ns   12 ms   1.6 GHz
Goal: Illusion of large, fast, cheap memory
Let programs address a memory space that
 scales to the disk size, at a speed that is
     usually as fast as register access
            iMac’s PowerPC 970: All caches on-
               L1 (64K Instruction)



                   L1 (32K Data)
           The Principle of Locality
• The Principle of Locality:
   – Program access a relatively small portion of the address
     space at any instant of time.
• Two Different Types of Locality:
   – Temporal Locality (Locality in Time): If an item is
     referenced, it will tend to be referenced again soon (e.g.,
     loops, reuse)
   – Spatial Locality (Locality in Space): If an item is
     referenced, items whose addresses are close by tend to be
     referenced soon
     (e.g., straightline code, array access)
      It is a property of programs which is exploited in machine design.
• Last 15 years, HW relied on locality for speed
                                      Programs with locality cache well ...
                                                           Bad locality behavior
Memory Address (one dot per access)


                                                       Donald J. Hatfield, Jeanette Gerald: Program
                                                       Restructuring for Virtual Memory. IBM Systems Journal
                                                       10(3): 168-192 (1971)
    Memory Hierarchy: Terminology
• Hit: data appears in some block in the upper level (example: Block X)
    – Hit Rate: the fraction of memory access found in the upper level
    – Hit Time: Time to access the upper level which consists of
        RAM access time + Time to determine hit/miss
• Miss: data needs to be retrieve from a block in the lower level (Block Y)
    – Miss Rate = 1 - (Hit Rate)
    – Miss Penalty: Time to replace a block in the upper level +
       Time to deliver the block the processor
• Hit Time << Miss Penalty (500 instructions on 21264!)

                                                     Lower Level
             To Processor    Upper Level              Memory
                                 Blk X
         From Processor                                  Blk Y

                           More terms
• Block: a fixed-size collection of data, which is retrieved from the main
  memory and placed into cache. (cache unit)
• Temporal locality:recently accessed data items are likely to be accessed in
  the near future.
• Spatial locality: items whose addresses are near one another tend to be
  referenced closed together in time.
• The time required for the cache miss depends on both latency and
  bandwidth of the memory. Latency determines the time to retrieve the first
  word of the block, and bandwidth determines the time to retrieve the rest
  of this block.
• Vitural memory: the address space is ususally broken into fixed number of
  blocks (pages). At any time, each page resides either in main memory or on
  disk. When the CPU references an item within a page that is not in the
  cache or main memory, a page fault occurs, and the entire page is then
  moved from the disk to main memory.
• The cache and main memory have the same relationship as the main
  memory and disk.

                  Cache performance
• Memory stall cycles: the number of cycles during
  which CPU is stalled waiting for memory access.
• CPUtime=(CPUclock cycles+Memstall cycles) x Cycle time
• Memstall cycles=# of Misses x Miss Penalty
                           MemMisses 
                  =    IC x     Inst
                  = IC               MissRate  MissPenalty

 Miss Rate: the fraction of cache accesses that result in a miss.

                Impact on Performance
•   Example assume we have a computer where CPI is 1.0 when all
    memory accesses hit the cache. The only data accesses are loads and
    stores, and these total 50% of the insts. If the miss penalty is 25 clock
    cycles and the miss rate is 2%, How much fast the computer would be, and
    what is that if all insts were cache hits?

Traditional Four Questions for Memory
         Hierarchy Designers

• Q1: Where can a block be placed in the upper level? (Block
   – Fully Associative, Set Associative, Direct Mapped
• Q2: How is a block found if it is in the upper level?
  (Block identification)
   – Tag/Block
• Q3: Which block should be replaced on a miss?
  (Block replacement)
   – Random, LRU
• Q4: What happens on a write?
  (Write strategy)
   – Write Back or Write Through (with Write Buffer)

    Q1: Where can a block be placed in the upper
             level? (Block placement)
(a) fully associative: any block in the main memory can be
    placed in any block frame.
          It is flexible but expensive due to associativity
(b) direct mapping: each block in memory is placed in a fixed block frame with the
   following mapping function:
   block frame # = (block addr in mem.) MOD (#of block frames in cache)
          It is inflexible but simple and economical.
(c) set associative: a compromise between fully associative and direct mapping; The cache
    is divided into sets of block frames, and each block from the memory is first mapped to
    a fixed set wherein the block can be placed in any block frame. Mapping to a set follows
    the function, called a bit selection:
    set # = (block addr in mem.)MOD(# of sets in cache)
    - n-way set associative: there are n blocks in a set;
    - fully associative is an m-way set associative if there are m block frames in the cache;
    whereas, direct mapping is one-way set associative
    - one-way, two-way, and four-way are the most frequently used methods.
      Q1: Where can a block be placed in the upper

• Block 12 placed in 8 block cache:
   – Fully associative, direct mapped, 2-way set associative
   – S.A. Mapping = Block Number Modulo Number Sets
                          Direct Mapped     2-Way Assoc
           Full Mapped
                          (12 mod 8) = 4   (12 mod 4) = 0
            01234567        01234567         01234567




   Q2: How is a block found if it is in the upper level?(Block identification)

• each block frame in the cache has an address tag indicating the block's address
  in the memory
• all possible tags are searched in parallel
• a valid bit is attached to the tag to indicate whether the block
  contains valid information or not
• an address for a datum from CPU, A, is divided into block
  address field and block offset field:
         block address = A / block size
         block offset = (A) MOD (block size)
• block address is further divided into tag and index
• index indicates the set in which the block may reside, while tag is compared to
  indicate a hit or a miss

      Example: Alpha 21264 Data Cache

                                                For 2-way set
                                                associative, use
                                                round-robin (FIFO)
                                                to choose where to go

                          16 bytes Cache miss
Figure 5.7
  Q3: Which block should be replaced on a miss?
              (Block replacement)

• the more choices for replacement, the more expensive
  for hardware -- direct mapping is the simplest
• random vs. least-recently used (LRU): the former has
  uniform allocation and is simple to build while the latter
  can take advantage of temporal locality but can be
  expensive to implement (why?)

• Random, LRU, FIFO

    Q3: After a cache read miss, if there are no
    empty cache blocks, which block should be
            removed from the cache?
The Least Recently Used              A randomly chosen block?
(LRU) block? Appealing,                 Easy to implement, how
but hard to implement for                     well does it work?
high associativity

      Miss Rate for 2-way Set Associative Cache
               Size         Random           LRU
              16 KB         5.7%            5.2%
              64 KB         2.0%            1.9%
              256 KB        1.17%           1.15%

                          Q4: What happens on a write?
                                (Write strategy)
•   most cache accesses are reads: all instruction accesses are reads, and most instructions don’t write
    to memory.
•   optimize reads to make the common case fast, observing that CPU doesn't have to wait for writes
    while must wait for reads: fortunately, read is easy: reading and tag comparison can be done in
    parallel; but write is hard:
             (a) cannot overlap tag reading and block writing (destructive)
             (b) CPU specifies write size: only 1 - 8 bytes
    Thus write strategies often distinguish cache design:
    (a) write through (or store through): write info to blocks in both levels
    - ensuring consistency at the cost of memory and bus bandwidth
    - write stalls may be alleviated by using write buffers by overlapping processor execution with
    memory updating
    (b) write back (store in): write info to blocks only in cache level
    - minimizing memory and bus traffic at the cost of weak consistency
    - use dirty bit to indicate modification, reduce frequency of write-back on replacement
    - read misses may result in writes (why?)
•   On a write miss: the data are not needed
    (a) write allocate (The block is allocated on a write miss):      normally used in write-back
    (b) no-write allocate (write miss does not affect cache, modified in
    lower-level cache):      normally used in write-through

    Q4: What happens on a write?
                        Write-Through            Write-Back
                                              Write data only to the
                     Data written to cache           cache
                     also written to lower- Update lower level
                         level memory       when a block falls out
                                                  of the cache

    Debug                    Easy                    Hard
 Do read misses
 produce writes?              No                      Yes
Do repeated writes
 make it to lower             Yes                      No

   Additional option -- let writes to an un-cached address
        allocate a new cache line (“write-allocate”).     23
   Write Buffers for Write-Through Caches

                            Cache        Lower
        Processor                         Level
                          Write Buffer

  Holds data awaiting write-through to
          lower level memory
Q. Why a write buffer ?       A. So CPU doesn’t stall

Q. Why a buffer, why          A. Bursts of writes are
not just one register ?       common.
Q. Are Read After Write A. Yes! Drain buffer before
(RAW) hazards an issue next read, or send read 1st
for write buffer?       after check write buffers.
• Example
  – Assume a fully associative write –back cache with many
    cache entries that start empty. Below is a sequence of five
    memory operations (the address is in square brackets):

  For no-write allocate, ? Misses and ? hits

  For write allocate, ? Misses and ? hits?

Example: Fig. 5.7 Alpha AXP 21064 data cache (64-bit machine).
Cache size = 65,536 bytes (64K), block size = 64 bytes, with two-
way set-associative placement, write back, write allocate on a write
miss, 44 bit physical address. What is the index size?

               Unified vs Split Caches
• Unified vs Separate I&D
           Proc                         Proc

          Unified          I-Cache-1             D-Cache-1
          Cache-1                      Unified

• Example:
  – 16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47%
  – 32KB unified: Aggregate miss rate=1.99%
• Which is better (ignore L2 cache)?
  – Assume 33% data ops  75% accesses from instructions
  – hit time=1, miss time=50
  – Note that data hit has 1 stall for unified cache (only one port)
Discussion: single (unified) cache vs. separate cache different miss rates
(74% instruction vs. 26% data), see figure 5.8.
(May have structural hazards from Load/Store with a single/unified cache)

                   Size          Instruction    Data     Unified
                                    cache      Cache     Cache

                  8 KB              8.16       44.0       63.0
                  16 KB             3.82       40.9       51.0

                  32 KB             1.36       38.4       43.3

                  64 KB             0.61       36.9       39.4

                 128 KB             0.30       35.3       36.2

                 256 KB             0.02       32.6       32.9

                    Figure 5.8 Miss per 1000 instructions
                for inst, data and unified caches of diff sizes
         The Limits of Physical Addressing
            “Physical addresses” of memory locations

A0-A31                                                  A0-A31

CPU                                                    Memory
D0-D31                                                  D0-D31


         All programs share one address space:
               The physical address space
          Machine language programs must be
           aware of the machine organization
           No way to prevent a program from
           accessing any machine resource                        29
          Solution: Add a Layer of
    “Virtual Addresses”                     “Physical
A0-A31                Virtual   Physical                A0-A31

CPU                        Address                  Memory
D0-D31                                                  D0-D31


         User programs run in an standardized
                 virtual address space
             Address Translation hardware
          managed by the operating system (OS)
         maps virtual address to physical memory
    Hardware supports “modern” OS features:
        Protection, Translation, Sharing                         30
           Three Advantages of Virtual
• Translation:
   – Program can be given consistent view of memory, even though physical
     memory is scrambled
   – Makes multithreading reasonable (now used a lot!)
   – Only the most important part of program (“Working Set”) must be in
     physical memory.
   – Contiguous structures (like stacks) use only as much physical memory as
     necessary yet still grow later.
• Protection:
   – Different threads (or processes) protected from each other.
   – Different pages can be given special behavior
       » (Read Only, Invisible to user programs, etc).
   – Kernel data protected from User programs
   – Very important for protection from malicious programs
• Sharing:
   – Can map same physical page to multiple users
     (“Shared memory”)
     Page tables encode virtual address spaces
                                    A virtual address space
Address Space
                 Address Space
                                     is divided into blocks
                                    of memory called pages
                    frame              A machine
                                 usually supports
                                   pages of a few
                                   (MIPS R4000):

                A valid page table entry codes physical
                memory “frame” address for the page           32
     Page tables encode virtual address spaces
             Page Table      Physical
                           Memory Space
                                             A virtual address space
                                              is divided into blocks
                                             of memory called pages
                             frame              A machine
                                          usually supports
                                            pages of a few
  address                                            sizes
                                            (MIPS R4000):
manages                   A page table is indexed by a
the page                        virtual address
table for
each ASID         A valid page table entry codes physical
                  memory “frame” address for the page                  33
                            Details of Page Table
           Page Table     Physical
                        Memory Space
                                         Virtual Address
                           frame                            12
                           frame          V page no.       offset
                           frame                       Page Table
                                   Page Table
                                   Base Reg             Access
                                            index   V   Rights   PA
address                                     page
                                            table   table located
                                                     in physical P page no.      offset
                                                       memory                     12
                                                                      Physical Address

   • Page table maps virtual page numbers to physical
     frames (“PTE” = Page Table Entry)
   • Virtual memory => treat memory  cache for                                    34
     Page tables may not fit in memory!
               A table for 4KB pages for a 32-bit address
                          space has 1M entries
         Each process needs its own address space!

  Two-level Page Tables

          32 bit virtual address
    31          22 21   12 11        0
         P1 index P2 index Page Offset

Top-level table wired in main memory

Subset of 1024 second-level tables in
  main memory; rest are on disk or

     VM and Disk: Page replacement policy
                                                             Page Table
                                Dirty bit: page dirty used
                                    written.     1 0           ...
                                                   1   0
                                Used bit: set to   0   1
                                   1 on any        1   1
                                                   0   0
           Set of all pages
             in Memory         Tail pointer:
                               Clear the used
                               bit in the
                               page table
Head pointer                                                     Freelist
Place pages on free
list if used bit
is still clear.
Schedule pages
with dirty bit set to
                          Architect’s role:
be written to disk.     support setting dirty                Free Pages
                           and used bits                                    36
TLB Design Concepts

MIPS Address Translation: How does it work?
    “Virtual Addresses”                     “Physical
 A0-A31               Virtual   Physical                A0-A31
 CPU                      Look-Aside                Memory
                            Buffer                      D0-D31
                                              What is
                                             the table
     Translation Look-Aside Buffer (TLB)            of
      A small fully-associative cache of    mappings
  mappings from virtual to physical addresses that it
                  TLB also contains
          protection bits for virtual address
  Fast common case: Virtual address is in TLB,
    process has permission to read/write it.                     38
    The TLB caches page table entries
                                                                           Physical and virtual
                                                                           pages must be the
                                                                               same size!
TLB caches
 page table                                                                   Physical
   virtual address                 for ASID
    page    off                                                               address
                      Page Table



                          3                                                V=0 pages either
                                       physical address
                                                                           reside on disk or
                                         page    off
                        TLB                                                have not yet been
                     frame page               MIPS handles TLB misses in       allocated.
                        2   2                      software (random
                        0   5
                                                 replacement). Other                    V=0
                                                                            OS handles 39
                                               machines use hardware.        “Page fault”
          Can TLB and caching be
          Virtual Page Number                       Page Offset

                                            Index          Byte Select


             Look-Aside                 Cache Tags Valid Cache Data
                (TLB)                                      Cache Block

                      Cache Tag    =
                                                           Cache Block

  This works, but ...
Q. What is the downside?
   A. Inflexibility. Size of cache
   limited by page size.
                                                            Data out
Problems With Overlapped TLB Access
Overlapped access only works as long as the address bits used to
     index into the cache do not change as the result of VA translation

This usually limits things to small caches, large page sizes, or high
     n-way set associative caches if you want a large cache

Example: suppose everything the same except that the cache is
    increased to 8 K bytes instead of 4 K:

                                11      2
                                index   00
                                                This bit is changed
                                                by VA translation, but
                      20          12            is needed for cache
                 virt page #     disp           lookup
        go to 8K byte page sizes;
        go to 2 way set associative cache; or
        SW guarantee VA[13]=PA[13]

                                                1K   2 way set assoc cache
                          4                 4                            41
         Use virtual addresses for cache?
    “Virtual Addresses”                   “Physical
A0-A31                    Virtual   Physical      A0-A31
CPU          Cache         Look-Aside          Main Memory
                             Buffer               D0-D31

         Only use TLB on a cache miss !

 Downside: a subtle, fatal problem. What is it?

 A. Synonym problem. If two address spaces
 share a physical frame, data may be in cache
twice. Maintaining consistency is a nightmare.
                 Summary #1/3:
             The Cache Design Space
• Several interacting dimensions                Cache Size

   –   cache size
   –   block size                                             Associativity
   –   associativity
   –   replacement policy
   –   write-through vs write-back
   –   write allocation                                    Block Size

• The optimal choice is a compromise
   – depends on access characteristics
       » workload
       » use (I-cache, D-cache, TLB)
                                         Good   Factor A        Factor B
   – depends on technology / cost
• Simplicity often wins                         Less             More

      Summary #2/3: Caches
• The Principle of Locality:
   – Program access a relatively small portion of the address space at any
     instant of time.
       » Temporal Locality: Locality in Time
       » Spatial Locality: Locality in Space
• Three Major Categories of Cache Misses:
   – Compulsory Misses: sad facts of life. Example: cold start misses.
   – Capacity Misses: increase cache size
   – Conflict Misses: increase cache size and/or associativity.
               Nightmare Scenario: ping pong effect!
• Write Policy: Write Through vs. Write Back
• Today CPU time is a function of (ops, cache misses) vs. just
  f(ops): affects Compilers, Data structures, and Algorithms

 Summary #3/3: TLB, Virtual Memory
• Page tables map virtual address to physical address
• TLBs are important for fast translation
• TLB misses are significant in processor performance
    – funny times, as most systems can’t access all of 2nd level cache without TLB
• Caches, TLBs, Virtual Memory all understood by examining how they
  deal with 4 questions:
  1) Where can block be placed?
  2) How is block found?
  3) What block is replaced on miss?
  4) How are writes handled?
• Today VM allows many processes to share single memory without
  having to swap all processes to disk; today VM protection is more
  important than memory hierarchy benefits, but computers insecure


To top