Docstoc

Computer Architecture

Document Sample
Computer Architecture Powered By Docstoc
					 Computer
Architecture
    Module 6
     Cache


           Asst.Prof.Dr. Wanida Kanarkard,
     Department of Computer Engineering,
                      Khon Kaen University
  Levels of the Memory Hierarchy
                            Upper Level
Capacity
Access Time                                           Staging
Cost                                                  Xfer Unit        faster
CPU Registers
100s Bytes                 Registers
<10s ns
                                  Instr. Operands   prog./compiler
                                                    1-8 bytes
Cache
K Bytes
10-100 ns
                           Cache
1-0.1 cents/bit                                     cache cntl
                                  Blocks            8-128 bytes
Main Memory
M Bytes                    Memory
200ns- 500ns
$.0001-.00001 cents /bit                             OS
Disk
                                  Pages              512-4K bytes
G Bytes, 10 ms
(10,000,000 ns)            Disk
  -5 -6
10 - 10 cents/bit                                    user/operator
                                  Files              Mbytes
Tape                                                                    Larger
infinite
sec-min                     Tape                                  Lower Level
10 -8

                                                                                2
      The Principle of Locality
• The Principle of Locality:
   – Program access a relatively small portion of the
     address space at any instant of time.
• Two Different Types of Locality:
   – Temporal Locality (Locality in Time): If an item is
     referenced, it will tend to be referenced again soon
     (e.g., loops, reuse)
   – Spatial Locality (Locality in Space): If an item is
     referenced, items whose addresses are close by tend
     to be referenced soon
     (e.g., straightline code, array access)
• Last 15 years, HW relied on localilty for speed

                                                            3
     Memory Hierarchy: Terminology
• Hit: data appears in some block in the upper level (example: Block
  X)
   – Hit Rate: the fraction of memory access found in the upper
      level
   – Hit Time: Time to access the upper level which consists of
        RAM access time + Time to determine hit/miss
• Miss: data needs to be retrieve from a block in the lower level
  (Block Y)
   – Miss Rate = 1 - (Hit Rate)
   – Miss Penalty: Time to replace a block in the upper level +
        Time to deliver the block the processor
• Hit Time << Miss Penalty (500 instructions on 21264!)

          To Processor   Upper Level     Lower Level
                          Memory          Memory
                            Blk X
      From Processor                        Blk Y

                                                                       4
              Cache Measures
• Hit rate: fraction found in that level
   – So high that usually talk about Miss rate
   – Miss rate fallacy: as MIPS to CPU performance,
     miss rate to average memory access time in memory
• Average memory-access time
      = Hit time + Miss rate x Miss penalty
              (ns or clocks)
• Miss penalty: time to replace a block from lower level,
  including time to replace in CPU
   – access time: time to lower level
     = f(latency to lower level)
   – transfer time: time to transfer block
     =f(BW between upper & lower levels)                5
  Computer Memory System Overview
• Characteristics of memory system
   – Location
   – Capacity
   – Unit of transfer
   – Method of accessing
   – Performance
   – Physical type
   – Physical characteristics
   – Organisation
• The memory hierarchy


                                     6
                    Location
• Registers of CPU and Control Unit
• Internal (cache, main memory)
• External (accessible through I/O)




                                      7
                    Capacity
• Internal memory
   – Number of bytes or words
   – Common word length: 8, 16, 32 bits
• External memory
   – Number of bytes




                                          8
              Unit of Transfer

• Internal memory
   – Number of bits read out of or written into memory
     at a time, must be less than data bus width
• External memory
   – Usually a block which is much larger than a word




                                                         9
             Access Methods

• Sequential
   – e.g. tape
• Direct
   – e.g. disk
• Random
   – e.g. main memory
• Associative
   – e.g. cache




                              10
                Performance

• Access time (latency)
   – Time between presenting the address and getting
     the valid data
• Memory Cycle time (to RAM)
   – Time may be required for the memory to “recover”
     before next access
   – Cycle time is access + recovery
• Transfer Rate
   – Rate at which data can be moved



                                                    11
              Physical Types
• Semiconductor
   – RAM
• Magnetic
   – Disk & Tape
• Optical
   – CD & DVD




                               12
       Physical Characteristics
• Volatile vs. non-volatile
• Erasable vs. non-erasable




                                  13
                 Organisation
• Physical arrangement of bits into words




                                            14
            Memory Hierarchy
• Balance of cost, capacity, and access time
   – Capacity , cost , access time 
   – Access time , cost , capacity 




                                               15
      Memory Hierarchy - Diagram
•Decreasing cost
•Increasing capacity
•Increasing access time
•Decreasing frequency
of access by processor




                                   16
     Cache Memory Principles
• Small amount of fast memory
• Sits between normal main memory and CPU
• May be located on CPU chip or module




                                            17
                  What is a cache?
• Small, fast storage used to improve average access time to slow
  memory.
• Exploits spacial and temporal locality
• In computer architecture, almost everything is a cache!
   – Registers a cache on variables
   – First-level cache a cache on second-level cache
   – Second-level cache a cache on memory
   – Memory a cache on disk (virtual memory)
   – TLB a cache on page table
   – Branch-prediction a cache on prediction information?
                             Proc/Regs

                             L1-Cache
     Bigger                  L2-Cache                   Faster

                              Memory

                          Disk, Tape, etc.
                                                                    18
      Cache Operation Overview
1)   CPU generates the address (RA) of a word to be read
2)   Check if block containing RA in cache
3)   Yes, get from cache (fast), return
4)   No, access main memory for required block
5)   Allocate cache line for this new found block
6)   Load block to cache, and deliver word to CPU




                                                       19
Typical Cache Organisation




                             20
       Elements of Cache Design
•   Cache size
•   Mapping Function
•   Replacement Algorithms
•   Write Policy
•   Line Size
•   Number of Caches




                                  21
                   Cache Size
• Cost
   – More cache is more expensive
• Speed
   – More cache is faster (less block swapping)
   – Checking cache for data takes time




                                                  22
  Simplest Cache: Direct Mapped
Memory Address   Memory
  0
                               4 Byte Direct Mapped Cache
  1
                                Cache Index
  2
                                0
  3
                                1
  4
                                2
  5
                                3
  6
  7
                          • Location 0 can be occupied by data from:
  8
  9
                             – Memory location 0, 4, 8, ... etc.
  A                          – In general: any memory location
  B                            whose 2 LSBs of the address are 0s
  C                          – Address<1:0> => cache index
  D                       • Which one should we place in the cache?
  E
                          • How can we tell which one is in the
  F
                            cache?
                                                                23
  Q1: Where can a block be
  placed in the upper level?
• direct mapped - 1 place



• n-way set associative - n places

• fully-associative - any place




                                     24
Q2: How is a block found if it
    is in the upper level?
• Tag on each block
   – No need to check index or block offset
• Increasing associativity shrinks index, expands tag




                 Block Address
                                                 Block offset
          Tag                    Index




                                                                25
26
27
 Q3: Which block should be
    replaced on a miss?
• Easy for Direct Mapped
• Set Associative or Fully Associative:
   – Random
   – LRU (Least Recently Used)
Associativity: 2-way           4-way       8-way
Size       LRURandom LRURandom LRU Random
16 KB     5.2% 5.7% 4.7%         5.3% 4.4%  5.0%
64 KB     1.9% 2.0% 1.5%         1.7% 1.4%  1.5%
256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%



                                                   28
 Q4: What happens on a write?
• Write through—The information is written to both the
  block in the cache and to the block in the lower-level
  memory.
• Write back—The information is written only to the
  block in the cache. The modified cache block is written
  to main memory only when it is replaced.
   – is block clean or dirty?
• Pros and Cons of each?
   – WT: read misses cannot result in writes
   – WB: no repeated writes to same location
• WT always combined with write buffers so that don’t
  wait for lower level memory

                                                      29
     Write Buffer for Write Through
                            Cache
       Processor                         DRAM


                          Write Buffer


• A Write Buffer is needed between the Cache and Memory
   – Processor: writes data into the cache and the write buffer
   – Memory controller: write contents of the buffer to memory
• Write buffer is just a FIFO:
   – Typical number of entries: 4
   – Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write
     cycle
• Memory system designer’s nightmare:
   – Store frequency (w.r.t. time) -> 1 / DRAM write cycle
   – Write buffer saturation

                                                                      30
            Mapping Function
• Direct mapping
• Associative mapping
• Set associative mapping

Assumptions for examples
• Cache size 64KBytes
• Block size 4 bytes
• Cache lines 16K (64K/4 = 16K)
• Main memory size 16MBytes
• Main memory address 24 bit
• Number of blocks in main memory 4M


                                       31
                Direct Mapping
• Each block of main memory maps to only one cache
  line
   – i.e. if a block is in cache, it must be in one specific
     place
• Blocks of memory are assigned to lines of cache
• Line number can be calculated from a given address




                                                           32
                         Direct Mapping
                        Address Structure

 Tag                           Line                          Word
       8                              14                       2

• 24 bit address
• 2 bit word identifier (4 byte block)
• 22 bit block identifier
   – 14 bit line
   – 8 bit tag (=22-14)
• No two blocks in the same line have the same Tag field
• Check contents of cache by finding line and checking Tag




                                                                   33
                Direct Mapping
               Cache Line Table
• Cache line     Main Memory blocks held
• 0                    0, m, 2m, 3m…2s-m
• 1                    1,m+1, 2m+1…2s-m+1

• m-1            m-1, 2m-1,3m-1…2s-1

m= 214 (16K)




                                            34
Direct Mapping Example




                         35
        Direct Mapping Summary
•   Address length = (s + w) bits
•   Number of addressable units = 2s+w words or bytes
•   Block size = line size = 2w words or bytes
•   Number of blocks in main memory = 2s+ w/2w = 2s
•   Number of lines in cache = m = 2r
•   Size of tag = (s – r) bits




                                                        36
    Direct Mapping pros & cons
• Simple
• Inexpensive
• Fixed location for given block
   – If a program accesses 2 blocks that map to the same
     line repeatedly, cache misses are very high




                                                      37
            Associative Mapping
•   A main memory block can load into any line of cache
•   Memory address is interpreted as tag and word
•   Tag uniquely identifies block of memory
•   Every line’s tag is examined for a match
•   Cache searching gets expensive




                                                          38
Associative Mapping Example




                              39
              Associative Mapping
               Address Structure
                                                     Word
                  Tag 22 bit                         2 bit
• 22 bit tag stored with each 32 bit block of data
• Compare tag field with tag entry in cache to check for
  hit
• Least significant 2 bits of address identify which 16 bit
  word is required from 32 bit data block
• e.g.
   – Address        Tag           Data          Cache line
   – 16339C         058CE7        FEDCBA98      0001
   – FFFFFC         3FFFFF        24682468      3FFF


                                                         40
    Associative Mapping Summary
•   Address length = (s + w) bits
•   Number of addressable units = 2s+w words or bytes
•   Block size = line size = 2w words or bytes
•   Number of blocks in main memory = 2s+ w/2w = 2s
•   Number of lines in cache = undetermined
•   Size of tag = s bits




                                                        41
       Set Associative Mapping
• Cache is divided into a number of sets
• Each set contains a number of lines
• A given block maps to any line in a given set
   – e.g. Block B can be in any line of set i
• e.g. 2 lines per set
   – 2 way associative mapping
   – A given block can be in one of 2 lines in only one set




                                                         42
             Set Associative Mapping
                Address Structure
                                                       Word
Tag 9 bit                  Set 13 bit                  2 bit


  • Use set field to determine cache set to look in
  • Compare tag field to see if we have a hit
  • e.g
     – Address        Tag Data            Set number
     – 1FF 7FFC       1FF 12345678        1FFF
     – 001 7FFC       001 11223344        1FFF



                                                           43
Two Way Set Associative Mapping
           Example




                                  44
          Set Associative Mapping
                 Summary
•   Address length = (s + w) bits
•   Number of addressable units = 2s+w words or bytes
•   Block size = line size = 2w words or bytes
•   Number of blocks in main memory = 2d
•   Number of lines in set = k
•   Number of sets = v = 2d
•   Number of lines in cache = kv = k * 2d
•   Size of tag = (s – d) bits




                                                        45
        Replacement Algorithms (1)
              Direct mapping
• No choice
• Each block only maps to one line
• Replace that line




                                     46
          Replacement Algorithms (2)
          Associative & Set Associative
• Implemented in hardware for speed
• Least Recently used (LRU)
  e.g. in 2 way set associative
   – Which of the 2 block is lru?
• First in first out (FIFO)
   – replace block that has been in cache longest
• Least frequently used
   – replace block which has had fewest hits
• Random




                                                    47
                  Write Policy
• Data in cache and data in main memory must be up to
  date
• Multiple devices may have access to main memory (e.g.
  I/O, and CPU)
• Multiple CPUs may have individual caches
• If a word is altered at any one place, all others need to
  be updated




                                                        48
               Write through
• All writes go to main memory as well as cache
• Multiple CPUs can monitor main memory traffic to keep
  local (to CPU) cache up to date
• Pros: simple
• Cons:
   – Lots of traffic
   – Slows down writes




                                                    49
                   Write back
• Updates initially made in cache only
• Update bit for cache slot is set when update occurs
• If block is to be replaced, write to main memory only if
  update bit is set
• Pros: minimal memory writes
• Cons:
   – Other caches get out of sync
   – Portions of main memory are invalid, hence I/O must
      access main memory through cache
   – Complex cirtuitry and potential bottleneck



                                                       50
                    Line Size
• Block size , hit ratio  then 
• Block size , number of blocks in cache 
• Block size , words relevancy 




                                              51
            Number of Caches
• Multilevel caches
   – L1 on-chip cache, L2: external cache
   – No system bus access between processor and L1, L1
     and L2
• Unified vs. split caches
   – Unified: higher hit rate, easy to implement
   – Split: one cache for instructions, one for data




                                                    52
   A Modern Memory Hierarchy
• By taking advantage of the principle of locality:
   – Present the user with as much memory as is available in the
      cheapest technology.
   – Provide access at the speed offered by the fastest technology.


                   Processor


                  Control                                                           Tertiary
                                                                        Secondary   Storage
                                                                         Storage  (Disk/Tape)
                                                  Second     Main
                                                                          (Disk)
                                 On-Chip
                     Registers




                                                   Level   Memory
                                  Cache




          Datapath                                Cache    (DRAM)
                                                 (SRAM)



        Speed (ns): 1s                     10s              100s    10,000,000s 10,000,000,000s
       Size (bytes): 100s                                             (10s ms)      (10s sec)
                                           Ks               Ms           Gs           Ts
                                                                                            53
     Basic Issues in VM System Design
 size of information blocks that are transferred from
     secondary to main storage (M)

 block of information brought into M, and M is full, then some region
    of M must be released to make room for the new block -->
    replacement policy

 which region of M is to hold the new block --> placement policy

 missing item fetched from secondary memory only on the occurrence
    of a fault --> demand load policy

                                 mem            disk
                      cache

        reg
                                                   pages
                                 frame
Paging Organization

virtual and physical address space partitioned into blocks of equal size

                              page frames

      pages
                                                                           54
       Address Map
 V = {0, 1, . . . , n - 1} virtual address space     n>m
 M = {0, 1, . . . , m - 1} physical address space

 MAP: V --> M U {0} address mapping function
     MAP(a) = a' if data at virtual address a is present in physical
                    address a' and a' in M

              = 0 if data at virtual address a is not present in M


        a                                 missing item fault
                Name Space V
                                           fault
Processor                                 handler

                                0
                  Addr Trans               Main           Secondary
        a         Mechanism               Memory           Memory
                                a'

                        physical address                  OS performs
                                                          this transfer

                                                                          55
        Paging Organization
                      V.A.
P.A.                                                                              unit of
    0    frame 0         1K                         0      page 0        1K       mapping
 1024          1         1K          Addr        1024           1        1K
                                     Trans
                                     MAP                                    also unit of
 7168            7       1K                                                 transfer from
                                                                            virtual to
        Physical                                                            physical
        Memory
                                               31744            31       1K memory
                                                        Virtual Memory
       Address Mapping
                               10
 VA     page no.              disp

                     Page Table
Page Table
Base Reg                 Access
                     V                                  actually, concatenation
         index           Rights   PA           +        is more likely
         into
         page
         table       table located           physical
                      in physical            memory
                        memory               address
                                                                                        56
   Virtual Address and a Cache
          VA              PA               miss
                 Trans-                             Main
 CPU                              Cache
                 lation                            Memory
                          hit
                  data
It takes an extra memory access to translate VA to PA

This makes cache access very expensive, and this is the
"innermost loop" that you want to go as fast as possible

ASIDE: Why access cache with PA at all? VA caches have a problem!
   synonym / alias problem: two different virtual addresses map to
   same physical address => two different cache entries holding data for
   the same physical address!

   for update: must update all cache entries with same
   physical address or memory becomes inconsistent

   determining this requires significant hardware, essentially an
   associative lookup on the physical address tags to see if you
   have multiple hits


                                                                    57
                     TLBs
  A way to speed up translation is to use a special cache of recently
     used page table entries -- this has many names, but the most
     frequently used is Translation Lookaside Buffer or TLB

       Virtual Address Physical Address Dirty Ref Valid Access




Really just a cache on the page table mappings

TLB access time comparable to cache access time
   (much less than main memory access time)




                                                                        58
       Translation Look-Aside Buffers
   Just like any other cache, the TLB can be organized as fully associative,
      set associative, or direct mapped

   TLBs are usually small, typically not more than 128 - 256 entries even on
      high end machines. This permits fully associative
      lookup on these machines. Most mid-range machines use small
      n-way set associative organizations.


                                       hit
                       VA              PA                 miss
                              TLB                                 Main
              CPU                              Cache
                             Lookup                              Memory
Translation                 miss             hit
with a TLB
                              Trans-
                              lation
                                                   data
                               1/2 t                t              20 t
                                                                           59
     Reducing Translation Time

Machines with TLBs go one step further to reduce #
  cycles/cache access

They overlap the cache access with the TLB access:

  high order bits of the VA are used to look in the TLB
  while low order bits are used as index into cache




                                                          60
      Overlapped Cache & TLB Access
                        assoc                index
32         TLB                                             Cache      1K
                        lookup

                                                           4 bytes
                                        10     2
                                               00

 PA         Hit/                                          Data
                            20           12          PA              Hit/
            Miss                                                     Miss
                         page #         disp


                                                =

     IF cache hit AND (cache tag = PA) then deliver data to CPU
     ELSE IF [cache miss OR (cache tag = PA)] and TLB hit THEN
              access memory with the PA from the TLB
     ELSE do standard VA translation
                                                                            61
Problems With Overlapped TLB Access
 Overlapped access only works as long as the address bits used to
    index into the cache do not change as the result of VA translation

 This usually limits things to small caches, large page sizes, or high
    n-way set associative caches if you want a large cache

 Example: suppose everything the same except that the cache is
    increased to 8 K bytes instead of 4 K:

                                11      2
                               cache
                               index    00
                                                This bit is changed
                                                by VA translation, but
                  20              12            is needed for cache
             virt page #         disp           lookup
 Solutions:
    go to 8K byte page sizes;
    go to 2 way set associative cache; or



                                                1K   2 way set assoc cache
        10
                           4                4
                                                                             62
    Summary #1/4:

• The Principle of Locality:
   – Program access a relatively small portion of the
     address space at any instant of time.
      • Temporal Locality: Locality in Time
      • Spatial Locality: Locality in Space
• Three Major Categories of Cache Misses:
   – Compulsory Misses: sad facts of life. Example: cold
     start misses.
   – Capacity Misses: increase cache size
   – Conflict Misses: increase cache size and/or
     associativity.
             Nightmare Scenario: ping pong effect!
• Write Policy:                                            63
             Summary #2 / 4:
          The Cache Design Space
• Several interacting dimensions          Cache Size

   – cache size
                                                        Associativity
   – block size
   – associativity
   – replacement policy
   – write-through vs write-back                     Block Size

   – write allocation
• The optimal choice is a compromiseBad
   – depends on access characteristics
      • workload                  Good    Factor A        Factor B

      • use (I-cache, D-cache, TLB)       Less             More

   – depends on technology / cost
• Simplicity often wins                                                 64
    Summary #3/4: TLB, Virtual Memory
• Caches, TLBs, Virtual Memory all understood by
  examining how they deal with 4 questions: 1) Where
  can block be placed? 2) How is block found? 3) What
  block is repalced on miss? 4) How are writes handled?
• Page tables map virtual address to physical address
• TLBs are important for fast translation
• TLB misses are significant in processor performance
   – funny times, as most systems can’t access all of 2nd
     level cache without TLB misses!




                                                            65
     Summary #4/4: Memory Hierachy
• VIrtual memory was controversial at the time:
  can SW automatically manage 64KB across many
  programs?
   – 1000X DRAM growth removed the controversy
• Today VM allows many processes to share single
  memory without having to swap all processes to disk;
  today VM protection is more important than memory
  hierarchy
• Today CPU time is a function of (ops, cache misses) vs.
  just f(ops):
  What does this mean to Compilers, Data structures,
  Algorithms?



                                                            66

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:5
posted:2/17/2012
language:English
pages:66