Docstoc

Computer Architecture Instruction Set Architecture (PowerPoint)

Document Sample
Computer Architecture Instruction Set Architecture (PowerPoint) Powered By Docstoc
					Computer Architecture
           Chapter 7


              Fall 2004
    Department of Computer Science
         Kent State University
                Memory

• A random-access memory (RAM) is an
  array of words each with a unique address
• Addresses are binary numbers
• If we use n bits for addresses then the
  memory can contain at most 2n words
• The depth of a memory is the number of
  words and the width is the number of bits in
  each word
     Memory Technologies

• Dynamic RAM (DRAM)
  – More dense (larger)
  – Slower
  – Less expensive
• Static RAM (SRAM)
  – Less dense (smaller)
  – Faster
  – More expensive
                 Locality

• Programs tend to access only a small part of
  memory at a time
• Temporal locality: If a value in memory is
  accessed once is likely to be accessed again
  soon
• Spatial locality: If a value in memory is
  accessed it is likely that values nearby will
  also be accessed
        Memory Hierarchy

• A memory hierarchy uses multiple types of
  memory to exploit locality
• Faster and smaller memory, placed near the
  processor, holds data that is likely to be
  accessed soon
• Slower and larger memory, placed further
  from the processor, holds data that is not
  likely to be accessed soon
          Memory Hierarchy

Speed         CPU       Size     Cost ($/bit)



Fastest      Memory   Smallest     Highest



             Memory




Slowest      Memory   Biggest      Lowest
               Hit or Miss

• The processor first looks for data in the
  highest level of memory (fast and small), if
  it succeeds it's called a hit
• If the processor does not find the data it is a
  miss and the processor must search the next
  level of memory
• Once the data is found in the lower level it
  is copied into the upper level
Hit or Miss

  Processor




       Data are transferred
                  Blocks

• Data is transferred between levels of
  memory in blocks or lines
• The size of a block can be different between
  different levels
• Typically block sizes increase as we move
  further from the processor
             Access Time

• Hit time is the time to access a block in the
  upper level including the time to determine
  whether the access is a hit or a miss
• Miss penalty is the time to access a block in
  the lower level and transfer it back
• The average access time depends on the hit
  rate, the fraction of memory accesses that
  hit in the upper level
Levels in Memory Hierarchy
                                 CPU




                                                       Increasing distance
                               Level 1
                                                         from the CPU in
                                                           access time

  Levels in the                Level 2
 memory hierarchy




                               Level n



                    Size of the memory at each level
                     Cache

• A cache is a small memory that stores a subset of
  blocks from a larger memory
• In order to access a block in a cache we have to
  determine if that block is in the cache
• If it is, we must determine where in the cache it is
  located
• For now we will assume that a block contains just
  a single word
                        Cache

             X4                              X4

             X1                              X1

           Xn – 2                          Xn – 2



           Xn – 1                          Xn – 1

            X2                                X2

                                             Xn

             X3                              X3

a. Before the reference to Xn   b. After the reference to Xn
       Direct-Mapped Cache

• In a direct-mapped cache each block in memory
  maps to exactly one location in the cache
• Obviously, more than one memory block will map
  to the same cache location
• Typically the mapping is:
   – Block address mod Number of blocks in cache
• This means that the lower bits of the address are
  used as an index into the cache
Direct-Mapped Cache
                                Cache




                        000
                        001
                        010
                        011



                        111
                        100
                        101
                        110




00001   00101   01001   01101      10001   10101   11001   11101

                            Memory
                   Tags

• Each cache entry has a tag that identifies
  which of the words that can be stored there
  actually is there
• For a direct-mapped cache, the tag contains
  the upper address bits
• We must also be able to detect when no
  word is stored in a cache entry, so each
  cache entry also has a valid bit
Cache Organization
                Address (showing bit positions)
                31 30     13 12 11       210
                                               Byte
                                              offset

                          20            10
  Hit                                                  Data
          Tag
                               Index


        Index Valid Tag                Data
         0
         1
         2




        1021
        1022
        1023
                        20               32
              Larger Blocks

• The previous cache configuration exploits
  temporal locality but not spatial locality
• In order to exploit spatial locality we typically use
  larger blocks
• However, if the block size is too large then there
  will be fewer blocks in the cache and the miss rate
  will increase
• Large block size can also increase the miss penalty
  since more data must be transferred on a miss
Cache with 4 Words/Block
                     Address (showing bit positions)
                         31      16 15    4 32 1 0


                                 16      12     2 Byte
Hit         Tag                                                                    Data
                                                 offset
                         Index                                      Block offset
          16 bits                             128 bits
      V    Tag                                  Data




                                                                         4K
                                                                         entries




              16    32                    32              32   32




                                                 Mux
                                                    32
Block Size vs. Miss Rate
             40%

             35%

             30%

             25%
 Miss rate




             20%

             15%

             10%

             5%

             0%
                   4   16                        64            256
                            Block size (bytes)        1 KB
                                                      8 KB
                                                      16 KB
                                                      64 KB
                                                      256 KB
     Handling Cache Misses

• If a memory access (instruction or data)
  causes a miss we must stall the pipeline
• While the pipeline is stalled the memory
  controller fills the cache with the block
  from memory
• A pipeline stall in this case is simpler to
  implement since we can freeze the entire
  pipeline
                    Writes

• Write-through: Writes store the new value both in
  the cache and in memory; cache and memory
  remain consistent at all times
• In order to avoid waiting for memory the data can
  be placed into a write buffer so the processor can
  continue working while the write occurs
• Write-back: Writes store the new value in cache
  only; when the cache block is replaced its contents
  are written to memory
        Cache Performance

• Processor execution time can be divided into time
  spent executing code and time spent waiting on
  memory accesses
• Cost of cache hits are included in the normal
  execution time
• Cycles = Execution cycles + Memory-stall cycles
• This also means that: CPI = Execution CPI +
  Memory-stall CPI
  When Memory Stalls Occur

• An instruction may access memory twice: during
  instruction fetch and during data memory access
• Typically there are separate caches for instructions
  and data in order to prevent structural hazards
• Thus, we can consider stalls during instruction
  fetch (which affect all instructions) separately
  from stalls during data memory access (which
  affect loads and stores)
• Memory-stall cycles = Instruction miss cycles +
  Data miss cycles
        Memory-Stall Cycles

• The number of memory-stall cycles depends on
  the frequency of memory accesses, the cache miss
  rate, and the miss penalty
• Instruction miss cycles = Instruction count × Miss
  rate × Miss penalty
• Data miss cycles = Percentage of loads/stores ×
  Instruction count × Miss rate × Miss penalty
• Divide by the instruction count to get CPI
             Miss Penalty

• The miss penalty is typically independent of
  the clock speed
• This means that higher clock speeds
  increase the miss penalty
• Increasing the clock speed without also
  improving the memory system will yield
  diminishing returns (Amdahl's law)
     Fully Associative Cache

• In a direct-mapped cache each block can go in
  exactly one place
• On the other hand, a fully associative cache allows
  blocks to be placed anywhere in the cache
• Fully associative caches have a lower miss rate,
  but require more hardware
• We must compare the block address against all the
  tags in the cache simultaneously
       Set Associative Cache

• Between direct-mapped and fully associative are
  set associative caches
• The cache entries are divided into sets
• A given block always goes to a specific set, but
  can be placed anywhere within that set
• If there are n entries in a set then it is called an n-
  way set associative cache
• We must compare the block address against all the
  tags in the set simultaneously
                          Cache Types

         Direct mapped              Set associative            Fully associative
Block # 0 1 2 3 4 5 6 7     Set #   0    1   2   3



  Data                      Data                       Data




                1                    1                                      1
   Tag                       Tag                        Tag
                2                    2                                      2

Search                     Search                     Search
Associativity vs. Cache Size
               One-way set associative
                  (direct mapped)
                 Block    Tag Data
                  0
                                                   Two-way set associative
                  1
                                              Set     Tag Data Tag Data
                  2
                  3                            0

                  4                            1

                  5                            2

                  6                            3

                  7


                                  Four-way set associative
                 Set      Tag Data Tag Data Tag Data Tag Data
                  0
                  1


                         Eight-way set associative (fully associative)
   Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data
          Searching a Cache

• As before the block address is decomposed into a
  tag, an index, and a block offset
• Now the index selects a set instead of a specific
  block
• All tags in the set are compared to the block
  address tag in parallel
• As the associativity increases, the tags get larger
  and the indexes smaller
• In a fully associative cache there is no index; the
  entire block address (minus block offset) is the tag
4-Way Set Associative Cache
                                      Address
                         31 30   12 11 10 9 8              3210


                                 22                    8




     Index   V   Tag   Data       V      Tag    Data              V   Tag   Data        V    Tag   Data
       0
       1
       2

     253
     254
     255
                                                                                              22      32




                                                                        4-to-1 multiplexor



                                   Hit                                         Data
           Block Replacement

• On a cache miss, we need to insert a new block into the
  cache by replacing an old one
• In the direct-mapped cache, a block can only go in one
  place, so we must replace the block that's already there
• In a set/fully associative cache we have a choice of which
  block to replace
• A common scheme for choosing the block to replace is
  least-recently used (LRU) which replaces the block that
  has been unused for the longest time
• Another option is random replacement
          Multilevel Caches

• We can reduce the miss penalty by adding a
  second level of cache to the hierarchy
• If a memory access misses in the primary cache,
  we check the secondary cache
• Only when we miss in both caches do we have to
  wait for main memory
• Typically the primary (L1) cache is on-chip while
  the secondary (L2) cache is off-chip
• We can optimize the L1 cache to reduce hit time
  and optimize the L2 cache to reduce the miss rate
    Causes of Cache Misses

• Compulsory misses occur the first time a
  block is accessed
• Capacity misses occur when there is not
  enough space in the cache to store all the
  blocks that are being used
• Conflict (or collision) misses occur when
  multiple blocks contend for the same set

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:6
posted:2/17/2012
language:
pages:35