Computer Architecture Instruction Set Architecture (PowerPoint)

Document Sample
Computer Architecture Instruction Set Architecture (PowerPoint) Powered By Docstoc
					Computer Architecture
           Chapter 7

              Fall 2004
    Department of Computer Science
         Kent State University

• A random-access memory (RAM) is an
  array of words each with a unique address
• Addresses are binary numbers
• If we use n bits for addresses then the
  memory can contain at most 2n words
• The depth of a memory is the number of
  words and the width is the number of bits in
  each word
     Memory Technologies

• Dynamic RAM (DRAM)
  – More dense (larger)
  – Slower
  – Less expensive
• Static RAM (SRAM)
  – Less dense (smaller)
  – Faster
  – More expensive

• Programs tend to access only a small part of
  memory at a time
• Temporal locality: If a value in memory is
  accessed once is likely to be accessed again
• Spatial locality: If a value in memory is
  accessed it is likely that values nearby will
  also be accessed
        Memory Hierarchy

• A memory hierarchy uses multiple types of
  memory to exploit locality
• Faster and smaller memory, placed near the
  processor, holds data that is likely to be
  accessed soon
• Slower and larger memory, placed further
  from the processor, holds data that is not
  likely to be accessed soon
          Memory Hierarchy

Speed         CPU       Size     Cost ($/bit)

Fastest      Memory   Smallest     Highest


Slowest      Memory   Biggest      Lowest
               Hit or Miss

• The processor first looks for data in the
  highest level of memory (fast and small), if
  it succeeds it's called a hit
• If the processor does not find the data it is a
  miss and the processor must search the next
  level of memory
• Once the data is found in the lower level it
  is copied into the upper level
Hit or Miss


       Data are transferred

• Data is transferred between levels of
  memory in blocks or lines
• The size of a block can be different between
  different levels
• Typically block sizes increase as we move
  further from the processor
             Access Time

• Hit time is the time to access a block in the
  upper level including the time to determine
  whether the access is a hit or a miss
• Miss penalty is the time to access a block in
  the lower level and transfer it back
• The average access time depends on the hit
  rate, the fraction of memory accesses that
  hit in the upper level
Levels in Memory Hierarchy

                                                       Increasing distance
                               Level 1
                                                         from the CPU in
                                                           access time

  Levels in the                Level 2
 memory hierarchy

                               Level n

                    Size of the memory at each level

• A cache is a small memory that stores a subset of
  blocks from a larger memory
• In order to access a block in a cache we have to
  determine if that block is in the cache
• If it is, we must determine where in the cache it is
• For now we will assume that a block contains just
  a single word

             X4                              X4

             X1                              X1

           Xn – 2                          Xn – 2

           Xn – 1                          Xn – 1

            X2                                X2


             X3                              X3

a. Before the reference to Xn   b. After the reference to Xn
       Direct-Mapped Cache

• In a direct-mapped cache each block in memory
  maps to exactly one location in the cache
• Obviously, more than one memory block will map
  to the same cache location
• Typically the mapping is:
   – Block address mod Number of blocks in cache
• This means that the lower bits of the address are
  used as an index into the cache
Direct-Mapped Cache



00001   00101   01001   01101      10001   10101   11001   11101


• Each cache entry has a tag that identifies
  which of the words that can be stored there
  actually is there
• For a direct-mapped cache, the tag contains
  the upper address bits
• We must also be able to detect when no
  word is stored in a cache entry, so each
  cache entry also has a valid bit
Cache Organization
                Address (showing bit positions)
                31 30     13 12 11       210

                          20            10
  Hit                                                  Data

        Index Valid Tag                Data

                        20               32
              Larger Blocks

• The previous cache configuration exploits
  temporal locality but not spatial locality
• In order to exploit spatial locality we typically use
  larger blocks
• However, if the block size is too large then there
  will be fewer blocks in the cache and the miss rate
  will increase
• Large block size can also increase the miss penalty
  since more data must be transferred on a miss
Cache with 4 Words/Block
                     Address (showing bit positions)
                         31      16 15    4 32 1 0

                                 16      12     2 Byte
Hit         Tag                                                                    Data
                         Index                                      Block offset
          16 bits                             128 bits
      V    Tag                                  Data


              16    32                    32              32   32

Block Size vs. Miss Rate



 Miss rate





                   4   16                        64            256
                            Block size (bytes)        1 KB
                                                      8 KB
                                                      16 KB
                                                      64 KB
                                                      256 KB
     Handling Cache Misses

• If a memory access (instruction or data)
  causes a miss we must stall the pipeline
• While the pipeline is stalled the memory
  controller fills the cache with the block
  from memory
• A pipeline stall in this case is simpler to
  implement since we can freeze the entire

• Write-through: Writes store the new value both in
  the cache and in memory; cache and memory
  remain consistent at all times
• In order to avoid waiting for memory the data can
  be placed into a write buffer so the processor can
  continue working while the write occurs
• Write-back: Writes store the new value in cache
  only; when the cache block is replaced its contents
  are written to memory
        Cache Performance

• Processor execution time can be divided into time
  spent executing code and time spent waiting on
  memory accesses
• Cost of cache hits are included in the normal
  execution time
• Cycles = Execution cycles + Memory-stall cycles
• This also means that: CPI = Execution CPI +
  Memory-stall CPI
  When Memory Stalls Occur

• An instruction may access memory twice: during
  instruction fetch and during data memory access
• Typically there are separate caches for instructions
  and data in order to prevent structural hazards
• Thus, we can consider stalls during instruction
  fetch (which affect all instructions) separately
  from stalls during data memory access (which
  affect loads and stores)
• Memory-stall cycles = Instruction miss cycles +
  Data miss cycles
        Memory-Stall Cycles

• The number of memory-stall cycles depends on
  the frequency of memory accesses, the cache miss
  rate, and the miss penalty
• Instruction miss cycles = Instruction count × Miss
  rate × Miss penalty
• Data miss cycles = Percentage of loads/stores ×
  Instruction count × Miss rate × Miss penalty
• Divide by the instruction count to get CPI
             Miss Penalty

• The miss penalty is typically independent of
  the clock speed
• This means that higher clock speeds
  increase the miss penalty
• Increasing the clock speed without also
  improving the memory system will yield
  diminishing returns (Amdahl's law)
     Fully Associative Cache

• In a direct-mapped cache each block can go in
  exactly one place
• On the other hand, a fully associative cache allows
  blocks to be placed anywhere in the cache
• Fully associative caches have a lower miss rate,
  but require more hardware
• We must compare the block address against all the
  tags in the cache simultaneously
       Set Associative Cache

• Between direct-mapped and fully associative are
  set associative caches
• The cache entries are divided into sets
• A given block always goes to a specific set, but
  can be placed anywhere within that set
• If there are n entries in a set then it is called an n-
  way set associative cache
• We must compare the block address against all the
  tags in the set simultaneously
                          Cache Types

         Direct mapped              Set associative            Fully associative
Block # 0 1 2 3 4 5 6 7     Set #   0    1   2   3

  Data                      Data                       Data

                1                    1                                      1
   Tag                       Tag                        Tag
                2                    2                                      2

Search                     Search                     Search
Associativity vs. Cache Size
               One-way set associative
                  (direct mapped)
                 Block    Tag Data
                                                   Two-way set associative
                                              Set     Tag Data Tag Data
                  3                            0

                  4                            1

                  5                            2

                  6                            3


                                  Four-way set associative
                 Set      Tag Data Tag Data Tag Data Tag Data

                         Eight-way set associative (fully associative)
   Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data
          Searching a Cache

• As before the block address is decomposed into a
  tag, an index, and a block offset
• Now the index selects a set instead of a specific
• All tags in the set are compared to the block
  address tag in parallel
• As the associativity increases, the tags get larger
  and the indexes smaller
• In a fully associative cache there is no index; the
  entire block address (minus block offset) is the tag
4-Way Set Associative Cache
                         31 30   12 11 10 9 8              3210

                                 22                    8

     Index   V   Tag   Data       V      Tag    Data              V   Tag   Data        V    Tag   Data

                                                                                              22      32

                                                                        4-to-1 multiplexor

                                   Hit                                         Data
           Block Replacement

• On a cache miss, we need to insert a new block into the
  cache by replacing an old one
• In the direct-mapped cache, a block can only go in one
  place, so we must replace the block that's already there
• In a set/fully associative cache we have a choice of which
  block to replace
• A common scheme for choosing the block to replace is
  least-recently used (LRU) which replaces the block that
  has been unused for the longest time
• Another option is random replacement
          Multilevel Caches

• We can reduce the miss penalty by adding a
  second level of cache to the hierarchy
• If a memory access misses in the primary cache,
  we check the secondary cache
• Only when we miss in both caches do we have to
  wait for main memory
• Typically the primary (L1) cache is on-chip while
  the secondary (L2) cache is off-chip
• We can optimize the L1 cache to reduce hit time
  and optimize the L2 cache to reduce the miss rate
    Causes of Cache Misses

• Compulsory misses occur the first time a
  block is accessed
• Capacity misses occur when there is not
  enough space in the cache to store all the
  blocks that are being used
• Conflict (or collision) misses occur when
  multiple blocks contend for the same set

Shared By: