Memory Hierarchy of Computers & Hardwares by seldeepak

VIEWS: 689 PAGES: 28

More Info
									                    The Memory Hierarchy
                      It says here the          Sounds like
                         choices are             something
                      “large and slow”,         that a little
                              or                     $
                      “small and fast”            could fix

6.004 – Fall 2002                     11/7/02                   L18 – Memory Hierarchy 1
                       What we want in a memory
                                           PC                ADDR

                                          INST               DOUT

                                 BETA                        MEMORY
                                        MADDR                ADDR
                                        MDATA                DIN/DOUT

                                        Capacity             Latency    Cost
                    Register         100’s of bits           20 ps      $$$$
                    SRAM             100’ Kbytes              1 ns      $$$
                    DRAM             100’ Mbytes             40 ns       $
                    Hard disk*       10’s Gbytes             10 ms       ¢
                    Want             100 Mbytes               1 ns      cheap
                    * non-volatile
6.004 – Fall 2002                                  11/7/02                      L18 – Memory Hierarchy 2
                                SRAM Memory Cell
                                                                      There are two bit-lines per
                    Good, but                       Slow and          column, one supplies the bit the
     static          slow 0                         almost 1          other it’s complement.
     bistable                   6-T SRAM Cell
     storage                                                          On a Read Cycle -
     element                     0         1
                                                                        A single word line is activated
                                                                      (driven to “1”), and the access
                                                                      transistors enable the selected
   word line N                         access FETs                    cells, and their complements,
                                                                      onto the bit lines.

  word line N+1                                                       Writes are similar to reads,
                                                                      except the bit-lines are driven
                                                                      with the desired value of the cell.
              bit                    Doesn’t this               bit
                     Strong           violate our
                                                    Strong            The writing has to “overpower”
                       1              discipline?     0               the original contents of the
                                                                      memory cell.
6.004 – Fall 2002                                     11/7/02                              L18 – Memory Hierarchy 3
                       Tricks to make SRAMs fast
                            precharge or VDD           Forget that it is a digital circuit
                                                       1) Precharge the bit lines prior to the
                           SRAM Cell                      read (for instance- while the
                                                          address is being decoded) because
                                                          the access FETs are good pull-
                                                          downs and poor pull-ups
                        Differential Sense Amp         2) Use a differential amplifier to
                                                          “sense” the difference in the two
               bit                               bit      bit-lines long before they reach a
                                                          valid logic levels.
                         or VDD

         sense amp          clk

6.004 – Fall 2002                                       11/7/02                          L18 – Memory Hierarchy 4
               Multiport SRAMs (a.k.a. Register
                                   wd                                      rd0     rd1
                                              PU = 2 / 1                     2/1
                                              PD = 4 / 1

                                        4/1                          2/1
                                                                                             This transistor
                                                                                         isolates the storage
                                              PU = 2 / 2                                 node so that it won’t
                                              PD = 2 / 3                                  flip unintentionally.
                    One can increase the number of SRAM ports by adding access
                       transistors. By carefully sizing the inverter pair, so that one
                       is strong and the other weak, we can assure that our WRITE
                       bus will only fight with the weaker one, and the READs are
                       driven by the stronger one. Thus minimizing both access and
                       write times.
                    What is the cost per cell of adding a new read or write port?

6.004 – Fall 2002                                    11/7/02                                  L18 – Memory Hierarchy 5
                                1-T Dynamic Ram
            Six transistors/cell may not sound like much, but they can
            add up quickly. What is the fewest number of transistors
            that can be used to store a bit?
                                                                          TiN top electrode (VREF)
        Explicit storage             1-T DRAM Cell                                         Ta2O5 dielectric
          capacitor                                         word
                                                                                              W bottom
                                                             access FET                       electrode
                                                                             poly     access fet

        C in storage capacitor determined by:
                    better dielectric more area
                           C=    d
                                     thinner film

6.004 – Fall 2002                                       11/7/02                                 L18 – Memory Hierarchy 6
             Tricks for increasing throughput
                                                                                                           but, alas, not latency
   Multiplexed Address                                                bit lines           word lines    The first thing that should
 (row first, then column)                                                                               pop into you mind when
                                                     Col. Col. Col.                Col.                 asked to speed up a digital
                                                      1    2    3                  2M                   design…

                                                                                             Row 1
                               Row Address Decoder

                         N                                                                   Row 2
                                                                                                        Synchronous DRAM
                                                                                             Row 2N

                                                                                                          Synchronous DRAM
                         M                                                                  (one bit)
                                                       Column Multiplexer/Shifter                             (DDRAM)


                                                                         D                      Data
      t1            t2       t3 t4                                                               out

6.004 – Fall 2002                                                            11/7/02                                 L18 – Memory Hierarchy 7
                            Hard Disk Drives

          Typical high-end drive:
          • Average latency = 4 ms
          • Average seek time = 9 ms
          • Transfer rate = 20M bytes/sec
          • Capacity = 60G byte
          • Cost = $180 $99

                                                                      figures from
6.004 – Fall 2002                           11/7/02   L18 – Memory Hierarchy 8
                              Quantity vs Quality…
                                                      Your memory system can be
                                                         • BIG and SLOW... or
                                                         • SMALL and FAST.

                                                          We’ve explored a range of
                                                          circuit-design trade-offs.

               SRAM                                         Is there an
          1                                                 ARCHITECTURAL solution
          .1                                                to this DILEMMA?
        .01                        DISK
        .00                                   TAPE
                10-8   10-6     10-3      1   10          Time
6.004 – Fall 2002                               11/7/02                                L18 – Memory Hierarchy 9
                      Best of Both Worlds

           What we WANT: A BIG, FAST memory!

           We’d like to have a memory system that
              • PERFORMS like 32 MBytes of SRAM; but
              • COSTS like 32 MBytes of slow memory.

           SURPRISE: We can (nearly) get our wish!

           KEY: Use a hierarchy of memory technologies:

                        SRAM      MEM

6.004 – Fall 2002                     11/7/02             L18 – Memory Hierarchy 10
                                     Key IDEA
                    • Keep the most often-used data in a small, fast
                      SRAM (often local to CPU chip)
                    • Refer to Main Memory only rarely, for
                      remaining data.
                    • The reason this strategy works: LOCALITY

                        Locality of Reference:
                          Reference to location X at time t implies that
                            reference to location X+∆X at time t+∆t
                            becomes more probable as ∆X and ∆t
                            approach zero.

6.004 – Fall 2002                            11/7/02                       L18 – Memory Hierarchy 11
                    Memory Reference Patterns
                                                S is the set of locations
                                                   accessed during ∆t.
                                                Working set: a set S which
                                                  changes slowly wrt
      data                                        access time.
                                                Working set size, |S|

     stack                                         |S|


                     ∆t                  time

6.004 – Fall 2002              11/7/02                       L18 – Memory Hierarchy 12
            Exploiting the Memory Hierarchy
             Approach 1 (Cray, others): Expose Hierarchy
                    • Registers, Main Memory,
                     Disk each available as                       SRAM
                     storage alternatives;
                    • Tell programmers: “Use them cleverly”      CPU

             Approach 2: Hide Hierarchy
                    • Programming model: SINGLE kind of memory, single address space.
                    • Machine AUTOMATICALLY assigns locations to fast or slow
                       memory, depending on usage patterns.

                                                 Dynamic                   HARD
                      CPU            Static       RAM                      DISK
                       X?                       “MAIN MEMORY”
                                                                         “SWAP SPACE”
6.004 – Fall 2002                                  11/7/02                              L18 – Memory Hierarchy 13
                                       The Cache Idea:
           Program-Transparent Memory Hierarchy
                                            1.0                (1.0-α)
                                                   100 37                  RAM

                                                  "CACHE"                 "MAIN
                              Cache contains TEMPORARY COPIES of selected
                                 main memory locations... eg. Mem[100] = 37
          1) Improve the average access time
                        α       HIT RATIO: Fraction of refs found in CACHE.        make the
                      (1-α)     MISS RATIO: Remaining references.                  hit ratio as
                                                                                   high as
                    t ave = αt c + (1 − α)( t c + t m ) = t c + (1 − α)t m         possible.
          2) Transparency (compatibility, programming ease)

6.004 – Fall 2002                                    11/7/02                         L18 – Memory Hierarchy 14
                     How High of a Hit Ratio?
               Suppose we can easily build an on-chip static memory with
               a 4 nS access time, but the fastest dynamic memories
               that we can buy for main memory have an average access
               time of 40 nS. How high of a hit rate do we need to sustain
               an average access time of 5 nS?

                                   t ave − t c      5−4
                            α = 1−             = 1−     = 97 .5 %
                                        tm           40

6.004 – Fall 2002                        11/7/02                    L18 – Memory Hierarchy 15
                       The Cache Principle
        Find “Bitdiddle, Ben”

      5-Minute Access Time:                     5-Second Access Time:

                                          ALGORITHM: Look nearby for the
                                          requested information first, if its not
                                          there check secondary storage

6.004 – Fall 2002               11/7/02                             L18 – Memory Hierarchy 16
                             Basic Cache Algorithm
                                ON REFERENCE TO Mem[X]: Look for X among cache tags...

                                HIT: X = TAG(i) , for some cache line i
                    CPU              • READ:          return DATA(i)
                                     • WRITE:         change DATA(i); Start Write to Mem(X)

       Tag            Data      MISS: X not found in TAG of any cache line

           A        Mem[A]           • REPLACEMENT SELECTION:
                                          Select some line k to hold Mem[X] (Allocation)
           B        Mem[B]
                                     • READ:          Read Mem[X]
                                                      Set TAG(k)=X, DATA(K)=Mem[X]
                                     • WRITE:         Start Write to Mem(X)
                MAIN                                  Set TAG(k)=X, DATA(K)= new Mem[X]

                              QUESTION: How do we “search” the cache?
6.004 – Fall 2002                                  11/7/02                       L18 – Memory Hierarchy 17
                    Associativity: Parallel Lookup
       Find “Bitdiddle, Ben”

                               Nope, “Smith”
                                 Nope, “Jones”

                                          HERE IT IS!

                                         Nope, “Bitwit”

6.004 – Fall 2002                        11/7/02          L18 – Memory Hierarchy 18
                    Fully-Associative Cache

                                     TAG        Data
                        Address      =?

     The extreme in associatively:   TAG        Data
     All comparisons made in
     parallel                        =?                         HIT
     Any data item could be
     located in any cache location
                                     TAG        Data
                                     =?                      Out

6.004 – Fall 2002                     11/7/02          L18 – Memory Hierarchy 19
                     Direct-Mapped Cache

         Find “Bitdiddle, Ben”                   NO Parallelism:
                                                     Look in JUST ONE place,
                                                     determined by
                                                     parameters of incoming
                                                     request (address bits)
                                                 ... can use ordinary RAM as


6.004 – Fall 2002                      11/7/02                       L18 – Memory Hierarchy 20
                     The Problem with Collisions
    Find “Bitwit”
       Find “Bituminous”
           Find “Bitdiddle”   Nope, I’ve got
                                                  Contention among B’s.... each
                                “BITWIT”             competes for same cache
                               under “B”             line!
                                                       - CAN’T cache both
                                                          “Bitdiddle” & “Bitwit”
                                                       ... Suppose B’s tend
                                                           to come at once?
                                                      BETTER IDEA:
                                                         File by LAST letter!

6.004 – Fall 2002                       11/7/02                         L18 – Memory Hierarchy 21
                        Optimizing for Locality:
                           selecting on statistically independent bits

           Find “Bitdiddle”                      Here’s
                                                 under E

                                                                LESSON: Choose CACHE
                                                                  LINE from independent
                                                                  parts of request to
                                                                  MINIMIZE CONFLICT
                                                                  given locality patterns...
              Y                                                 IN CACHE: Select line by
                    A                                             LOW ORDER address
                                           Here’s                 bits!
         Find “Bitwit”                     under T              Does this ELIMINATE
6.004 – Fall 2002                            11/7/02                           L18 – Memory Hierarchy 22
                         Direct Mapped Cache
          Low-cost extreme:
             Single comparator
             Use ordinary (fast) static RAM for cache tags & data:

                                                            K x (T + D)-bit static RAM
                    Incoming Address
                                                            Tag             Data
                     T             K

                                   K-bit Cache Index

                                  T Upper-address bits
                                                                                   D-bit data word
         QUESTION: Why not use HIGH-order bits
         as Cache Index?                                    HIT          Data Out
6.004 – Fall 2002                                 11/7/02                                L18 – Memory Hierarchy 23
                    Contention, Death, and Taxes...
                                Nope, I’ve got
      Find “Bitdiddle”         “BITTWIDDLE”
                             under “E”; I’ll replace      LESSON: In a non-associative
                                      it.                   cache, SOME pairs of
                                                            addresses must compete
                                                            for cache lines...
                                                          ... if working set includes such
                                                              pairs, we get THRASHING
                                                              and poor performance.

         Y                              Nope, I’ve got
             A                          “BITDIDDLE”
                                        under “E”; I’ll
   Find “Bittwiddle”                     replace it.

6.004 – Fall 2002                     11/7/02                              L18 – Memory Hierarchy 24
                Direct-Mapped Cache Contention
                    Memory    Cache             Hit/
                    Address    Line             Miss
                    1024        0               HIT              GREAT
      Loop A:
       Loop A:       37        37                                here…
        Pgm at
         Pgm at     1025        1               HIT
           1024      38        38               HIT
         , ,data
             data   1026        2                      Assume 1024-line direct-
         at 37:
           at 37:    39        39                        mapped cache, 1 word/line.
                    1024        0                        Consider tight loop, at
                      ...                                steady state:
                    1024       0                        (assume WORD, not BYTE,
      Loop B:
       Loop B:      2048       0                MISS     addressing)
        Pgm at
         Pgm at     1025       1                MISS
           1024     2049       1                MISS
         , ,data
             data   1026       2                                … but not here!
         atat       2050       2                MISS
           2048     1024       0                MISS   We need some associativity,
         ::           ...                              But not full associativity…
6.004 – Fall 2002                     11/7/02                           L18 – Memory Hierarchy 25
                    Set Associative Approach...
           Find “Byte”            ... modest parallelism
              Find “Bidittle”
                 Find “Bitdiddle”          Nope, I’ve got
                                                 under “E”

                                                         HIT! Here’s

                                                                       Nope, I’ve got
                                                                        under “E”

6.004 – Fall 2002                            11/7/02                      L18 – Memory Hierarchy 26
                        N-way Set-Associative Cache
                                       Can store N colliding entries at once!

                    k         t

                                  =?        =?                =?

                DATA OUT
6.004 – Fall 2002                          11/7/02                    L18 – Memory Hierarchy 27
                                  Things to Cache
           •    What we’ve got: basic speed/cost tradeoffs.
           •    Need to exploit a hierarchy of technologies
           •    Key: Locality. Look for “working set”, keep in fast memory.
           •    Transparency as a goal
           •    Transparent caches: hits, misses, hit/miss ratios
           •    Associativity: performance at a cost. Data points:
                    • Fully associative caches: no contention, prohibitive cost.
                    • Direct-mapped caches: mostly just fast RAM. Cheap, but has
                      contention problems.
                    • Compromise: set-associative cache. Modest parallelism handles
                      contention between a few overlapping “hot spots”, at modest cost.

6.004 – Fall 2002                               11/7/02                       L18 – Memory Hierarchy 28

To top