Memory Hierarchy of Computers & Hardwares
Description
Description of memory hierarchy of hardwares in a CPU.Level by description of all types of memory with examples and pictures
Shared by: seldeepak
Categories
Tags
Memory Hierarchy, cache line, Computation Structures, The Memory, main memory, working set, Microelectronic Devices, Technology Books, memory cell, external storage, parallel computer, complex computer, complex system, memory sizes, Disk latency, average access time, Introduction to Algorithms, Science/Software Engineering, Semantics of Programming Languages, Syntax and Semantics, Discrete Mathematics, ssh connection, ssh key, I say, ssh session, Wordpress rss, Navier Stokes equation, RSS Feeds,
-
Stats
- views:
- 589
- posted:
- 2/15/2010
- language:
- English
- pages:
- 28
Document Sample


The Memory Hierarchy
It says here the Sounds like
choices are something
“large and slow”, that a little
or $
“small and fast” could fix
6.004 – Fall 2002 11/7/02 L18 – Memory Hierarchy 1
What we want in a memory
PC ADDR
INST DOUT
BETA MEMORY
MADDR ADDR
MDATA DIN/DOUT
Capacity Latency Cost
Register 100’s of bits 20 ps $$$$
SRAM 100’ Kbytes 1 ns $$$
DRAM 100’ Mbytes 40 ns $
Hard disk* 10’s Gbytes 10 ms ¢
Want 100 Mbytes 1 ns cheap
* non-volatile
6.004 – Fall 2002 11/7/02 L18 – Memory Hierarchy 2
SRAM Memory Cell
There are two bit-lines per
Good, but Slow and column, one supplies the bit the
static slow 0 almost 1 other it’s complement.
bistable 6-T SRAM Cell
storage On a Read Cycle -
element 0 1
A single word line is activated
(driven to “1”), and the access
1
transistors enable the selected
word line N access FETs cells, and their complements,
onto the bit lines.
word line N+1 Writes are similar to reads,
except the bit-lines are driven
with the desired value of the cell.
bit Doesn’t this bit
Strong violate our
Strong The writing has to “overpower”
static
1 discipline? 0 the original contents of the
memory cell.
6.004 – Fall 2002 11/7/02 L18 – Memory Hierarchy 3
Tricks to make SRAMs fast
precharge or VDD Forget that it is a digital circuit
1) Precharge the bit lines prior to the
SRAM Cell read (for instance- while the
address is being decoded) because
the access FETs are good pull-
downs and poor pull-ups
rdata
Differential Sense Amp 2) Use a differential amplifier to
“sense” the difference in the two
bit bit bit-lines long before they reach a
valid logic levels.
precharge
or VDD
clocked
cross-coupled
sense amp clk
write
data
6.004 – Fall 2002 11/7/02 L18 – Memory Hierarchy 4
Multiport SRAMs (a.k.a. Register
Files)
wd rd0 rd1
PU = 2 / 1 2/1
PD = 4 / 1
4/1 2/1
This transistor
isolates the storage
5/1
PU = 2 / 2 node so that it won’t
PD = 2 / 3 flip unintentionally.
write
read0
read1
One can increase the number of SRAM ports by adding access
transistors. By carefully sizing the inverter pair, so that one
is strong and the other weak, we can assure that our WRITE
bus will only fight with the weaker one, and the READs are
driven by the stronger one. Thus minimizing both access and
write times.
What is the cost per cell of adding a new read or write port?
6.004 – Fall 2002 11/7/02 L18 – Memory Hierarchy 5
1-T Dynamic Ram
Six transistors/cell may not sound like much, but they can
add up quickly. What is the fewest number of transistors
that can be used to store a bit?
TiN top electrode (VREF)
Explicit storage 1-T DRAM Cell Ta2O5 dielectric
capacitor word
line
W bottom
access FET electrode
VREF
poly access fet
word
line
bit
C in storage capacitor determined by:
better dielectric more area
εA
C= d
thinner film
6.004 – Fall 2002 11/7/02 L18 – Memory Hierarchy 6
Tricks for increasing throughput
but, alas, not latency
Multiplexed Address bit lines word lines The first thing that should
(row first, then column) pop into you mind when
Col. Col. Col. Col. asked to speed up a digital
1 2 3 2M design…
Row 1
Row Address Decoder
PIPELINING
N Row 2
Synchronous DRAM
(SDRAM)
Row 2N
memory
Double-clocked
cell
Synchronous DRAM
M (one bit)
Column Multiplexer/Shifter (DDRAM)
Clock
D Data
t1 t2 t3 t4 out
6.004 – Fall 2002 11/7/02 L18 – Memory Hierarchy 7
Hard Disk Drives
Typical high-end drive:
• Average latency = 4 ms
• Average seek time = 9 ms
• Transfer rate = 20M bytes/sec
• Capacity = 60G byte
• Cost = $180 $99
figures from www.pctechguide.com
6.004 – Fall 2002 11/7/02 L18 – Memory Hierarchy 8
Quantity vs Quality…
Your memory system can be
• BIG and SLOW... or
• SMALL and FAST.
We’ve explored a range of
circuit-design trade-offs.
$/MB
10
0
10
SRAM Is there an
1 ARCHITECTURAL solution
DRAM
.1 to this DILEMMA?
.01 DISK
.00 TAPE
1
Access
10-8 10-6 10-3 1 10 Time
0
6.004 – Fall 2002 11/7/02 L18 – Memory Hierarchy 9
Best of Both Worlds
What we WANT: A BIG, FAST memory!
We’d like to have a memory system that
• PERFORMS like 32 MBytes of SRAM; but
• COSTS like 32 MBytes of slow memory.
SURPRISE: We can (nearly) get our wish!
KEY: Use a hierarchy of memory technologies:
MAIN
SRAM MEM
CPU
6.004 – Fall 2002 11/7/02 L18 – Memory Hierarchy 10
Key IDEA
• Keep the most often-used data in a small, fast
SRAM (often local to CPU chip)
• Refer to Main Memory only rarely, for
remaining data.
• The reason this strategy works: LOCALITY
Locality of Reference:
Reference to location X at time t implies that
reference to location X+∆X at time t+∆t
becomes more probable as ∆X and ∆t
approach zero.
6.004 – Fall 2002 11/7/02 L18 – Memory Hierarchy 11
Memory Reference Patterns
S is the set of locations
accessed during ∆t.
address
Working set: a set S which
changes slowly wrt
data access time.
Working set size, |S|
stack |S|
program
∆t
∆t time
6.004 – Fall 2002 11/7/02 L18 – Memory Hierarchy 12
Exploiting the Memory Hierarchy
Approach 1 (Cray, others): Expose Hierarchy
• Registers, Main Memory,
Disk each available as SRAM
MAIN
MEM
storage alternatives;
• Tell programmers: “Use them cleverly” CPU
Approach 2: Hide Hierarchy
• Programming model: SINGLE kind of memory, single address space.
• Machine AUTOMATICALLY assigns locations to fast or slow
memory, depending on usage patterns.
Dynamic HARD
Small
CPU Static RAM DISK
CACHE
X? “MAIN MEMORY”
“SWAP SPACE”
6.004 – Fall 2002 11/7/02 L18 – Memory Hierarchy 13
The Cache Idea:
Program-Transparent Memory Hierarchy
1.0 (1.0-α)
DYNAMIC
CPU
100 37 RAM
"CACHE" "MAIN
MEMORY"
Cache contains TEMPORARY COPIES of selected
main memory locations... eg. Mem[100] = 37
GOALS:
1) Improve the average access time
Challenge:
α HIT RATIO: Fraction of refs found in CACHE. make the
(1-α) MISS RATIO: Remaining references. hit ratio as
high as
t ave = αt c + (1 − α)( t c + t m ) = t c + (1 − α)t m possible.
2) Transparency (compatibility, programming ease)
6.004 – Fall 2002 11/7/02 L18 – Memory Hierarchy 14
How High of a Hit Ratio?
Suppose we can easily build an on-chip static memory with
a 4 nS access time, but the fastest dynamic memories
that we can buy for main memory have an average access
time of 40 nS. How high of a hit rate do we need to sustain
an average access time of 5 nS?
t ave − t c 5−4
α = 1− = 1− = 97 .5 %
tm 40
6.004 – Fall 2002 11/7/02 L18 – Memory Hierarchy 15
The Cache Principle
Find “Bitdiddle, Ben”
5-Minute Access Time: 5-Second Access Time:
ALGORITHM: Look nearby for the
requested information first, if its not
there check secondary storage
6.004 – Fall 2002 11/7/02 L18 – Memory Hierarchy 16
Basic Cache Algorithm
ON REFERENCE TO Mem[X]: Look for X among cache tags...
HIT: X = TAG(i) , for some cache line i
CPU • READ: return DATA(i)
• WRITE: change DATA(i); Start Write to Mem(X)
Tag Data MISS: X not found in TAG of any cache line
A Mem[A] • REPLACEMENT SELECTION:
Select some line k to hold Mem[X] (Allocation)
B Mem[B]
• READ: Read Mem[X]
Set TAG(k)=X, DATA(K)=Mem[X]
(1−α)
• WRITE: Start Write to Mem(X)
MAIN Set TAG(k)=X, DATA(K)= new Mem[X]
MEMORY
QUESTION: How do we “search” the cache?
6.004 – Fall 2002 11/7/02 L18 – Memory Hierarchy 17
Associativity: Parallel Lookup
Find “Bitdiddle, Ben”
Nope, “Smith”
Nope, “Jones”
HERE IT IS!
Nope, “Bitwit”
6.004 – Fall 2002 11/7/02 L18 – Memory Hierarchy 18
Fully-Associative Cache
TAG Data
Incoming
Address =?
The extreme in associatively: TAG Data
All comparisons made in
parallel =? HIT
Any data item could be
located in any cache location
TAG Data
Data
=? Out
6.004 – Fall 2002 11/7/02 L18 – Memory Hierarchy 19
Direct-Mapped Cache
(non-associative)
Find “Bitdiddle, Ben” NO Parallelism:
Look in JUST ONE place,
determined by
parameters of incoming
request (address bits)
... can use ordinary RAM as
table
Z
Y
BA
6.004 – Fall 2002 11/7/02 L18 – Memory Hierarchy 20
The Problem with Collisions
Find “Bitwit”
PROBLEM:
Find “Bituminous”
Find “Bitdiddle” Nope, I’ve got
Contention among B’s.... each
“BITWIT” competes for same cache
under “B” line!
- CAN’T cache both
“Bitdiddle” & “Bitwit”
... Suppose B’s tend
to come at once?
Z
Y
BA
BETTER IDEA:
File by LAST letter!
6.004 – Fall 2002 11/7/02 L18 – Memory Hierarchy 21
Optimizing for Locality:
selecting on statistically independent bits
Find “Bitdiddle” Here’s
BITDIDDLE,
under E
LESSON: Choose CACHE
LINE from independent
parts of request to
MINIMIZE CONFLICT
given locality patterns...
Z
Y IN CACHE: Select line by
A LOW ORDER address
Here’s bits!
BITWIT,
Find “Bitwit” under T Does this ELIMINATE
contention?
6.004 – Fall 2002 11/7/02 L18 – Memory Hierarchy 22
Direct Mapped Cache
Low-cost extreme:
Single comparator
Use ordinary (fast) static RAM for cache tags & data:
K x (T + D)-bit static RAM
Incoming Address
Tag Data
T K
DISADVANTAGE:
COLLISIONS
K-bit Cache Index
T Upper-address bits
D-bit data word
=?
QUESTION: Why not use HIGH-order bits
as Cache Index? HIT Data Out
6.004 – Fall 2002 11/7/02 L18 – Memory Hierarchy 23
Contention, Death, and Taxes...
Nope, I’ve got
Find “Bitdiddle” “BITTWIDDLE”
under “E”; I’ll replace LESSON: In a non-associative
it. cache, SOME pairs of
addresses must compete
for cache lines...
... if working set includes such
pairs, we get THRASHING
and poor performance.
Z
Y Nope, I’ve got
A “BITDIDDLE”
under “E”; I’ll
Find “Bittwiddle” replace it.
6.004 – Fall 2002 11/7/02 L18 – Memory Hierarchy 24
Direct-Mapped Cache Contention
Memory Cache Hit/
Address Line Miss
Works
1024 0 HIT GREAT
Loop A:
Loop A: 37 37 here…
HIT
Pgm at
Pgm at 1025 1 HIT
1024
1024 38 38 HIT
, ,data
data 1026 2 Assume 1024-line direct-
HIT
at 37:
at 37: 39 39 mapped cache, 1 word/line.
HIT
1024 0 Consider tight loop, at
HIT
... steady state:
1024 0 (assume WORD, not BYTE,
MISS
Loop B:
Loop B: 2048 0 MISS addressing)
Pgm at
Pgm at 1025 1 MISS
1024
1024 2049 1 MISS
, ,data
data 1026 2 … but not here!
MISS
atat 2050 2 MISS
2048
2048 1024 0 MISS We need some associativity,
:: ... But not full associativity…
6.004 – Fall 2002 11/7/02 L18 – Memory Hierarchy 25
Set Associative Approach...
Find “Byte” ... modest parallelism
Find “Bidittle”
Find “Bitdiddle” Nope, I’ve got
“Bidittle”
under “E”
HIT! Here’s
BITDIDDLE!
Nope, I’ve got
“Byte”
under “E”
Z
Y
A
Z
Y
A
Z
Y
A
6.004 – Fall 2002 11/7/02 L18 – Memory Hierarchy 26
N-way Set-Associative Cache
Can store N colliding entries at once!
INCOMING ADDRESS
k t
=? =? =?
DATA OUT
0
HIT
6.004 – Fall 2002 11/7/02 L18 – Memory Hierarchy 27
Things to Cache
• What we’ve got: basic speed/cost tradeoffs.
• Need to exploit a hierarchy of technologies
• Key: Locality. Look for “working set”, keep in fast memory.
• Transparency as a goal
• Transparent caches: hits, misses, hit/miss ratios
• Associativity: performance at a cost. Data points:
• Fully associative caches: no contention, prohibitive cost.
• Direct-mapped caches: mostly just fast RAM. Cheap, but has
contention problems.
• Compromise: set-associative cache. Modest parallelism handles
contention between a few overlapping “hot spots”, at modest cost.
6.004 – Fall 2002 11/7/02 L18 – Memory Hierarchy 28
Related docs
Get documents about "