Computer Systems
the impact of caches
University of Amsterdam
Arnoud Visser 1
Computer Systems – the impact of caches
Introduction
Different sorts of memory
• On-die 0/1/10 cycles
• On-board 100
• On-disk 10.000
• Off-machine 1.000.000
University of Amsterdam
Arnoud Visser 2
Computer Systems – the impact of caches
The CPU-Memory Gap
• The increasing gap between
disk, DRAM and SRAM, CPU speeds.
100,000,000
10,000,000
1,000,000
100,000 Disk seek time
DRAM access time
ns
10,000
SRAM access time
1,000
CPU cycle time
100
10
1
1980 1985 1990 1995 2000
University of Amsterdam
year
Arnoud Visser 3
Computer Systems – the impact of caches
Storage Trends
bigger, not faster
metric 1980 1985 1990 1995 2000 2000:1980
$/MB 500 100 8 0.30 0.05 10,000
Disk access (ms) 87 75 28 10 8 11
typical size (MB) 1 10 160 1,000 9,000 9,000
metric 1980 1985 1990 1995 2000 2000:1980
DRAM $/MB 8,000 880 100 30 1 8,000
access (ns) 375 200 100 70 60 6
typical size (MB) 0.064 0.256 4 16 64 1,000
(Culled from back issues of Byte and PC Magazine)
University of Amsterdam
Arnoud Visser 4
Computer Systems – the impact of caches
Processor trends
faster
metric 1980 1985 1990 1995 2000 2000:1980
SRAM $/MB 19,200 2,900 320 256 100 190
access (ns) 300 150 35 15 2 100
typical size (MB) 0.008 0.016 0.032
1980 1985 1990 1995 2000 2000:1980
processor 8080 286 386 Pent P-III
clock rate (MHz) 1 6 20 150 750 750
cycle time (ns) 1,000 166 50 6 1.6 750
University of Amsterdam
Arnoud Visser 5
Computer Systems – the impact of caches
Intel Processors Cache
SRAM
L1 L2
486 1989-1994 8K -
Pentium 1993 8K 8K -
Pentium Pro 1995-1999 8K 8K 256K-1M
Pentium II 1997 16 K 16 K 512K ½
Celeron A 1998 16 K 16 K 128K
Pentium III 2000 16 K 16 K 256K
Coppermine
Pentium 4 2000 12 K 8K 256K
Willamette
Pentium 4 2002 12 K 8K 512K
Northwood
University of Amsterdam
http://www.intel.com/pressroom/kits/quickreffam.htm
Arnoud Visser 6
Computer Systems – the impact of caches
Memory Hierarchy
Smaller, L0:
faster, Registers CPU registers hold words
and retrieved from cache memory.
costlier L1: On-chip L1
(per byte) cache (SRAM) L1 cache holds cache lines
storage retrieved from the L2 cache.
devices Off-chip L2
L2:
cache (SRAM) L2 cache holds cache lines
retrieved from memory.
L3: Main memory
Larger, (DRAM)
Main memory holds disk
slower, blocks retrieved from local
and disks.
cheaper Local secondary storage
(per byte) L4:
(local disks) Local disks hold files
storage
retrieved from disks
devices
on remote network
servers.
L5: Remote secondary storage
(distributed file systems, Web servers)
University of Amsterdam
Arnoud Visser 7
Computer Systems – the impact of caches
Pay the price
• To access large amounts of data in a
cost-effective manner, the bulk of the
data must be stored on disk
80 GB: ~$110
1GB: ~$200
4 MB: ~$500
SRAM DRAM Disk
University of Amsterdam
Arnoud Visser 8
Computer Systems – the impact of caches
Locality
• Principle of Locality:
– Programs tend to reuse data and instructions near
those they have used recently, or that were recently
referenced themselves.
– Temporal locality: Recently referenced items are
likely to be referenced in the near future.
– Spatial locality: Items with nearby addresses tend
to be referenced close together in time.
University of Amsterdam
Arnoud Visser 9
Computer Systems – the impact of caches
University of Amsterdam
Arnoud Visser 10
Computer Systems – the impact of caches
Locality Example
sum = 0;
for (i = 0; i < n; i++)
sum += a[i];
return sum;
• Data
– Reference array elements in succession
(stride-1 reference pattern): Spatial locality
– Reference sum each iteration: Temporal locality
• Instructions
– Reference instructions in sequence: Spatial locality
– Cycle through loop repeatedly: Temporal locality
University of Amsterdam
Arnoud Visser 11
Computer Systems – the impact of caches
Power Programmer
• Claim: Being able to look at code and
get a qualitative sense of its locality is
a key skill for a professional
programmer.
int sumarrayrows(int a[M][N])
{
int i, j, sum = 0;
• Good locality? for (i = 0; i < M; i++)
for (j = 0; j < N; j++)
sum += a[i][j];
return sum
} University of Amsterdam
Arnoud Visser 12
Computer Systems – the impact of caches
Stride-M example
• Question: Does this function have
good locality?
int sumarraycols(int a[M][N])
{
int i, j, sum = 0;
for (j = 0; j < N; j++)
for (i = 0; i < M; i++)
sum += a[i][j];
return sum
}
University of Amsterdam
Arnoud Visser 13
Computer Systems – the impact of caches
Matrix M=2,N=3
int sumarrowrows()
Adress 0 4 8 12 16 20
Contents a00 a01 a02 a10 a11 a12
Acces order 1 2 3 4 5 6
int sumarrowcols()
Adress 0 4 8 12 16 20
Contents a00 a01 a02 a10 a11 a12
Acces order 1 3 5 2 4 6
University of Amsterdam
Arnoud Visser 14
Computer Systems – the impact of caches
Expect: Stride-1 is better!
32 bytes
600
500
400
MB/s
300 Series1
200
100
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
stride (words)
– int A[2][4] University of Amsterdam
Arnoud Visser 15
Computer Systems – the impact of caches
Reality:
small matrices fit in cache
4 KB
5000
4500
4000
3500
Througput (MB/s)
3000
2500 Series1
2000
1500
1000
500
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
stride (words)
– int A[32][32] University of Amsterdam
Arnoud Visser 16
Computer Systems – the impact of caches
Reality:
Performance-drop cache L2 / L1
not dramatic
128 KB
6000
5000
4000
Throughput (MB/s)
3000 Series1
2000
1000
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
stride (words)
– int A[180][180] University of Amsterdam
Arnoud Visser 17
Computer Systems – the impact of caches
Reality:
Only when DRAM is accessed,
the penalty can be seen
1 MB
1800
1600
1400
1200
Throughput (MB/s)
1000
Series1
800
600
400
200
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
stride (words)
– int A[512][512] University of Amsterdam
Arnoud Visser 18
Computer Systems – the impact of caches
Memory Mountain
Pentium 4
5000
2.4 GHz
4500 8 KB L1 d-cache
Read throughput (MB/s)
4000
12 KB L1 i-cache
L1 512 KB L2 cache
3500
3000
2500 L2
2000
Ridges of
1500 xe
temporal
Slopes of
locality
spatial 1000
locality 500
0
Mem
s1
s3
2k
s5
8k
s7
32k
s9
128k
s11
Stride (words)
512k
s13
Working set size (bytes)
2m
s15
8m
University of Amsterdam
Arnoud Visser 19
Computer Systems – the impact of caches
Summary
• As long as your data fits in the cache, and
your program shows good locality, good
performance is guaranteed.
University of Amsterdam
Arnoud Visser 20
Computer Systems – the impact of caches
Assignment
• Practice Problem 6.9 (p. 624):
'Order three functions to the spatial locality
enjoyed by each.'
• Practice Problem 6.22 (p. 659):
'Estimate the time, in CPU cycles, to read a 8-byte
word, from the different L1-d of a i7 processor
University of Amsterdam
Arnoud Visser 21
Computer Systems – the impact of caches