CS0447 Computer Organization Assembly Language by variablepitch346

VIEWS: 20 PAGES: 33

									Managing Distributed, Shared L2 Caches through OS-Level Page Allocation
Sangyeun Cho and Lei Jin

Dept. of Computer Science University of Pittsburgh

Multicore distributed L2 caches


L2 caches typically sub-banked and distributed
• IBM Power4/5: 3 banks
• Sun Microsystems T1: 4 banks • Intel Itanium2 (L3): many “sub-arrays”

router

processor core



(Distributed L2 caches + switched NoC)  NUCA Hardware-based management schemes
• Private caching • Shared caching • Hybrid caching


local L2 cache

Dec. 13 ’06 – MICRO-39

Private caching
3. Access directory

1.

 short hit latency (always local)  high on-chip miss rate
1. L1 miss

2.

L1 miss L2 access
• •

Hit Miss

 long miss resolution time  complex coherence enforcement
2. L2 access

3.

Access directory
• •

A copy on chip Global miss

Dec. 13 ’06 – MICRO-39

Shared caching

1.

L1 miss L2 access
• •

 low on-chip miss rate

 straightforward data location  simple coherence (no replication)  long average hit latency

1. L1 miss

2.

Hit Miss

Dec. 13 ’06 – MICRO-39

Our work
 

Placing “flexibility” as the top design consideration OS-level data to L2 cache mapping
• Simple hardware based on shared caching • Efficient mapping maintenance at page granularity



Demonstrating the impact using different policies

Dec. 13 ’06 – MICRO-39

Talk roadmap
 

Data mapping, a key property Flexible page-level mapping
• Goals • Architectural support • OS design issues

 

Management policies Conclusion and future works

Dec. 13 ’06 – MICRO-39

Data mapping, the key


Data mapping = deciding data location (i.e., cache slice) Private caching
• Data mapping determined by program location • Mapping created at miss time • No explicit control





Shared caching
• Data mapping determined by address
slice number = (block address) % (Nslice)

• Mapping is static

• Cache block installation at miss time

Mapping granularity = block

• No explicit control
• (Run-time can impact location within slice)
Dec. 13 ’06 – MICRO-39

Changing cache mapping granularity
Memory blocks Memory pages

 miss rate?  impact on existing techniques? (e.g., prefetching)  latency?

Dec. 13 ’06 – MICRO-39

Observation: page-level mapping
Memory pages Program 1

 Mapping data to different $$ feasible
OS policies  Key: OS page allocation PAGE ALLOCATION OS PAGE ALLOCATION

 Flexible

Program 2

Dec. 13 ’06 – MICRO-39

Goal 1: performance management

 Proximity-aware data mapping
Dec. 13 ’06 – MICRO-39

Goal 2: power management
0 0

0

0

0

0

0

0

0

0

0

0

 Usage-aware cache shut-off
Dec. 13 ’06 – MICRO-39

Goal 3: reliability management

X

X

 On-demand cache isolation
Dec. 13 ’06 – MICRO-39

Goal 4: QoS management

 Contract-based cache allocation
Dec. 13 ’06 – MICRO-39

Architectural support
Method 1: “bit selection”
slice_num = (page_num) % (Nslice) L1 miss data address Method 2: “region table” Method 1: “bit selection” regionx_low ≤ page_num ≤ regionx_high slice number = (page_num) % (Nslice) other bits page_numslice_num page offset Method 2: “region table” regionx_low ≤ page_num ≤ regionx_high Method 3: “page table (TLB)” Method 3: “page table (TLB)” page_num «–» slice_num page_num «–» slice_num reg_table TLB region0_low region1_low vpage_num0 vpage_num1 region0_high region1_high ppage_num0 ppage_num1 slice_num0 slice_num1 slice_num0 slice_num1

 Simple hardware support enough
 Combined scheme feasible

Dec. 13 ’06 – MICRO-39

Some OS design issues


Congruence group CG(i)
• •

Set of physical pages mapped to slice i A free list for each i  multiple free lists Data proximity Cache pressure (e.g.) Profitability function P = f(M, L, P, Q, C)
M: miss rates L: network link status P: current page allocation status Q: QoS requirements C: cache configuration



On each page allocation, consider
• • •

 

Impact on process scheduling Leverage existing frameworks
• •

Page coloring – multiple free lists NUMA OS – process scheduling & page allocation

Dec. 13 ’06 – MICRO-39

Working example
Program
0 1 2 3 5 5 4 5 6 7 5 5 4 1 12 13 14 15 6

P(1) 0.9 P(4) = 0.95  Static vs. dynamic mapping
8 9

 10 Program 11

 Proper run-time monitoring needed

P(6) = 0.9 0.8 P(4) 0.7 P(5) = 0.8 information (e.g., …

profile)

Dec. 13 ’06 – MICRO-39

Page mapping policies

Dec. 13 ’06 – MICRO-39

Simulating private caching
For a page requested from a program running on core i, map the page to cache slice i
SPEC2k INT
100
100

SPEC2k FP

private caching
80

OS-based

L2 cache latency (cycles)

 Simulating private caching is simple
 Similar or better performance 60
40

80

60

40

20

20

0 128kB
Dec. 13 ’06 – MICRO-39

0

256kB

512kB

128kB

256kB

512kB

L2 cache slice size

Simulating “large” private caching
For a page requested from a program running on core i, map the page to cache slice i; also spread pages
SPEC2k INT
1.8 1.6 1.8 1.6 1.4

SPEC2k FP 1.93

Relative performance (time-1)

1.4 1.2 1 0.8 0.6 0.4 0.2 0 gcc parser eon twolf

private OS

1.2 1 0.8 0.6 0.4 0.2 0 wupwise galgel ammp sixtracks

512kB cache slice
Dec. 13 ’06 – MICRO-39

Simulating shared caching
For a page requested from a program running on core i, map the page to all cache slices (round-robin, random, …)
SPEC2k INT
100 100

SPEC2k FP 129 106

80

L2 cache latency (cycles)

OS shared

 Simulating shared caching is simple
 Mostly similar behavior/performance 60  Pathological cases (e.g., applu)
40

80

60

40

20

20

0 128kB
Dec. 13 ’06 – MICRO-39

0 256kB 512kB 128kB 256kB 512kB

L2 cache slice size

Simulating clustered caching
For a page requested from a program running on core of group j, map the page to any cache slice within group (round-robin, random, …)
0 1 2 3

1.2 1 0.8 0.6 0.4 0.2 0

private shared OS
4 5 6 7

Relative performance (time-1)

 Simulating clustered caching is simple
 Lower miss traffic than private
8 9 10 11

 Lower on-chip traffic than shared
12 13 14 15

FFT
Dec. 13 ’06 – MICRO-39

LU

R ADIX

OC E AN

4 cores used; 512kB cache slice

Profile-driven page mapping


Using profiling collect:
• Inter-page conflict information • Per-page access count information



Page mapping cost function (per slice)
• Given program location, page to map, and previously mapped pages • (# conflicts miss penalty) + weight  (# accesses  latency) • weight as a knob miss cost Latency cost
Larger value  more weight on proximity (than miss rate) Optimize both miss rate and data proximity

Theoretically important to understand limits  Can be practically important, too

Dec. 13 ’06 – MICRO-39

Profile-driven page mapping, cont’d
weight remote

100% 80% 60%
L2 cache accesses

40% 20% 0%
100% 80% 60% 40% 20%
on-chip hit miss local

gcc

gzip

gap

mgrid

mesa

eon

equake

vortex

bzip2

256kB L2 cache slice
Dec. 13 ’06 – MICRO-39

wupwise

art

mcf

parser

ammp

crafty

twolf

vpr

0%

Profile-driven page mapping, cont’d
Program location

450 400 350
# pages mapped

GCC

300 250 200 150 100 50 0

256kB L2 cache slice
Dec. 13 ’06 – MICRO-39

Profile-driven page mapping, cont’d

80%

108%

Performance improvement Over shared caching

60%

 Room for performance improvement
40%

 Best of the two or better than the two
21% 17%  Dynamic mapping schemes desired 9% 1% 7% 0% 9% 2% 4% 3% 9% 2% 23% 6%

39%

20%

0%

gap

gcc

gzip

mgrid

mesa

eon

vortex

-20%

256kB L2 cache slice
Dec. 13 ’06 – MICRO-39

wupwise

equake

bzip2

art

parser

mcf

ammp

crafty

twolf

vpr

-1%

Isolating faulty caches
When there are faulty cache slices, avoid mapping pages to them
8

shared Relative L2 cache latency
6

4

2

OS

0 0 1 2 4 8 # cache slice deletions

4 cores running a multiprogrammed workload; 512kB cache slice
Dec. 13 ’06 – MICRO-39

Conclusion


“Flexibility” will become important in future multicores
• Many shared resources • Allows us to implement high-level policies



OS-level page-granularity data-to-slice mapping
• Low hardware overhead

• Flexible



Several management policies studied
• Mimicking private/shared/clustered caching straightforward • Performance-improving schemes

Dec. 13 ’06 – MICRO-39

Future works


Dynamic mapping schemes
• Performance • Power



Performance monitoring techniques
• Hardware-based

• Software-based



Data migration and replication support

Dec. 13 ’06 – MICRO-39

Thank you!
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation
Sangyeun Cho and Lei Jin
Dept. of Computer Science University of Pittsburgh
Dec. 13 ’06 – MICRO-39

Multicores are here

IBM Power5 (2004) AMD Opteron dual-core (2005) Sun Micro. T1, 8 cores (2005)

Quad cores (2007)

Intel Core2 Duo (2006)

Intel 80 cores? (2010?)
Dec. 13 ’06 – MICRO-39

A multicore outlook

???

Dec. 13 ’06 – MICRO-39

A processor model
router


processor core

Private L1 I/D-$$
•

8kB~32kB 128kB~512kB 8~18 cycles



Local unified L2 $$
• •


local L2 cache

Switched network
•

2~4 cycles/switch



Distributed directory
• Scatter hotspots

Many cores (e.g., 16)
Dec. 13 ’06 – MICRO-39

Other approaches


Hybrid/flexible schemes
• “Core clustering” [Speight et al., ISCA2005] • “Flexible CMP cache sharing” [Huh et al., ICS2004] • “Flexible bank mapping” [Liu et al., HPCA2004]



Improving shared caching
• “Victim replication” [Zhang and Asanovic, ISCA2005]



Improving private caching
• “Cooperative caching” [Chang and Sohi, ISCA2006] • “CMP-NuRAPID” [Chishti et al., ISCA2005]

Dec. 13 ’06 – MICRO-39


								
To top