slides
Document Sample


Improving Multiprocessor
Performance with Coarse-Grain
Coherence Tracking
Jason F. Cantin, Mikko H. Lipasti, and James E. Smith
International Symposium on Computer Architecture
June 7th, 2005
Overview of Idea
Coarse-Grain Coherence Tracking:
1. Monitors coherence status of memory at a
multi-line granularity
2. Uses the coarse-grain information to identify
requests that don’t need a coherence
broadcast
3. Sends these requests directly to memory
June 7, 2005 ISCA 2005 2
Problem
Broadcast Network
Data Network
NC $
P P P
MC P
DRAM DRAM DRAM DRAM
Snoop-based systems support a limited
number of processors
– Limited broadcast bandwidth
– Increasing memory latency
June 7, 2005 ISCA 2005 3
Opportunity
• Some data requests don’t need a broadcast
– Requests for non-shared data
– Fetches of unmodified instructions
– Write-backs
• Some non-data requests don’t need to leave the
processor
– Requests to upgrade copy, but not shared
– Requests to flush copies, but not cached elsewhere
June 7, 2005 ISCA 2005 4
Unnecessary Broadcasts
100% 93%
80%
65% 67%
62%
Write-back
Requests
60%
DCB
40% Write
I-Fetch
20% Read
0%
Scientific Multiprogrammed Commercial Arithmetic Mean
June 7, 2005 ISCA 2005 5
Our Approach
• Identify requests that don’t need a broadcast
• Send data requests directly to memory
– Reduce broadcast traffic
– Reduce latency in some systems
• Avoid sending non-data requests externally
– Further reduce broadcast traffic
– Reduce latency
June 7, 2005 ISCA 2005 6
Coarse-Grain Coherence Tracking
• Memory is divided into coarse-grain regions
– Aligned, power-of-two multiple of cache line size
– Can range from two lines to a physical page
• A cache-like structure is added to each
processor for monitoring coherence at the
granularity of regions
– Region Coherence Array (RCA)
June 7, 2005 ISCA 2005 7
Coarse-Grain Coherence Tracking
• Each entry has an address tag, state, and
count of lines cached by the processor
• The state indicates if the processor and / or
other processors are sharing / modifying lines
in the region
• On cache misses, the region state is read to
determine if a broadcast is necessary
June 7, 2005 ISCA 2005 8
Coarse-Grain Coherence Tracking
• On snoops, the region state provides a
response for the region
– Piggy-backed onto the conventional response
– Used to update other processors’ region state
• RCA maintains inclusion over caches
– When regions are evicted, their lines are evicted
– RCA must respond correctly if region’s lines cached
– Replacement algorithm uses line count
June 7, 2005 ISCA 2005 9
Example: Conventional Snooping
Network
Read: P0, 100002
Read: P0, 100002
Invalid Invalid
Tag State
• P0 loads 100002
0010 Pending
Invalid
0000 Exclusive 0000 Invalid
MISS $0 $1
0000 Invalid 0000 Invalid
Data 10000
• Snoop performed Load: 2
Data
• Response sent P0 P1
• Data transfer
M0 M1
June 7, 2005 ISCA 2005 10
Coarse-Grain Coherence Tracking
Region Coherence
Array added; two Network exclusive
P0 has
Invalid, 10000
Read: P ,Region 2
access to0regionNot Shared
lines per region Read: P0, 100002 Invalid, Region Not Shared
Tag State
• P0 loads 100002
0010 Pending
Invalid
0000 Exclusive 000 Pending
001 Invalid
DI 0000 Invalid 000 Invalid
MISS $0 RCA $1 RCA
0000 Invalid Invalid
000 0000 Invalid Invalid
000
• Snoop performed Data
Load: 100002
• Response sent P0 P1
Data
• Data transfer
M0 M1
June 7, 2005 ISCA 2005 11
Coarse-Grain Coherence Tracking
Region Coherence
Array added; two Network
lines per region Exclusive region state,
Tag State broadcast unnecessary
• P0 loads 110002
0010 Exclusive 001 DI 0000 Invalid 000 Invalid
MISS, Region Hit 0$
0000 Exclusive
0011 Pending
Invalid
RCA
000
Invalid 0000
$1
Invalid
RCA
Invalid
000
• Direct request sent Data
Load: 110002
• Data transfer P0 P1
Read: P0, 110002
Data
M0 M1
June 7, 2005 ISCA 2005 12
Coarse-Grain Coherence Tracking
Region Coherence
Array added; two Network
Region not exclusive
Owned, Region Owned RFO: P1, 100002
lines per region Owned, Region Owned anymore
RFO: P1, 100002
• P1 stores 100002
Pending
Invalid
0010 Exclusive 001 DI
DD 0010 Pending
0000 Modified
Invalid 001 Invalid
DD
000 Pending
MISS $ 0 RCA $1 RCA
0011 Exclusive Invalid
000 0000 Invalid Invalid
000
Data
• Snoop performed Data Store: 100002
Hits in P0 cache
P0 P1
• Response sent
• Data transfer
M0 M1
June 7, 2005 ISCA 2005 13
Overhead
• Storage space needed for RCA
– 3-6% storage overhead for cache
• Two bits needed in snoop response for region
response
• Path to memory needed to avoid broadcasts
– Simple with on-chip memory controllers
– May leverage data network
June 7, 2005 ISCA 2005 14
Simulator
PHARMsim:
• Execution-driven simulator built on top of SimOS-PPC
• Four 4-way superscalar out-of-order processors
• Two-level hierarchy with split L1, unified L2 caches
• Separate address / data networks –similar to Fireplane
• Region Coherence Array with same sets/assoc. as L2
June 7, 2005 ISCA 2005 15
Workloads
• Scientific
– Ocean, Raytrace, Barnes
• Multiprogrammed
– SPECint2000_rate
• Commercial
– TPC-W, TPC-B, TPC-H, SPECweb99,
SPECjbb2000
June 7, 2005 ISCA 2005 16
Broadcasts Avoided
100%
80%
67%
56% Write-back
Requests
60%
DCB
40% Write
I-Fetch
20% Read
0%
Oracle
Oracle
Oracle
Oracle
1KB
1KB
1KB
1KB
256B
512B
256B
512B
256B
512B
256B
512B
Scientific Multiprogrammed Commercial Arithmetic Mean
June 7, 2005 ISCA 2005 17
Snoop Traffic Reduction – Peak
8000
Broadcasts / 100K Cycles
6000
64%
4000
51%
2000 38%
0
Scientific Multiprogrammed Commercial
Peak Traffic Peak Traffic with 512B Regions
June 7, 2005 ISCA 2005 18
Snoop Traffic Reduction – Average
4000
Broadcasts / 100K Cycles
2000
74% 47%
86%
0
Scientific Multiprogrammed Commercial
Average Traffic Average Traffic with 512B Regions
June 7, 2005 ISCA 2005 19
Execution Time
91.2%
1.0
Normalized Execution Time
0.8
0.6
0.4
0.2
0.0
Scientific Multiprogrammed Commercial Arithmetic Mean
Baseline 256B Regions 512B Regions 1KB Regions
June 7, 2005 ISCA 2005 20
Remaining Opportunity
• With 512B regions, ~10% of requests are
broadcast unnecessarily
• A third of the 10% are region false sharing
• Half of the 10% miss in RCA
– Potential for prefetching
June 7, 2005 ISCA 2005 21
Inclusion Overhead
100%
80%
L2 Miss Ratio
60%
40%
+0.23% +0.56%
20%
+0.04%
0%
Scientific Multiprogrammed Commercial
Baseline miss rate 512B miss rate
--Regions with no lines cached replaced first
June 7, 2005 ISCA 2005 22
Conclusion
Coarse-Grain Coherence Tracking:
• Reduces broadcast traffic
– Most data requests sent directly to memory
• Reduces latency
– Many requests not sent to central arbitration point
– Many non-data requests not sent externally
• Improves scalability and performance
June 7, 2005 ISCA 2005 23
The End
June 7, 2005 ISCA 2005 24
Inclusion Evictions
2 lines evicted
100%
1 line evicted
Region Evictions
80% 0 lines evicted
60%
40%
20%
0%
ed
ific
l
ia
m
c
nt
er
am
ie
m
Sc
gr
om
ro
C
tip
ul
M
June 7, 2005 ISCA 2005 25
Ordering
• Ordering point is now the Region Coherence Array
– A direct request is ordered once it accesses the RCA
• Direct requests are serialized w.r.t. to snoop requests
– A direct request occurs either before, or after a snoop
– All must appear to access and update RCA atomically
• No two processors can have exclusive access to a
region at the same time (no races)
June 7, 2005 ISCA 2005 26
Comparison to RegionScout
CGCT RegionScout
Optimization Latency Power
Avoids broadcast for
Yes Yes
non-shared data
Avoids broadcast for
Yes No
clean data
Avoids tag lookups on
No Yes –Like Jetty
snoops
Region state storage Inclusive cache Hash table, small cache
Region state transfer 2 bits in snoop response 1 bit in snoop response
Region protocol 7 states Effectively 4 states
June 7, 2005 ISCA 2005 27
Execution Time
92.2%
1.0
91.2%
Normalized Execution Time
0.8
0.6
0.4
0.2
0.0
Scientific Multiprogrammed Commercial Arithmetic Mean
512B Regions 512B Regions, half number of sets
June 7, 2005 ISCA 2005 28
Get documents about "