Portable_ mostly-concurrent_ mostly-copying garbage collection for
Shared by: gjmpzlaezgx
-
Stats
- views:
- 0
- posted:
- 10/18/2011
- language:
- English
- pages:
- 30
Document Sample


Portable,
mostly-concurrent,
mostly-copying GC for
multi-processors
Tony Hosking
Secure Software Systems Lab
Purdue University
Platform assumptions
• Symmetric multi-processor (SMP/CMP)
• Multiple mutator threads
• (Large heaps)
Desirable properties
• Maximize throughput
• Minimize collector pauses
• Scalability
Exploiting parallelism
• Avoid contention
• (Mostly-)Concurrent allocation
• (Mostly-)Concurrent collection
Concurrent allocation
• Use thread-private allocation “pages”
• Threads contend for free pages
• Each thread allocates from its own
page
• multiple small objects per page, or
• multiple pages per large object
Concurrent collection:
The tricolour abstraction
• Black
• “live”
• scanned
• cannot refer to white
• Grey
• “live” wavefront
• still to be scanned
• may refer to any color
• White
• hypothetical garbage
Garbage collection
• White = whole heap
• Shade root targets grey
• While grey nonempty
• Shade one grey object black
• Shade its white children grey
• At end, white objects are garbage
Copying collection
• Partition white from black by copying
• Reclaim white partition wholesale
• At next GC, “flip” black to white
Incremental collection
Mutator threads
Concurrent collection
Mutator threads
Background GC thread
Concurrent mutators
• Mutation changes reachability during GC
• Loss of black/grey reference is safe
• Non-white object losing its last reference
will be garbage at next GC
• New reference from black to white
• New reference may make target live
• Collector may never see new reference
• Mutations may require compensation
Compensation options
• Prevent mutator from creating black-to-
white references
• write barrier on black
• read barrier on grey to prevent mutator
obtaining white refs
• Prevent destruction of any path from a
grey object to a white object without
telling GC
• write barrier on grey
Mostly-copying GC
[Bartlett]
• Copying collection with ambiguous roots
• Uncooperative compilers
• Untidy references
• Explicit pinning
• Pin ambiguously-referenced objects
• Shade their page grey without copying
• Assume heap accuracy
• Copy remaining heap-referenced objects
Incremental MCGC
[DeTreville]
• Enforce grey mutator invariant
– STW greys ambiguously-referenced pages
– Read barrier on grey using VM page protection
• Read barrier
– Stop mutator threads
– Unprotect page
– Copy white targets to grey
– Shade page black
– Restart threads
• Atomic system call wrappers unprotect
parameter targets (otherwise traps in OS
return error)
Concurrent MCGC?
• Stopping all threads at each
increment is prohibitive on SMP &
impedes concurrency
• BUT barriers difficult to place on
ambiguous references with
uncooperative compilers
• ALSO Preemptive scheduling may
break wrapper atomicity
Mostly-concurrent MCGC
• Enforce black mutator invariant
• STW blackens ambiguously-referenced
pages
• Read barrier on load of accurate (tidy) grey
reference
• Read barrier:
• Blacken grey references as they are loaded
• No system call wrappers: arguments are
always black
Read barrier on load of grey
• Object header bit marks grey objects
• Inline fast path checks grey bit in target
header, calls out to slow path if set
• Out-of-line slow path:
• Lock heap meta-data
• For each (grey) source object in target page
• Copy white targets to grey
• Clear grey header bit
• Shade target page black
• Unlock heap meta-data
Coherence for fast path
• STW phase synchronizes mutators’ views of
heap state
• Grey bits are set only in newly-copied
objects (ie, newly-allocated grey pages)
since most recent STW
• Mutators can never see a cleared grey
header unless the page is also black
• Seeing a spurious grey header due to weak
ordering is benign: slow path will synchronize
Implementation
• Modula-3:
• gcc-based compiler back-end
• No tricky target-specific stack-maps
• Compiler front-end emits barriers
• M3 threads map to preemptively-scheduled
POSIX pthreads
• Stop/start threads: signals + semaphores, or
OS primitives if available
• Simple to port: Darwin (OS X), Linux,
Solaris, Alpha/OSF
Experiments
• Parallelized GCOld benchmark to permit
throughput measurements for multiple
mutators
• Measures steady-state GC throughput
• 2 platforms:
• 2 x 2.3GHz PowerPC Macintosh Xserve
running OS X 10.4.4
• 8 x 700MHz Intel Pentium 3 SMP running
Linux 2.6
Read Barriers: STW
1 user-level mutator thread, work=1
5
4
Hardware Software
4
3
elapsed time (s)
3
2
2
1
1
0
0.1 0.5 1 2 4 8
GC ratio
Elapsed time (s)
1 system-level mutator thread, work=1
7
6 STW INC
5
elapsed time (s)
4
3
2
1
0
0.1 0.5 1 2 4 8
GC ratio
Heap size
1 system-level mutator thread
140
120 STW INC
100
maximum heap (MB)
80
60
40
20
0
0.1 0.5 1 2 4 8
GC ratio
BMU
1 system-level mutator thread, work=1000, ratio=1
Scalability
work=1000, ratio=1, 8xP3
120
STW INC
100
80
elapsed time (s)
60
40
20
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
mutator threads
Java Hotspot server
work=1000, 8xP3
200
180 Serial Concurrent MS
160
140
elapsed time (s)
120
100
80
60
40
20
0
1 2 3 4 5 6 7 8
mutator threads
Conclusions
• Mostly-concurrent,mostly-copying collection
is feasible for multi-processors (proof-of-
existence)
• Performance is good (scalable)
• Portable: changes only to compiler front-end
to introduce barriers, and to GC run-time
system
• Compiler back-end unchanged: full-blown
optimizations enabled, no stack-map
overheads
Future work
• Convert read barrier to “clean” only
target object instead of whole page
Scalability
work=10, ratio=1, 8xP3
80
STW INC
70
60
elapsed time (s)
50
40
30
20
10
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
mutator threads
Java Hotspot server
work=10, 8xP3
120
Serial Concurrent MS
100
80
elapsed time (s)
60
40
20
0
1 2 3 4 5 6 7 8
mutator threads
Get documents about "