Path Profile Estimation and
Superblock Formation
Jeff Pang
Jimeng Sun
Motivation
Compile Optimize Run
Profile
Why Continuous Profiling? Challenges:
– Continuous Optimization – Automated
– Dynamic Optimization – Low overhead
– Realistic Profiles
– Accuracy
Related Work:
H. Chen, et al. Dynamic Trace Selection Using Performance
Hardware Sampling. CGO, 2003.
A. Shye, et al. Analysis of Path Profiling Information Gathered
with Performance Monitoring Hardware. ICCA, 2005.
Goals
Superblock Run with
Formation Simulated PMU
Path Profile Sample
Path Profile
Estimation
• Take advantage of modern Performance Monitoring Units
– Like in Pentium 4, Itanium, PPC 970, etc.
– Allows sampling of last couple branches
– “Simulated” for our project using instrumentation
• Estimate full path profile using samples
• Validate by doing Superblock formation
– Optimization to improve scheduling
on VLIW processors
– Path-based Superblocks based on Young (1997)
Design Overview
instrument instrumented
(pmu sim) program
source frontend
optimized
superblock backend
program
estimated Offline sampled
path profile estimator profile
• Implemented PMU simulator and Superblock
optimization as SUIF passes
• Implemented Estimator offline using sampled branch
profiles and SUIF CFG
Path Sampling Exact paths:
A
50 50 ABDEG
• Exact path profile: B C ACDFG
– Accurate 50
D
50
– But expensive 50 50
Edge Profile:
• Edge profile ABDEG
E F
– Inaccurate (due to the independence ACDFG
50 50
assumption) G and
– Cheap ABDFG
• It is hard (impossible) to reconstruct the ACDEG
path information
• Sampling path profile Sampling:
– Periodically sample 4 consecutive branches {AB, DE}
(branch trace buffer) {AC, DF}
– Cheap to collect and more accurate than =>
edge profile ABDEG
ACDFG
Hot Path Formation
• Sampling paths are short
• Sampling paths => longer paths
– Join 2 paths if they can merge into one
simple path and the frequencies about both
paths are large
– e.g. 5000 ABCD, 4000 CDEF => 4000
ABCDEF
Path Estimation Accuracy
• We compare the top 100
100%
Accuracy
90%
80%
paths captured by the 70%
60%
exact path profile and the 50%
40%
10k
100k
estimated path profile 30%
20%
1m
10%
0%
adpcm_e 099.go 132.ijpeg
• The success rate is
30
Σest ∩ act cycleact / 25
runtime
runtime
Σact cycleact 20
15
10
5
0
adpcm_e 099.go 132.ijpeg
Superblock Formation
A A
B F B F A A
A
A A A
C C C
A B B A
D G D D G
E E E A B
Tail Duplication Loop Unrolling Combinations
• Creates larger regions to schedule over
for hot paths
Superblock Performance
• Performance results Code Expansion (x86 ELF)
pending 1.8
– Waiting for CASH 1.6
simulator setup… 1.4
Normalized Exe Size
1.2
• Superblock formation 1 ba se
e x a ct
0.8
on P4 useless 0.6
e stim ate
– Causes 0-5% 0.4
0.2
slowdown on tested
0
benchmarks (probably a dpcm _e 132.ijpe g 099.go
due to icache misses) A pplication
– Need multi-issue
architecture to see
sched. benefits?