Computer Systems
Optimizing program performance
University of Amsterdam
Arnoud Visser 1
Computer Systems – optimizing program performance
Performance can make the
difference
• Use Pointers instead of array indices
• Use doubles instead of floats
• Optimize inner loops
Recommendations Patrick van der Smagt in
1991 for neural net implementations
University of Amsterdam
Arnoud Visser 2
Computer Systems – optimizing program performance
Machine-independent versus
Machine-dependent optimizations
– Optimizations you should do regardless
of processor / compiler
• Code Motion (out of the loop)
• Reducing procedure calls
• Unneeded Memory usage
• Share Common sub-expressions
– Machine-Dependent Optimizations
• Pointer code
• Unrolling
• Enabling instruction level parallelism
University of Amsterdam
Arnoud Visser 3
Computer Systems – optimizing program performance
Machine dependent
One has to known today’s architectures
• Superscalar (Pentium)
(often two instructions/cycle)
• Dynamic execution (P6)
(three instructions out-of-order/cycle)
• Explicit parallelism (Itanium)
(six execution units)
University of Amsterdam
Arnoud Visser 4
Computer Systems – optimizing program performance
Pentium III Design
Instruction Control
Address
Fetch
Retirement Control
Unit Instruction
Register
Instrs. Cache
Instruction
File Decode
Operations
Register Prediction
Updates OK?
Integer/ General FP FP Functional
Load Store
Branch Integer Add Mult/Div Units
Operation Results Addr.
Addr.
Data Data
Data
Cache
Execution University of Amsterdam
Arnoud Visser 5
Computer Systems – optimizing program performance
Functional Units of Pentium III
• Multiple Instructions Can Execute in Parallel
– 1 load
– 1 store
– 2 integer (one may be branch)
– 1 FP Addition
– 1 FP Multiplication or Division
University of Amsterdam
Arnoud Visser 6
Computer Systems – optimizing program performance
Performance of Pentium III
operations
• Many instructions can be Pipelined to 1 cycle
Instruction Latency Cycles/Issue
– Load / Store 3 1
– Integer Multiply 4 1
– Integer Divide 36 36
– Double/Single FP Multiply 5 2
– Double/Single FP Add 3 1
– Double/Single FP Divide 38 38
University of Amsterdam
Arnoud Visser 7
Computer Systems – optimizing program performance
Instruction Control
Instruction Control
Address
Fetch
Retirement Control
Unit Instruction
Register
Instrs. Cache
Instruction
File Decode
• Grabs Instructions From Memory Operations
– Based on current PC + predicted targets for predicted branches
– Hardware dynamically guesses (possibly) branch target
• Translates Instructions Into Operations
– Primitive steps required to perform instruction
– Typical instruction requires 1–3 operations
• Converts Register References Into Tags
– Abstract identifier linking destination of
one operation with sources of later operations University of Amsterdam
Arnoud Visser 8
Computer Systems – optimizing program performance
Translation Example
• Version of Combine4
– Integer data, multiply operation
.L24: # Loop:
imull (%eax,%edx,4),%ecx # t *= data[i]
incl %edx # i++
cmpl %esi,%edx # i:length
jl .L24 # if < goto Loop
• Translation of First Iteration
.L24:
imull (%eax,%edx,4),%ecx load (%eax,%edx.0,4) t.1
imull t.1, %ecx.0 %ecx.1
incl %edx incl %edx.0 %edx.1
cmpl %esi,%edx cmpl %esi, %edx.1 cc.1
jl .L24 jl-taken cc.1 University of Amsterdam
Arnoud Visser 9
Computer Systems – optimizing program performance
Visualizing Operations
load (%eax,%edx,4) t.1
%edx.0 imull t.1, %ecx.0 %ecx.1
incl incl %edx.0 %edx.1
%edx.1
cmpl %esi, %edx.1 cc.1
load cmpl jl-taken cc.1
cc.1
%ecx.0
jl
t.1
Time
• Operations
imull
– Vertical position denotes
time at which executed
%ecx.1
• Cannot begin operation
until operands available
– Height denotes latency
University of Amsterdam
Arnoud Visser 10
Computer Systems – optimizing program performance
4 Iterations of Combining Sum
%edx.0
1 incl %edx.1
2 load cmpl
%ecx.i +1 incl %edx.2
cc.1
3 %ecx.0
jl load cmpl
%ecx.i +1 incl %edx.3
t.1 cc.2
4 addl i=0 jl load cmpl
%ecx.i +1 incl %edx.4 4 integer ops
%ecx.1 t.2 cc.3
5 Iteration 1 addl i=1 jl load cmpl
%ecx.i +1
%ecx.2 t.3 cc.4
6 Cycle Iteration 2 addl i=2 jl
%ecx.3 t.4
7 Iteration 3 addl i=3
%ecx.4
Iteration 4
• Resource Analysis
• Performance
– Unlimited resources should give CPE of 1.0
– Would require executing 4 integer
operations in parallel University of Amsterdam
Arnoud Visser 11
Computer Systems – optimizing program performance
Pentium Resource Constraints
%edx.3
6
7 load incl %edx.4
8 %ecx.3 cmpl
%ecx.i +1 incl %edx.5
t.4 cc.4
9 addl jl load
10 i=3 cmpl
%ecx.i +1 load incl %edx.6
t.5 cc.5
%ecx.4
11 addl jl
Iteration 4 t.6
%ecx.5
12 i=4 addl cmpl load
Iteration 5 cc.6
cc.6
13 jl incl %edx.7
t.7
%ecx.6 i=5
14 addl cmpl
Iteration 6 cc.7
15 jl load incl %edx.8
– Only16two integer functional units
Cycle
i=6
t.8
cmpl
%ecx.i +1
cc.8
– Set priority based on program
17
%ecx.7
Iteration 7
addl jl
i=7
order18 %ecx.8
Iteration 8
Performance
– Sustain CPE of 2.0 University of Amsterdam
Arnoud Visser 12
Computer Systems – optimizing program performance
Loop Unrolling – Measured CPE=1.33
void combine5(vec_ptr v, int *dest)
• Optimization {
int length = vec_length(v);
– Combine int limit = length-2;
multiple int *data = get_vec_start(v);
int sum = 0;
iterations into int i;
single loop body /* Combine 3 elements at a time */
for (i = 0; i < limit; i+=3) {
– Amortizes loop sum += data[i] + data[i+2]
+ data[i+1];
overhead across }
multiple /* Finish any remaining elements */
for (; i < length; i++) {
iterations sum += data[i];
– Finish extras at }
*dest = sum;
end } University of Amsterdam
Arnoud Visser 13
Computer Systems – optimizing program performance
Resource distribution with Loop Unrolling
%edx.2
5
6
7 addl %edx.3
8 load cmpl
%ecx.i +1
cc.3
9 %ecx.2c load jl
t.3a
10 addl load addl %edx.4
%ecx.3a t.3b
11 addl load cmpl
%ecx.i +1
%ecx.3b t.3c cc.4
12 i=6 addl load jl
%ecx.3c
t.4a
13 Iteration 3 addl load
%ecx.4a t.4b
14 Cycle addl
%ecx.4b t.4c
• Predicted Performance
15 i=9 addl %ecx.4c
– Can complete iteration in Iteration 4
3 cycles
– Should give CPE of 1.0
University of Amsterdam
Arnoud Visser 14
Computer Systems – optimizing program performance
Effect of Unrolling
Unrolling 1 2 3 4 8 16
Degree
Intege Sum 2.00 1.50 1.33 1.50 1.25 1.06
r
Intege Product 4.00
r
•
FP Only helps integer sum for our examples
Sum 3.00
FP – Other cases constrained by functional unit latencies
Product 5.00
• Effect is nonlinear with degree of unrolling
• Many subtle effects determine exact scheduling
of operations
University of Amsterdam
Arnoud Visser 15
Computer Systems – optimizing program performance
Unrolling is for long vectors
Unrolling Degree 1 2 3 4 8 16
IntSum ∞ 2.00 1.50 1.33 1.50 1.25 1.06
IntSum 1024 2.06 1.56 1.40 1.56 1.31 1.12
IntSum 31 4.02 3.57 3.39 3.84 3.91 3.66
• New source of overhead
– The need to finish the remaining elements when the
vector length is not divisible by the degree of unrolling
University of Amsterdam
Arnoud Visser 16
Computer Systems – optimizing program performance
3 Iterations of Combining Product
• Unlimited
%edx.0
1 incl %edx.1
2 load cmpl
cc.1
incl %edx.2 Resource
3 jl load cmpl incl %edx.3
%ecx.0
4
t.1
i=0 jl
cc.2
load cmpl
Analysis
– Assume
t.2 cc.3
5 jl
imull t.3
6 operation can
7 %ecx.1
start as soon as
8 Iteration 1
9
operands
imull
10 Cycle i=1 available
11
12
%ecx.2
Iteration 2
• Performance
13 imull
– Limiting factor
14 i=2 becomes latency
15 %ecx.3
– Gives CPE of 4.0
University of Amsterdam
Iteration 3
Arnoud Visser 17
Computer Systems – optimizing program performance
Iteration splitting
void combine6(vec_ptr v, int *dest)
{ • Optimization
int length = vec_length(v); – Make operands
int limit = length-1; available by
int *data = get_vec_start(v);
int x0 = 1; accumulating in two
int x1 = 1; different products
int i; (x0,x1)
/* Combine 2 elements at a time */
for (i = 0; i < limit; i+=2) { – Combine at end
x0 *= data[i];
x1 *= data[i+1];
• Performance
} – CPE = 2.0
/* Finish any remaining elements */
for (; i < length; i++) {
x0 *= data[i];
}
*dest = x0 * x1;
}
University of Amsterdam
Arnoud Visser 18
Computer Systems – optimizing program performance
Resource distribution with Iteration Splitting
%edx.0
1 addl %edx.1
2 load cmpl addl %edx.2
cc.1
3 %ecx.0 load jl load cmpl addl %edx.3
t.1a cc.2
4 %ebx.0 load jl load cmpl
t.1b cc.3
5 load jl
imull
6
imull
7 %ecx.1
t.2a
8 i=0 %ebx.1
t.2b
9 Iteration 1
imull
10 Cycle
imull
11 %ecx.2
t.3a
12 i=2
– Predicted Performance
%ebx.2
t.3b
13 Iteration 2
• Make use of both
14
imull
imull
execution units
15 %ecx.3
• Gives CPE of 2.0
16 i=4 %ebx.3
University of Amsterdam
Iteration 3
Arnoud Visser 19
Computer Systems – optimizing program performance
Results for Pentium III
Method Integer Floating Point
+ * + *
Abstract -g 42.06 41.86 41.44 160.00
Abstract -O2 31.25 33.25 31.25 143.00
Move vec_length 20.66 21.25 21.15 135.00
data access 6.00 9.00 8.00 117.00
Accum. in temp 2.00 4.00 3.00 5.00
Pointer 3.00 4.00 3.00 5.00
Unroll 4 1.50 4.00 3.00 5.00
Unroll 16 1.06 4.00 3.00 5.00
2X2 1.50 2.00 2.00 2.50
4X4 1.50 2.00 1.50 2.50
8X4 1.25 1.25 1.50 2.00
Theoretical Opt. 1.00 1.00 1.00 2.00
Worst : Best 39.7 33.5 27.6 80.0
– Biggest gain doing basic optimizations
– But, last little bit helps University of Amsterdam
Arnoud Visser 20
Computer Systems – optimizing program performance
Results for Pentium 4
Method Integer Floating Point
+ * + *
Abstract -g 35.25 35.34 35.85 38.00
Abstract -O2 26.52 30.26 31.55 32.00
Move vec_length 18.00 25.71 23.36 24.25
data access 3.39 31.56 27.50 28.35
Accum. in temp 2.00 14.00 5.00 7.00
Unroll 4 1.01 14.00 5.00 7.00
Unroll 16 1.00 14.00 5.00 7.00
4X2 1.02 7.00 2.63 3.50
8X4 1.01 3.98 1.82 2.00
8X8 1.63 4.50 2.42 2.31
Worst : Best 35.2 8.9 19.7 19.0
– Higher latencies (int * = 14, fp + = 5.0, fp * = 7.0)
• Clock runs at 2.0 GHz
• Not an improvement over 1.0 GHz P3 for integer *
– Avoids FP multiplication anomaly
University of Amsterdam
Arnoud Visser 21
Computer Systems – optimizing program performance
Machine-Dependent Opt. Summary
• Loop Unrolling
– Some compilers do this automatically
– Generally not as clever as what can achieve by
hand
• Exposing Instruction-Level Parallelism
– Very machine dependent
• Warning:
– Benefits depend heavily on particular machine
– Do only for performance-critical parts of code
– Best if performed by compiler
• But GCC on IA32/Linux is not very good
University of Amsterdam
Arnoud Visser 22
Computer Systems – optimizing program performance
Conclusion
How should I write my programs, given that
I have a good, optimizing compiler?
• Don’t: Smash Code into Oblivion
– Hard to read, maintain, & assure correctness
• Do:
– Select best algorithm & data representation
– Write code that’s readable & maintainable
• Procedures, recursion, without built-in constant limits
• Even though these factors can slow down code
• Focus on Inner Loops
– Detailed optimization means detailed measurement
University of Amsterdam
Arnoud Visser 23
Computer Systems – optimizing program performance
Assignment
• Practice Problems
– Practice Problem 5.5:
'What program variable has been spilled for combine6()?‘
– Practice Problem 5.6:
'Indicate the CPE of different associated products.'
• Optimization Lab
University of Amsterdam
Arnoud Visser 24
Computer Systems – optimizing program performance