Embed
Email

Computer Systems

Document Sample

Shared by: yaofenji
Categories
Tags
Stats
views:
0
posted:
12/7/2011
language:
pages:
24
Computer Systems



Optimizing program performance







University of Amsterdam





Arnoud Visser 1

Computer Systems – optimizing program performance

Performance can make the

difference

• Use Pointers instead of array indices

• Use doubles instead of floats

• Optimize inner loops









Recommendations Patrick van der Smagt in

1991 for neural net implementations

University of Amsterdam





Arnoud Visser 2

Computer Systems – optimizing program performance

Machine-independent versus

Machine-dependent optimizations

– Optimizations you should do regardless

of processor / compiler

• Code Motion (out of the loop)

• Reducing procedure calls

• Unneeded Memory usage

• Share Common sub-expressions

– Machine-Dependent Optimizations

• Pointer code

• Unrolling

• Enabling instruction level parallelism



University of Amsterdam





Arnoud Visser 3

Computer Systems – optimizing program performance

Machine dependent

One has to known today’s architectures

• Superscalar (Pentium)

(often two instructions/cycle)

• Dynamic execution (P6)

(three instructions out-of-order/cycle)

• Explicit parallelism (Itanium)

(six execution units)

University of Amsterdam





Arnoud Visser 4

Computer Systems – optimizing program performance

Pentium III Design

Instruction Control

Address

Fetch

Retirement Control

Unit Instruction

Register

Instrs. Cache

Instruction

File Decode

Operations

Register Prediction

Updates OK?







Integer/ General FP FP Functional

Load Store

Branch Integer Add Mult/Div Units





Operation Results Addr.

Addr.

Data Data





Data

Cache



Execution University of Amsterdam





Arnoud Visser 5

Computer Systems – optimizing program performance

Functional Units of Pentium III

• Multiple Instructions Can Execute in Parallel

– 1 load

– 1 store

– 2 integer (one may be branch)

– 1 FP Addition

– 1 FP Multiplication or Division









University of Amsterdam





Arnoud Visser 6

Computer Systems – optimizing program performance

Performance of Pentium III

operations

• Many instructions can be Pipelined to 1 cycle

Instruction Latency Cycles/Issue

– Load / Store 3 1

– Integer Multiply 4 1

– Integer Divide 36 36

– Double/Single FP Multiply 5 2

– Double/Single FP Add 3 1

– Double/Single FP Divide 38 38





University of Amsterdam





Arnoud Visser 7

Computer Systems – optimizing program performance

Instruction Control

Instruction Control

Address

Fetch

Retirement Control

Unit Instruction

Register

Instrs. Cache

Instruction

File Decode









• Grabs Instructions From Memory Operations



– Based on current PC + predicted targets for predicted branches

– Hardware dynamically guesses (possibly) branch target

• Translates Instructions Into Operations

– Primitive steps required to perform instruction

– Typical instruction requires 1–3 operations

• Converts Register References Into Tags

– Abstract identifier linking destination of

one operation with sources of later operations University of Amsterdam





Arnoud Visser 8

Computer Systems – optimizing program performance

Translation Example

• Version of Combine4

– Integer data, multiply operation

.L24: # Loop:

imull (%eax,%edx,4),%ecx # t *= data[i]

incl %edx # i++

cmpl %esi,%edx # i:length

jl .L24 # if < goto Loop





• Translation of First Iteration

.L24:

imull (%eax,%edx,4),%ecx load (%eax,%edx.0,4)  t.1

imull t.1, %ecx.0  %ecx.1

incl %edx incl %edx.0  %edx.1

cmpl %esi,%edx cmpl %esi, %edx.1  cc.1

jl .L24 jl-taken cc.1 University of Amsterdam





Arnoud Visser 9

Computer Systems – optimizing program performance

Visualizing Operations

load (%eax,%edx,4)  t.1

%edx.0 imull t.1, %ecx.0  %ecx.1

incl incl %edx.0  %edx.1

%edx.1

cmpl %esi, %edx.1  cc.1

load cmpl jl-taken cc.1

cc.1

%ecx.0

jl

t.1





Time

• Operations

imull

– Vertical position denotes

time at which executed

%ecx.1

• Cannot begin operation

until operands available

– Height denotes latency

University of Amsterdam





Arnoud Visser 10

Computer Systems – optimizing program performance

4 Iterations of Combining Sum

%edx.0



1 incl %edx.1



2 load cmpl

%ecx.i +1 incl %edx.2

cc.1

3 %ecx.0

jl load cmpl

%ecx.i +1 incl %edx.3

t.1 cc.2

4 addl i=0 jl load cmpl

%ecx.i +1 incl %edx.4 4 integer ops

%ecx.1 t.2 cc.3

5 Iteration 1 addl i=1 jl load cmpl

%ecx.i +1

%ecx.2 t.3 cc.4

6 Cycle Iteration 2 addl i=2 jl

%ecx.3 t.4

7 Iteration 3 addl i=3

%ecx.4

Iteration 4

• Resource Analysis

• Performance

– Unlimited resources should give CPE of 1.0

– Would require executing 4 integer

operations in parallel University of Amsterdam





Arnoud Visser 11

Computer Systems – optimizing program performance

Pentium Resource Constraints

%edx.3



6



7 load incl %edx.4



8 %ecx.3 cmpl

%ecx.i +1 incl %edx.5

t.4 cc.4

9 addl jl load



10 i=3 cmpl

%ecx.i +1 load incl %edx.6

t.5 cc.5

%ecx.4

11 addl jl

Iteration 4 t.6

%ecx.5

12 i=4 addl cmpl load

Iteration 5 cc.6

cc.6

13 jl incl %edx.7

t.7

%ecx.6 i=5

14 addl cmpl

Iteration 6 cc.7

15 jl load incl %edx.8



– Only16two integer functional units

Cycle

i=6

t.8

cmpl

%ecx.i +1

cc.8



– Set priority based on program

17

%ecx.7



Iteration 7

addl jl

i=7

order18 %ecx.8



Iteration 8



Performance

– Sustain CPE of 2.0 University of Amsterdam





Arnoud Visser 12

Computer Systems – optimizing program performance

Loop Unrolling – Measured CPE=1.33



void combine5(vec_ptr v, int *dest)

• Optimization {

int length = vec_length(v);

– Combine int limit = length-2;

multiple int *data = get_vec_start(v);

int sum = 0;

iterations into int i;

single loop body /* Combine 3 elements at a time */

for (i = 0; i < limit; i+=3) {

– Amortizes loop sum += data[i] + data[i+2]

+ data[i+1];

overhead across }

multiple /* Finish any remaining elements */

for (; i < length; i++) {

iterations sum += data[i];

– Finish extras at }

*dest = sum;

end } University of Amsterdam





Arnoud Visser 13

Computer Systems – optimizing program performance

Resource distribution with Loop Unrolling

%edx.2



5



6



7 addl %edx.3



8 load cmpl

%ecx.i +1

cc.3

9 %ecx.2c load jl

t.3a

10 addl load addl %edx.4

%ecx.3a t.3b

11 addl load cmpl

%ecx.i +1

%ecx.3b t.3c cc.4

12 i=6 addl load jl

%ecx.3c

t.4a

13 Iteration 3 addl load

%ecx.4a t.4b

14 Cycle addl

%ecx.4b t.4c



• Predicted Performance

15 i=9 addl %ecx.4c





– Can complete iteration in Iteration 4





3 cycles

– Should give CPE of 1.0

University of Amsterdam





Arnoud Visser 14

Computer Systems – optimizing program performance

Effect of Unrolling

Unrolling 1 2 3 4 8 16

Degree

Intege Sum 2.00 1.50 1.33 1.50 1.25 1.06

r

Intege Product 4.00

r



FP Only helps integer sum for our examples

Sum 3.00

FP – Other cases constrained by functional unit latencies

Product 5.00

• Effect is nonlinear with degree of unrolling

• Many subtle effects determine exact scheduling

of operations

University of Amsterdam





Arnoud Visser 15

Computer Systems – optimizing program performance

Unrolling is for long vectors

Unrolling Degree 1 2 3 4 8 16

IntSum ∞ 2.00 1.50 1.33 1.50 1.25 1.06

IntSum 1024 2.06 1.56 1.40 1.56 1.31 1.12

IntSum 31 4.02 3.57 3.39 3.84 3.91 3.66



• New source of overhead

– The need to finish the remaining elements when the

vector length is not divisible by the degree of unrolling







University of Amsterdam





Arnoud Visser 16

Computer Systems – optimizing program performance

3 Iterations of Combining Product

• Unlimited

%edx.0



1 incl %edx.1



2 load cmpl

cc.1

incl %edx.2 Resource

3 jl load cmpl incl %edx.3

%ecx.0

4

t.1



i=0 jl

cc.2

load cmpl

Analysis

– Assume

t.2 cc.3

5 jl

imull t.3

6 operation can

7 %ecx.1

start as soon as

8 Iteration 1



9

operands

imull

10 Cycle i=1 available

11



12

%ecx.2



Iteration 2

• Performance

13 imull

– Limiting factor

14 i=2 becomes latency

15 %ecx.3

– Gives CPE of 4.0

University of Amsterdam



Iteration 3

Arnoud Visser 17

Computer Systems – optimizing program performance

Iteration splitting

void combine6(vec_ptr v, int *dest)

{ • Optimization

int length = vec_length(v); – Make operands

int limit = length-1; available by

int *data = get_vec_start(v);

int x0 = 1; accumulating in two

int x1 = 1; different products

int i; (x0,x1)

/* Combine 2 elements at a time */

for (i = 0; i < limit; i+=2) { – Combine at end

x0 *= data[i];

x1 *= data[i+1];

• Performance

} – CPE = 2.0

/* Finish any remaining elements */

for (; i < length; i++) {

x0 *= data[i];

}

*dest = x0 * x1;

}

University of Amsterdam





Arnoud Visser 18

Computer Systems – optimizing program performance

Resource distribution with Iteration Splitting

%edx.0



1 addl %edx.1



2 load cmpl addl %edx.2

cc.1

3 %ecx.0 load jl load cmpl addl %edx.3

t.1a cc.2

4 %ebx.0 load jl load cmpl

t.1b cc.3

5 load jl

imull

6

imull

7 %ecx.1

t.2a

8 i=0 %ebx.1

t.2b

9 Iteration 1

imull

10 Cycle

imull

11 %ecx.2

t.3a

12 i=2

– Predicted Performance

%ebx.2

t.3b

13 Iteration 2



• Make use of both

14

imull



imull

execution units

15 %ecx.3



• Gives CPE of 2.0

16 i=4 %ebx.3

University of Amsterdam

Iteration 3

Arnoud Visser 19

Computer Systems – optimizing program performance

Results for Pentium III

Method Integer Floating Point

+ * + *

Abstract -g 42.06 41.86 41.44 160.00

Abstract -O2 31.25 33.25 31.25 143.00

Move vec_length 20.66 21.25 21.15 135.00

data access 6.00 9.00 8.00 117.00

Accum. in temp 2.00 4.00 3.00 5.00

Pointer 3.00 4.00 3.00 5.00

Unroll 4 1.50 4.00 3.00 5.00

Unroll 16 1.06 4.00 3.00 5.00

2X2 1.50 2.00 2.00 2.50

4X4 1.50 2.00 1.50 2.50

8X4 1.25 1.25 1.50 2.00

Theoretical Opt. 1.00 1.00 1.00 2.00

Worst : Best 39.7 33.5 27.6 80.0





– Biggest gain doing basic optimizations

– But, last little bit helps University of Amsterdam





Arnoud Visser 20

Computer Systems – optimizing program performance

Results for Pentium 4

Method Integer Floating Point

+ * + *

Abstract -g 35.25 35.34 35.85 38.00

Abstract -O2 26.52 30.26 31.55 32.00

Move vec_length 18.00 25.71 23.36 24.25

data access 3.39 31.56 27.50 28.35

Accum. in temp 2.00 14.00 5.00 7.00

Unroll 4 1.01 14.00 5.00 7.00

Unroll 16 1.00 14.00 5.00 7.00

4X2 1.02 7.00 2.63 3.50

8X4 1.01 3.98 1.82 2.00

8X8 1.63 4.50 2.42 2.31

Worst : Best 35.2 8.9 19.7 19.0



– Higher latencies (int * = 14, fp + = 5.0, fp * = 7.0)

• Clock runs at 2.0 GHz

• Not an improvement over 1.0 GHz P3 for integer *

– Avoids FP multiplication anomaly

University of Amsterdam





Arnoud Visser 21

Computer Systems – optimizing program performance

Machine-Dependent Opt. Summary

• Loop Unrolling

– Some compilers do this automatically

– Generally not as clever as what can achieve by

hand

• Exposing Instruction-Level Parallelism

– Very machine dependent

• Warning:

– Benefits depend heavily on particular machine

– Do only for performance-critical parts of code

– Best if performed by compiler

• But GCC on IA32/Linux is not very good

University of Amsterdam





Arnoud Visser 22

Computer Systems – optimizing program performance

Conclusion

How should I write my programs, given that

I have a good, optimizing compiler?

• Don’t: Smash Code into Oblivion

– Hard to read, maintain, & assure correctness

• Do:

– Select best algorithm & data representation

– Write code that’s readable & maintainable

• Procedures, recursion, without built-in constant limits

• Even though these factors can slow down code

• Focus on Inner Loops

– Detailed optimization means detailed measurement

University of Amsterdam





Arnoud Visser 23

Computer Systems – optimizing program performance

Assignment



• Practice Problems

– Practice Problem 5.5:

'What program variable has been spilled for combine6()?‘

– Practice Problem 5.6:

'Indicate the CPE of different associated products.'



• Optimization Lab





University of Amsterdam





Arnoud Visser 24

Computer Systems – optimizing program performance



Related docs
Other docs by yaofenji
6-20-11BdPacket
Views: 0  |  Downloads: 0
Photo Album - Freepages
Views: 0  |  Downloads: 0
SKMBT_C30009011411170
Views: 1  |  Downloads: 0
platnick
Views: 0  |  Downloads: 0
11_Chevrolet_2013_Malibu Safety_120711 V3
Views: 1  |  Downloads: 0
On site Interviews_6.11
Views: 0  |  Downloads: 0
NOAA-PMEL DART Workshop
Views: 0  |  Downloads: 0
budget_presentation_2010-11
Views: 0  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!