# Program Optimization Measurement Loops Parallelism by malj

VIEWS: 1 PAGES: 26

• pg 1
```									                                                    Carnegie Mellon

Program Optimization:
Measurement/Loops/Parallelism
CSCi 2021: Computer Architecture and Organization

Chapter 5.6-5.9

‹#›
Carnegie Mellon

Today
¢ Measure performance
¢ Loops
¢ Exploiting Instruction-Level Parallelism
¢ Exam

¢ Assignment 4 out later today
§ performance optimization

‹#›
Carnegie Mellon

Last Time
¢ Program optimization
¢ Code motion
¢ Memory aliasing
¢ Procedure calls

‹#›
Carnegie Mellon

Exploiting Instruction-Level Parallelism
¢ Hardware can execute multiple instructions in parallel
§ pipelining and multiple hardware units for execution
¢ Performance limited by data dependencies
¢ Simple transformations can have dramatic performance
improvement
§ Compilers often cannot make these transformations
§ Lack of associativity in floating-point arithmetic

‹#›
Carnegie Mellon

Benchmark Example: Data Type for Vectors
/* data structure for vectors */
typedef struct{
int len;
double *data; // double data[MaxLen];
} vec;

len          0   1     len-1
data

/* retrieve vector element and store at val */
int get_vec_element(v *vec, int idx, double *val)
{
if (idx < 0 || idx >= v->len)
return 0;
*val = v->data[idx];
return 1;
}

‹#›
Carnegie Mellon

Benchmark Computation
void combine1(vec_ptr v, data_t *dest)
{
long int i;                                     Compute sum or
*dest = IDENT; // 0 or 1                        product of vector
for (i = 0; i < vec_length(v); i++) {           elements
data_t val;
get_vec_element(v, i, &val);
*dest = *dest OP val;
}
}

¢Data Types                       ¢Operations
§ Use different declarations      § Use different definitions of
for data_t                        OP
§ int                             § +
§ float                           § *
§ double
‹#›
Carnegie Mellon

Cycles Per Element (CPE)
¢ Convenient way to express performance of program that operates on
vectors or lists, O(n) doesn’t tell us enough
¢ In our case: CPE = cycles per OP (*dest = *dest OP val)
§ Sum of cycles in loop, divide by N
¢ T = CPE*n + Overhead
§ CPE is slope of line

measure cycles using
prog1: Slope = 4.0             special instructions

prog2: Slope = 3.5

‹#›
Carnegie Mellon

Fundamental Limits
¢ Latency
§ how long something takes

¢ Issue time
§ how long to wait before issuing next operation
§ can be < 1 clock cycle due to parallelism

¢ Throughput = 1/issue time (ideal or max)

¢ If latency is 5 cycles, but tput is 1 cycle, what does that
tell us?

‹#›
Carnegie Mellon

Benchmark Performance
void combine1(vec_ptr v, data_t *dest)
{
long int i;                              Compute sum or
*dest = IDENT;                           product of vector
for (i = 0; i < vec_length(v); i++) {    elements
data_t val;
get_vec_element(v, i, &val);
*dest = *dest OP val;                  Why won’t compiler
}                                         move vec_length?
}

Method                  Integer               Double FP
Combine1               29.0          29.2     27.4           27.9
unoptimized
Combine1 –O1           12.0          12.0     12.0           13.0

highly
pipelined                                ‹#›
Carnegie Mellon

Basic Optimizations
void combine4(vec_ptr v, data_t *dest)
{
int i;
int length = vec_length(v);
data_t *d = get_vec_start(v);
data_t t = IDENT;
for (i = 0; i < length; i++)
t = t OP d[i];
*dest = t;
}

¢ Move vec_length out of loop
¢ Remove call to get_vec_element
¢ Avoid bounds check on each cycle (in get_vec_element)
¢ Accumulate in temporary         What does the
temporary save?
‹#›
Carnegie Mellon

Effect of Basic Optimizations
void combine4(vec_ptr v, data_t *dest)
{
int i;
int length = vec_length(v);
data_t *d = get_vec_start(v);               Drawback?
data_t t = IDENT;
for (i = 0; i < length; i++)
t = t OP d[i];
*dest = t;
}

Method             Integer          Double FP
Combine1 –O1      12.0       12.0   12.0        13.0
Combine4           2.0        3.0    3.0         5.0

¢ Eliminates sources of overhead in loop
‹#›
Carnegie Mellon

Looking at Execution:
Modern CPU Design                                  What does this suggest?

Instruction Control
Retirement          Control
Unit                                                  Instruction
Register                             Instructions          Cache
Instruction
File             Decode
Operations

Integer/ General      FP      FP                                    Functional

Operation Results

Arith units have internal pipelines                     Data       Data

100
Ø instrs “in flight”                                    Data
Ømicro operations                                       Cache

Execution
‹#›
Carnegie Mellon

Loop Unrolling
void unroll2a_combine(vec_ptr v, data_t *dest)
{
int length = vec_length(v);
int limit = length-1;
data_t *d = get_vec_start(v);
data_t x = IDENT;
int i;
/* Combine 2 elements at a time */
for (i = 0; i < limit; i+=2) {       for (i = 0; i < length; i++)
x = (x OP d[i]) OP d[i+1];           t = t OP d[i];
}
/* Finish any remaining elements */
for (; i < length; i++) {
x = x OP d[i];
}
*dest = x;
}

¢Benefit?
§ Perform 2x more useful work per iteration
‹#›
Carnegie Mellon

Effect of Loop Unrolling
Method                 Integer            Double FP
Combine4               2.0        3.0       3.0       5.0
Unroll 2x              2.0        1.5       3.0       5.0
Latency                1.0        3.0       3.0       5.0
Bound

¢ Helps integer multiply only
§ compiler does clever optimization (associativity)
¢ Others don’t improve. Why?
§ Still sequential dependency       x = (x OP d[i]) OP d[i+1];
between iterations

‹#›
Carnegie Mellon

Loop Unrolling with Reassociation
void unroll2aa_combine(vec_ptr v, data_t *dest)
{
int length = vec_length(v);
int limit = length-1;
data_t *d = get_vec_start(v);
data_t x = IDENT;
int i;
/* Combine 2 elements at a time */      Compare to before
for (i = 0; i < limit; i+=2) {
x = x OP (d[i] OP d[i+1]); x = (x OP d[i]) OP d[i+1];
}
/* Finish any remaining elements */
for (; i < length; i++) {
x = x OP d[i];
}
*dest = x;
}

¢Can this change the result of the computation?
¢Yes, for FP. Why?
‹#›
Carnegie Mellon

Effect of Reassociation
Method                Integer          Double FP
Combine4              2.0        3.0    3.0        5.0
Unroll 2x             2.0        1.5    3.0        5.0
Unroll 2x,            2.0        1.5    1.5        3.0
reassociate
Latency               1.0        3.0    3.0        5.0
Bound
theoretical   Throughput            1.0        1.0    1.0        1.0
best          Bound

¢ Nearly 2x speedup for FP +, FP *
§ Reason: Breaks sequential dependency
x = x OP (d[i] OP d[i+1]);

§ Why is that? (next slide)
‹#›
Carnegie Mellon

Reassociated Computation
x = x OP (d[i] OP d[i+1]);              ¢What changed:
§ Ops in the next iteration can be
started early (no dependency)
d0 d1

*                              ¢Overall Performance
1           d2 d3
§ N elements, D cycles latency/op
*      d4 d5             § Should be (N/2+1)*D cycles:
*                                      CPE = D/2
*      d6 d7
*
*
*

*

‹#›
Carnegie Mellon

Loop Unrolling with Separate Accumulators
void unroll2a_combine(vec_ptr v, data_t *dest)
{
int length = vec_length(v);
int limit = length-1;
data_t *d = get_vec_start(v);
data_t x0 = IDENT; // 0 or 1
data_t x1 = IDENT; // 0 or 1
int i;
/* Combine 2 elements at a time */
for (i = 0; i < limit; i+=2) {
x0 = x0 OP d[i];     // “evens”
x1 = x1 OP d[i+1];   // “odds”
}
/* Finish any remaining elements */
for (; i < length; i++) {
x0 = x0 OP d[i];
}
*dest = x0 OP x1;
}

¢Different form of reassociation: actual parallelism
‹#›
Carnegie Mellon

Effect of Separate Accumulators
Method                        Integer                Double FP
Combine4                      2.0         3.0          3.0            5.0
Unroll 2x                     2.0         1.5          3.0            5.0
Unroll 2x,                    2.0         1.5          1.5            3.0
reassociate
Unroll 2x Parallel 2x         1.5         1.5          1.5            2.5
Latency Bound                 1.0         3.0          3.0            5.0
Throughput Bound              1.0         1.0          1.0            1.0

¢ 2x speedup (over unroll2) FP +, FP *
§ Breaks sequential dependency in a “cleaner,” more obvious way
x0 = x0 OP d[i];
x1 = x1 OP d[i+1];

‹#›
Carnegie Mellon

Separate Accumulators
x0 = x0 OP d[i];                              ¢What changed:
x1 = x1 OP d[i+1];                              § Two independent “streams” of
operations
1 d0            1 d1
¢Overall Performance
*     d2        *          d3
§ N elements, D cycles latency/op
*    d4             *         d5          § Should be (N/2+1)*D cycles:
CPE = D/2
*          d6        *        d7     § CPE matches prediction!

*                  *

*

‹#›
Carnegie Mellon

Unrolling & Accumulating
¢ Idea
§ Can unroll to any degree L
§ Can expose more potential parallelism

¢ Limitations
§ Diminishing returns
§ Cannot go beyond throughput limitations of execution units
§ Short lengths (N, N < L)
§ Finish off iterations sequentially

‹#›
Carnegie Mellon

Amdahl’s Law

‹#›
Carnegie Mellon

The Exam
¢ Coverage
§ Chapter 3.7 through 4 (up to 4.5); does not include performance
optimization

§ Procedure calls; stack frames; stack/frame pointer; registers
§ know the code that must be generated to carry out a
procedure call including its return
§ be able to manipulate the stack and access variables
§ recursion

§ Arrays and structures
§ know what they are; understand C code; alignment issues
§ understand how they map to assembly (for simple structs and
1D arrays)
‹#›
Carnegie Mellon

Exam
¢ Processor Architecture ISA
§ X86/Y86 – we will give a cheat sheet; no need to memorize all the assembly
instructions; register layouts; definition of instructions
§ RISC/CISC
§ Know how to specify simple HCL; write simple logic gates
§ Be able to go from assembly instruction<->byte-level encodings;
basic C<->assembly

¢ Seq and Pipelined CPU
§   Hardware components: register file, ALU, etc
§   Know instruction stages (F, D, E, M, W)
§   Know why pipelining improves over seq
§   Know about data dependencies and hazards
§   Know how to measure basic performance: latency, throughput

‹#›
Carnegie Mellon

Composition
¢ Mix of short answer and work questions (multiple parts)
§ 20%, 80%
¢ Recitation will go over an old exam
¢ Hints:
§ Question about arrays and structs – know the assembly level
§ Question about mapping assembly back to C

¢ To study
§ Review notes, practice problems, homework questions
§ Refer back to things I *said* in class

‹#›
Carnegie Mellon

Next Time
¢ Good luck on the exam
¢ No office hours on Friday (out of town, sorry)
¢Have a great weekend!

‹#›

```
To top