VIEWS: 1 PAGES: 26 POSTED ON: 10/30/2013 Public Domain
Carnegie Mellon Program Optimization: Measurement/Loops/Parallelism CSCi 2021: Computer Architecture and Organization Chapter 5.6-5.9 ‹#› Carnegie Mellon Today ¢ Measure performance ¢ Loops ¢ Exploiting Instruction-Level Parallelism ¢ Exam ¢ Assignment 4 out later today § performance optimization ‹#› Carnegie Mellon Last Time ¢ Program optimization ¢ Code motion ¢ Memory aliasing ¢ Procedure calls ‹#› Carnegie Mellon Exploiting Instruction-Level Parallelism ¢ Hardware can execute multiple instructions in parallel § pipelining and multiple hardware units for execution ¢ Performance limited by data dependencies ¢ Simple transformations can have dramatic performance improvement § Compilers often cannot make these transformations § Lack of associativity in floating-point arithmetic ‹#› Carnegie Mellon Benchmark Example: Data Type for Vectors /* data structure for vectors */ typedef struct{ int len; double *data; // double data[MaxLen]; } vec; len 0 1 len-1 data /* retrieve vector element and store at val */ int get_vec_element(v *vec, int idx, double *val) { if (idx < 0 || idx >= v->len) return 0; *val = v->data[idx]; return 1; } ‹#› Carnegie Mellon Benchmark Computation void combine1(vec_ptr v, data_t *dest) { long int i; Compute sum or *dest = IDENT; // 0 or 1 product of vector for (i = 0; i < vec_length(v); i++) { elements data_t val; get_vec_element(v, i, &val); *dest = *dest OP val; } } ¢Data Types ¢Operations § Use different declarations § Use different definitions of for data_t OP § int § + § float § * § double ‹#› Carnegie Mellon Cycles Per Element (CPE) ¢ Convenient way to express performance of program that operates on vectors or lists, O(n) doesn’t tell us enough ¢ In our case: CPE = cycles per OP (*dest = *dest OP val) § Sum of cycles in loop, divide by N ¢ T = CPE*n + Overhead § CPE is slope of line measure cycles using prog1: Slope = 4.0 special instructions prog2: Slope = 3.5 ‹#› Carnegie Mellon Fundamental Limits ¢ Latency § how long something takes ¢ Issue time § how long to wait before issuing next operation § can be < 1 clock cycle due to parallelism ¢ Throughput = 1/issue time (ideal or max) ¢ If latency is 5 cycles, but tput is 1 cycle, what does that tell us? ‹#› Carnegie Mellon Benchmark Performance void combine1(vec_ptr v, data_t *dest) { long int i; Compute sum or *dest = IDENT; product of vector for (i = 0; i < vec_length(v); i++) { elements data_t val; get_vec_element(v, i, &val); *dest = *dest OP val; Why won’t compiler } move vec_length? } Method Integer Double FP Operation Add Mult Add Mult Combine1 29.0 29.2 27.4 27.9 unoptimized Combine1 –O1 12.0 12.0 12.0 13.0 highly pipelined ‹#› Carnegie Mellon Basic Optimizations void combine4(vec_ptr v, data_t *dest) { int i; int length = vec_length(v); data_t *d = get_vec_start(v); data_t t = IDENT; for (i = 0; i < length; i++) t = t OP d[i]; *dest = t; } ¢ Move vec_length out of loop ¢ Remove call to get_vec_element ¢ Avoid bounds check on each cycle (in get_vec_element) ¢ Accumulate in temporary What does the temporary save? ‹#› Carnegie Mellon Effect of Basic Optimizations void combine4(vec_ptr v, data_t *dest) { int i; int length = vec_length(v); data_t *d = get_vec_start(v); Drawback? data_t t = IDENT; for (i = 0; i < length; i++) t = t OP d[i]; *dest = t; } Method Integer Double FP Operation Add Mult Add Mult Combine1 –O1 12.0 12.0 12.0 13.0 Combine4 2.0 3.0 3.0 5.0 ¢ Eliminates sources of overhead in loop ‹#› Carnegie Mellon Looking at Execution: Modern CPU Design What does this suggest? Instruction Control Fetch Address Retirement Control Unit Instruction Register Instructions Cache Instruction File Decode Operations Register Updates Prediction OK? Integer/ General FP FP Functional Load Store Branch Integer Add Mult/Div Units Operation Results Addr. Addr. Arith units have internal pipelines Data Data 100 Ø instrs “in flight” Data Ømicro operations Cache Execution ‹#› Carnegie Mellon Loop Unrolling void unroll2a_combine(vec_ptr v, data_t *dest) { int length = vec_length(v); int limit = length-1; data_t *d = get_vec_start(v); data_t x = IDENT; int i; /* Combine 2 elements at a time */ for (i = 0; i < limit; i+=2) { for (i = 0; i < length; i++) x = (x OP d[i]) OP d[i+1]; t = t OP d[i]; } /* Finish any remaining elements */ for (; i < length; i++) { x = x OP d[i]; } *dest = x; } ¢Benefit? § Perform 2x more useful work per iteration ‹#› Carnegie Mellon Effect of Loop Unrolling Method Integer Double FP Operation Add Mult Add Mult Combine4 2.0 3.0 3.0 5.0 Unroll 2x 2.0 1.5 3.0 5.0 Latency 1.0 3.0 3.0 5.0 Bound ¢ Helps integer multiply only § compiler does clever optimization (associativity) ¢ Others don’t improve. Why? § Still sequential dependency x = (x OP d[i]) OP d[i+1]; between iterations ‹#› Carnegie Mellon Loop Unrolling with Reassociation void unroll2aa_combine(vec_ptr v, data_t *dest) { int length = vec_length(v); int limit = length-1; data_t *d = get_vec_start(v); data_t x = IDENT; int i; /* Combine 2 elements at a time */ Compare to before for (i = 0; i < limit; i+=2) { x = x OP (d[i] OP d[i+1]); x = (x OP d[i]) OP d[i+1]; } /* Finish any remaining elements */ for (; i < length; i++) { x = x OP d[i]; } *dest = x; } ¢Can this change the result of the computation? ¢Yes, for FP. Why? ‹#› Carnegie Mellon Effect of Reassociation Method Integer Double FP Operation Add Mult Add Mult Combine4 2.0 3.0 3.0 5.0 Unroll 2x 2.0 1.5 3.0 5.0 Unroll 2x, 2.0 1.5 1.5 3.0 reassociate Latency 1.0 3.0 3.0 5.0 Bound theoretical Throughput 1.0 1.0 1.0 1.0 best Bound ¢ Nearly 2x speedup for FP +, FP * § Reason: Breaks sequential dependency x = x OP (d[i] OP d[i+1]); § Why is that? (next slide) ‹#› Carnegie Mellon Reassociated Computation x = x OP (d[i] OP d[i+1]); ¢What changed: § Ops in the next iteration can be started early (no dependency) d0 d1 * ¢Overall Performance 1 d2 d3 § N elements, D cycles latency/op * d4 d5 § Should be (N/2+1)*D cycles: * CPE = D/2 * d6 d7 * * * * ‹#› Carnegie Mellon Loop Unrolling with Separate Accumulators void unroll2a_combine(vec_ptr v, data_t *dest) { int length = vec_length(v); int limit = length-1; data_t *d = get_vec_start(v); data_t x0 = IDENT; // 0 or 1 data_t x1 = IDENT; // 0 or 1 int i; /* Combine 2 elements at a time */ for (i = 0; i < limit; i+=2) { x0 = x0 OP d[i]; // “evens” x1 = x1 OP d[i+1]; // “odds” } /* Finish any remaining elements */ for (; i < length; i++) { x0 = x0 OP d[i]; } *dest = x0 OP x1; } ¢Different form of reassociation: actual parallelism ‹#› Carnegie Mellon Effect of Separate Accumulators Method Integer Double FP Operation Add Mult Add Mult Combine4 2.0 3.0 3.0 5.0 Unroll 2x 2.0 1.5 3.0 5.0 Unroll 2x, 2.0 1.5 1.5 3.0 reassociate Unroll 2x Parallel 2x 1.5 1.5 1.5 2.5 Latency Bound 1.0 3.0 3.0 5.0 Throughput Bound 1.0 1.0 1.0 1.0 ¢ 2x speedup (over unroll2) FP +, FP * § Breaks sequential dependency in a “cleaner,” more obvious way x0 = x0 OP d[i]; x1 = x1 OP d[i+1]; ‹#› Carnegie Mellon Separate Accumulators x0 = x0 OP d[i]; ¢What changed: x1 = x1 OP d[i+1]; § Two independent “streams” of operations 1 d0 1 d1 ¢Overall Performance * d2 * d3 § N elements, D cycles latency/op * d4 * d5 § Should be (N/2+1)*D cycles: CPE = D/2 * d6 * d7 § CPE matches prediction! * * * ‹#› Carnegie Mellon Unrolling & Accumulating ¢ Idea § Can unroll to any degree L § Can expose more potential parallelism ¢ Limitations § Diminishing returns § Cannot go beyond throughput limitations of execution units § Short lengths (N, N < L) § Finish off iterations sequentially ‹#› Carnegie Mellon Amdahl’s Law ‹#› Carnegie Mellon The Exam ¢ Coverage § Chapter 3.7 through 4 (up to 4.5); does not include performance optimization § Procedure calls; stack frames; stack/frame pointer; registers § know the code that must be generated to carry out a procedure call including its return § be able to manipulate the stack and access variables § recursion § Arrays and structures § know what they are; understand C code; alignment issues § understand how they map to assembly (for simple structs and 1D arrays) ‹#› Carnegie Mellon Exam ¢ Processor Architecture ISA § X86/Y86 – we will give a cheat sheet; no need to memorize all the assembly instructions; register layouts; definition of instructions § RISC/CISC § Know how to specify simple HCL; write simple logic gates § Be able to go from assembly instruction<->byte-level encodings; basic C<->assembly ¢ Seq and Pipelined CPU § Hardware components: register file, ALU, etc § Know instruction stages (F, D, E, M, W) § Know why pipelining improves over seq § Know about data dependencies and hazards § Know how to measure basic performance: latency, throughput ‹#› Carnegie Mellon Composition ¢ Mix of short answer and work questions (multiple parts) § 20%, 80% ¢ Recitation will go over an old exam ¢ Hints: § Question about arrays and structs – know the assembly level § Question about SEQ/PIPE § Question about mapping assembly back to C ¢ To study § Review notes, practice problems, homework questions § Refer back to things I *said* in class ‹#› Carnegie Mellon Next Time ¢ Good luck on the exam ¢ No office hours on Friday (out of town, sorry) ¢Have a great weekend! ‹#›