Docstoc

Program Optimization Measurement Loops Parallelism

Document Sample
Program Optimization Measurement Loops Parallelism Powered By Docstoc
					                                                    Carnegie Mellon




Program Optimization:
Measurement/Loops/Parallelism
CSCi 2021: Computer Architecture and Organization


    Chapter 5.6-5.9




                                                                ‹#›
                                             Carnegie Mellon




Today
¢ Measure performance
¢ Loops
¢ Exploiting Instruction-Level Parallelism
¢ Exam



¢ Assignment 4 out later today
   § performance optimization




                                                         ‹#›
                         Carnegie Mellon




Last Time
¢ Program optimization
¢ Code motion
¢ Memory aliasing
¢ Procedure calls




                                     ‹#›
                                                            Carnegie Mellon




Exploiting Instruction-Level Parallelism
¢ Hardware can execute multiple instructions in parallel
   § pipelining and multiple hardware units for execution
¢ Performance limited by data dependencies
¢ Simple transformations can have dramatic performance
  improvement
   § Compilers often cannot make these transformations
   § Lack of associativity in floating-point arithmetic




                                                                        ‹#›
                                                            Carnegie Mellon



Benchmark Example: Data Type for Vectors
/* data structure for vectors */
typedef struct{
    int len;
    double *data; // double data[MaxLen];
} vec;

                                       len          0   1     len-1
                                      data

/* retrieve vector element and store at val */
int get_vec_element(v *vec, int idx, double *val)
{
    if (idx < 0 || idx >= v->len)
        return 0;
    *val = v->data[idx];
    return 1;
}




                                                                        ‹#›
                                                                 Carnegie Mellon



Benchmark Computation
 void combine1(vec_ptr v, data_t *dest)
 {
     long int i;                                     Compute sum or
     *dest = IDENT; // 0 or 1                        product of vector
     for (i = 0; i < vec_length(v); i++) {           elements
        data_t val;
        get_vec_element(v, i, &val);
        *dest = *dest OP val;
     }
 }

 ¢Data Types                       ¢Operations
    § Use different declarations      § Use different definitions of
      for data_t                        OP
    § int                             § +
    § float                           § *
    § double
                                                                             ‹#›
                                                                          Carnegie Mellon



 Cycles Per Element (CPE)
¢ Convenient way to express performance of program that operates on
  vectors or lists, O(n) doesn’t tell us enough
¢ In our case: CPE = cycles per OP (*dest = *dest OP val)
    § Sum of cycles in loop, divide by N
¢ T = CPE*n + Overhead
    § CPE is slope of line


                                                                measure cycles using
                                 prog1: Slope = 4.0             special instructions



                                           prog2: Slope = 3.5




                                                                                      ‹#›
                                                                Carnegie Mellon




Fundamental Limits
¢ Latency
   § how long something takes


¢ Issue time
   § how long to wait before issuing next operation
   § can be < 1 clock cycle due to parallelism


¢ Throughput = 1/issue time (ideal or max)

¢ If latency is 5 cycles, but tput is 1 cycle, what does that
  tell us?


                                                                            ‹#›
                                                           Carnegie Mellon




Benchmark Performance
  void combine1(vec_ptr v, data_t *dest)
  {
      long int i;                              Compute sum or
      *dest = IDENT;                           product of vector
      for (i = 0; i < vec_length(v); i++) {    elements
         data_t val;
         get_vec_element(v, i, &val);
         *dest = *dest OP val;                  Why won’t compiler
      }                                         move vec_length?
  }

Method                  Integer               Double FP
Operation              Add          Mult      Add           Mult
Combine1               29.0          29.2     27.4           27.9
unoptimized
Combine1 –O1           12.0          12.0     12.0           13.0

                              highly
                              pipelined                                ‹#›
                                                          Carnegie Mellon




Basic Optimizations
      void combine4(vec_ptr v, data_t *dest)
      {
        int i;
        int length = vec_length(v);
        data_t *d = get_vec_start(v);
        data_t t = IDENT;
        for (i = 0; i < length; i++)
          t = t OP d[i];
        *dest = t;
      }


¢ Move vec_length out of loop
¢ Remove call to get_vec_element
¢ Avoid bounds check on each cycle (in get_vec_element)
¢ Accumulate in temporary         What does the
                                   temporary save?
                                                                      ‹#›
                                                           Carnegie Mellon




Effect of Basic Optimizations
         void combine4(vec_ptr v, data_t *dest)
         {
           int i;
           int length = vec_length(v);
           data_t *d = get_vec_start(v);               Drawback?
           data_t t = IDENT;
           for (i = 0; i < length; i++)
             t = t OP d[i];
           *dest = t;
         }

Method             Integer          Double FP
Operation         Add        Mult   Add     Mult
Combine1 –O1      12.0       12.0   12.0        13.0
Combine4           2.0        3.0    3.0         5.0

¢ Eliminates sources of overhead in loop
                                                                       ‹#›
                                                                                                     Carnegie Mellon

                                                   Looking at Execution:
Modern CPU Design                                  What does this suggest?

             Instruction Control
                                                 Fetch             Address
                            Retirement          Control
                               Unit                                                  Instruction
                             Register                             Instructions          Cache
                                               Instruction
                                File             Decode
                                                     Operations
Register Updates     Prediction OK?



                    Integer/ General      FP      FP                                    Functional
                                                              Load           Store
                     Branch Integer      Add    Mult/Div                                     Units


                           Operation Results
                                                          Addr.         Addr.

              Arith units have internal pipelines                     Data       Data

               100
              Ø instrs “in flight”                                    Data
              Ømicro operations                                       Cache

             Execution
                                                                                                                 ‹#›
                                                               Carnegie Mellon



 Loop Unrolling
void unroll2a_combine(vec_ptr v, data_t *dest)
{
    int length = vec_length(v);
    int limit = length-1;
    data_t *d = get_vec_start(v);
    data_t x = IDENT;
    int i;
    /* Combine 2 elements at a time */
    for (i = 0; i < limit; i+=2) {       for (i = 0; i < length; i++)
        x = (x OP d[i]) OP d[i+1];           t = t OP d[i];
    }
    /* Finish any remaining elements */
    for (; i < length; i++) {
        x = x OP d[i];
    }
    *dest = x;
}


 ¢Benefit?
     § Perform 2x more useful work per iteration
                                                                           ‹#›
                                                                     Carnegie Mellon




Effect of Loop Unrolling
         Method                 Integer            Double FP
         Operation             Add        Mult      Add    Mult
         Combine4               2.0        3.0       3.0       5.0
         Unroll 2x              2.0        1.5       3.0       5.0
         Latency                1.0        3.0       3.0       5.0
         Bound


¢ Helps integer multiply only
   § compiler does clever optimization (associativity)
¢ Others don’t improve. Why?
   § Still sequential dependency       x = (x OP d[i]) OP d[i+1];
   between iterations


                                                                                 ‹#›
                                                       Carnegie Mellon




Loop Unrolling with Reassociation
  void unroll2aa_combine(vec_ptr v, data_t *dest)
  {
      int length = vec_length(v);
      int limit = length-1;
      data_t *d = get_vec_start(v);
      data_t x = IDENT;
      int i;
      /* Combine 2 elements at a time */      Compare to before
      for (i = 0; i < limit; i+=2) {
          x = x OP (d[i] OP d[i+1]); x = (x OP d[i]) OP d[i+1];
      }
      /* Finish any remaining elements */
      for (; i < length; i++) {
          x = x OP d[i];
      }
      *dest = x;
  }


¢Can this change the result of the computation?
¢Yes, for FP. Why?
                                                                   ‹#›
                                                                       Carnegie Mellon




  Effect of Reassociation
              Method                Integer          Double FP
              Operation             Add       Mult   Add     Mult
              Combine4              2.0        3.0    3.0        5.0
              Unroll 2x             2.0        1.5    3.0        5.0
              Unroll 2x,            2.0        1.5    1.5        3.0
              reassociate
              Latency               1.0        3.0    3.0        5.0
              Bound
theoretical   Throughput            1.0        1.0    1.0        1.0
best          Bound

 ¢ Nearly 2x speedup for FP +, FP *
      § Reason: Breaks sequential dependency
          x = x OP (d[i] OP d[i+1]);

      § Why is that? (next slide)
                                                                                   ‹#›
                                                                      Carnegie Mellon




Reassociated Computation
x = x OP (d[i] OP d[i+1]);              ¢What changed:
                                          § Ops in the next iteration can be
                                            started early (no dependency)
        d0 d1

         *                              ¢Overall Performance
    1           d2 d3
                                          § N elements, D cycles latency/op
                 *      d4 d5             § Should be (N/2+1)*D cycles:
     *                                      CPE = D/2
                         *      d6 d7
             *
                                 *
                     *

                             *




                                                                                  ‹#›
                                                    Carnegie Mellon



Loop Unrolling with Separate Accumulators
   void unroll2a_combine(vec_ptr v, data_t *dest)
   {
       int length = vec_length(v);
       int limit = length-1;
       data_t *d = get_vec_start(v);
       data_t x0 = IDENT; // 0 or 1
       data_t x1 = IDENT; // 0 or 1
       int i;
       /* Combine 2 elements at a time */
       for (i = 0; i < limit; i+=2) {
          x0 = x0 OP d[i];     // “evens”
          x1 = x1 OP d[i+1];   // “odds”
       }
       /* Finish any remaining elements */
       for (; i < length; i++) {
           x0 = x0 OP d[i];
       }
       *dest = x0 OP x1;
   }

¢Different form of reassociation: actual parallelism
                                                                ‹#›
                                                                       Carnegie Mellon




Effect of Separate Accumulators
Method                        Integer                Double FP
Operation                    Add         Mult         Add         Mult
Combine4                      2.0         3.0          3.0            5.0
Unroll 2x                     2.0         1.5          3.0            5.0
Unroll 2x,                    2.0         1.5          1.5            3.0
reassociate
Unroll 2x Parallel 2x         1.5         1.5          1.5            2.5
Latency Bound                 1.0         3.0          3.0            5.0
Throughput Bound              1.0         1.0          1.0            1.0

¢ 2x speedup (over unroll2) FP +, FP *
    § Breaks sequential dependency in a “cleaner,” more obvious way
         x0 = x0 OP d[i];
         x1 = x1 OP d[i+1];


                                                                                   ‹#›
                                                                           Carnegie Mellon




Separate Accumulators
 x0 = x0 OP d[i];                              ¢What changed:
 x1 = x1 OP d[i+1];                              § Two independent “streams” of
                                                   operations
1 d0            1 d1
                                               ¢Overall Performance
 *     d2        *          d3
                                                 § N elements, D cycles latency/op
       *    d4             *         d5          § Should be (N/2+1)*D cycles:
                                                   CPE = D/2
            *          d6        *        d7     § CPE matches prediction!

                       *                  *

                                 *




                                                                                       ‹#›
                                                                     Carnegie Mellon




Unrolling & Accumulating
¢ Idea
   § Can unroll to any degree L
   § Can expose more potential parallelism


¢ Limitations
   § Diminishing returns
      § Cannot go beyond throughput limitations of execution units
   § Short lengths (N, N < L)
      § Finish off iterations sequentially




                                                                                 ‹#›
               Carnegie Mellon




Amdahl’s Law




                           ‹#›
                                                                       Carnegie Mellon




The Exam
¢ Coverage
   § Chapter 3.7 through 4 (up to 4.5); does not include performance
     optimization

   § Procedure calls; stack frames; stack/frame pointer; registers
      § know the code that must be generated to carry out a
        procedure call including its return
      § be able to manipulate the stack and access variables
      § recursion

   § Arrays and structures
      § know what they are; understand C code; alignment issues
      § understand how they map to assembly (for simple structs and
         1D arrays)
                                                                                   ‹#›
                                                                       Carnegie Mellon


 Exam
¢ Processor Architecture ISA
   § X86/Y86 – we will give a cheat sheet; no need to memorize all the assembly
     instructions; register layouts; definition of instructions
   § RISC/CISC
   § Know how to specify simple HCL; write simple logic gates
   § Be able to go from assembly instruction<->byte-level encodings;
     basic C<->assembly


¢ Seq and Pipelined CPU
   §   Hardware components: register file, ALU, etc
   §   Know instruction stages (F, D, E, M, W)
   §   Know why pipelining improves over seq
   §   Know about data dependencies and hazards
   §   Know how to measure basic performance: latency, throughput

                                                                                   ‹#›
                                                                   Carnegie Mellon




Composition
¢ Mix of short answer and work questions (multiple parts)
   § 20%, 80%
¢ Recitation will go over an old exam
¢ Hints:
   § Question about arrays and structs – know the assembly level
   § Question about SEQ/PIPE
   § Question about mapping assembly back to C


¢ To study
   § Review notes, practice problems, homework questions
   § Refer back to things I *said* in class



                                                                               ‹#›
                                                   Carnegie Mellon




Next Time
¢ Good luck on the exam
¢ No office hours on Friday (out of town, sorry)
¢Have a great weekend!




                                                               ‹#›

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:1
posted:10/30/2013
language:Latin
pages:26