search

Document Sample
search Powered By Docstoc
					Analytical models and
intelligent search for program
generation and optimization

                   David Padua
       Department of Computer Science
   University of Illinois at Urbana-Champaign


                                1
Program optimization today

 The optimization phase of a compiler applies a
  series of transformations to achieve its
  objectives.
 The compiler uses program analysis to
  determine which transformations are
  correctness-preserving.
 Compiler transformation and analysis
  techniques are reasonably well-understood.
 Since many of the compiler optimization
  problems have “exponential complexity”,
  heuristics are needed to drive the application
  of transformations.
                                                   2
Optimization drivers

 Developing driving heuristics is
  laborious.
 One reason for this is the lack of
  methodologies and tools to build
  optimization drivers.
 As a result, although there is much
  in common among compilers, their
  optimization phases are usually re-
  implemented from scratch.
                                        3
Optimization drivers (Cont.)

  As a result, machines and languages not
   widely popular usually lack good compilers.
   (some popular systems too)
   – DSP, network processor, and embedded system
     programming is often done in assembly language.
   – Evaluation of new architectural features
     requiring compiler involvement is not always
     meaningful.
   – Languages such as APL, MATLAB, LISP, … suffer
     from chronic low performance.
   – New languages difficult to introduce (although
     compilers are only a part of the problem).


                                                       4
A methodology based on the notion
of search space
 Program transformations often have several
  possible target versions.
  –   Loop unrolling: How many times
  –   Loop tiling: size of the tile.
  –   Loop interchanging: order of loop headers
  –   Register allocation: which registers are stored in
      memory to give room for new values.
 The process of optimization can be seen as a
  search in the space of possible program
  versions.


                                                           5
Empirical search
Iterative compilation

 Perhaps the simplest
  application of the
  search space model
  is empirical search
  where several
  versions are
  generated and
  executed on the
  target machine. The
  fastest version is
  selected.




  T. Kisuki, P.M.W. Knijnenburg, M.F.P. O'Boyle, and H.A.G. Wijshoff .
  Iterative compilation in program optimization. In Proc. CPC2000,       6
  pages 35-44, 2000
Empirical search and traditional
compilers
 Searching is not a new approach and
  compilers have applied it in the past,
  but using architectural prediction
  models instead of actual runs:
  – KAP searched for best loop header
    order
  – SGI’s MIPS-pro and IBM PowerPC
    compilers select the best degree of
    unrolling.

                                          7
Limitations of empirical search

 Empirical search is conceptually simple and
  portable.
 However,
  – the search space tends to be too large specially
    when several transformations are combined.
  – It is not clear how to apply this method when
    program behavior is a function of the input data set.
 Need heuristics/search strategies.
 Availability of performance “formulas” could
  help evaluate transformations across input
  data sets and facilitate search.

                                                       8
Program/library generators

 An on-going effort at Illinois focuses on
  program generators.
 The objectives are:
  – To develop better program generators.
  – To improve our understanding the
    optimization process without the need to
    worry about program analysis.



                                               9
Compilers and Library Generators


       Algorithm


                             Program Generation


  Internal representation



                            Program Transformation
    Source Program
                                                  10
Empirical search in program/library
generators


 Examples:
  – FFTW [M. Frigo, S. Johnson]
  – Spiral (FFT/signal processing) [J. Moura
    (CMU), M. Veloso (CMU), J. Johnson
    (Drexel)]
  – ATLAS (linear algebra)[R. Whaley, A.
    Petitet, J. Dongarra]
  – PHiPAC[J. Demmel et al]
  – Sorting [X. Li, M. Garzaran (Illinois)]
                                               11
Techniques presented in the rest of
the talk

 Analytical models (ATLAS)
  – Pure
  – Combined with search
 Pure search strategies
  – Data independent performance (Spiral)
  – Data dependent performance (Sorting)



                                       12
I. Analytical models and ATLAS



  Joint work with G. DeJong (Illinois),
  M. Garzaran, and K. Pingali (Cornell)



                           13
ATLAS

 A Linear Algebra Library Generator.
  ATLAS = Automatically Tuned linear Algebra
  Software
 At installation time, searches for the best
  parameters of a Matrix-Matrix Multiplication
  routine.
 We studied ATLAS and modified the system
  to replace the search with an analytical model
  that identifies the best MMM parameters
  without the need for search.


                                               14
The modified version of ATLAS

 Original ATLAS Infrastructure
                                                                      Compile,
                                                MFLOPS
                                                                      Execute,
                                                                      Measure



                L1Size                       NB
     Detect                ATLAS Search   MU,NU,KU     ATLAS MM       MiniMMM
    Hardware      NR          Engine       xFetch    Code Generator    Source
   Parameters   MulAdd      (MMSearch)     MulAdd      (MMCase)
                Latency                    Latency


 Model-Based ATLAS Infrastructure

                 L1Size                      NB
     Detect     L1I$Size                  MU,NU,KU     ATLAS MM       MiniMMM
    Hardware      NR          Model        xFetch    Code Generator    Source
   Parameters   MulAdd                     MulAdd      (MMCase)
                Latency                    Latency




                                                                                 15
Detecting Machine Parameters

 Micro-benchmarks
  – L1Size: L1 Data Cache size
     • Similar to Hennessy-Patterson book
  – NR: Number of registers
     • Use several FP temporaries repeatedly
  – MulAdd: Fused Multiply Add (FMA)
     • “c+=a*b” as opposed to “c+=t; t=a*b”
  – Latency: Latency of FP Multiplication
     • Needed for scheduling multiplies and adds in the
       absence of FMA

                                                      16
Compiler View

 ATLAS Code Generation
                                                                      Compile,
                                                MFLOPS
                                                                      Execute,
                                                                      Measure



                 L1Size                      NB
      Detect               ATLAS Search   MU,NU,KU     ATLAS MM       MiniMMM
     Hardware      NR         Engine       xFetch    Code Generator    Source
    Parameters   MulAdd     (MMSearch)     MulAdd      (MMCase)
                 Latency                   Latency



 Focus on MMM (as part of BLAS-3)
  – Very good reuse: O(N2) data, O(N3) computation
  – Many optimization opportunities        for (int j = 0; j < N; j++)
     • Few “real” dependencies               for (int i = 0; i < N; i++)
                                               for (int k = 0; k < N;
  – Will run poorly on modern machines     k++)
     • Poor use of cache and registers           C[i][j] += A[i][k] *
                                           B[k][j]
     • Poor use of processor pipelines
                                                                                 17
Characteristics of the code as
generated by ATLAS
 Cache-level blocking (tiling)
  – Atlas blocks only for L1 cache
 Register-level blocking
  – Highest level of memory hierarchy
  – Important to hold array values in registers
 Software pipelining
  – Unroll and schedule operations
 Versioning
  – Dynamically decide which way to compute
 Back-end compiler optimizations
  – Scalar Optimizations
  – Instruction Scheduling



                                                  18
Cache Tiling for MMM

Tiling in ATLAS:
Only square tiles: NBxNBxNB
  Working set of tile must fit in L1 cache
  Tiles usually copied first into contiguous buffer
                                                                   B
Special clean-up code generated for boundaries       k


Mini-MMM:
 for (int j = 0; j < NB; j++)
    for (int i = 0; i < NB; i++)            k             j
       for (int k = 0; k < NB; k++)
         C[i][j] += A[i][k] * B[k][j]                 i
                                            NB                     NB
                                        A        NB           NB         C
   Optimization parameter: NB
                                                                        19
  IJK version (large cache)
DO I = 1, N//row-major storage
 DO J = 1, N                                                       B
                                                           K
  DO K = 1, N
    C(I,J) = C(I,J) + A(I,K)*B(K,J)             A

                                                    K          C

 Large cache scenario:
    – Matrices are small enough to fit into cache
    – Only cold misses, no capacity misses
    – Miss ratio:
       • Data size = 3 N2
       • Each miss brings in b floating-point numbers
       • Miss ratio = 3 N2 /b*4N3 = 0.75/bN = 0.019 (b = 4,N=10)


                                                                       20
  IJK version (small cache)
DO I = 1, N
 DO J = 1, N                                                     B
                                                             K
  DO K = 1, N
    C(I,J) = C(I,J) + A(I,K)*B(K,J)              A

                                                     K           C
 Small cache scenario:
    – Matrices are large compared to cache
       • reuse distance is not O(1) => miss
    – Cold and capacity misses
    – Miss ratio:
       • C: N2/b misses (good temporal locality)
       • A: N3 /b misses (good spatial locality)
       • B: N3 misses (poor temporal and spatial locality)
       • Miss ratio  0.25 (b+1)/b = 0.3125 (for b = 4)


                                                                     21
    Register tiling for Mini-MMM

Micro-MMM:                                                          NU

   MUx1 sub-matrix of A




                                                                         K
   1xNU sub-matrix of B
   MUxNU sub-matrix of C
   MU+NU+MU*NU <= NR                                                              B

Mini-MMM code after register tiling and unrolling                    NB
for (int j = 0; j < NB; j += NU)
  for (int i = 0; i < NB; i += MU)
    load C[i..i+MU-1, j..j+NU-1] into registers
    for (int k = 0; k < NB; k++)
       load A[i..i+MU-1,k] into registers         MU




                                                               NB
       load B[k,j..j+NU-1] into registers
       multiply A’s and B’s and add to C’s             K
    store C[i..i+MU-1, j..j+NU-1]
                                                           A                      C
     Unroll k loop KU times
     Optimization parameters: MU,NU,KU
                                                                             22
    Scheduling
          ……
          for (int k = 0; k < NB; k+= KU)
                  load A[i..i+MU-1,k] into registers
                  load B[k,j..j+NU-1] into registers       Micro-MMM
                  multiply A’s and B’s and add to C’s
                  ………
   If processor has combined Multiply-add, use it
   Otherwise, schedule multiplies and adds first
    –   interleave M1M2…MMU*NU and A1A2…AMU*NU after skewing additions by      Latency
   Schedule IFetch number of initial loads for one micro-MMM at
    the end of previous micro-MMM
   Schedule remaining loads for micro-MMM in blocks of NFetch
    –   memory pipeline can support only a small number of outstanding loads


    Optimization parameters: MulAdd, Latency, xFetch


                                                                                         23
 High-level picture
 Multi-dimensional optimization problem:
  – Independent parameters: NB,MU,NU,KU,…
  – Dependent variable: MFlops
  – Function from parameters to variables is given
    implicitly; can be evaluated repeatedly
 One optimization strategy: orthogonal range
  search
  – Optimize along one dimension at a time, using
    reference values for parameters not yet optimized
  – Not guaranteed to find optimal point, but might
    come close


                                                     24
Specification of OR Search


  Order in which dimensions are optimized
  Reference values for un-optimized
   dimensions at any step
  Interval in which range search is done for
   each dimension




                                                25
Search strategy


 1.   Find Best NB
 2.   Find Best MU & NU
 3.   Find Best KU
 4.   Find Best xFetch
 5.   Find Best Latency (lat)
 6.   Find non-copy version tile size (NCNB)



                                               26
Find Best NB

 Search in following range
  – 16 <= NB <= 80
  – NB2 <= L1Size
 In this search, use simple estimates for
  other parameters
  – (eg) KU: Test each candidate for
     • Full K unrolling (KU = NB)
     • No K unrolling (KU = 1)


                                         27
 Finding other parameters

 Find best MU, NU : try all MU & NU that satisfy
              1  MU,NU  NB
              MU*NU + MU + NU  NR
   – In this step, use best NB from previous step

 Find best KU

 Find best Latency
            values between 1 and 6
 Find best xFetch
             IFetch: [2,MU+NU], Nfetch:[1,MU+NU-IFetch]



                                                          28
Model-based estimation of
optimization parameters values



                                                                          Execute
                                               MFLOPS                        &
                                                                          Measure



               L1Size                            NB
   Detect                   ATLAS Search     MU, NU, KU   ATLAS MM Code   MiniMMM
  Hardware       NR            Engine         Latency        Generator     Source
 Parameters    MulAdd        (MMSearch)        xFetch        (MMCase)
               Latency                        MulAdd




                L1Size                           NB
   Detect     L1 I-Cache   Model Parameter   MU, NU, KU   ATLAS MM Code   MiniMMM
  Hardware        NR          Estimator       Latency        Generator     Source
 Parameters    MulAdd        (MMModel)         xFetch        (MMCase)
               Latency                        MulAdd




                                                                                    29
 High-level picture
 NB: hierarchy of models
   – Find largest NB for which there are no capacity or
     conflict misses
   – Find largest NB for which there are no capacity misses,
     assuming optimal replacement
   – Find largest NB for which there are no capacity misses,
     assuming LRU replacement
 MU,NU: estimate from number of registers,
  making them roughly equal
              MU*NU + MU + NU  NR
 KU: maximize subject to I-cache size
 Latency: from hardware parameter
 xFetch: set to 2

                                                               30
 Largest NB for no capacity/conflict misses



 Tiles are copied into contiguous memory
 Condition for cold misses only:
   – 3*NB2  L1Size                                      B
                                          k




                                k               j

                                            i
                                NB                       NB
                            A        NB             NB



                                                              31
Largest NB for no capacity misses

 MMM:
                                               K



 for (int j = 0; i < N; i++)                                   B
    for (int i = 0; j < N; j++)
      for (int k = 0; k < N; k++)
        c[i][j] += a[i][k] * b[k][j]   K               N (J)

 Cache model:




                                               M (I)
   – Fully associative
                                           A                   C
   – Line size 1 Word
   – Optimal Replacement
 Bottom line:
  N2+N+1<C
   – One full matrix
   – One row / column
   – One element




                                                                   32
Extending the Model

 Line Size > 1
  – Spatial locality
  – Array layout in memory matters
 Bottom line: depending on loop order
  – either    NB2   NB      C
              B    B  1  B
                       

  – or        NB 2          C
              B   NB  1  B
                   




                                         33
Extending the Model (cont.)
 LRU (not optimal replacement)
 MMM sample:
 for (int j = 0; i < N; i++)
    for (int i = 0; j < N; j++)
      for (int k = 0; k < N; k++)
        c[i][j] += a[i][k] * b[k][j]
 Bottom line:
                                            Ai ,1 B1, j Ai , 2 B2, j  Ai , NB BNB , j Ci , j
           NB 
              2
                              C
           B     3NB  1 
                             B
                                            A1,1 A1, 2  A1,NBC1, j
           NB2   NB          C
           B   3 B   1  B            A2,1 A2, 2  A2,NBC2, j
                      
           NB 
              2
                            NB   C      
           B   2 NB    B   1  B
                                   
                                      ANB1,1 ANB1, 2  ANB1,NBCNB1, j
           NB2   NB 
           B   2 B   NB  1  B
                                      C     ANB,1B1, j ANB, 2 B2, j  ANB,NB BNB, jCNB, j
                    



                                                                                                34
Experiments

 Architectures:
  – SGI R12K, 270MHz
  – Sun UltraSparcIII, 900MHz
  – Intel PIII, 550MHz
 Measure
  – Mini-MMM performance
  – Complete MMM performance
  – Sensitivity to variations on parameters




                                              35
Installation time of ATLAS and Model

           10000

            9000

            8000

            7000

            6000
time (s)




            5000

            4000

            3000

            2000

            1000

               0
                    SGI Atlas    SGI Model    Sun Atlas    Sun Model     Intel Atlas    Intel Model

              Detect Machine Parameters   Optimize MMM    Generate Final Code    Build Library




                                                                                                  36
 Parameter values
ATLAS
        Archi- Tile Size Unroll          Fetch       L
        tecture Copy / Non- MU / NU / KU F / I / N
                   Copy
        SGI     64 / 64 4 / 4 / 64 0 / 5 / 1         3
        Sun     48 / 48 5 / 3 / 48 0 / 3 / 5         5
        Intel   40 / 40 2 / 1 / 40 0 / 3 / 1         4
Model
        Archi- Tile Size Unroll          Fetch       L
        tecture Copy / Non- MU / NU / KU F / I / N
                   Copy
        SGI     62 / 45 4 / 4 / 62 1 / 2 / 2         6
        Sun     88 / 78 4 / 4 / 78 1 / 2 / 2         4
        Intel   42 / 39 2 / 1 / 42 1/ 2 / 2          3
                                                         37
Mini-MMM Performance

    Architecture ATLAS Model Difference
                (MFLOPS) (MFLOPS)   (%)

    SGI          457       453      1
    Sun          1287     1052      20
    Intel        394       384      2




                                          38
SGI Performance
                        F77   ATLAS      Model   BLAS

           600

           500

           400
  MFLOPS




           300

           200

           100

             0
                 0   1000     2000       3000     4000   5000
                                Martix Size




                                                                39
                                        F77   ATLAS      Model   BLAS

                           600

                           500

                           400


  MFLOPS                   300

                           200

                           100

                             0
                                 0   1000     2000       3000     4000   5000
                                                Martix Size
                             5
                           4.5
   TLB Misses (Billions)




                             4
                           3.5
                             3
                           2.5
                             2
                           1.5
                             1
                           0.5
                             0
                                 0   1000     2000      3000      4000   5000
                                               Matrix Size
                                                Model    ATLAS



TLB effects are important when matrix size is large.
                                               40
Sun Performance




                  41
Pentium III Performance
                        G77   ATLAS      Model   BLAS

           600

           500

           400
  MFLOPS




           300

           200

           100

             0
                 0   1000     2000       3000     4000   5000
                                Martix Size




                                                                42
Sensitivity to tile size (SGI)
                                                            L2 cache conflict-free tile

           600
                                          B

           500        A
                                                                          L2 cache Model tile
                  M


           400
  MFLOPS




           300


           200


           100


             0
                 20        220          420          620            820
                          Tile Size (B: Best, A: ATLAS, M: Model)



 Higher levels of memory hierarchy cannot be ignored.
                                                                                          43
Sensitivity to tile size: Sun

             1600
                              B
             1400                 A

                                                M
             1200

             1000
    MFLOPS




              800

              600

              400

              200

                0
                    20   40           60   80       100       120   140
                         Tile Size (B: Best, A: ATLAS, M: Model)




                                                                          44
But .. Results are not always perfect

 We recently conducted several experiments on
  other machines.
 We considered this a blind test to check the
  effectiveness of our approach.
 In these experiments, the search strategy
  sometimes does better than the model.




                                            45
Recent experiments: Itanium 2




                                46
Recent experiments: Pentium 4




                                47
Hybrid approaches

 We are studying two strategies that combine
  model with search.
 First, the model can be used to find a first
  approximation to the value of the parameters
  and then use hill climbing to refine this value.
 Use a general shape of the performance curve
  and use curve fitting to find optimal point.




                                                 48
II. Intelligent Search and
Sorting


          Joint work with
     Xiaoming Li and M. Garzaran



                         49
Sorting

 Generating sorting libraries is an interesting
  problem for several reasons.
  – It differs from the problems of ATLAS and Spiral
    in that performance depends on the characteristics
    of the input data.
  – It is not as clearly decomposable as the linear
    algebra problems




                                                    50
Outline Sorting

 Part I: Selecting one of several possible
  “pure” sorting algorithm at runtime
  – Motivation
  –   Sorting Algorithms
  –   Factors
  –   Empirical Search and Runtime Adaptation
  –   Experiment Results
 Part II: Building a hybrid sorting algorithm
  – Primitives
  – Searching approaches



                                                 51
Motivation

  Theoretical complexity does not suffice to
   evaluate sorting algorithms
   – Cache effect
   – Instruction number
  The performance of sorting algorithms
   depends on the characteristics of input
   – Number of records
   – Input distribution
   – …

                                                52
What we accomplished in this work

 Identified architectural and runtime factors
  that affect the performance of sorting
  algorithms.
 Developed a empirical search strategy to
  identify the best shape and parameter values
  of each sorting algorithm.
 Developed a strategy to choose at runtime the
  best sorting algorithm for a specific input data
  set.


                                                53
Performance vs. Distribution




                               54
Performance vs. Distribution




                               55
Performance vs. Sdev




                       56
Performance vs. Sdev




                       57
Outline Sorting

 Part I: Select the best algorithm
  – Motivation
  –   Sorting Algorithms
  –   Factors
  –   Empirical Search and Runtime Adaptation
  –   Experiment Results
 Part II: Build the best algorithm
  – Primitives
  – Searching approaches


                                                58
Quicksort

   Set guardians at both ends of the input array.
   Eliminate recursion.
   Choose the median of three as the pivot.
   Use insertion sort for small partitions.




                                                 59
Radix sort

 Non comparison algorithm



   Vector                       Dest.
   to sort   counter   accum.   vector
  0   12
      12     1   2     1   0     31
                                 31   0
  1   23
      23     2   1     2   2
                           3      1   1
  2   31
      31     3   2     3   3     12
                                 12   2
  3   13
      13     4   1     4   5     23
                                 23   3
  4    4                         33
                                 33   4
  5    1                          4   5

                                          60
Cache Conscious Radix Sort

 CC-radix(bucket)
    if fits in cache L (bucket) then
          Radix sort (bucket)
    else
      sub-buckets = Reverse sorting(bucket)
      For each sub-bucket in sub-buckets
           CC-radix(sub-buckets)
       endfor
    endif
             Pseudocode for CC-radix

                                              61
Multiway Merge Sort




                      62
Sorting algorithms for small partitions

 Insertion sort
 Apply register blocking to sorting algorihtm ->
  register sorting network




                                                63
Outline

 Part I: Select the best algorithm
  – Motivation
  –   Sorting Algorithms
  –   Factors
  –   Empirical Search and Runtime Adaptation
  –   Experiment Results
 Part II: Build the best algorithm
  – Primitives
  – Searching approaches


                                                64
Cache Size/TLB Size

 Quicksort: Using multiple pivots to tile
 CC-radix:
  – Fit each partition into cache
  – The number of active partitions < TLB size
 Multiway Merge Sort:
  – The heap should fit in the cache
  – Sorted runs should fit in the cache




                                                 65
Number of Registers

 Register Blocking




                      66
Cache Line Size

 To optimize shift-down operation




                                     67
Amount of Data to Sort

 Quicksort
  – Cache misses will increase with the amount of data.
 CC-radix
  – As amount of data increases, CC-radix needs more
    partitioning passes. After certain threshold, the
    performance drops dramatically.
 Multiway Merge Sort
  – Only profitable for large amount of data when
    reduction in number of cache misses can
    compensate for the increased number of operations
    with respect to Quicksort.


                                                        68
Distribution of the Data

 To goal is to distinguish the performance of the
  comparison based algorithms versus the radix based
  ones.
 Distribution shapes: Uniform, Normal, Exponential, …
   – Not a good criteria.
 Distribution width:
   – Standard deviation (sdev):
      • Only good for one-peak distribution
      • Expensive to calculate
   – Entropy
      • Represents the distribution of each bit


                                                         69
Outline

 Part I: Select the best algorithm
  – Motivation
  –   Sorting Algorithms
  –   Factors
  –   Empirical Search and Runtime Adaptation
  –   Experiment Results
 Part II: Build the best algorithm
  – Primitives
  – Searching approaches


                                                70
Library adaptation

 Architectural Factors
  – Cache / TLB size
  – Number of Registers       Empirical Search
  – Cache Line Size

 Runtime Factors
   – Distribution shape of the data     Does not matter


   – Amount of data to Sort    Machine learning and
   – Distribution              runtime adaptation

                                                      71
The Library

 Building the library  Installation time
  – Empirical Search
  – Learning Procedure
                                       Runtime
     • Use of training data
                                       Adaptation
 Running the library  Runtime
  – Runtime Procedure




                                                    72
Runtime Adaptation

 Has two parts: at installation time and at
  runtime
 Goal function: f:(N,E) -> {Multiway Merge(sh,f)
  Sort, Quicksort, CC-radix}
   – N: amount of input data
   – E: the entropy vector
 For given (N,E), identify the best
  configuration for Multiway Merge Sort as a
  function of size_of_heap and fanout .

                                               73
Runtime Adaptation

 f:(N,E) is linear separable problem.
  – A linear separable problem f(x1, x2, …,xn) is a
                                       
    decision problem that there exists a weight vector
                                      w

                                
              f ( x )  true if w  x   or false elsewise

 The runtime adaptation code is generated at
  the end of installation to implement the
  learned f:(N,E) and select the best
  configuration for Multiway Merge Sort.



                                                              74
Runtime Adaptation: Learning
Procedure
 Goal function:

   f:(N,E)  {Multiway Merge Sort, Quicksort, CC-radix}

   N: amount of input data
   E: the entropy vector

   – Use the entropy to learn the best algorithm between CC-
     radix and one of the other two
      • Output: weight vector (→) and threshold (Ө) for
                               w
        each value of N
   – Then, use N to choose between Multiway Merge or Quicksort



                                                               75
Runtime Adaptation:Runtime
Procedure

 Sample the input array
 Compute the entropy vector
 Compute S = ∑i wi * entropyi

 If S ≥ Ө
    choose CC-radix
  else
    choose others

                                 76
Outline

 Part I: Select the best algorithm
  – Motivation
  –   Sorting Algorithms
  –   Factors
  –   Empirical Search and Runtime Adaptation
  –   Experiment Results
 Part II: Build the best algorithm
  – Primitives
  – Searching approaches


                                                77
Setup




        78
Performance Comparison

                                         Pentium III Xeon, 16 M keys (float)

                            7000
  Execution Time (Cycles)




                            6500


                            6000


                            5500                                                               Intel MKL
                                                                                               Quicksort

                            5000


                            4500


                            4000
                                   100        1000     10000    100000    1000000   10000000

                                                     Standard Deviation

                                                                                                       79
Sun UltraSparcIII




                    80
IBM Power3




             81
Intel PIII Xeon




                  82
SGI R12000




             83
Conclusion
 Identified the architectural and runtime factors

 Used empirical search to find the best parameters
  values

 Our machine learning techniques proved to be quite
  effective:
   – Always selects the best algorithm.
   – The wrong decision introduces a 37% average
     performance degradation
   – Overhead (average 5%, worst case 7%)



                                                       84
Outline

 Part I: Select the best algorithm
  – Motivation
  –   Sorting Algorithms
  –   Factors
  –   Empirical Search and Runtime Adaptation
  –   Experiment Results
 Part II: Build the best algorithm
  – Primitives
  – Searching approaches


                                                85
Primitives

 Categorize sorting algorithms
   – Partition by some pivots: Quicksort, Bubble Sort,…
   – Partition by size: Merge Sort, Select Sort
   – Partition by radix: Radix Sort, CC-Radix
 Construct a sorting algorithm using these primitives.
                      DP            DV



                      DV            DP



                      DP            DV



                                                          86
Searching approaches

 The composite sorting algorithms are in the
  shape of trees.
 Every primitive have parameters.
 The searching mechanism must be able to
  search both the shape and the value.
 Genetic algorithm is a good choice (may not be
  the only one).




                                               87
Results




          88
Results




          89
90
SPIRAL

 The approach:
  – Mathematical formulation of signal processing algorithms
  – Automatically generate algorithm versions
  – A generalization of the well-known FFTW
  – Use compiler technique to translate formulas into
    implementations
  – Adapt to the target platform by searching for the optimal
    version




                                                                91
92
Fast DSP Algorithms As Matrix
Factorizations
 Computing y = F4 x is carried out as:
     t1 = A4 x ( permutation )
     t2 = A3 t1 ( two F2’s )
     t3 = A2 t2 ( diagonal scaling )
     y = A1 t3 ( two F2’s )
 The cost is reduced because A1, A2, A3 and A4
  are structured sparse matrices.




                                              93
Tensor Product Formulation of
Cooley-Tuckey
 Theorem

              Frs  ( Fr  I s )Tsrs ( I r  Fs ) Lrs
                                                   r


               Tsrs      is a diagonal matrix

 Example
              Lrs
               r         is a stride permutation




           F4  ( F2  I 2 )T44 ( I 2  F2 ) L4
                                              2

               1      0 1 0  1            0    0   0 1 1     0 0  1    0   0   0
               0      1 0 1  0            1    0   0 1  1   0 0  0    0   1   0
                                                                               
               1      0  1 0  0          0    1   0  0 0    1 1  0    1   0   0
                                                                                
               0      1 0  1 0           0    0   1  0 0    1  1 0   0   0   1


                                                                                           94
Formulas for Matrix
Factorizations


   F4  (F2  I 2 )T (I 2  F2 )L
                           4
                           2
                                                 4
                                                 2
                                                               R1
    Frs  (Fr  Is )T (I r  Fs )L
                            s
                             rs                      rs
                                                      r
                                                                     R2

                                                     
       k                                                   1
Fn   (I n i  Fn i  I n i )(I n i  T
                                          n i n i
                                          n i       )   (I n i  L
                                                                     n i n i
                                                                     ni         )
      i 1                                                ik
where n = n1…nk, ni- = n1…ni-1, ni+= ni+1…nk
                                                                         95
Factorization Trees


               Different computation order
     F8 : R1   Different data access pattern
                                                 F8 : R1

F2       F4 : R1    Different performance    F4 : R1   F2

       F2      F2                           F2    F2
                           F8 : R2

                      F2     F2      F2
                                                       96
Walsh-Hadamard Transform


                           97
Optimal Factorization Trees

 Depend on the platform
 Difficult to deduct
 Can be found by empirical search
  – The search space is very large
  – Different search algorithms
     • Random, DP, GA, hill-climbing, exhaustive




                                                   98
99
100
Size of Search Space


 N    # of formulas   N     # of formulas
 21              1    29         20793
 22              1    210       103049
 23              3    211       518859
 24             11    212      2646723
 25            45     213     13649969
 26           197     214     71039373
 27           903     215    372693519
 28          4279     216   1968801519

                                            101
102
103
More Search Choices

 Programming:
  – Loop unrolling
  – Memory allocation
  – In-lining
 Platform choices:
  – Compiler optimization options




                                    104
The SPIRAL System


        DSP Transform
      Formula Generator

         SPL Program

        SPL Compiler            Search
                                Engine
     C/FORTRAN Programs
    Performance Evaluation
       Target machine        DSP Library

                                           105
Spiral

 Spiral does the factorization at installation
  time and generates one library routine for
  each size.
 FFTW only generates codelets (input size 
  64) and at run time performs the
  factorization.




                                                  106
A Simple SPL Program


  Definition   Formula   Directive   Comment

 ; This is a simple SPL program
 (define A (matrix(1 2)(2 1)))
 (define B (diagonal(3 3))
 #subname simple
 (tensor (I 2)(compose A B))
 ;; This is an invisible comment

                                               107
Templates
(template
  (F n)[ n >= 1 ]                  Pattern
  ( do i=0,n-1
      y(i)=0                      Condition
      do j=0,n-1
                                   I-code
        y(i)=y(i)+W(n,i*j)*x(j)
      end
    end ))




                                              108
SPL Compiler

       SPL Formula                         Template Definition

                          Parsing
 Abstract Syntax Tree                      Template Table

              Intermediate Code Generation
                               I-Code
             Intermediate Code Restructuring
                               I-Code
                        Optimization
                               I-Code
                  Target Code Generation

                               FORTRAN, C
                                                                 109
Intermediate Code Restructuring
 Loop unrolling
   – Degree of unrolling can be controlled globally or case by case
 Scalar function evaluation
   – Replace scalar functions with constant value or array access
 Type conversion
   – Type of input data: real or complex
   – Type of arithmetic: real or complex
   – Same SPL formula, different C/Fortran programs




                                                                 110
111
Optimizations


                        * High-level scheduling
   Formula Generator    * Loop transformation

                        * High-level optimizations
                            - Constant folding
                            - Copy propagation
     SPL Compiler           - CSE
                            - Dead code elimination


                        * Low-level optimizations
   C/Fortran Compiler       - Instruction scheduling
                            - Register allocation
                                                       112
Basic Optimizations
(FFT, N=25, SPARC, f77 –fast –O5)




                                    113
Basic Optimizations
(FFT, N=25, MIPS, f77 –O3)




                             114
Basic Optimizations
(FFT, N=25, PII, g77 –O6 –malign-double)




                                           115
Performance Evaluation

 Evaluation the performance of the code
  generated by the SPL compiler
 Platforms: SPARC, MIPS, PII
 Search strategy: dynamic programming




                                           116
Pseudo MFlops

                # of FP operations in the algorithm
Pseudo MFlops 
                        Execution time ( s)
 Estimation of the # of FP operations:
   – FFT (radix-2):      5nlog2n – 10 + 16




                                                 117
FFT Performance   (N=21 to 26)


SPARC                     MIPS




                          PII




                                 118
FFT Performance   (N=27 to 220)



 SPARC                     MIPS




                           PII




                                  119

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:3
posted:2/15/2012
language:English
pages:119