Document Sample

Analytical models and intelligent search for program generation and optimization David Padua Department of Computer Science University of Illinois at Urbana-Champaign 1 Program optimization today The optimization phase of a compiler applies a series of transformations to achieve its objectives. The compiler uses program analysis to determine which transformations are correctness-preserving. Compiler transformation and analysis techniques are reasonably well-understood. Since many of the compiler optimization problems have “exponential complexity”, heuristics are needed to drive the application of transformations. 2 Optimization drivers Developing driving heuristics is laborious. One reason for this is the lack of methodologies and tools to build optimization drivers. As a result, although there is much in common among compilers, their optimization phases are usually re- implemented from scratch. 3 Optimization drivers (Cont.) As a result, machines and languages not widely popular usually lack good compilers. (some popular systems too) – DSP, network processor, and embedded system programming is often done in assembly language. – Evaluation of new architectural features requiring compiler involvement is not always meaningful. – Languages such as APL, MATLAB, LISP, … suffer from chronic low performance. – New languages difficult to introduce (although compilers are only a part of the problem). 4 A methodology based on the notion of search space Program transformations often have several possible target versions. – Loop unrolling: How many times – Loop tiling: size of the tile. – Loop interchanging: order of loop headers – Register allocation: which registers are stored in memory to give room for new values. The process of optimization can be seen as a search in the space of possible program versions. 5 Empirical search Iterative compilation Perhaps the simplest application of the search space model is empirical search where several versions are generated and executed on the target machine. The fastest version is selected. T. Kisuki, P.M.W. Knijnenburg, M.F.P. O'Boyle, and H.A.G. Wijshoff . Iterative compilation in program optimization. In Proc. CPC2000, 6 pages 35-44, 2000 Empirical search and traditional compilers Searching is not a new approach and compilers have applied it in the past, but using architectural prediction models instead of actual runs: – KAP searched for best loop header order – SGI’s MIPS-pro and IBM PowerPC compilers select the best degree of unrolling. 7 Limitations of empirical search Empirical search is conceptually simple and portable. However, – the search space tends to be too large specially when several transformations are combined. – It is not clear how to apply this method when program behavior is a function of the input data set. Need heuristics/search strategies. Availability of performance “formulas” could help evaluate transformations across input data sets and facilitate search. 8 Program/library generators An on-going effort at Illinois focuses on program generators. The objectives are: – To develop better program generators. – To improve our understanding the optimization process without the need to worry about program analysis. 9 Compilers and Library Generators Algorithm Program Generation Internal representation Program Transformation Source Program 10 Empirical search in program/library generators Examples: – FFTW [M. Frigo, S. Johnson] – Spiral (FFT/signal processing) [J. Moura (CMU), M. Veloso (CMU), J. Johnson (Drexel)] – ATLAS (linear algebra)[R. Whaley, A. Petitet, J. Dongarra] – PHiPAC[J. Demmel et al] – Sorting [X. Li, M. Garzaran (Illinois)] 11 Techniques presented in the rest of the talk Analytical models (ATLAS) – Pure – Combined with search Pure search strategies – Data independent performance (Spiral) – Data dependent performance (Sorting) 12 I. Analytical models and ATLAS Joint work with G. DeJong (Illinois), M. Garzaran, and K. Pingali (Cornell) 13 ATLAS A Linear Algebra Library Generator. ATLAS = Automatically Tuned linear Algebra Software At installation time, searches for the best parameters of a Matrix-Matrix Multiplication routine. We studied ATLAS and modified the system to replace the search with an analytical model that identifies the best MMM parameters without the need for search. 14 The modified version of ATLAS Original ATLAS Infrastructure Compile, MFLOPS Execute, Measure L1Size NB Detect ATLAS Search MU,NU,KU ATLAS MM MiniMMM Hardware NR Engine xFetch Code Generator Source Parameters MulAdd (MMSearch) MulAdd (MMCase) Latency Latency Model-Based ATLAS Infrastructure L1Size NB Detect L1I$Size MU,NU,KU ATLAS MM MiniMMM Hardware NR Model xFetch Code Generator Source Parameters MulAdd MulAdd (MMCase) Latency Latency 15 Detecting Machine Parameters Micro-benchmarks – L1Size: L1 Data Cache size • Similar to Hennessy-Patterson book – NR: Number of registers • Use several FP temporaries repeatedly – MulAdd: Fused Multiply Add (FMA) • “c+=a*b” as opposed to “c+=t; t=a*b” – Latency: Latency of FP Multiplication • Needed for scheduling multiplies and adds in the absence of FMA 16 Compiler View ATLAS Code Generation Compile, MFLOPS Execute, Measure L1Size NB Detect ATLAS Search MU,NU,KU ATLAS MM MiniMMM Hardware NR Engine xFetch Code Generator Source Parameters MulAdd (MMSearch) MulAdd (MMCase) Latency Latency Focus on MMM (as part of BLAS-3) – Very good reuse: O(N2) data, O(N3) computation – Many optimization opportunities for (int j = 0; j < N; j++) • Few “real” dependencies for (int i = 0; i < N; i++) for (int k = 0; k < N; – Will run poorly on modern machines k++) • Poor use of cache and registers C[i][j] += A[i][k] * B[k][j] • Poor use of processor pipelines 17 Characteristics of the code as generated by ATLAS Cache-level blocking (tiling) – Atlas blocks only for L1 cache Register-level blocking – Highest level of memory hierarchy – Important to hold array values in registers Software pipelining – Unroll and schedule operations Versioning – Dynamically decide which way to compute Back-end compiler optimizations – Scalar Optimizations – Instruction Scheduling 18 Cache Tiling for MMM Tiling in ATLAS: Only square tiles: NBxNBxNB Working set of tile must fit in L1 cache Tiles usually copied first into contiguous buffer B Special clean-up code generated for boundaries k Mini-MMM: for (int j = 0; j < NB; j++) for (int i = 0; i < NB; i++) k j for (int k = 0; k < NB; k++) C[i][j] += A[i][k] * B[k][j] i NB NB A NB NB C Optimization parameter: NB 19 IJK version (large cache) DO I = 1, N//row-major storage DO J = 1, N B K DO K = 1, N C(I,J) = C(I,J) + A(I,K)*B(K,J) A K C Large cache scenario: – Matrices are small enough to fit into cache – Only cold misses, no capacity misses – Miss ratio: • Data size = 3 N2 • Each miss brings in b floating-point numbers • Miss ratio = 3 N2 /b*4N3 = 0.75/bN = 0.019 (b = 4,N=10) 20 IJK version (small cache) DO I = 1, N DO J = 1, N B K DO K = 1, N C(I,J) = C(I,J) + A(I,K)*B(K,J) A K C Small cache scenario: – Matrices are large compared to cache • reuse distance is not O(1) => miss – Cold and capacity misses – Miss ratio: • C: N2/b misses (good temporal locality) • A: N3 /b misses (good spatial locality) • B: N3 misses (poor temporal and spatial locality) • Miss ratio 0.25 (b+1)/b = 0.3125 (for b = 4) 21 Register tiling for Mini-MMM Micro-MMM: NU MUx1 sub-matrix of A K 1xNU sub-matrix of B MUxNU sub-matrix of C MU+NU+MU*NU <= NR B Mini-MMM code after register tiling and unrolling NB for (int j = 0; j < NB; j += NU) for (int i = 0; i < NB; i += MU) load C[i..i+MU-1, j..j+NU-1] into registers for (int k = 0; k < NB; k++) load A[i..i+MU-1,k] into registers MU NB load B[k,j..j+NU-1] into registers multiply A’s and B’s and add to C’s K store C[i..i+MU-1, j..j+NU-1] A C Unroll k loop KU times Optimization parameters: MU,NU,KU 22 Scheduling …… for (int k = 0; k < NB; k+= KU) load A[i..i+MU-1,k] into registers load B[k,j..j+NU-1] into registers Micro-MMM multiply A’s and B’s and add to C’s ……… If processor has combined Multiply-add, use it Otherwise, schedule multiplies and adds first – interleave M1M2…MMU*NU and A1A2…AMU*NU after skewing additions by Latency Schedule IFetch number of initial loads for one micro-MMM at the end of previous micro-MMM Schedule remaining loads for micro-MMM in blocks of NFetch – memory pipeline can support only a small number of outstanding loads Optimization parameters: MulAdd, Latency, xFetch 23 High-level picture Multi-dimensional optimization problem: – Independent parameters: NB,MU,NU,KU,… – Dependent variable: MFlops – Function from parameters to variables is given implicitly; can be evaluated repeatedly One optimization strategy: orthogonal range search – Optimize along one dimension at a time, using reference values for parameters not yet optimized – Not guaranteed to find optimal point, but might come close 24 Specification of OR Search Order in which dimensions are optimized Reference values for un-optimized dimensions at any step Interval in which range search is done for each dimension 25 Search strategy 1. Find Best NB 2. Find Best MU & NU 3. Find Best KU 4. Find Best xFetch 5. Find Best Latency (lat) 6. Find non-copy version tile size (NCNB) 26 Find Best NB Search in following range – 16 <= NB <= 80 – NB2 <= L1Size In this search, use simple estimates for other parameters – (eg) KU: Test each candidate for • Full K unrolling (KU = NB) • No K unrolling (KU = 1) 27 Finding other parameters Find best MU, NU : try all MU & NU that satisfy 1 MU,NU NB MU*NU + MU + NU NR – In this step, use best NB from previous step Find best KU Find best Latency values between 1 and 6 Find best xFetch IFetch: [2,MU+NU], Nfetch:[1,MU+NU-IFetch] 28 Model-based estimation of optimization parameters values Execute MFLOPS & Measure L1Size NB Detect ATLAS Search MU, NU, KU ATLAS MM Code MiniMMM Hardware NR Engine Latency Generator Source Parameters MulAdd (MMSearch) xFetch (MMCase) Latency MulAdd L1Size NB Detect L1 I-Cache Model Parameter MU, NU, KU ATLAS MM Code MiniMMM Hardware NR Estimator Latency Generator Source Parameters MulAdd (MMModel) xFetch (MMCase) Latency MulAdd 29 High-level picture NB: hierarchy of models – Find largest NB for which there are no capacity or conflict misses – Find largest NB for which there are no capacity misses, assuming optimal replacement – Find largest NB for which there are no capacity misses, assuming LRU replacement MU,NU: estimate from number of registers, making them roughly equal MU*NU + MU + NU NR KU: maximize subject to I-cache size Latency: from hardware parameter xFetch: set to 2 30 Largest NB for no capacity/conflict misses Tiles are copied into contiguous memory Condition for cold misses only: – 3*NB2 L1Size B k k j i NB NB A NB NB 31 Largest NB for no capacity misses MMM: K for (int j = 0; i < N; i++) B for (int i = 0; j < N; j++) for (int k = 0; k < N; k++) c[i][j] += a[i][k] * b[k][j] K N (J) Cache model: M (I) – Fully associative A C – Line size 1 Word – Optimal Replacement Bottom line: N2+N+1<C – One full matrix – One row / column – One element 32 Extending the Model Line Size > 1 – Spatial locality – Array layout in memory matters Bottom line: depending on loop order – either NB2 NB C B B 1 B – or NB 2 C B NB 1 B 33 Extending the Model (cont.) LRU (not optimal replacement) MMM sample: for (int j = 0; i < N; i++) for (int i = 0; j < N; j++) for (int k = 0; k < N; k++) c[i][j] += a[i][k] * b[k][j] Bottom line: Ai ,1 B1, j Ai , 2 B2, j Ai , NB BNB , j Ci , j NB 2 C B 3NB 1 B A1,1 A1, 2 A1,NBC1, j NB2 NB C B 3 B 1 B A2,1 A2, 2 A2,NBC2, j NB 2 NB C B 2 NB B 1 B ANB1,1 ANB1, 2 ANB1,NBCNB1, j NB2 NB B 2 B NB 1 B C ANB,1B1, j ANB, 2 B2, j ANB,NB BNB, jCNB, j 34 Experiments Architectures: – SGI R12K, 270MHz – Sun UltraSparcIII, 900MHz – Intel PIII, 550MHz Measure – Mini-MMM performance – Complete MMM performance – Sensitivity to variations on parameters 35 Installation time of ATLAS and Model 10000 9000 8000 7000 6000 time (s) 5000 4000 3000 2000 1000 0 SGI Atlas SGI Model Sun Atlas Sun Model Intel Atlas Intel Model Detect Machine Parameters Optimize MMM Generate Final Code Build Library 36 Parameter values ATLAS Archi- Tile Size Unroll Fetch L tecture Copy / Non- MU / NU / KU F / I / N Copy SGI 64 / 64 4 / 4 / 64 0 / 5 / 1 3 Sun 48 / 48 5 / 3 / 48 0 / 3 / 5 5 Intel 40 / 40 2 / 1 / 40 0 / 3 / 1 4 Model Archi- Tile Size Unroll Fetch L tecture Copy / Non- MU / NU / KU F / I / N Copy SGI 62 / 45 4 / 4 / 62 1 / 2 / 2 6 Sun 88 / 78 4 / 4 / 78 1 / 2 / 2 4 Intel 42 / 39 2 / 1 / 42 1/ 2 / 2 3 37 Mini-MMM Performance Architecture ATLAS Model Difference (MFLOPS) (MFLOPS) (%) SGI 457 453 1 Sun 1287 1052 20 Intel 394 384 2 38 SGI Performance F77 ATLAS Model BLAS 600 500 400 MFLOPS 300 200 100 0 0 1000 2000 3000 4000 5000 Martix Size 39 F77 ATLAS Model BLAS 600 500 400 MFLOPS 300 200 100 0 0 1000 2000 3000 4000 5000 Martix Size 5 4.5 TLB Misses (Billions) 4 3.5 3 2.5 2 1.5 1 0.5 0 0 1000 2000 3000 4000 5000 Matrix Size Model ATLAS TLB effects are important when matrix size is large. 40 Sun Performance 41 Pentium III Performance G77 ATLAS Model BLAS 600 500 400 MFLOPS 300 200 100 0 0 1000 2000 3000 4000 5000 Martix Size 42 Sensitivity to tile size (SGI) L2 cache conflict-free tile 600 B 500 A L2 cache Model tile M 400 MFLOPS 300 200 100 0 20 220 420 620 820 Tile Size (B: Best, A: ATLAS, M: Model) Higher levels of memory hierarchy cannot be ignored. 43 Sensitivity to tile size: Sun 1600 B 1400 A M 1200 1000 MFLOPS 800 600 400 200 0 20 40 60 80 100 120 140 Tile Size (B: Best, A: ATLAS, M: Model) 44 But .. Results are not always perfect We recently conducted several experiments on other machines. We considered this a blind test to check the effectiveness of our approach. In these experiments, the search strategy sometimes does better than the model. 45 Recent experiments: Itanium 2 46 Recent experiments: Pentium 4 47 Hybrid approaches We are studying two strategies that combine model with search. First, the model can be used to find a first approximation to the value of the parameters and then use hill climbing to refine this value. Use a general shape of the performance curve and use curve fitting to find optimal point. 48 II. Intelligent Search and Sorting Joint work with Xiaoming Li and M. Garzaran 49 Sorting Generating sorting libraries is an interesting problem for several reasons. – It differs from the problems of ATLAS and Spiral in that performance depends on the characteristics of the input data. – It is not as clearly decomposable as the linear algebra problems 50 Outline Sorting Part I: Selecting one of several possible “pure” sorting algorithm at runtime – Motivation – Sorting Algorithms – Factors – Empirical Search and Runtime Adaptation – Experiment Results Part II: Building a hybrid sorting algorithm – Primitives – Searching approaches 51 Motivation Theoretical complexity does not suffice to evaluate sorting algorithms – Cache effect – Instruction number The performance of sorting algorithms depends on the characteristics of input – Number of records – Input distribution – … 52 What we accomplished in this work Identified architectural and runtime factors that affect the performance of sorting algorithms. Developed a empirical search strategy to identify the best shape and parameter values of each sorting algorithm. Developed a strategy to choose at runtime the best sorting algorithm for a specific input data set. 53 Performance vs. Distribution 54 Performance vs. Distribution 55 Performance vs. Sdev 56 Performance vs. Sdev 57 Outline Sorting Part I: Select the best algorithm – Motivation – Sorting Algorithms – Factors – Empirical Search and Runtime Adaptation – Experiment Results Part II: Build the best algorithm – Primitives – Searching approaches 58 Quicksort Set guardians at both ends of the input array. Eliminate recursion. Choose the median of three as the pivot. Use insertion sort for small partitions. 59 Radix sort Non comparison algorithm Vector Dest. to sort counter accum. vector 0 12 12 1 2 1 0 31 31 0 1 23 23 2 1 2 2 3 1 1 2 31 31 3 2 3 3 12 12 2 3 13 13 4 1 4 5 23 23 3 4 4 33 33 4 5 1 4 5 60 Cache Conscious Radix Sort CC-radix(bucket) if fits in cache L (bucket) then Radix sort (bucket) else sub-buckets = Reverse sorting(bucket) For each sub-bucket in sub-buckets CC-radix(sub-buckets) endfor endif Pseudocode for CC-radix 61 Multiway Merge Sort 62 Sorting algorithms for small partitions Insertion sort Apply register blocking to sorting algorihtm -> register sorting network 63 Outline Part I: Select the best algorithm – Motivation – Sorting Algorithms – Factors – Empirical Search and Runtime Adaptation – Experiment Results Part II: Build the best algorithm – Primitives – Searching approaches 64 Cache Size/TLB Size Quicksort: Using multiple pivots to tile CC-radix: – Fit each partition into cache – The number of active partitions < TLB size Multiway Merge Sort: – The heap should fit in the cache – Sorted runs should fit in the cache 65 Number of Registers Register Blocking 66 Cache Line Size To optimize shift-down operation 67 Amount of Data to Sort Quicksort – Cache misses will increase with the amount of data. CC-radix – As amount of data increases, CC-radix needs more partitioning passes. After certain threshold, the performance drops dramatically. Multiway Merge Sort – Only profitable for large amount of data when reduction in number of cache misses can compensate for the increased number of operations with respect to Quicksort. 68 Distribution of the Data To goal is to distinguish the performance of the comparison based algorithms versus the radix based ones. Distribution shapes: Uniform, Normal, Exponential, … – Not a good criteria. Distribution width: – Standard deviation (sdev): • Only good for one-peak distribution • Expensive to calculate – Entropy • Represents the distribution of each bit 69 Outline Part I: Select the best algorithm – Motivation – Sorting Algorithms – Factors – Empirical Search and Runtime Adaptation – Experiment Results Part II: Build the best algorithm – Primitives – Searching approaches 70 Library adaptation Architectural Factors – Cache / TLB size – Number of Registers Empirical Search – Cache Line Size Runtime Factors – Distribution shape of the data Does not matter – Amount of data to Sort Machine learning and – Distribution runtime adaptation 71 The Library Building the library Installation time – Empirical Search – Learning Procedure Runtime • Use of training data Adaptation Running the library Runtime – Runtime Procedure 72 Runtime Adaptation Has two parts: at installation time and at runtime Goal function: f:(N,E) -> {Multiway Merge(sh,f) Sort, Quicksort, CC-radix} – N: amount of input data – E: the entropy vector For given (N,E), identify the best configuration for Multiway Merge Sort as a function of size_of_heap and fanout . 73 Runtime Adaptation f:(N,E) is linear separable problem. – A linear separable problem f(x1, x2, …,xn) is a decision problem that there exists a weight vector w f ( x ) true if w x or false elsewise The runtime adaptation code is generated at the end of installation to implement the learned f:(N,E) and select the best configuration for Multiway Merge Sort. 74 Runtime Adaptation: Learning Procedure Goal function: f:(N,E) {Multiway Merge Sort, Quicksort, CC-radix} N: amount of input data E: the entropy vector – Use the entropy to learn the best algorithm between CC- radix and one of the other two • Output: weight vector (→) and threshold (Ө) for w each value of N – Then, use N to choose between Multiway Merge or Quicksort 75 Runtime Adaptation:Runtime Procedure Sample the input array Compute the entropy vector Compute S = ∑i wi * entropyi If S ≥ Ө choose CC-radix else choose others 76 Outline Part I: Select the best algorithm – Motivation – Sorting Algorithms – Factors – Empirical Search and Runtime Adaptation – Experiment Results Part II: Build the best algorithm – Primitives – Searching approaches 77 Setup 78 Performance Comparison Pentium III Xeon, 16 M keys (float) 7000 Execution Time (Cycles) 6500 6000 5500 Intel MKL Quicksort 5000 4500 4000 100 1000 10000 100000 1000000 10000000 Standard Deviation 79 Sun UltraSparcIII 80 IBM Power3 81 Intel PIII Xeon 82 SGI R12000 83 Conclusion Identified the architectural and runtime factors Used empirical search to find the best parameters values Our machine learning techniques proved to be quite effective: – Always selects the best algorithm. – The wrong decision introduces a 37% average performance degradation – Overhead (average 5%, worst case 7%) 84 Outline Part I: Select the best algorithm – Motivation – Sorting Algorithms – Factors – Empirical Search and Runtime Adaptation – Experiment Results Part II: Build the best algorithm – Primitives – Searching approaches 85 Primitives Categorize sorting algorithms – Partition by some pivots: Quicksort, Bubble Sort,… – Partition by size: Merge Sort, Select Sort – Partition by radix: Radix Sort, CC-Radix Construct a sorting algorithm using these primitives. DP DV DV DP DP DV 86 Searching approaches The composite sorting algorithms are in the shape of trees. Every primitive have parameters. The searching mechanism must be able to search both the shape and the value. Genetic algorithm is a good choice (may not be the only one). 87 Results 88 Results 89 90 SPIRAL The approach: – Mathematical formulation of signal processing algorithms – Automatically generate algorithm versions – A generalization of the well-known FFTW – Use compiler technique to translate formulas into implementations – Adapt to the target platform by searching for the optimal version 91 92 Fast DSP Algorithms As Matrix Factorizations Computing y = F4 x is carried out as: t1 = A4 x ( permutation ) t2 = A3 t1 ( two F2’s ) t3 = A2 t2 ( diagonal scaling ) y = A1 t3 ( two F2’s ) The cost is reduced because A1, A2, A3 and A4 are structured sparse matrices. 93 Tensor Product Formulation of Cooley-Tuckey Theorem Frs ( Fr I s )Tsrs ( I r Fs ) Lrs r Tsrs is a diagonal matrix Example Lrs r is a stride permutation F4 ( F2 I 2 )T44 ( I 2 F2 ) L4 2 1 0 1 0 1 0 0 0 1 1 0 0 1 0 0 0 0 1 0 1 0 1 0 0 1 1 0 0 0 0 1 0 1 0 1 0 0 0 1 0 0 0 1 1 0 1 0 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 0 1 94 Formulas for Matrix Factorizations F4 (F2 I 2 )T (I 2 F2 )L 4 2 4 2 R1 Frs (Fr Is )T (I r Fs )L s rs rs r R2 k 1 Fn (I n i Fn i I n i )(I n i T n i n i n i ) (I n i L n i n i ni ) i 1 ik where n = n1…nk, ni- = n1…ni-1, ni+= ni+1…nk 95 Factorization Trees Different computation order F8 : R1 Different data access pattern F8 : R1 F2 F4 : R1 Different performance F4 : R1 F2 F2 F2 F2 F2 F8 : R2 F2 F2 F2 96 Walsh-Hadamard Transform 97 Optimal Factorization Trees Depend on the platform Difficult to deduct Can be found by empirical search – The search space is very large – Different search algorithms • Random, DP, GA, hill-climbing, exhaustive 98 99 100 Size of Search Space N # of formulas N # of formulas 21 1 29 20793 22 1 210 103049 23 3 211 518859 24 11 212 2646723 25 45 213 13649969 26 197 214 71039373 27 903 215 372693519 28 4279 216 1968801519 101 102 103 More Search Choices Programming: – Loop unrolling – Memory allocation – In-lining Platform choices: – Compiler optimization options 104 The SPIRAL System DSP Transform Formula Generator SPL Program SPL Compiler Search Engine C/FORTRAN Programs Performance Evaluation Target machine DSP Library 105 Spiral Spiral does the factorization at installation time and generates one library routine for each size. FFTW only generates codelets (input size 64) and at run time performs the factorization. 106 A Simple SPL Program Definition Formula Directive Comment ; This is a simple SPL program (define A (matrix(1 2)(2 1))) (define B (diagonal(3 3)) #subname simple (tensor (I 2)(compose A B)) ;; This is an invisible comment 107 Templates (template (F n)[ n >= 1 ] Pattern ( do i=0,n-1 y(i)=0 Condition do j=0,n-1 I-code y(i)=y(i)+W(n,i*j)*x(j) end end )) 108 SPL Compiler SPL Formula Template Definition Parsing Abstract Syntax Tree Template Table Intermediate Code Generation I-Code Intermediate Code Restructuring I-Code Optimization I-Code Target Code Generation FORTRAN, C 109 Intermediate Code Restructuring Loop unrolling – Degree of unrolling can be controlled globally or case by case Scalar function evaluation – Replace scalar functions with constant value or array access Type conversion – Type of input data: real or complex – Type of arithmetic: real or complex – Same SPL formula, different C/Fortran programs 110 111 Optimizations * High-level scheduling Formula Generator * Loop transformation * High-level optimizations - Constant folding - Copy propagation SPL Compiler - CSE - Dead code elimination * Low-level optimizations C/Fortran Compiler - Instruction scheduling - Register allocation 112 Basic Optimizations (FFT, N=25, SPARC, f77 –fast –O5) 113 Basic Optimizations (FFT, N=25, MIPS, f77 –O3) 114 Basic Optimizations (FFT, N=25, PII, g77 –O6 –malign-double) 115 Performance Evaluation Evaluation the performance of the code generated by the SPL compiler Platforms: SPARC, MIPS, PII Search strategy: dynamic programming 116 Pseudo MFlops # of FP operations in the algorithm Pseudo MFlops Execution time ( s) Estimation of the # of FP operations: – FFT (radix-2): 5nlog2n – 10 + 16 117 FFT Performance (N=21 to 26) SPARC MIPS PII 118 FFT Performance (N=27 to 220) SPARC MIPS PII 119

DOCUMENT INFO

Shared By:

Categories:

Tags:

Stats:

views: | 3 |

posted: | 2/15/2012 |

language: | English |

pages: | 119 |

OTHER DOCS BY wuzhengqin

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.