Docstoc

109

Document Sample
109 Powered By Docstoc
					 USC
INFORMATION
 SCIENCES
INSTITUTE




       An Open64-based Compiler Approach to
       Performance Prediction and Performance
        Sensitivity Analysis for Scientific Codes

                        Jeremy Abramson and Pedro C. Diniz
              University of Southern California / Information Sciences Institute
                               4676 Admiralty Way, Suite 1001
                              Marina del Rey, California 90292
 USC
INFORMATION
 SCIENCES
                                                  Motivation
INSTITUTE




    • Performance analysis is conceptually easy
              – Just run the program!
                   • The “what” of performance. Is this interesting?
              – Is that realistic?
                   • Huge programs with large data sets
                        – “Uncertainty principle” and intractability of profiling/instrumenting

    • Performance prediction and analysis is in practice very hard
              – Not just interested in wall clock time
                   • The “why” of performance is a big concern
                   • How to accurately characterize program behavior?
                   • What about architecture effects?
                        – Can’t reuse wall clock time
                        – Can reuse program characteristics
 USC
INFORMATION
 SCIENCES
                                         Motivation (2)
INSTITUTE




    • What about the future?
                  • Different architecture = better results?
                  • Compiler transformations (loop unrolling)
    • Need a fast, scalable, automated way of determining program
      characteristics
              – Determine what causes poor performance
                  • What does profiling tell us?
                  • How can the programmer use profiling (low-level) information?
 USC
INFORMATION
 SCIENCES
                                           Overview
INSTITUTE




    • Approach
              – High level / low level synergy
              – Not architecture-bound
    • Experimental results
              – CG core
    • Caveats and future work
    • Conclusion
 USC
INFORMATION
 SCIENCES
                   Low versus High level information
INSTITUTE




     la $r0, a
     lw $r1 i
     mult $offset, $r1, 4
     add $offset, $offset, $r0   or
     lw $r2, $offset
     add $r3, $r2, 1
     la $r4, b
     sw $r4, $r3



    • Which can provide meaningful performance information to
      a programmer?
    • How do we capture the information at a low level while
      maintaining the structure of high level source?
 USC
INFORMATION
 SCIENCES
                     Low versus High level information (2)
INSTITUTE




    • Drawbacks of looking at low-level
              – Too much data!
              – You found a “problem” spot. What now?
                  • How do programmers relate information back to source level?
    • Drawbacks of looking at source-level
              – What about the compiler?
                  • Code may look very different
              – Architecture impacts?
    • Solution: Look at high-level structure, try to anticipate
      compiler
 USC
INFORMATION
 SCIENCES
                                   Experimental Approach
INSTITUTE



    • Goal: Derive performance expectations from source code for different
      architectures
              – What should the performance be and why?
              – What is limiting the performance?
                  • Data-dependencies?
                  • Architecture limitations?
    • Use high level information
              – WHIRL intermediate representation in Open64
                  • Arrays not lowered
    • Construct DFG
              – Decorate graph with latency information
    • Schedule the DFG
              – Compute as-soon-as-possible schedule
              – Variable number of functional units
                  • ALU, Load/Store, Registers
                  • Pipelining of operations
  USC
 INFORMATION
  SCIENCES
                           Compilation process
 INSTITUTE




                        OPR_STID: B

for (i; i < 0; …          OPR_ADD
…                            OPR_ARRAY
   B = A[i] + 1
…                              OPR_LDA: A
1. Source (C/Fortran)          OPR_LDID: i
                             OPR_CONST: 1
                        2. Open64 WHIRL (High-level)   3. Annotated DFG
  USC
 INFORMATION
  SCIENCES
                           Memory modeling approach
 INSTITUTE




                                              i is a loop induction
                                              variable
Array node represents
address calculation at a                0
                                                  Register hit? Assign
high level                                        latency



  Array expression is
  affine. Assume a
  cache hit, and assign
  latency accordingly
 USC
INFORMATION
 SCIENCES
                                 Example: CG
INSTITUTE




    do 200 j = 1, n
      xj = x(j)
         do 100 k = colstr(j) , colstr(j+1)-1
           y(rowidx(k)) = y(rowidx(k)) + a(k) + xj
         100 continue
    200 continue
 USC
INFORMATION
 SCIENCES
                                                                CG Analysis Results
INSTITUTE



                                     70.00                                                                    66
                                                                                     64.29

                                     60.00


                                     50.00                                  47.29
              Cycles per iteration




                                                        40.13
                                     40.00
                                                                                                                               All iterations
                                                                                                                               Outer Loop
                                     30.00

                                                18.74
                                     20.00


                                     10.00


                                      0.00
                                             Optimized Code ( -O3       Non-optimized code              Generated Prediction
                                               compiler switch)

                                                            Figure 4. Validation results of CG on a MIPS R10000 machine



     Prediction results consistent with un-optimized version of the code
 USC
INFORMATION
 SCIENCES
                                                   CG Analysis Results (2)
INSTITUTE



                             70

                             69

                             68

                             67

                             66
                    Cycles




                                                                                                                        1 F loating P oint U ni t
                             65
                                                                                                                        5 F loating P oint U ni ts

                             64

                             63

                             62

                             61
                                  1 LSU          2 LSU              3 LSU             4 LSU              5 LSU
                                                           Load/St ore Unit s


                                     Figure 5. Cycle time for an iteration of CG with varying architectural configurations




    • What’s the best way to use processor space?
              – Pipelined ALUs?
              – Replicate standard ALUs?
 USC
INFORMATION
 SCIENCES
                                       Caveats, Future Work
INSTITUTE



    • More compiler-like features are needed to improve accuracy
              – Control flow
                   • Implement trace scheduling
                   • Multiple-paths can give upper/lower performance bounds
              – Simple compiler transformations
                   • Common sub-expression elimination
                   • Strength reduction
                   • Constant folding
              – Register allocation
                   • “Distance”-based methods?
                   • Anticipate cache for spill code
              – Software pipelining?
                   • Unrolling exploits ILP
    • Run-time data?
              – Array references, loop trip counts, access patterns from performance
                skeletons
 USC
INFORMATION
 SCIENCES
                                         Conclusions
INSTITUTE




    • SLOPE provides very fast performance prediction and
      analysis results
    • High-level approach gives more meaningful information
              – Still try to anticipate compiler and memory hierarchy
    • More compiler transformations to be added
              – Maintain high-level approach, refine low-level accuracy
 USC
INFORMATION
 SCIENCES
INSTITUTE




       An Open64-based Compiler Approach to
       Performance Prediction and Performance
        Sensitivity Analysis for Scientific Codes
                         Jeremy Abramson and Pedro C. Diniz
              University of Southern California / Information Sciences Institute
                               4676 Admiralty Way, Suite 1001
                              Marina del Rey, California 90292

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:21
posted:9/16/2012
language:Unknown
pages:15