Profile Guided Optimizations in Visual C++ 2005 by dfhdhdhdhjr


									Profile Guided Optimizations in
               Visual C++ 2005
                              Andrew Pardoe
                Phoenix Team (C++ Optimizer)
What do optimizers do?
int setArray(int a, int *array)
  for(int x = 0; x < a; ++x)
      array[x] = 0;
   return x;
 The compiler knows nothing about the value of „a‟
 The compiler knows nothing about the array‟s alignment
 The compiler doesn‟t look at all the source files together
 The compiler doesn‟t know how the program will execute
What is PGO (pronounced PoGO)?
   A “profile” details a program‟s behavior in a specific
   Profile-guided optimizations use the profile to guide the
    optimizer for that given scenario
   PGO tells the optimizer which areas of the application
    were most frequently executed
   This information lets the optimizer be more selective in
    optimizing the program
   PGO has its own set of optimizations as well as
    improving traditional optimizations
Example of a PGO win
   Compiler optimizations make assumptions based on static
    analysis and standard heuristics
       For example, we assume that a loop executes multiple times
        for (p=list; *p; p=p->next) {
          p->f = sqrt(F);
       The optimizer would hoist the call to the loop invariant sqrt(F)
        tmp = sqrt(F);
        for (p=list; *p; p=p->next) {
          p->f = tmp;
       If the profile shows that p is zero, we will not hoist the call
How is PGO used?

Source code
              PGO Probes

 Scenarios                    Profile

Source code     Profile
How is PGO used?
   PGO is built on top of Link-Time Code Generation
   Must link object files twice: once for instrumented build,
    once for optimized build
   Can be used on almost all native code
       exe, dll, lib
       COM/MFC
       Windows services
   Cannot be used on system or managed code
       Drivers or kernel mode code
       No code compiled with /CLR
   Incorrect scenarios could cause worse optimizations!
PGO profile gathering
   Two major themes of PGO profile gathering
     Identify “hot paths” in program execution path and
      optimize to make these paths perform well
     Likewise, identify “cold paths” to separate cold code—
      or dead code—from hot code
     Identify “typical” values such as switch values, loop
      induction variables and targets of indirect calls and
      optimize code for these values
PGO main optimizations: inlining
   Improved inlining heuristics
     Inline based on frequency of call, not function size
      or depth of call stack
     “Hot” call sites: inline agressively
     “Cold” call sites: only inline if there are other
      optimization opportunities (such as folding)
     “Dead” call sites: only inline the trivial cases
PGO main optimizations: inlining
   Speculative inlining: used for virtual call specification
     Indirect calls are profiled to find typical targets
     An indirect call heavily biased toward certain
      target(s) can be multi-versioned
     The new sequence contains direct call(s) to
      typical target(s), which can be inlined
   Partial inlining: only inline the portions of the callee
    we execute. If the cold code is called, call the non-
    inlined function.
PGO main optimizations: code size
   Choice of favoring size versus speed made on a per-
    function basis
           Program execution should be dominated by functions optimized for
            speed and less-frequently used functions should be small
   PGO computes a dynamic instruction count for each
    profiled function.
       Inlining effects are taken into account.
   Sorts functions in descending order by count.
   Functions in the upper 99% of total dynamic instruction
    count are optimized for speed. Others are compressed.
   In large applications (Vista, SQL) most functions are
    optimized for size.
PGO main optimizations: locality
   Reorder the code to “fall through” wherever possible
   Intra-function layout reorders basic blocks so that the
    major trace falls through whenever possible.
   Inter-function layout tries to place frequent caller-callee
    pairs near one another in the image.
   Extract “dead” code from the .text section and put it in a
    remote section of the image
   Dead code can be entire functions that are not called or
    basic blocks inside a function
   Penalty for being wrong is very large so the profile must
    be accurate!
What code benefits most?
   C++ programs: many virtual calls can be inlined once the
    target is determined through profiling
   Large applications where size and speed are important
   Code with frequent branches that are difficult to predict
    at compile time
   Code which can be separated by profiling into “hot” and
    “cold” blocks to help instruction cache locality
   Code for which you know the typical usage patterns and
    can produce accurate profiling scenarios
Scenario 1
   Customer compiles with /O2 and gets pretty good
    performance but wants to take advantage of advanced
    optimizations like LTCG and PGO
   Code is tested by the dev team throughout development
    cycle using unit and bug regression tests
   Customer has done performance measurements of the
    code. Customer has no automated tests to measure
    performance but believes it can improve.
   Is this customer ready to try PGO? Probably not.
Scenario 2
   Customer has well-defined performance goals and tests
    set up to measure performance
   Customer knows typical usage patterns for the
   Application is being built with LTCG
   Most of the execution time is spent in tightly-nested
    loops doing heavy floating-point calculations
   Is this customer ready to use PGO? Maybe…
Scenario 3
   Customer has well-defined performance goals and tests
    set up to measure performance
   Customer knows typical usage patterns for the
   Application is being built with LTCG
   Application spends most of its time in branches and calls
   Application is fairly large and makes use of inheritance
   Is this customer ready to use PGO? Definitely.
Scenario 4
   Customer has a build lab and wants to enable PGO in
    nightly builds
   But profiling every night seems too expensive
   Solution: PGO Incremental Update
       Avoid running profile scenarios at every build
       PGU uses “stale” profile data
       Can check in profile data and refresh weekly
   PGU restricts optimizations
       Functions which have changed will not be optimized
       Effects of localized changes are usually negligible
PGO sweeper
   Some scenarios are difficult to collect profile data for
       Profile scenario may not begin and end with application launch
        and shutdown
       Some components cannot write a file
       Some components cannot link to the PGO runtime DLL
   PGO sweeper collects profile data from running
    instrumented processes
   This allows you to close a currently open .pgc file and
    create a new one without exiting the instrumented binary
   You get one .pgc file per run or sweep.You can delete any
    .pgc files you do not want reflected in your scenario.
PGO Manager
   PGO manager adds profile data from one or more .pgc
    files into the .pgd file
   The .pgd file is the main profile database
   Allows you to profile multiple scenarios (.pgc) for a single
    codebase into one profile database (.pgd)
   PGO manager also lets you generate reports from the
    .pgd file to see that your scenarios “feel right” in the code
   Information in the reports include
       Module count, function count, arc and value count
       Static (all) instruction count, dynamic (hot) instruction count
       Basic block count, average basic block size
       Function entry count
How much performance does PGO get?
   Performance gain is architecture and application specific
       IA64 sees biggest gains
       x64 benefits more than x32
       Large applications benefit more than small: SQL server saw
        over 30% gains through PGO
       Many parts of Windows use PGO to balance size vs. speed
   If you understand your real-world scenarios and have
    adequate, repeatable tests PGO is almost always a win
   Once your testing is in-place integrating PGO into your
    build process should be easy
Performance gains over LTCG
Call-graph profiling
   Given this call graph, determine which code paths are hot
    and which are cold


          foo                 bar                baz

Call-graph profiling continued
   Measure the frequency of calls

                     10               75
           a                  bar          baz

                     20              50
          foo                 bar          baz

                    100              15
                    100              15      15
          bat                 bar          baz
Call-graph profiling after inlining
   Inline functions based on call profile
       Highest-frequency calls are (bar, baz) and (bat, bar)


                         20                      125
             foo                     bar                    baz

                         100                     15
             bat                     bar                    baz
Reordering basic blocks
   Change code layout to improve instruction cache locality

        Execution profile   Default layout     Optimized layout
                            Default layout   Optimized layout

                                 A                  A
       100           100

                                 B                  C
        B            C

       100            100
                                 C                  D
                                 D                  B
Speculative inlining of virtual calls
   Profiling shows the dynamic type of object A in function
    Func was almost always Foo (and almost never Bar)
                                           void Func(Base *A)
           class Base                         …
                                             void Bar(Base *A)
           {                                 {while(true)
             …                                {…
             virtual void call();                …
           }                                    {if(type(A) == Foo:Base)
class Foo:Base            class Bar:Base             // inline
                                                   A->call(); of A->call();
{                         {                      }…
  …                         …                   }else // virtual dispatch
  void call();              void call();     } A->call();
}                         }                      …
Partial inlining
Profiling shows that           Basic Block 1
condition Cond favors
the left branch over
the right branch

                    Hot Code                   Cold Code

                                More Code
Partial inlining concluded
We can inline the hot           Basic Block 1
path, and not the cold
path. We can make different
decisions at each call site!

                     Hot Code                   Cold Code

                                 More Code
 Using PGO (in more detail)
Source code        Compile with
                   /GL and opts     Object files

                     Link with
                    /LTCG:PGI      Instrumented     .PGD file
Object files

Scenarios          Instrumented
                                   .PGC file(s)

               .PGC files
                                    Link with      Optimized
Object files                      /LTCG:PGO         binary
                      .PGD file
PGO tips
   The scenarios used to generate the profile data should be real-
    world scenarios. The scenarios are NOT and attempt to do
    code coverage.
   Using scenarios to train with that are not representative of
    real-world use can result in code that performs worse than if
    PGO was not used.
   Name the optimized code something different from the
    instrumented code, for example, app.opt.exe and app.inst.exe.
    This way you can rerun the instrumented application to
    supplement your set of scenario profiles without rerunning
    everything again.
   To tweak results, use the /clear option of pgomgr to clear out
    a .PGD files.
PGO tips
   If you have two scenarios that run for different amounts
    of time, but would like them to be weighted equally, you
    can use the weight switch (/merge:weight in pgomgr) on
    .PGC files to adjust them.
   You can use the speed switch to change the speed/size
   You can control the inlining threshold with a switch but
    use it with care. The values from 0-100 aren't linear.
   Integrate PGO into your build process and update
    scenarios frequently for the most consistent results and
    best performance increases.
In summary
   Using PGO is very easy, with four simple steps
       CL to parse the source files
           cl /c /O2 /GL *.cpp
       LINK / PGI to generate instrumented image
           link /ltcg:pgi /pgd:appname.pgd *.obj *.lib
           Also generates a PGD file (PGO database)
       Run your program on representative scenarios
           Generates PGC files (PGO profile data)
       LINK / PGO to generate optimized image
           Implicitly uses the generated PGC files
           link /ltcg:pgo /pgd:appname.pgd *.obj *.lib
More information
   Matt Pietrek‟s Under the Hood column from May 2002 has
    a fantastic explanation of LTCG internals
   Multiple articles on PGO located on MSDN
       The links are long: just search for PGO on MSDN
   Look through articles by Kang Su Gatlin on his blog at or on MSDN
   Improvements are coming in the new VC++ backend
       Based on the Phoenix optimization framework
       Profiling is a major scenario for the Phoenix-based optimizer
       There will be a talk on Phoenix later today

To top