Performance Programming

Document Sample
Performance Programming Powered By Docstoc
					    Performance Programming: Theory, Practice and
                     Case Studies



                              Module I:
                    Measuring Program Performance




9   Performance Programing    Module I: Measuring Program Performance
                               Outline
       Measuring methodology and guidelines
       Measurement tools
           Timing Tools
           Profiling Tools
           Process monitoring and tracing tools
           System monitoring tools

       Hardware counter measurements
           Monitoring tools
           Code instrumentation

       Parallel performance measurements
           Guidelines and recommendations
           Tools for parallel monitoring

       Summary
10   Performance Programing       Module I: Measuring Program Performance
             Measurement Methodology
       Quantifying performance is the first step in the
       application tuning process
       Important to set reasonable expectations for op-
       timization
       Measurements should be made repeatedly to
       identify parts of the program that need to be op-
       timized
       Proper choice of measurement characteristics
       suitable for a particular application
       Comparison of measurements to theoretical peak
       values
11   Performance Programing   Module I: Measuring Program Performance
                              What to Measure
       Timing measurements
           Wall clock time for a single job (turnaround time)
           Wall clock time for multiple jobs (throughput measurements)
           Wall clock time for parallel runs (scalability measurements)

       Execution and computation rates
           MFLOPS (million floating point operations per second)
           MIPS (million instructions per second)
           IPC (instructions per cycle)

       Resource utilization
           Memory usage
           I/O utilization
           Network usage



12   Performance Programing       Module I: Measuring Program Performance
                Benchmarking Guidelines
       Benchmark runs should adequately represent the
       use of the application
       Preferably only one parameter changing at a time
       Overhead of measurement should be considered
       Runs from tmpfs or from a locally mounted ufs
       System activities should be monitored
       The systems should not have any other computa-
       tional jobs running during benchmarking
       System parameters and settings should be docu-
       mented together with the results of the runs.
13   Performance Programing   Module I: Measuring Program Performance
                          Measurement Tools
       Functionality
           Timing tools
           Profiling tools
           Monitoring tools

       Usage requirements
           Tools that can operate on optimized binaries
           Tools that require recompilation
           Tools that require source code instrumentation

       Parallel / serial measurement tools
           Tools measuring serial performance
           Tools measuring parallel performance




14   Performance Programing     Module I: Measuring Program Performance
                    Timing Entire Program
       Measuring the elapsed (wall- clock) time that
       passes during the program execution
       Example: Solaris time, timex, and ptime




15   Performance Programing   Module I: Measuring Program Performance
                 Timing Program Portions
       Fortran 77: etime, dtime (both not thread
       safe)
       C, C++, Fortran 90/95: gethrtime
           High resolution timer (nanoseconds)
           Can be called via a C wrapper from Fortran 77
           Can be used for multithreaded applications

       Platform-specific tools and methods
           Solaris microstate accounting
           Fine-grain timing measurements by accessing UltraSPARC
           TICK register directly
         .inline readtick,1
         rd      %tick, %o1
         stx     %o1, [%o0]
         .end



16   Performance Programing    Module I: Measuring Program Performance
                   Measurement Overhead
       Computing overhead Distribution
       of gethrtime()     22500

       call               20000
                                            17500
      #include<sys/time.h>                  15000
      time_t start, end;
      int i, iters = 100000;                12500
      for (i = 0; i < iters; i++) {
         start = gethrtime();               10000
         end = gethrtime();                   7500
         (void)printf("%lld \n",
         (end - start));                      5000
      }
                                              2500
                                                  0
                                                               Call overhead (ns)
                                               180-185      185-190      190-195   195-200
                                               200-205      205-210      210-215   215-220
                                               220-225      225-230
                                                            (ns)

17   Performance Programing    Module I: Measuring Program Performance
          Program Profiling with gprof
       Application profiling
           Special form of timing measurements that shows which func-
           tions account for large parts of application runtimes
           Should be used on multiple and representative test cases

       gprof - standard UNIX profiling utility
           Can be used for profiling executalbes and shared libraries
           Based on Program Counter (PC) sampling at periodic intervals
           Requires recompilation with -pg (Linux, Solaris, Tru64) or -G
           (HP-UX)
           After the run the data is collected in gmon.out file
           Profiling results displayed with gprof command




18   Performance Programing    Module I: Measuring Program Performance
                              gprof Output
       Output includes
           Absolute time spent in a function
           Percentage of total run time spent in a function
           Number of calls to the function
           Average time per call

       Functions can be sorted by
           time they consume together with their descendants (commul-
           ative or inclusive time)
           time spent executing the function itself (self or exclusive time)
           % cumulative         self                  self          total
         time  seconds        seconds    calls       ms/call        ms/call name
         66.4     65.70         65.70   186116          0.35           0.35 dmmch_ [4]
         15.2     80.72         15.02    20448          0.73           0.73 dmake_ [8]
         10.9     91.51         10.79    16924          0.64           0.64 dgemm_ [9]
         ...




19   Performance Programing        Module I: Measuring Program Performance
     Profiling Using Coverage Analysis
       Coverage analysis tools annotate source code
       with the number of times each line was executed
           Basic block profiling
           Results can be accumulated for multiple runs
           Information about hot loops in the code and branches taken
           Code coverage for quality assurance
                               DO 350 L = LL, LL+ LSEC- 1
         150483840 ->             F11 = F11 + T1( L- LL+ 1, I- II+ 1 )*
                              $ T2( L- LL+ 1, J- JJ+ 1 )


       Available on UNIX platforms
           Linux/GNU: gcov
           Solaris: tcov
           IRIX: cvcov, cvxcov
           Tru64: pixie
           AIX: tprof

20   Performance Programing          Module I: Measuring Program Performance
                 Advanced Profiling Tools
       Measurement parameters and features
           Measurements based on hardware counters
           Profiling by
               functions
               basic blocks
               lines of high level code
               assembly instructions
           Source code annotation
           Capabilities to work with parallel programs
               synchronization overhead,
               load balancing monitoring

       Available tools
                Tool             Vendor                            Platforms
                VTune            Intel                             NT
                Analyzer         Sun                               Solaris
                SpeedShop        SGI                               IRIX
                DCPI             DEC Compaq HP                     Tru64, NT
21   Performance Programing          Module I: Measuring Program Performance
     Example: Sun Performance Analyzer (1 of 3)
            Profiling by function and module (no recompilation)




22    Performance Programing     Module I: Measuring Program Performance
     Example: Sun Performance Analyzer (2 of 3)
            Annotated source (recompilation with -g) and disassembly (no
            recompilation)




23    Performance Programing    Module I: Measuring Program Performance
     Example: Sun Performance Analyzer (3 of 3)
            Hardware counter overflow profiling




24    Performance Programing    Module I: Measuring Program Performance
                Process Monitoring Tools
       Tracing tools
           Linux: strace (ltrace for dynamic library calls)
           Solaris: truss (sotruss for dynamic library calls)
           IRIX: par
           Tru64: atom -tool ptrace

       procfs-based tools
           pmap: prints the address space of the program
           pldd: lists the dynamic shared objects linked into the process
           (including ones explicitly attached using dlopen)
           pstack: prints a stack trace for each LWP in the process
           pflags: prints the /proc tracing flags
           ptree: process trees containing specified pids or users
           pwait: wait for specified processes to terminate
           pcred: prints the credentials (effective, real, saved UIDs and
           GIDs)
25   Performance Programing     Module I: Measuring Program Performance
       Example: profiling system calls
           truss on Solaris
           Reports the number of system calls for a process and associated
           time




26   Performance Programing     Module I: Measuring Program Performance
                 System Monitoring Tools
       Tools for various UNIX platforms
           vmstat, vm_stat, memvis - virtual memory and CPU sta-
           tistics
           mpstat, mpvis - parallel memory/CPU statistics
           netstat, nfsstat, nfsvis - network status and statistics
           iostat, dkvis - I/O statistics
           sar - system activity report
           top, prstat - list of most active processes
           systat - system activity stats
           lockstat - kernel lock statistics
           dkstat - file status information




27   Performance Programing   Module I: Measuring Program Performance
     vmstat - Virtual Memory Statistics
           Available on HP-UX, Tru64, Solaris, Linux, FreeBSD, etc.
           Example on Alpha/Tru64
                              Memory
                                                     Paging
                              Usage
                                                     Activity
                                                                                 CPU
                                                                                 Usage




                                                                                   Idle
                                                                                   System




28   Performance Programing            Module I: Measuring Program Performance
     Hardware Counter Measurements
       Hardware performance counters allow for the
       runtime low-overhead measurements of various
       hardware events
           Cache references
           Cache misses
           Pipeline stalls
           Branch misprediction statistics
           D-TLB (Data Translation Lookaside Buffer) misses
           I-TLB (Instruction Translation Lookaside Buffer)
           Bus statistics including DMA and cache coherency transac-
           tions on a multiprocessor systems
           Others

       Only several events can be monitored at the same
       time
29   Performance Programing    Module I: Measuring Program Performance
                      Code Instrumentation
       APIs can be used directly in the code
           High-resolution timing of performance-critical parts of the pro-
           gram
           Access to HW performance counters

       Example (Solaris)
         if (cpc_take_sample(&before) == -1) exit(-1);
            for (k = 0; k < N-1; k++)
               sum = sum + a[k]*b[k];
         if (cpc_take_sample(&after) == -1) exit(-1);

           Counters specified by setting PERFEVENTS environment
           variable
         example% setenv PERFEVENTS pic0=Load_use,pic1=Load_use_RAW

           Works on UltraSPARC CPUs


30   Performance Programing     Module I: Measuring Program Performance
     Parallel Measurement Methodology
       Same guidelines as in the serial case
           Parallel benchmarks should be representative of typical uses of
           applications
           Benchmarking must be performed to ensure repeatable and
           consistent results
           Probe effects and tool overheads should be minimized

       Specifics of parallel benchmarking
           Parallelism vs. Concurrency
           Dedicated mode of benchmarking
           Number of processors
           Choice of timer and time criterion
           Processor-set configuration
           Processor allocation in clusters



31   Performance Programing     Module I: Measuring Program Performance
      Timing a Parallel Threaded Program
           timex can be used for parallel timing




           Note that the real time decreases, but the user time repre-
           senting combined CPU usage stays constant


32   Performance Programing     Module I: Measuring Program Performance
                   Specific Parallel Timers
       Timing MPI programs
           time or timex timers can be used in combination with MPI
           submitting commands (mprun, mpirun, etc.)
           For timing portions of an MPI program, one can use the
           MPI_Wtime function available in Fortran, C and C++ bind-
           ings (typically highly accurate).

       Threaded applications can use gethrvtime (S-
       olaris, Tru64 with Solaris Compatibility Library)
           Shows the user time on a per-thread basis
           Can be used in combination with gethrtime, which returns
           the elapsed real (wallclock) time on a per-thread basis




33   Performance Programing   Module I: Measuring Program Performance
               Parallel System Monitoring
        mpstat - mutliprocessor monitoring
                                Context Thread
       Cross                             migrations
                     Interrupts switches                          System calls
       calls                                 Mutex info                     CPU usage
CPU
ID
                                                                                  First snapshot:
                                                                                  average
                                                                                  since boot




                                                                                  Sample
                                                                                  measurements




34    Performance Programing          Module I: Measuring Program Performance
                      Kernel Lock Statistics
       Tools that report kernel lock statistics
           lockstat - Solaris, IRIX, AIX, Linux
           lockinfo - Tru64

       Allows one to specify what events to monitor
           spin on adaptive mutex
           block on read access to rwlock due to waiting writers

       On some platforms generates gprof-like output
             # lockstat -IWk example_tnf 24
             ...
             Profiling interrupt: 151649 events in 130.282 seconds (1164 events/sec)
             Count indv cuml rcnt nsec Hottest CPU+PIL          Caller
             --------------------------------------------------------------------
             85698   57% 57% 1.00     188 cpu[12]               mutex_vector_enter
             14247    9% 66% 1.00     160 cpu[9]+10             disp_getwork
             12792    8% 74% 1.00     746 cpu[14]               mutex_tryenter
             10359    7% 81% 1.00     280 cpu[5]                (usermode)
              1951    1% 82% 1.00      59 cpu[1]                splx
              1648    1% 84% 1.00     365 cpu[5]+10             _resume_from_idle
              1510    1% 85% 1.00     490 cpu[9]+10             disp
              1259    1% 85% 1.00     255 cpu[15]+10            setfrontdq
35   Performance Programing         Module I: Measuring Program Performance
     Binding a Program To a Set of Processors

       Process monitoring can be difficult on multiproc-
       essor systems due to process migration
       Single-threaded programs
           One can bind to a processor

       For multithreaded programs
           One can use processor sets

       Commands to set up and use processor sets
           psrset (HP-UX, Solaris)
           pset (IRIX)
           pset_create, pset_assign_cpu,
           pset_assign_pid, etc. (Tru64)



36   Performance Programing     Module I: Measuring Program Performance
                              Summary
       Monitoring performance is essential to optimiza-
       tion
           If you cannot measure it you cannot improve it

       Important to select benchmarks carefully and
       identify parameters to measure
       Select tools suitable for the task
           System-wide or process-specific?
           Parallel or serial?
           Require recompilation or instrumentation?
           Need source-level information?
           Need hardware counter information?




37   Performance Programing     Module I: Measuring Program Performance

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:47
posted:4/4/2010
language:English
pages:29
Description: Performance Programming