Docstoc

Characterization of GCC 2.96 and GCC 3.1

Document Sample
Characterization of GCC 2.96 and GCC 3.1 Powered By Docstoc
					Characterization of GCC 2.96 and GCC 3.1 generated code with OProfile
William Cohen
Red Hat, Inc. wcohen@redhat.com

OProfile, a low-overhead, system-wide profiler was used to collect data on building GCC 3.1 from source code with the Red Hat GCC 2.96 and FSF GCC 3.1 compilers. The execution times of programs do not give insight into why the code generated by one compiler is faster or slower than the code generated by another compiler. OProfile uses the performance monitoring hardware of the processor to identify the performance changes caused by data memory references, cache misses, branch mispredictions, and processor pipeline stalls. OProfile shows that the GCC 3.1 compiler generates better code than the Red Hat GCC 2.96 compiler. Metrics computed from data collected by OProfile such as Clocks Per Instruction (CPI), clocks per micro-operation, and instruction cache misses are improved. OProfile shows that the processor spends a significant fraction of time (>15.3%) waiting for outstanding instruction cache misses. OProfile collected data shows that GCC 3.1 generated code has fewer memory references.

Table of Contents
Introduction ...........................................................................................................................3 OProfile ...................................................................................................................................3 Experiment Environment ....................................................................................................4 GCC Characterization ..........................................................................................................5 Future Work .........................................................................................................................13 Summary...............................................................................................................................14 Bibliography ........................................................................................................................14

Introduction
The complexity of computer systems makes it likely that no single engineer understands the entire system in great detail. The complexity of hardware and software also leads to unexpected interactions, causing the engineer’s intuition to be often wrong in guessing the underlying cause of the performance problems and the magnitude of these problems. Engineers need accurate system-wide instrumentation to help characterize system performance and to speed the identification of the underlying causes where performance problems are concerned. OProfile [OProfile02] is a low-overhead, system-wide profiling tool. It makes use of the performance monitoring hardware in the Intel Pentium Pro/II/III [Intel97] and AMD Athlon [AMD02] processors to measure system performance. OProfile data characterizes the quality of code generated by the compiler. The build of GCC 3.1 was used as a test case. OProfile shows that the GCC 3.1 compiler generates better code than the GCC 2.96 compiler. Metrics computed from data collected by OProfile, such as Clocks Per Instruction (CPI), clocks per micro-operation, and instruction cache misses, are improved. OProfile shows that a significant fraction of time (>15.3%) is spent waiting for instruction cache misses. OProfile collected data shows that GCC 3.1 generated code has fewer memory references.

OProfile
OProfile works in a manner similar to the DEC Continuous Profiling Infrastructure (DCPI) [DEC02]. OProfile installs a Linux kernel module to collect profiling data with the aid of the performance monitoring hardware of the Intel Pentium Pro/II/III or AMD Athlon processors. OProfile’s data collection mechanism sets the performance monitoring hardware to cause an interrupt when a certain number of events are counted, for example 100,000 instructions retired. When the event count is exceeded, the interrupt causes OProfile to record the current program counter. This data is recorded on a per-executable basis. Thus, for each executable with samples, there is a file listing those samples. OProfile also provides tools to do simple analysis on the sample files. The tools calculate the total number of events for each executable and percentage of total events for each executable. There are two major advantages of OProfile over profiling tools such as gprof [Fenlason98]. OProfile does not require code changes to the executable to collect data on it. OProfile is also a system-wide profiler. For profiling complex processes such as bootstrapping the GCC compiler, these characteristics of OProfile are essential. OProfile does not require code to be compiled with special profiling options, therefore traditional system code on a machine can be profiled. This approach for profiling avoids artifacts that are created due to the instrumentation code. When the compiler instruments code to produce data for gprof, it produces different code than the uninstrumented code. This can make a significant different for leaf functions. An uninstrumented leaf function makes no calls to other functions. The compiler can aggressively optimize the code and avoid saving frame pointers and registers. When the same function is compiled with gprof instrumentation, a call to another function, mcount, is made. This additional call requires the compiler to produce addition prologue and epilogue code to save and to restore registers in a frame. For a small function, these additional instructions can significantly change the characteristics of the function.

3

Characterization of GCC 2.96 and GCC 3.1 generated code with OProfile

The gprof tool only allows data collection on a single thread of an application. The profiling data is saved when the program successfully exits. This prevents data collection for processes spawned by a thread and programs that normally do not exit, such as daemons. OProfile collects samples on a system-wide basis and stores the data in a repository. Thus, OProfile collects samples on all processes running on the system. OProfile allows the user to flush the data to the repository, allowing the user to examine data from processes that are still running.

Experiment Environment
The data collection presented in this document was performed on a Dell Inspiron 4100 with 256MB of memory and a 1GHz Pentium III mobile processor with 512KB L2 cache. Differences in IA32 architectures may limit the applicability of this data to other processors such as Intel Pentium 4 and AMD Athlon. Table 1 lists the hardware and software used in this environment. Table 1. Data Collection Environment Hardware Dell Inspiron 4100 Intel(R) Pentium(R) III Mobile CPU 1000MHz CPU: L1 I cache: 16K, L1 D cache: 16K CPU: L2 cache: 512K 256MB DRAM 40GB hard disk (Model IC25N040ATCS04-0) Red Hat Linux 7.2 GCC version 2.96 20000731 (Red Hat Linux 7.2 default compiler) Linux kernel 2.4.9-21 Xfree86-4.1.99 (snapshot 20011120) OProfile snapshot from 20020522 GCC 3.1 release

Software

The Red Hat Linux 7.2 operating system, which includes the GCC 2.96 compiler, was installed on the experimental machine. A new kernel (2.4.9-21) was configured, built, and installed on the machine for the OProfile software. A new version of Xserver was required to support the Radeon Mobility chipset in this machine. The OProfile software was configured and installed on the machine. The power management in the BIOS on this machine was turned off because the non-maskable interrupts (NMIs) used to trigger the sampling are not compatible with the power management. The GCC 3.1 release was checked out via an anonymous CVS checkout from gcc.gnu.org. A separate build directory, /home/wcohen/gcc31/native/, was made and the the following configuration was done in the build directory:
../gcc/configure --prefix=/home/wcohen/gcc31/native.install

The compiler was built using the following command:
time make bootstrap >& problems

4

Characterization of GCC 2.96 and GCC 3.1 generated code with OProfile

The compiler was installed using the following command:
make install

The newly created GCC 3.1 compiler executable is used in a number of the experiments. It is used to gauge the efficiency that GCC 3.1 compiles code and the efficiency of the code generated by the compiler. The source code for the the experiments will be the GCC 3.1 source code. Because the Pentium III only has two performance monitoring registers, multiple data collection runs were made with a "rm -rf *" command performed in the build directory before each data collection run.

GCC Characterization
The GCC 3.1 build was characterized to gauge the level of improvements made in GCC from the earlier GCC 2.96 compiler included with Red Hat Linux 7.2 and to determine whether there were any obvious changes that could be made to improve performance. Overall system performance was measured using the following command running in an empty directory:
../gcc/configure --prefix=/home/wcohen/gcc31/data.install; make >& problems

OProfile was started immediately before the command above was entered. Each time, OProfile measured two parameters using the processor’s performance monitoring hardware. After the build was complete, OProfile data collection was stopped. In most cases, measurements were taken so ratios, such as cache miss rates and Clocks Per Instruction, could be calculated. It is also useful to compare the absolute counts between the builds made with the default GCC 2.96 compiler and the GCC 3.1 compiler built for these comparisons. The largest consumers of the processor’s time are cc1 (the main C to assembly language translator), jc1 (Java to assembly language translator), and cc1plus (C++ to assembly language translator). The jc1 and cc1plus programs are created during the build process for GCC 3.1. The collected data lists two cc1 programs, the one for the compiler used for the build process and the one built during the build process. Because we are mainly concerned about the quality of the code generated by the GCC 2.96 and GCC 3.1 compilers, we will only examine the metrics for the executable built and run during the build process (the jc1, cc1plus, and cc1 built from the GCC 3.1 sources).

Overall System Profile
One significant advantage of OProfile over gprof is that OProfile can provide a system-wide view. For complex tasks such as a build of the compiler, which uses multiple processes to implement the build, this is a significant advantage. Table 2 shows the top ten executables that consume processor time to build the GCC 3.1 compiler. Two builds were done for this experiment, the first build used the default GCC 2.96 compiler and the second build used the GCC 3.1 compiler built and installed in a separate directory, /home/wcohen/gcc31/native.install/. The first column in the table lists the number of samples recorded. Each sample represents 498,000 clock cycles on the processor (CPU_CLK_UNHALTED). The

5

Characterization of GCC 2.96 and GCC 3.1 generated code with OProfile

second column is the percentage of the total samples for that data run. The last column is the executable. This data is the output of OProfile’s op_time command. It is not surprising that the main portion of the GCC compiler, cc1, consumes more time in the the GCC 3.1 compiler than the GCC 2.96 compiler. Much of the work done on GCC 3.1 has been to improve optimization and quality of the code generated [gcc02]. There has little work done to tune the compiler and to reduce the runtime of the compiler. Table 2. CPU_CLK_UNHALTED (each sample 498,000 clock cycles) for gcc 3.1 builds using gcc 2.96 and gcc 3.1. Count Build using GCC 2.96 874127 667554 502884 456297 238561 184439 178924 146445 66654 35570 Build using GCC 3.1 989124 26.5881
/home/wcohen/gcc31/native.install/lib/gcc-lib/i686pc-linux-gnu/3.1/cc1 /home/wcohen/gcc31/data/gcc/jc1 /home/wcohen/gcc31/data/gcc/cc1plus /lib/modules/2.4.9-21custom/build/vmlinux /bin/bash /bin/sed /home/wcohen/gcc31/data/gcc/cc1 /usr/bin/as /usr/bin/ld /home/wcohen/gcc31/data/gcc/genattrtab

%Total

Executable

24.6806 18.8481 14.1987 12.8833 6.7357 5.2075 5.0518 4.1348 1.8819 1.0043

/home/wcohen/gcc31/data/gcc/jc1 /usr/lib/gcc-lib/i386-redhat-linux/2.96/cc1 /home/wcohen/gcc31/data/gcc/cc1plus /lib/modules/2.4.9-21custom/build/vmlinux /bin/bash /home/wcohen/gcc31/data/gcc/cc1 /bin/sed /usr/bin/as /usr/lib/gcc-lib/i386-redhat-linux/2.96/cpp0 /usr/bin/ld

825818 474786 458134 240797 180797 173775 156436 33054 29886

22.1984 12.7625 12.3148 6.4727 4.8599 4.6711 4.2051 0.8885 0.8033

6

Characterization of GCC 2.96 and GCC 3.1 generated code with OProfile

The GCC build process uses some of the recently created executables, such as cc1plus and jc1, to build libraries. Table 2 shows that the cc1plus and jc1 generated by GCC 3.1 are slightly more efficient than the cc1plus and jc1 generated by GCC 2.96. The GCC 3.1 generated versions have lower counts: 28,098 (5.6%) lower for cc1plus, (10,664 (5.8%) lower for cc1, and 48,309 (5.5%) lower for jc1. GCC 3.1 normally incorporates the preprocessing performed by an executable called cpp0 into cc1. A cpp0 is still available in GCC 3.1 and used in some cases. However, the amount of time spent in the GCC 3.1 cpp0 is reduced when compared to the GCC 2.96 build (1.8% of build time); cpp0 does not make it to the top 10 consumers of processor time for the build using GCC 3.1. There are many other kinds of metrics that can be collected by OProfile or computed from the OProfile data. The following subsections will discuss the other metrics and the observations for the GCC 3.1 builds with the GCC 2.96 and GCC 3.1 compilers.

Clocks Per Instruction (CPI)
Hennessy and Patterson [Hennessy90] describe a simple formula (although very approximate) to estimate the number of clock cycles required to execute a program. It ignores the data dependent variations in instruction cost (for example branch taken/not taken) and the interactions between instructions. The formula assumes n distinct instructions. Cycles Per Instruction for each instruction type is CPIi and the relative instruction count for the particular instruction is Ii. CPU clock cycles = n
i=1

(CPIi) (Ii)

There are two basic approaches to reduce the number of clock cycles: keeping CPI constant while reducing the number of instructions executed and keeping instructions executed constant while reducing CPI. The formula can be simplified by lumping all instructions into a single category: CPU clock cycles = (CPIavg) (I) With algebraic manipulation, the formula can be rewritten as: CPIavg = CPU_clock_cycles/I The CPIavg can be estimated by computing the ratio of CPU_CLK_UNHALTED events to INST_RETIRED events on the Pentium III processor. The Pentium III processor can retire a maximum of three instructions per cycle, a CPI of .33333. The estimated CPIs for cc1, cc1plus, and jc1 shown in Table 4 are uniformly better for the executables built with GCC 3.1 than built with GCC 2.96. However, the CPI is still considerably higher than the minimum CPI of .3333. Further analysis is required to determine contributing factors to the CPI. Table 3. CPI Estimates GCC 3.1 Executable cc1 cc1plus jc1 GCC 2.96 Built 1.42 1.56 1.32 GCC 3.1 Built 1.32 1.47 1.24

7

Characterization of GCC 2.96 and GCC 3.1 generated code with OProfile

The x86 instruction set allows the encoding of relatively complex operations such as read-modify-write operations as a single instruction. The Pentium Pro and later IA32 processors (Pentium II and III) convert the x86 instructions into micro-operations (UOPS). The decoder that converts x86 instructions into micro operations has restrictions. The Pentium Pro style decoder can decode one instruction with four micro ops followed by two instructions each composed of a single micro op in a single cycle. If the instruction requires more than four micro ops, the decoder can only decode that instruction. In some cases, better performance can be obtained by using a sequence of simpler x86 instructions in the place of more complex x86 instructions. The GCC x86 code generator takes this into account. A clearer picture of performance may be obtained by estimating the clocks required per micro operations. The Pentium Pro/II/III hardware allows measurement of UOPS retired. Table 4 shows the estimate of clocks per UOP. The clocks per UOP retired is slightly higher for the GCC 3.1 cc1plus and jc1. However, this is due to GCC 3.1 eliminating more UOPS, and the amount of time not dropping linearly due to the eliminated UOPS, rather than GCC 3.1 generating worse code. Still the code is not executing UOPS at the maximum rate possible, three UOPS per cycle. the Section called Memory Hierarchy and the Section called Branch Misprediction will examine some of the causes of increased CPI and clocks per UOP. Table 4. Clock cycles per UOP estimates GCC 3.1 Executable cc1 cc1plus jc1 GCC 2.96 Built .887 .953 .803 GCC 3.1 Built .869 .970 .812

Memory Hierarchy
The Pentium III processor used for these experiments has three caches to avoid accessing the much slower DRAM memory. There are separate L1 data and instruction caches, each 16KB in size. The processor has a much larger unified L2 cache of 512KB. The observed performance characteristics of the three caches are discussed in the following sections. Instruction Cache The Pentium III processor has an 16K L1 cache which is organized as 32 bytes per line, 512 lines total, grouped as four lines per set. A build of GCC 3.1 was performed with OProfile measuring the number of instructions retired (INST_RETIRED) and number of icache misses (IFETCH_MISS). Table 5 contains information on the ratio of icache misses to instructions retired. It shows that the GCC 3.1 compiler generates code that has a slightly lower icache miss rate for the programs examined. The icache miss rates are less than 1.1%. The miss rate may seem small, but it can impact the overall performance because of the cost associated with an instruction cache miss.

8

Characterization of GCC 2.96 and GCC 3.1 generated code with OProfile

Table 5. Instruction cache misses per instruction (imisses/inst_retired) GCC 3.1 Executable cc1 cc1plus jc1 GCC 2.96 Built .0109 .0103 .0081 GCC 3.1 Built .0100 .0098 .0080

The Pentium III performance monitoring hardware can also measure the number of cycles that the processor stalls due to icache misses (IFU_MEM_STALL). Another set of experiments were run measuring the processor cycles (CPU_CLK_UNHALTED) and the number of cycles the processor stalled due to icache misses (IFU_MEM_STALL). The ratio of icache stall cycles to total clock cycles shows the impact of the relatively low cache miss rate. Table 6 shows the fraction of cycles the processor has outstanding icache misses. Between 18.1% and 15.3% of the time, the processor is waiting for a icache fetch. This is a significant portion of the total time. Table 6. Estimate of fraction of cycles outstanding L1 instruction cache miss (IFU_MEM_STALL/CPU_CLK_UNHALTED) GCC 3.1 Executable cc1 cc1plus jc1 GCC 2.96 Built .181 .155 .153 GCC 3.1 Built .177 .157 .157

Table 7 estimates the cost of a instruction cache miss for the Pentium III processor, the costs range from 15.0 to 19.6 cycles per instruction cache miss. Thus, even modest reductions in the instruction cache miss rate can improve overall performance. Table 7. Estimate of cycles for L1 Icache misses (Table 5/Table 6) GCC 3.1 Executable cc1 cc1plus jc1 GCC 2.96 Built 16.6 15.0 18.9 GCC 3.1 Built 17.7 16.0 19.6

Data Cache The Pentium III L1 data cache has the same configuration as the instruction cache, 32 bytes per line, 512 lines total, grouped as four lines per set. Two OProfile data collection runs were made with each build compiler. The first run recorded data memory references (DATA_MEM_REFS) and the number of data cache lines pulled into the L1 cache (DCU_LINES_IN). Table 8 shows the ratio of DCU_LINES_IN to DATA_MEM_REFERENCES. In general, a lower value is better. In Table 8, the GCC

9

Characterization of GCC 2.96 and GCC 3.1 generated code with OProfile

3.1 compiled executable ratios are uniformly higher than the GCC 2.96 compiled executables. This is because the GCC compiler is more effective at eliminating redundant memory references, which cache well, than removing the initial cache miss. In all three executables, the GCC 3.1 generated code had lower counts for both DCU_LINES_IN and DATA_MEM_REFERENCES than the GCC 2.96 generated code. The data cache miss rate for cc1plus is significantly higher than the cache miss rate for cc1 or jc1 and should be examined more closely. Table 8. L1 data cache (DCU_LINES_IN/DATA_MEM_REFS) GCC 3.1 Executable cc1 cc1plus jc1 misses per memory reference

GCC 2.96 Built .0153 .0222 .0142

GCC 3.1 Built .0156 .0231 .0151

The second set of data collected with OProfile on the L1 Data cache was the number of clock cycles (CPU_CLK_UNHALTED) and the number of cycles that a processor had outstanding data cache misses weighted by the number misses to be satisfied in each clock cycle (DCU_MISS_OUTSTANDING). Table 9 shows the ratio between these entries. Because the GCC 3.1 compiler is more effective at removing cycles than data cache misses, the fraction of cycles spent servicing data cache misses increase from the GCC 2.96 generated code and the GCC 3.1 code. Table 9. Estimate of fraction of cycles outstanding (DCU_MISS_OUTSTANDING/CPU_CLK_UNHALTED) GCC 3.1 Executable cc1 cc1plus jc1 GCC 2.96 Built .0682 .1610 .0740 Dcache miss

GCC 3.1 Built .0829 .1680 .0828

The number of cycles for an L1 data cache miss can be estimated from Table 8 and Table 9. The resulting estimates of the costs for an L1 data cache miss are in Table 10. The cost ranges from 4.46 cycles for GCC 2.96 generated cc1 to 7.27 cycles for GCC 3.1 generated cc1plus. The cycles required for cc1plus are higher than the costs of cc1 or jc1. One will notice that the costs listed in Table 10 for the L1 data cache misses are significantly lower than the costs listed in Table 7 for the L1 instruction cache misses. It is assumed that the L1 caches are implemented in the same technology, so this is probably caused by other effects, such as branching costs. The effects of branches will be discussed in the Section called Branch Misprediction. Table 10. Estimate of cycles for a L1 DCache miss (Table 9/Table 8) GCC 3.1 Executable GCC 2.96 Built GCC 3.1 Built

10

Characterization of GCC 2.96 and GCC 3.1 generated code with OProfile

cc1 cc1plus jc1

4.46 7.25 5.21

5.31 7.27 5.48

L2 Cache The Pentium III processor has a unified L2 cache to reduce external bus traffic. The L2 cache has connections to both the L1 caches and the system bus. Measurements were taken to determine how saturated each of the connections to the L2 bus were. Because the L1 cache miss rates are relatively low for cc1, cc1plus, and jc1, use of these busses is expected to be low, as shown in Table 11 and Table 12. Again, the ratios are higher for the GCC 3.1 generated code because the compiler is more effective at reducing the clock cycles than eliminating the cache misses. Table 11. Fraction of cycles (L2_DBUS_BUSY/CPU_CLK_UNHALTED) GCC 3.1 Executable cc1 cc1plus jc1 GCC 2.96 Built .0364 .0379 .0335 L1-L2 connection busy

GCC 3.1 Built .0370 .0389 .0340 bus connection busy

Table 12. Fraction of cycles L2-System (BUS_DRDY_CLOCKS/CPU_CLK_UNHALTED) GCC 3.1 Executable cc1 cc1plus jc1 GCC 2.96 Built .0617 .0952 .0609

GCC 3.1 Built .0657 .1000 .0652

The OProfile data shows that the kernel is a much larger consumer of the bus bandwidth between L2 cache and other sub-systems. This is due to loading the programs from disk drives into memory. The fraction of cycles the kernel (vmlinux) spent keeping the L1-L2 cache connection busy was .097 for the build using GCC 2.96 and .103 for the build using GCC 3.1. The fraction of cycles the kernel (vmlinux) spent keeping the external bus connection busy was .261 for the build using GCC 2.96 and .267 for the build using GCC 3.1.

Branch Misprediction
Branch instructions are relatively expensive when compared to other instructions.

11

Characterization of GCC 2.96 and GCC 3.1 generated code with OProfile

The processor needs to fetch instructions from other locations in memory. For conditional branches, there are two possible locations in memory that the processor may execute instructions from: the condition true path and the condition false (fall through) path. The decision of which path may be dependent on the immediately preceding instructions that are still being executed. Rather than incur a penalty waiting for the data dependencies for the branch to be resolved, most processors now predict the most likely outcome of the branch and speculatively fetch and execute instructions along the predicted path. If the branch is mispredicted, the speculative work is scrapped and a penalty for the mispredicted branch is incurred. On a Pentium III, this penalty is approximately a dozen clock cycles. On processors with deeper pipelines, for example Pentium 4, the penalty can be even greater. The accuracy of the processor’s branch predictions greatly influences the overall performance of a program. The Pentium III has static prediction rules and a 512-entry Branch Target Buffer (BTB) to improve the performance of code with branches. The static prediction rules are used when a branch is not in the BTB. It predicts backward conditional branches as taken and forward conditional branches as not taken. Once the branch is encountered, it is placed in the BTB. Three sets of experiments related to branches were performed to estimate the following metrics:
• • •

The fraction of branch instructions not handled by BTB The fraction of branch statically predicted The fraction of branches mispredicted

Table 13 shows the fraction of branch instructions where the BTB provided no branch prediction information. The BTB misses a significant fraction of the branch instructions. Large programs with many branches may overflow the BTB causing a higher miss rate. Computed branches, used in switch cases and indirect branches, can also cause the BTB to miss. Another possible cause is the context switch between processes, but the details of the BTB mechanism on the Pentium III needs to be examined more closely to determine whether that is the case. Table 13. Fraction of branch instructions (BR_BTB_MISSES/BR_INST_RETIRED) GCC 3.1 Executable cc1 cc1plus jc1 GCC 2.96 Built .420 .402 .359 not predicted by BTB

GCC 3.1 Built .445 .407 .385

The Pentium III also attempts to statically predict for cases that are not in the BTB. The ratio of static branch predictions (BACLEAR) to branch instructions was also examined. Table 14 shows the ratios for these experiments. The static predictor was used in less than 8% of the branch instructions. Table 14. Fraction of branch instructions statically predicted

12

Characterization of GCC 2.96 and GCC 3.1 generated code with OProfile

(BACLEAR/BR_INST_RETIRED) GCC 3.1 Executable cc1 cc1plus jc1 GCC 2.96 Built .076 .073 .054 GCC 3.1 Built .076 .078 .060

The GCC 3.1 code has a more favorable ratio of branch miss predictions to branch instructions than the GCC 2.96 code. In Table 15, each of the GCC 3.1 built executables has a lower ratio than the equivalent GCC 2.96 built executable. GCC 3.1 estimates the relative frequency that basic blocks are executed and reorders the basic blocks based on those estimates. This reordering could certainly affect the numbers calculated in Table 15. Table 15. Fraction of branch instructions (BR_MISS_PRED_RETIRED/BR_INST_RETIRED) GCC 3.1 Executable cc1 cc1plus jc1 GCC 2.96 Built .103 .100 .076 GCC 3.1 Built .093 .088 .072 mispredicted

Future Work
Future work can be broken into three areas: examining other aspects of program performance, characterizing other programs, and examining the effect of optimization flags on the code. A relatively small subset of characteristics were analyzed for GCC. The ones selected for this paper were instructions, time required, cache performance, bus performance, and branch performance. The performance hardware can monitor many other events besides the ones examined in this paper. IA32 processors can have pipeline stalls due to mixing different sized memory operations to the same region of memory. The performance monitoring hardware can also measure floating point operation performance. The only program examined for this paper was GCC 3.1. This program has very little floating point code. Other programs may have very different characteristics. Programs other than GCC 3.1 should be characterized to determine what improvements can be made. The OProfile measurements show that GCC 3.1 generates better code than GCC 2.96. However, there is still the possibility for improvement. Over 15% of the processor time was spent waiting for outstanding instruction cache misses. It may be possible to improve the overall performance by having better layout of the basic blocks in functions. GCC 3.1 can collect data about the paths taken through code. It can use

13

Characterization of GCC 2.96 and GCC 3.1 generated code with OProfile

that information to reorder the basic blocks to reduce the number of branches taken and to improve the utilization of data in instruction cache lines. GCC 3.1 improves the ratio of branch mispredictions to branches over GCC 2.96. GCC can generate conditional move instructions that can eliminate some branch instructions. These instructions are not on the original Pentium processors, so are not generated by default. It would be worthwhile to see if these instructions can reduce the number of branches and branch mispredictions observed in the code.

Summary
OProfile shows that GCC 3.1 generates better code than the GCC 2.96 compiler for the limited application scope examined by this paper. For the program examined, the new compiler generates code that is approximately 5% faster than the GCC 2.96 compiler. This improved code generation comes with some additional compile-time cost. OProfile also gives some insight for the underlying cause of the performance improvements. The GCC 3.1 generated code has fewer UOPS executed, instruction cache misses, and fewer data memory references.

Bibliography
• • • •

[AMD02] AMD Athlon Processor x86 Code Optimization Guide, Publication 22007, revision K, Feb 2002. [DEC02] DCPI Publications and http://www.tru64unix.compaq.com/dcpi/publications.htm. Presentations,

[Fenlason98] Jay Fenlason and Richard Stallman, GNU gprof , 1998. GNU gprof http://www.gnu.org/manual/gprof-2.9.1/html_mono/gprof.html. [Hennessy90] John Hennessy and Dave Patterson, COMPUTER ARCHITECTURE: A QUANTITATIVE APPROACH, Morgan Kaufmann Publishers, San Mateo, CA, 1990. [gcc02] Using the GNU Compiler http://gcc.gnu.org/onlinedocs/gcc-3.1/gcc/. [OProfile02] SourceForge: Project http://sourceforge.net/projects/oprofile/. Collection (GCC), 2002.

• • •

[Intel97] Intel optimization manual, Order Number 242816-003, 1997. Info Oprofile,

Notes
1. http://www.tru64unix.compaq.com/dcpi/publications.htm 2. http://www.gnu.org/manual/gprof-2.9.1/html_mono/gprof.html 3. http://gcc.gnu.org/onlinedocs/gcc-3.1/gcc/ 4. http://sourceforge.net/projects/oprofile/

14


				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:114
posted:5/24/2008
language:English
pages:14