Compiler Features

Document Sample
Compiler Features Powered By Docstoc

Compiler Techniques for Single Processor Tuning
An Introduction


The Compiler
Compiler manages processor resources:
• registers • integer/floating-point execution units • load/store/prefetch for data flow in/out of processor • the implementation details of processor and system architecture are built into the compiler
User Program (C/C++/Fortran, etc.)
– high level representation

Compilation process
– low level representation

Solving: –data dependencies –control flow dependencies –parallelization –compactification of code –optimal scheduling of the code

Machine instructions


MIPSpro Compiler Components

Executable object
InterProcedural Analyzer

F77/f90 cc/CC driver
Macro preprocessor

Front-end (source to WHIRL format)
Global optimizer Code generator

InterProcedural Analyzer

Loop nest optimizer

Parallel optimizer

• There are no source-to-source optimizers or parallelizers • Source code is translated to WHIRL (Winning Hierarchical Intermediate Representation Language);
– same IR for different levels of representation – whirl2f and whirl2c translates back into Fortran or C from IRs

• Inter-Procedural analyzer requires final translation at link time


Compiler Optimizations
• Global Optimizer:
– dead code elimination – copy propagation – loop normalization
• stride one loops • single induction variable

• Loop Nest Optimizer:
– – – – – – loop unrolling (outer) loop interchange loop fusion/fission loop blocking memory prefetch padding local variables

– memory alias analysis – strength reduction

• Inter-Procedural Analyzer:
– – – – – cross-file function inlining dead function elimination dead variable elimination padding of variables in common blocks inter-procedural constant propagation

• Code Generator:
– – – – – – software pipelining inner loop unrolling if-conversion read/write optimization recurrence breaking instruction scheduling inside basic blocks

• Automatic Parallelizer
– loop level work distribution


SGI Architecture, ABI, Languages
• Instruction Set Architecture (ISA):
– -mips4 (R1x000, R8000, R5000 processors) – -mips3 (R4400) – -mips[1|2] (R3000, R4000 processors, invokes old ucode compiler)

• ABI (Application Binary Interface): – -n32 (32 bit pointers, 4 byte integers, 4 byte real) – -64 (64 bit pointers, 4 byte integers, 4 byte real)
C size[bit] -n32 -64 char/character 8 8 short 16 16 int/integer 32 32 long 32 64 long long 64 64 logical float/real 32 32 double 64 64 pointer 32 64 Variable F size[bit] -n32 -64 8 8

• Languages:
– – – – C C++ Fortran 77 Fortran 90



32 32 64

32 32 64


Options: ABI & ISA
-n32 -64 -o32/-32 -mips[1234]

invoke the MIPSpro Compiler, use 32 bit addressing invoke the MIPSpro Compiler, use 64 bit addressing invoke the old ucode compiler, 32 bit addressing ISA; -mips[12] implies ucode compiler

There are two more ways to define the ABI and ISA:
• environment variable “SGI_ABI” can be set to -n32 or -64 • the ABI/ISA/Processor/optimization can be set in a file ~/compiler.defaults or /etc/compiler.defaults. In addition, the location of the file can be defined by “COMPILER_DEFAULTS_PATH” environment variable. The file should contain a line like: DEFAULT:abi=n32:isa=mips4:proc=r10000:arith=3:opt=O3

There is a way to find which compiler flags were used:

dwarfdump -i file.o | grep DW_AT_producer


Optimization Levels
Compilation speed degrades with higher optimization
• -O0 • -O1 • -O3 • -ipa • -apo
• -g[0|3]

turn off all optimizations only local optimizations aggressive optimizations, LNO, software pipelining inter-procedural analysis (only at -O2 and -O3) automatic parallelization option (same as -pfa) debugging switch: -g0 forces -O0 -g3 to debug with -O3

• -O2 or -O extensive but conservative optimizations


Options: Performance
Option -r10000 -r8000 Functionality Generate optimal instruction schedule for the R10000 proc Generate optimal instruction schedule for the R8000 proc

-O[0|1|2|3] -Ofast=[ipXX]

Set optimization Level to 0, 1, 2, 3 Select best optimization for the given architecture

XX machine (output of the hinv -c processor command) 27 Origin2000 (all cpu frequencies and cache sizes) 35 Origin3000 (all cpu frequencies and cache sizes) optimizations may differ on the version of the compiler. Currently:

-O3 -IPA -TARG:platform=ip27 -n32 -OPT:Olimit=0:roundoff=3:div_split=ON:alias=typed (thus -Ofast switch invokes the Interprocedural Analyzer)

-mp -mpio -apo

Enable multi-processing directives Support I/O from a parallel region Invoke automatic parallelization option


Options: Porting
Option -d8/d16 -r8 -i8 -static Functionality Double precision variables as 8 or 16 bytes
Convert REAL to REAL*8 and COMPLEX to COMPLEX*16 (1) Convert INTEGER to INTEGER*8 and LOGICAL to 8 byte sizes (1) Local variables will be initialized in fixed locations on the heap (-static_threadprivate makes static variables private to each thread)

-col[72|120] -Dname -Idir -alignN -G0 -xgot -multigot
-version -show

Source line is 72 or 120 columns Define name for the pre-processor Define include directory dir Assume alignment on the N=8,16,32,64,128 bit boundary Put all static data into indirect address area make big tables for static data and program addresses
Automatic choice of table sizes for static variables and addresses

Show compiler version Put the compiler in verbose mode: all switches are displayed

(1) Note: explicit sizes are preserved, i.e. REAL*4 remains 32 bit


Options: Debugging
Option -g -DEBUG: Functionality Disable optimization and keep all symbol tables the DEBUG group option (man DEBUG_GROUP):
• check_div=n n=1 (default) check integer divide by zero n=2 check integer overflow n=3 check integer divide by zero and overflow • subscript_check (default ON) to check for subscripts out of range C/C++: produces trap #8 f77: aborts run and dumps core f90: aborts run if setenv F90_BOUNDS_CHECK_ABORT • verbose_runtime (default OFF) to give source line number of failures • trap_uninitialized (default OFF) initialize all variables to 0xFFFA5A5 when used as pointer - access violation when used as fp values - NaN causes fp trap

Example: f77 -n32 -mips4 -g file.f \ -DEBUG:subscript_check:verbose_runtime=ON \ -DEBUG:check_div=3 -DEBUG:trap_uninitialized=ON


Compilation Examples
1. Produce executable a.out with default compilation options:
f77 source.f cc source.f

be aware of the defaults setting (e.g. /etc/compiler.defaults ) same flags for Fortran and C

2. Options for debugging:
f77/cc -o prog -n32 -g -static source.f

2. Explicit setting of ABI/ISA/Processor, highest opt:
f77/cc -o prog -n32 -mips4 -r10000 -O3 source.f

3. Detailed control of the optimization process with the group options :
f77/cc -o prog -64 -mips4 -O3 -Ofast=ip27 -OPT:round=3:IEEE_arith=3 -IPA:dfe=on ...


Fine Tuning Compiler Actions
Compiler performs many sophisticated optimizations on the source code under certain assumption about the program. Typically:
• • • • program data is large (does not fit into the cache) program does not violate language standard program is insensitive to roundoff errors all data in the program is alias-ed, unless it can be proved otherwise

if one or more of these assumptions does not hold, compiler should be tuned to the program with the compiler options. Most important:
• OPT for general optimizations assumptions • LNO for the Loop Nest optimizer options • IPA for the Inter-Procedural Analyzer options

Additional options that help to tune the compiler properly:
• TENV, TARG for the target machine and environment description
-TENV:align_aggregates=x (bytes)

• LIST, DEBUG for the listing and debugging options


Group Options
Compiler options can be set with the key=value expressions on the command line. These options are combined in logical groups. Multiple key=val expressions are colon separated; same group headings can be specified several times, the effects are cumulative:

E.g.: -OPT:roundoff=2:alias=restrict -OPT:IEEE_arithmetic=3 etc.
Group Heading Reference page Usage comments

-OPT:key=val -TENV:key=val -TARG:key=val -FLIST/CLIST -LIST:key=val -DEBUG:key=val -IPA:key=val -INLINE:key=val -LNO:key=val -MP:key=val -LANG: -CG: -WOPT:

cc(1) f77(1) opt(5) cc(1) f77(1) cc(1) f77(1) cc(1) f77(1) cc(1) f77(1) debug_group(5) ipa(5) ipa(5) lno(5) cc(1) f77(1) cc(1) f77(1) cc(1) f77(1) cc(1) f77(1)

Optimizations Control target environment Control target architecture Listing control Options to control listing Debugging options Inter-Procedural Analyzer control Procedure inliner control Loop Nest Optimizer control Parallelization control language compatibility features code generation global optimizer


Compiler man Pages
• Primary man pages:

man f77(1) f90(1) cc(1) CC(1) ld(1)
• some of the compiler option groups are rather large and deserve their own man pages

man opt(5)

man lno(5) man ipa(5) man DEBUG_GROUP(5) man mp(3F) man pe_environ(5) man sigfpe(3C)


The Run-Time Library Structure
*.a, *.so



*.a, *.so Cmplrs/mongoose-compiler

/usr lib32/ mips3 *.a, *.so nonshared/*.a *.a, *.so nonshared/*.a



*.a, *.so lib64/
Cmplrs/mongoose-compiler nonshared/*.a

mips3 mips4

*.a, *.so nonshared/*.a *.a, *.so nonshared/*.a



The Scientific Libraries
Standard scientific libraries containing:
• Basic Linear Algebra operations and algorithms:
– BLAS1, BLAS2, BLAS3 (see man intro_blas1,_blas2,_blas3) – LAPACK (see man intro_lapack)

• Fast Fourier Transformations (FFT):
– 1D, 2D, 3D, multiple 1D transformations (see man intro_fft)

• Convolutions (Signal Processing, e.g. man SIIR2D) • Sparse Solvers (see man solvers; man PSLDLT)

To use:
– -lscs serial versions ( -lscs_i8, -lscs_i8_mp for long integers) – -lscs_mp -mp for parallel versions – man intro_scsl for detailed description
– -lcomplib.sgimath or -lcomplib.sgimath_mp for older versions – man complib.sgimath for detailed description


Computational Domain
Range of numbers (from /usr/include/limits.h):
FLT_DIG FLT_MAX FLT_MIN 6 /* decimal digits of precision of a float */ 3.40282347E+38F 1.17549435E-38F


15 /* decimal digits of precision of a double */ 1.7976931348623157E+308 2.2250738585072014E-308 -9223372036854775807LL-1LL 9223372036854775807LL 18446744073709551615LLU

The extended precision (REAL*16) is available and supported by the compiler. But this mode of calculation is slow (by factor ~40)


Underflow and Denormal Numbers
When de-normalized numbers emerge in a computation (i.e. numbers x<DBL_MIN) they are flushed to zero by default:
Program denorm real*8 a,b #include <sys/fpu.h> void no_flush_() { union fpc_csr f; f.fc_word = get_fpc_csr(); f.fc_struct.flush = 0; set_fpc_csr(f.fc_word); }

a = 2.2250738585072014D-308 b = a/10.0D0 write(6,10) b end

will print zero. To force IEEE-754 gradual underflow it is necessary to manipulate status register on the R1x000 cpu. Calling no_flush at the beginning of the program will print
Flush-to-zero property can lead to x-y=0, while xy . Keeping de-normalized numbers in computations will avoid that condition, but will cause fp exception, that must be processed in software.

It is a performance issue - not to manipulate the de-normalized numbers in calculations.


Overflow Example
Program example that generates overflows and underflows:
Parameter (N=20) INCLUDE “/usr/include/limits.h” Real*8 A(N),B(N) Compile with: f77 -n32 -mips4 -O3 complex*16 C(N) do I=1,N A(I) = (FLT_MAX/10)*I ! single precision range B(I) = (FLT_MIN*10)/I ! will fit into double enddo

Note: Compilation with -r8 avoids the error.


! Standard requires passing from base precision: real*4 !

write (0,’(I3,2(2G22.15/))’) (I,A(I),B(I),C(I),I=1,N)

Output with all exceptions ignored by default:
0.340282347000000E+39 0.340282346638529E+39 11 0.374310581700000E+39 Overflow! Infinity 12 etc… 10 0.117549435000000E-37 0.117549435082229E-37 0.106863122727273E-37 0.000000000000000 A,B Cr,Ci

Flush to zero!

setenv TRAP_FPE “UNDERFL=TRACE; OVERFL=TRACE“ will trap at Overflow and Underflow and produce traceback ( Link -lfpe).


Floating Point Exceptions
A fp status register flag is set when fpu is has an illegal condition:
• • • • • division by zero overflow underflow invalid inexact

By default, all exceptions are ignored!
(e.g. for 1/0 NaN value is set and execution continues)
The status register can be programmed to raise a Floating Point Exception. If an FPE occurs, the system can take a specified action:
• abort • ignore the exception • repair the illegal condition

You can manipulate the status register to select action:
• with calls to the FPE library, link with -lfpe • with environment variable TRAP_FPE

see man handle_sigfpes


Compiler-Generated Exceptions
•Compiler can do more optimizations if it is allowed to generate code that could cause exceptions (-TENV:X=0..4)
X X X X X = = = = = 0 1 2 3 4 no speculative code motion IEEE-754 underflow and inexact FPE disabled (default -O0 and -O2) all IEEE-754 exceptions disabled except 1/0 (default -O3) all IEEE-754 exceptions are disabled memory access exceptions are disabled

IF-conversion with conditional moves for Software Pipelining (with -O3):
Do i=1,N if(a(i) .lt. eps) then x = x + 1/eps else x = x + 1/a(i) endif enddo #put eps in $f1 and (1/eps) in $f0 ldc1 recip.d movt.d add.d $f5,-8($2) $fcc0,$f5,$f1 $f2,$f5 $f2,$f0,$fcc0 $f1,$f1,$f2 load a(i) if(a(i) < eps) 1/a(i) y = 1/a or 1/eps x = x+ y

Removing IF(…) will cause divide by zero! In this case this exception must be ignored
Note: transf applied already at -O3 & X=1!


IEEE_754 Compliance
The MIPS4 instruction set contains IEEE-754 non-compliant instructions:
• recip.s/d (reciprocal 1/x) instruction is accurate to 1 ulp

• rsqrt.s/d

(reciprocal-sqrt: 1/sqrt(x)) instruction to 2 ulp

-OPT:IEEE_arithmetic=X specify degree of non-compliance and what to do with inf and NaN operands
X = 1 strict IEEE-754 compliance; does not use the recip and rsqrt instructions (-O1,2)
X = 2 optimize 0*x=0 and x/x=1 while x can be NaN (default at -O3) X = 3 any mathematically valid transformation is allowed, including recip & rsqrt instr.

y_tmp = 1/y Do i=1,n -O3 do i=1,n x = x + a(i)/y -OPT:IEEE_arithmetic=3 x = x + a(I)*y_tmp enddo enddo
21 cycles/iteration; 4% peak 1 cycles/iteration; 100% peak

Note: X=3 is required!


Rounding Accuracy
Rounding mode can be specified with -OPT:roundoff=X switch:
X = 0 no optimizations that affect fp behaviour (default at -O1 -O2)

X = 1 allows simple transformations with limited round-off and overflow differences X = 2 allows reordering of reduction loops (default at -O3)
X = 3 any mathematically valid transformation is allowed

do i=1,n x = x + a(i) enddo
With -O3 -OPT:roundoff=1 2 cycles/iter; 25% peak

-O3 -OPT:roundoff=2 (default at -O3)

do i=1,n,8 x0 = x0 + a(i) x1 = x1 + a(i+1) … enddo x = x0 + x1 + …
1 cycles/iter; 50% peak

Recommendation: Your program should work correctly when compiled with -O3 -OPT:IEEE_arithmetic=3:roundoff=3


Compiler is the primary tool of program optimization
• Compilation is the process of lowering the code representation from high level to low, I.e. processor level
• The MipsPro compiler targets the MIPS R1x000 processor and has built in the features of the processor and Origin architecture • A large number of options exist to steer the compilation process
– ABI, ISA and optimization options selections – setting of assumptions about the program behaviour

• There are optimized and parallelized libraries of subroutines for scientific computation • When programming for a digital computer, it is important to remember the limitations due to limited validity range of the floating point calculations

Shared By: