VIEWS: 13 PAGES: 32 POSTED ON: 4/6/2010
Embedded ISA Support for Enhanced Floating-Point to Fixed-Point ANSI C Compilation Tor Aamodt and Paul Chow University of Toronto { aamodt, pc }@eecg.utoronto.ca 3rd ACM International Conference on Compilers, Architectures and Synthesis for Embedded Systems, Nov. 17-18th, 2000, San Jose CA What is this presentation about? FOCUS: Signal processing applications developed using high-level language representation and floating-point data types... WANT: Faster fixed-point software development... QUESTION: Are there “better” fixed-point DSP instruction-sets in terms of runtime, power, or roundoff-noise performance? Tor Aamodt & Paul Chow Embedded ISA Support for Enhanced Floating- University of Toronto Point to Fixed-Point ANSI C Compilation Slide 2 of 32 Presentation Outline Motivation & Background Focus on… Automatic Conversion to Fixed-Point Architectural Enhancements Some Experimental Results Summary / Future Directions Tor Aamodt & Paul Chow Embedded ISA Support for Enhanced Floating- University of Toronto Point to Fixed-Point ANSI C Compilation Slide 3 of 32 Motivation 80% of DSPs in use are Fixed-Point. Why? Because fixed-point hardware is cheaper and uses less power … … however, it is much harder to develop signal-processing software for. Tor Aamodt & Paul Chow Embedded ISA Support for Enhanced Floating- University of Toronto Point to Fixed-Point ANSI C Compilation Slide 4 of 32 Background UTDSP Project: DSP Compiler/Architecture Co-design Traditional DSP architectures are hard for compilers to generate efficient code for… eg. extended precision accumulators First Generation Silicon Sept. 30, 1999: 108 pin PGA 0.35 µm CMOS / 63 MHz (Sean Peng‟s M.A.Sc.) 16-bit Fixed-Point VLIW DSP with novel 2-level Instruction fetching architecture (reduced pin-count) June 2000: Synopsys CoCentric Fixed-Point Designer Tool First commercial tool for transforming floating-point ANSI C programs into fixed-point ($20,000 US) Tor Aamodt & Paul Chow Embedded ISA Support for Enhanced Floating- University of Toronto Point to Fixed-Point ANSI C Compilation Slide 5 of 32 Background: Fixed-Point versus Floating-Point sign bit 8 bit exponent 23+1 bit normalized (excess 127) mantissa 32 bit Floating-Point (IEEE): explicit binary-point Fixed-Point: implied binary-point sign bit Integer Part Fractional Part Tor Aamodt & Paul Chow Embedded ISA Support for Enhanced Floating- University of Toronto Point to Fixed-Point ANSI C Compilation Slide 6 of 32 Background: Using Fixed-Point Arithmetic Floating-Point: yn = yn-1 + xn Fixed-Point: yn = ((•y n-1>>3) + xn ) << 1 Explicit Scaling Operations Tor Aamodt & Paul Chow Embedded ISA Support for Enhanced Floating- University of Toronto Point to Fixed-Point ANSI C Compilation Slide 7 of 32 Automatic Conversion Process Traditional Optimizing Compiler: Input Parser Optimizer Code Generator Processor Program • CONSTRAINT: Input/Output Invariance • GOAL: Application Speedup ie. make code faster, but do not break anything!!! Tor Aamodt & Paul Chow Embedded ISA Support for Enhanced Floating- University of Toronto Point to Fixed-Point ANSI C Compilation Slide 8 of 32 Automatic Conversion Process Traditional Optimizing Compiler: Input Parser Optimizer Code Generator Processor Program Sample Inputs Floating-Point to Fixed-Point Translator • “RELAX” CONSTRAINTS… • GOALS: “Good” Input/Ouput Fidelity (eg. good signal-to-noise ratio) Fast/Low-Power Operation (10-500 faster than FP emulation) Tor Aamodt & Paul Chow Embedded ISA Support for Enhanced Floating- University of Toronto Point to Fixed-Point ANSI C Compilation Slide 9 of 32 Floating-Point to Fixed-Point Translation float a, b, x[N]; int a, b, x[N]; y = a*x[i] + b*x[i+1]; y = a•x[i] >> 2 + b•x[i+1]; 1. Type Conversion 2. Scaling Operations 3. Fractional Fixed-Point Operations Tor Aamodt & Paul Chow Embedded ISA Support for Enhanced Floating- University of Toronto Point to Fixed-Point ANSI C Compilation Slide 10 of 32 Floating-Point to Fixed-Point Translator SUIF Parser* Optimizer Identifier Assignment Fixed-PointConversion Instrument Code Sample Inputs Profile *SUIF = Stanford University Intermediate Format See: http://suif.stanford.edu Tor Aamodt & Paul Chow Embedded ISA Support for Enhanced Floating- University of Toronto Point to Fixed-Point ANSI C Compilation Slide 11 of 32 Collecting Dynamic Range Information Consider the ANSI C code: float a, b, x[N]; Code Instrumentation: y = a*x[i] + b*x[i+1]; tmp_1 = a*x[i]; profile(tmp_1,1); Equivalent Expression Tree: ID Assignment: tmp_2 = b*x[i+1]; a profile(tmp_2,2); “1” : tmp_1 * x[i] y = tmp_1 * tmp_2; “0” : y + profile(y,0); b * “2” : tmp_2 x[i+1] Tor Aamodt & Paul Chow Embedded ISA Support for Enhanced Floating- University of Toronto Point to Fixed-Point ANSI C Compilation Slide 12 of 32 Generating Scaling Operations Signal Scaling: Integer Word Length (IWL) definition: IWL[x] = log2 max(x) + 1 IWL Sign bit Integer Part Fractional Part Tor Aamodt & Paul Chow Embedded ISA Support for Enhanced Floating- University of Toronto Point to Fixed-Point ANSI C Compilation Slide 13 of 32 Generating Scaling Operations Example: “A op B”: IWLA op B measured IWLA measured IWLA op B current IWLA current ? op IWLB measured IWL B current Converted A B Sub-Expressions Tor Aamodt & Paul Chow Embedded ISA Support for Enhanced Floating- University of Toronto Point to Fixed-Point ANSI C Compilation Slide 14 of 32 Automatic Conversion Process: IRP: Using Intermediate Result Profile Data Previous Algorithms: „Worst-Case Evaluation‟: Markus Willems et. al. FRIDGE: An Interactive Code Generation Environment for HW/SW CoDesign. ICASSP, April 1997. (a.k.a. Predecessor to Synopsys CoCentric Fixed- Point Designer Tool) A „Statistical‟ Approach: Ki-Il Kum, Jiyang Kang, and Wonyong Sung. A Floating-Point to Fixed-Point C Converter for Fixed- Point Digital Signal Processors. In Proc. 2nd SUIF Compiler Workshop, August 1997. Neither use Intermediate Result Profile data, instead, they combine range information from leaf nodes Is Useful Information Lost? Tor Aamodt & Paul Chow Embedded ISA Support for Enhanced Floating- University of Toronto Point to Fixed-Point ANSI C Compilation Slide 15 of 32 IRP: Additive Operations “A ± B” For example, assume |A| > |B|, and IWLA+B measured IWLA measured A: B: >> n n “A B” “(A << nA) (B >> [n-nB])” where: nA = IWLA current - IWLA measured nB = IWLA current - IWLB measured n = IWLA measured - IWLB measured IWLA+B current = IWLA measured Tor Aamodt & Paul Chow Embedded ISA Support for Enhanced Floating- University of Toronto Point to Fixed-Point ANSI C Compilation Slide 16 of 32 IRP: Multiplication “A • B” “(A << nA) • (B << nB)” where: nA = IWLA current - IWLA measured nB = IWLA current - IWLB measured IWLA•B current = IWLA measured + IWLB measured Tor Aamodt & Paul Chow Embedded ISA Support for Enhanced Floating- University of Toronto Point to Fixed-Point ANSI C Compilation Slide 17 of 32 IRP: Division “A / B” “(A >> [ndividend - nA]) / (B << nB)” nA = IWLA current - IWLA measured nB = IWLA current - IWLB measured ndiff = IWLA/B measured - IWLA measured + IWLB measured ndiff , if ndiff 0 ndividend = 0 , otherwise Tor Aamodt & Paul Chow Embedded ISA Support for Enhanced Floating- University of Toronto Point to Fixed-Point ANSI C Compilation Slide 18 of 32 IRP-SA: Using „Shift Absorption‟ Example: y = (a*x[i] + (b*x[i+1]>>1)) << 1 Question: Is information discarded unnecessarily here? Consider the following alternative: y = (a*x[i]<<1) + b*x[i+1] BUT: Can we really discard most significant bits and get roughly the same answer???? YES! Tor Aamodt & Paul Chow Embedded ISA Support for Enhanced Floating- University of Toronto Point to Fixed-Point ANSI C Compilation Slide 19 of 32 Architectural Support Common occurrence (using IRP-SA): A•B << n Fractional Multiplication IWLA with internal Left Shift A: IWLB B: A*B: IWLA+ IWLB n Tor Aamodt & Paul Chow Embedded ISA Support for Enhanced Floating- University of Toronto Point to Fixed-Point ANSI C Compilation Slide 20 of 32 Experimental Results Benchmarks 4th Order Cascaded/Parallel IIR Filter (IIR-C, IIR-P) (Normalized) Lattice Filter (LAT, NLAT) 128-Point Radix 2 Decimation in Time FFT (FFT-NR, FFT-MW) Levinson-Durbin Recursion (LEVDUR) 10x10 Matrix-Multiply (MMUL10) Nonlinear Control (INVPEND) Trig Function (SIN) Tor Aamodt & Paul Chow Embedded ISA Support for Enhanced Floating- University of Toronto Point to Fixed-Point ANSI C Compilation Slide 21 of 32 SQNR Enhancement: FMLS and/or IRP-SA 2 IRP-SA FMLS IRP-SA w/ FMLS 1.5 Equivalent Bits 1 0.5 0 -0.5 IIR4-C IIR4-P NLAT LAT FFT-NR FFT-MW INVPEND LEVDUR MMUL10 SIN Tor Aamodt & Paul Chow Embedded ISA Support for Enhanced Floating- University of Toronto Point to Fixed-Point ANSI C Compilation Slide 22 of 32 What Is The Effect of “Shift Absorption” ? Distribution of Fractional Multiply Output Shifts 0.8 IRP Relative Frequency 0.6 IRP-SA 0.4 0.2 0 3 left 2 left 1 left none 1 right FMLS Ouput Shift Distance Tor Aamodt & Paul Chow Embedded ISA Support for Enhanced Floating- University of Toronto Point to Fixed-Point ANSI C Compilation Slide 23 of 32 Experimental Results: Rotational Inverted Pendulum U of T System Control Group Non-linear Testbench Tor Aamodt & Paul Chow Embedded ISA Support for Enhanced Floating- University of Toronto Point to Fixed-Point ANSI C Compilation Slide 24 of 32 Closed-Loop System Response: Rotational Inverted Pendulum 12-bit Controller Comparison WC : 32.8 dB IRP-SA: 41.1 dB IRP-SA w/ fmls: 48.0 dB Tor Aamodt & Paul Chow Embedded ISA Support for Enhanced Floating- University of Toronto Point to Fixed-Point ANSI C Compilation Slide 25 of 32 128-Point Radix-2 FFT (Generated by MATLAB RealTime Workshop) Tor Aamodt & Paul Chow Embedded ISA Support for Enhanced Floating- University of Toronto Point to Fixed-Point ANSI C Compilation Slide 26 of 32 Speedup? Rotational Inverted Pendulum: Fractional Multiply Output Shift Relative Frequencies Tor Aamodt & Paul Chow Embedded ISA Support for Enhanced Floating- University of Toronto Point to Fixed-Point ANSI C Compilation Slide 27 of 32 …Yup! Tor Aamodt & Paul Chow Embedded ISA Support for Enhanced Floating- University of Toronto Point to Fixed-Point ANSI C Compilation Slide 28 of 32 Speedup* Using FMLS 1.4 Limiting 8-FMUL = { 4 left thru 3 right } 1.3 4-FMUL = { 2 left thru 1 right } 2-FMUL = { one left, no shift } Relative Speedup 1.2 1.1 1 LAT NLAT LEVDUR FFT-NR INVPEND FFT-MW MMUL10 IIR4-P IIR4-C SIN Tor Aamodt & Paul Chow Embedded ISA Support for Enhanced Floating- University of Toronto Point to Fixed-Point ANSI C Compilation Slide 29 of 32 SQNR Enhancement for various Output Shift Sets 2 Limiting 8-FMUL 4-FMUL 2-FMUL 1.5 Equivalent Bits 1 0.5 0 IIR4-C IIR4-P NLAT LAT FFT-NR FFT-MW LEVDUR MMUL10 INVPEND SIN Tor Aamodt & Paul Chow Embedded ISA Support for Enhanced Floating- University of Toronto Point to Fixed-Point ANSI C Compilation Slide 30 of 32 Summary The Fractional Multiply with internal Left Shift (FMLS) operation can improve runtime and signal-to-noise performance. Speedups of up to 35% and SQNR enhancement equivalent of up to 2 bits maybe even 4 bits (depending on how you choose to measure it) Easy VLSI implementation, and easy for compiler to use. Tor Aamodt & Paul Chow Embedded ISA Support for Enhanced Floating- University of Toronto Point to Fixed-Point ANSI C Compilation Slide 31 of 32 Future Directions Higher Level Transformations: Automatic Generation of Block-Floating-Point... Quantization Error Feedback… BOTH need signal-flow-graph representation… therefore probably need a better DSP language than ANSI C Variable Precision Arithmetic (How much precision does each operation need?) Tor Aamodt & Paul Chow Embedded ISA Support for Enhanced Floating- University of Toronto Point to Fixed-Point ANSI C Compilation Slide 32 of 32