Your Federal Quarterly Tax Payments are due April 15th

# fast math functions by 4a0Bz8

VIEWS: 0 PAGES: 215

• pg 1
```									Faster Math Functions

Robin Green
R&D Programmer
Sony Computer Entertainment America
   This is an Advanced Lecture
• There will be equations
• Programming experience is assumed

   Writing your own Math functions
•   Optimize for Speed
•   Optimize for Accuracy
•   Optimize for Space
Running Order
   Part One – 10:00 to 11:00
• Floating Point Recap
• Measuring Error
• Incremental Methods
- Sine and Cosine

   Part Two – 11:15 to 12:30
• Table Based Methods
• Range Reduction
• Polynomial Approximation
Running Order
   Part Three – 2:00 to 4:00
• Fast Polynomial Evaluation
• Higher Order functions
- Tangent
- Arctangent, Arcsine and Arccosine

   Part Four – 4:15 to 6:00
• More Functions
- Exponent and Logarithm
- Raising to a Power
• Q&A
Floating Point Formats
32-bit Single Precision Float
Floating Point Standards
   IEEE 754 is undergoing revision.
• In process right now.

   Get to know the issues.
• Quiet and Signaling NaNs.
• Specifying Transcendental Functions.
History of IEEE 754
History of IEEE 754
   IEEE754 ratified in 1985 after 8 years of
meetings.

   A story of pride, ignorance, political
intrigue, industrial secrets and genius.

   A battle of Good Enough vs. The Best.
Timeline: The Dark Ages
   Tower of Babel
• On one machine, values acted as non-zero for add/subtract
and zero for multiply-divide.
b = b * 1.0;
if(b==0.0) error;
else return a/b;

• On another platform, some values would overflow if
multiplied by 1.0, but could grow by addition.
• On another platform, multiplying by 1.0 would remove the
lowest 4 bits of your value.
• Programmers got used to storing numbers like this
b = (a + a) - a;
Timeline: 8087 needs “The Best”
   Intel decided the 8087 has to appeal to
the new mass market.

• Help “normal” programmers avoid the
counterintuitive traps.

• Full math library in hardware, using only 40,000
gates.

• Kahan, Coonen and Stone prepare draft spec, the
K-C-S document.
Timeline: IEEE Meetings
   Nat Semi, IBM, DEC, Zilog, Motorola,
Intel all present specifications.
• Cray and CDC do not attend…
   DEC with VAX has largest installed base.
• Double float had 8-bit exponent.
• Added an 11-bit “G” format to match K-C-S, but
with a different exponent bias.
   K-C-S has mixed response.
• Looks complicated and expensive to build.
• But there is a rationale behind every detail.
Timeline: The Big Argument
 DEC didn’t.
Timeline: The Big Argument
   Both Cray and VAX had no way of
detecting flush-to-zero.

extra code to handle these exceptions.

   How to measure the Cost/Benefit ratio?
Timeline: Trench Warfare
   DEC vs. Intel
• DEC argued that Gradual Underflow was
impossible to implement on VAX and too
expensive.
• Intel had cheap solutions that they couldn’t share
(similar to a pipelined cache miss).

   Advocates fought for every inch
• George Taylor from U.C.Berkeley built a drop-in
VAX replacement FPU.
• The argument for “impossible to build” was
broken.
Timeline: Trench Warfare
   DEC turned to theoretical arguments
• If DEC could show that GU was unnecessary then
K-C-S would be forced to be identical to VAX.

• Prof Donald Knuth, programming guru.
• Dr. J.H. Wilkinson, error-analysis & FORTRAN.

   Then DEC decided to force the
impasse…
Timeline: Showdown
   DEC found themselves a hired gun
• U.Maryland Prof G.W.Stewart III, a highly respected
numerical analyst and independent researcher

   In 1981 in Boston, he delivered his verdict
verbally…

“On balance, I think Gradual Underflow is
the right thing to do.”
Timeline: Aftermath
   By 1984, IEEE 754 had been
implemented in hardware by:
•   Intel   •   Nat. Semi.
•   AMD     •   Weitek
•   Apple   •   Zilog
•   IBM     •   AT&T

   It was the de facto standard long before
being a published standard.
Why IEEE 754 is best
The Format
   Sign, Exponent, Mantissa
• Mantissa used to be called “Significand”

   Why base2?
• Base2 has the smallest “wobble”.
• Base2 also has the hidden bit.
- More accuracy than any other base for N bits.
- Base3 arguments always argue using fixed-point values

   Why 32, 64 and 80-bit formats?
• Because 8087 could only do 64-bits of carry
propagation in a cycle!
Why A Biased Exponent?
   For sorting.
   Biased towards underflow.

exp_max = 127;
exp_min = -126;

•   Small number reciprocals will never Overflow.
•   Large numbers will use Gradual Underflow.
The Format
   Note the Symmetry
1   11111111   ???????????????????????   Not A Number

1   11111111   00000000000000000000000   Negative Infinity

1   11111110   ???????????????????????   Negative Numbers

1   00000000   ??????????????????????1   Negative Denormal

1   00000000   00000000000000000000000   Negative Zero

0   00000000   00000000000000000000000   Positive Zero

0   00000000   ??????????????????????1   Positive Denormal

0   00000001   ???????????????????????   Positive Numbers

0   11111111   00000000000000000000000   Positive Infinity

0   11111111   ???????????????????????   Not A Number
Rounding
   IEEE says operations must be “exactly
rounded towards even”.

   Why towards even?
• To stop iterations slewing towards infinity.
• Cheap to do using hidden “guard digits”.

   Why support different rounding modes?
• Used in special algorithms, e.g. decimal to binary
conversion.
Rounding
   How to round irrational numbers?
• Impossible to round infinite numbers accurately.
• Called the Table Makers Dilemma.
- In order to calculate the correct rounding, you need to
calculate worst case values to infinite precision.
- E.g. Sin(x) = 0.02310000000000000007

   IEEE754 just doesn’t specify functions
• Recent work looking into worst case values
Special Values
   Zero
• 0.0 = 0x00000000

   NaN
• Not an number.
• NaN = sqrt(-x), 0*infinity, 0/0, etc.
• Propagates into later expressions.
Special Values
   ±Infinity
• Allows calculation to continue without overflow.

   Why does 0/0=NaN when ±x/0=±infinity?
• Because of limit values.
• a/b can approach many values, e.g.

sin x      
1 
x
1  cosx      as x  0
 0
x         
Signed Zero
   Basically, WTF?
• Guaranteed that +0 = -0, so no worries.

   Used to recover the sign of an
overflowed value
• Allows 1/(1/x) = x as xinf
• Allows log(0)=-inf and log(-x)=NaN
• In complex math, sqrt(1/-1) = 1/sqrt(-1) only
works if you have signed zero
Destructive Cancellation
   The nastiest problem in floating point.
   Caused by subtracting two very similar values
• For example, in quadratic equation if b2 ≈ 4ac
• In fixed point…

1.10010011010010010011101
- 1.10010011010010010011100

0.00000000000000000000001

• Which gets renormalised with no signal that almost all digits
have been lost.
Compiler “Optimizations”
   Floating Point does not obey the laws of
algebra.
• Replace x/2 with 0.5*x – good
• Replace x/10 with 0.1*x – bad
• Replace x*y-x*z with x*(y-z) – bad if y≈z
• Replace (x+y)+z with x+(y+z) – bad

   A good compiler will not alter or reorder
floating point expressions.
• Compilers should flag bad constants, e.g.

float x = 1.0e-40;
Decimal to Binary Conversion
   In order to reconstruct the correct binary value
from a decimal constant

Single float : 9 digits
Double float : 17 digits

• Loose proof in the Proceedings
- works by analyzing the number of representable values in sub-
ranges of the number line, showing a need for between 6 and
9 decimal digits for single precision
Approximation Error
Measuring Error
   Absolute Error
• Measures the size of deviation, but tell us nothing
• The abs() is often ignored for graphing

errorabs  f actual  f approx
Measuring Error
   Absolute Error sometimes written ULPs
• Units in the Last Place

Approx         Actual      ULPs

0.0312         0.0314      2

0.0314         0.0314159   0.159
Measuring Error
   Relative Error
• A measure of how important the error is.

f approx
errorrel  1 
f actual
Example: Smoothstep Function
   Used for ease-in ease-out animations
and anti-aliasing hard edges
• Flat tangents at x=0 and x=1

1 cos x 
f ( x)  
2    2
Smoothstep Function
Smoothstep Approximation
   A cheap polynomial approximation
• From the family of Hermite blending functions.

f approx( x)  3x  2 x 2         3
Smoothstep Approximation
Absolute Error
Relative Error
Relative Error Detail
Incremental Algorithms
Incremental Methods
Q: What is the fastest method to calculate
sine and cosine of an angle?

A: Just two instructions.
There are however two provisos.

1. You have a previous answer to the problem.
2. You are taking equally spaced steps.
Resonant Filter
   Example using 64 steps   int N = 64;
per cycle.               float a = sin(2PI/N);
float c = 1.0f;
   NOTE: new s uses the     float s = 0.0f;
previously updated c.    for(int i=0; i<M; ++i) {
output_sin = s;
output_cos = c;
c = c – s*a;
s = s + c*a;
...
}
Resonant Filter Graph
Resonant Filter Quarter Circle
Goertzels Algorithm
   A more accurate               float cb = 2*cos(b);
algorithm                     float s2 = sin(a+b);
float s1 = sin(a+2*b);
• Uses two previous samples
float c2 = cos(a+b);
(Second Order)
float c1 = cos(a+2*b);
float s,c;
   Calculates x =                for(int i=0; i<m; ++i) {
sin(a+n*b) for all               s = cb*s1-s2;
integer n                        c = cb*c1-c2;
s2 = s1; c2 = c1;
s1 = s; c1 = c;
output_sin = s;
output_cos = c;
...
}
Goertzels Algorithm Graph
Goertzels Initialization
   Needs careful initialization
• You must account for a three iteration lag

// N steps over 2PI radians
float b = 2PI/N;

// subtract three steps from initial value
float new_a = a – 3.0f * b;
Goertzels Algorithm Quarter Circle
Table Based Solutions
Table Based Algorithms
   Traditionally the sine/cosine table was the
fastest possible algorithm
• With slow memory accesses, it no longer is

   New architectures resurrect the technique
• Vector processors with closely coupled memory
• Large caches with small tables forced in-cache

   Calculate point samples of the function
• Hash off the input value to find the nearest samples
• Interpolate these closest samples to get the result
Table Based Sine
Table Based Sine Error
   Given an index i, the approximation is…

sin x  table i   * (table i  1  table i)
 table i   * gradient i

   Which fits nicely into a 4-vector…

How Accurate Is My Table?
   The largest error occurs when two samples
• Given a stepsize of ∆x, the error E is:

 x 
E  1  cos 
 2 
• e.g. for 16 samples, the error will be:

1  cos 16  0.0192147
How Big Should My Table Be?
   Turning the problem around, how big should a
table be for an accuracy of E?
• We just invert the expression…

E  1%
1  cos N   1%
cos N   1  0.01
N   arccos0.99
N  22.19587...
N  23
How Big Should My Table Be?
   We can replace the arccos() with a small angle
approximation, giving us a looser bound.


N
2E
   Applying this to different accuracies gives us a
feel for where tables are best used.
Table Sizes
E          360°        45°
1% accurate     0.01       23          3
0.1% accurate   0.001      71          9
0.01% accurate 0.0001      223         28
1 degree        0.01745    17          3
0.1 degree      0.001745   54          7
8-bit int       2-7        26          4
16-bit int      2-15       403         51
24-bit float    10-5       703         88
32-bit float    10-7       7025        880
64-bit float    10-17      ~infinite   8.7e+8
Range Reduction
Range Reduction
   We need to map an infinite range of input
values x onto a finite working range [0..C].

   For most transcendentals we use a technique
• Works like y = x mod C but without a divide.
• We just work out how many copies of C to subtract from x
to get it within the target range.
1.   We remap 0..C into the 0..1 range by scaling

const float C = range;
const float invC = 1.0f/C;
x = x*invC;

2.   We then truncate towards zero (e.g. convert to int)

int k = (int)(x*invC);
// or (x*invC+0.5f);

3.   We then subtract k copies of C from x.

float y = x – (float)k*C;
High Accuracy Range Reduction
   Notice that y = x-k*C has a destructive
subtraction.

   Avoid this by encoding C in several constants.
• First constant C1 is a rational that has M bits of C’s
mantissa, e.g. PI = 201/64 = 3.140625
• Second constant C2 = C - C1
• Overall effect is to encode C using more bits than machine
accuracy.

float n = (float)k;
float y = (x – n*C1) – n*C2;
Truncation Towards Zero
   Another method for truncation
• Subtract it again
• You will have lost the fractional bits of the mantissa

A = 123.45        = 1111011.01110011001100110
B = 1.5*2^24 = 1100000000000000000000000.
A = A+B           = 1100000000000000001111011.
A = A-B           = 1111011.00000000000000000

• This technique requires you know the range of your input
parameter…
   Instead of range reducing to a whole cycle,
let’s use C=Pi/2 - a quarter cycle
• The lower bits of k now holds which quadrant our angle is in

   Why is this useful?
• Because we can use double angle formulas
• A is our range reduced angle.
• B is our quadrant offset angle.

sin( A  B)  sin( A) cos( B)  cos( A) sin( B)
cos( A  B)  cos( A) cos( B)  sin( A) sin( B)
Double Angle Formulas
   With four quadrants, the double angle
formulas now collapses into this useful form

sin  y  0 *  2   sin  y 
sin  y  1*  2   cos y 
sin  y  2 *  2    cos y 
sin  y  3 *  2   sin  y 
Four Segment Sine
A Sine Function
   Leading to code like this:

float table_sin(float x)
{
const float C = PI/2.0f;
const float invC = 2.0f/PI;
int k = (int)(x*invC);
float y = x-(float)k*C;
switch(k&3) {
case 0: return sintable(y);
case 1: return sintable(TABLE_SIZE-y);
case 2: return -sintable(TABLE_SIZE-y);
default: return –sintable(y);
}
return 0;
}
   Why stop at just four quadrants?
• If we have more quadrants we need to calculate both the
sine and the cosine of y.
• This is called the reconstruction phase.

    3                 3                3 
sin  y    sin  y * cos   cos y * sin  
    16                 16                16 
• Precalculate and store these constants.
• For little extra effort, why not return both the sine AND
cosine of the angle at the same time?
• This function traditionally called sincos()in FORTRAN
libraries
Sixteen Segment Sine
float table_sin(float x)
{
const float C = PI/2.0f;
const float invC = 2.0f/PI;
int k = (int)(x*invC);
float y = x-(float)k*C;
float s = sintable(y);
float c = costable(y);
switch(k&15) {
case 0: return s;
case 1: return s*0.923879533f + c*0.382683432f;
case 2: return s*0.707106781f + c*0.707106781f;
case 3: return s*0.382683432f + c*0.923879533f;
case 4: return c;
...
}
return 0;
}
Math Function Forms
   Most math functions follow three phases of
execution

1. Range Reduction
2. Approximation
3. Reconstruction

   This is a pattern you will see over and over
•   Especially when we meet Polynomial Approximations
Polynomial Approximation
Infinite Series
   Most people learn about approximating
functions from Calculus and Taylor series

3     5     7      9
sin x   x      
x  x x   x
3! 5! 7! 9!

   If we had infinite time and infinite storage,
this would be the end of the lecture.
Taylor Series
   Taylor series are generated by repeated
differentiation
• More strictly, the Taylor Series around x=0 is called the
Maclauren series

f 0 f 0
f x   f 0  f 0                    
2!       3!

   Usually illustrated by graphs of successive
approximations fitting to a sine curve.
Taylor Approx of Sine
Properties Of Taylor Series
x3 x5 x7 x9
sin x   x      
3! 5! 7! 9!

   This series shows all the signs of convergence
• Alternating signs
• Rapidly increasing divisor

   If we truncate at the 7th order, we get:
1 3 1 5          1 7
sin( x)  x  x         x       x
6     120      5040
 x  0.16667 x  0.0083333x 5  0.00019841x 7
Graph of Taylor Series Error
   The Taylor Series, however, has problems
• The problem lies in the error
• Very accurate for small values but is exponentially bad for
larger values.

   So we just reduce the range, right?
• This improves the maximal error.
• Bigger reconstruction cost, large errors at boundaries.
• The distribution of error remain the same.

• Improves the maximal error.
• Now you have twice as many coefficients.
Taylor 7th Order for –Pi/2..Pi/2
Taylor 7th Order for 0..Pi/2
Taylor 7th Order for 0..Pi/2
   And now the bad news.

sin( x)  0.0000023014110 
1.000023121x 
 0.00010117322 x 2 
 0.1664154429 x 3 
 0.00038530806 x 4 
0.008703147018 x 5 
 0.0002107589082 x 6 
 0.0001402989645 x 7
Taylor Series Conclusion
   For our purposes a Taylor series is next
to useless
• Wherever you squash error it pops back up
somewhere else.
• Sine is a well behaved function, the general case
is much worse.

   We need a better technique.
• Make the worst case nearly as good as the best
case.
Orthogonal Polynomials
Orthogonal Polynomials
   Families of polynomials with interesting
properties.
• Named after the mathematicians who discovered them
• Chebyshev, Laguerre, Jacobi, Legendre, etc.

   Integrating the product of two O.P.s returns
zero if the two functions are different.

c j
       if i  j
 w( x) Pi x Pj x  dx   0
      otherwise

• Where w(x) is a weighting function.
Orthogonal Polynomials
   Why should we care?
• If we replace Pi(x) an arbitrary function f(x), we end up with
a scalar value that states how similar f(x) is to Pj(x).
• This process is called projection and is often notated as

f Pj  f w Pj   f x Pj x wx  dx

   Orthogonal polynomials can be used to
approximate functions
• Much like a Fourier Transform, they can break functions into
approximating components.
Chebyshev Polynomials
   Lets take a concrete example
• The Chebyshev Polynomials Tn(x)

T0  x   1
T1  x   x
T2  x   2 x 2  1
T3  x   4 x 3  3 x
T4  x   8 x 4  8 x 2  1
T4  x   16 x 5  20 x 3  5 x

Tn 1  x   2 xTn  x   Tn 1  x 
Chebyshev Plots
   The first five Chebyshev polynomials
Chebyshev Approximation
   A worked example.
• Let’s approximate f(x) = sin(x) over [-π..π] using Chebyshev
Polynomials.
• First, transform the domain into [-1..1]

a  
b 
 a b      ab
g x   f         x    
 2          2 
 sin  x 
Chebyshev Approximation
   Calculate coefficient kn for each Tn(x)

 g x T x wx  dx
1
n
kn     1
cn

Where the constant cn and weighting function w(x) are


      if n  0
w x  
1
cn  
 2 otherwise
                                      1 x2
Chebyshev Coefficients
   The resulting coefficients

k0  0.0
k1  0.5692306864
k 2  0.0
k 2  0.666916672
k 4  0.0
k5  0.104282369
k6  

• This is an infinite series, but we truncate it to produce an
approximation to g(x)
Chebyshev Reconstruction
   Reconstruct the polynomial in x
• Multiply through using the coefficients kn

g ( x)  k0 1 
k1  x  
       
k2 2 x 2  1 
k 4 x  3 x  
3
3

k 4 x  3 x  
3
3

k 8 x  8 x  1 
4
4       2

k 16 x  20 x  5 x 
5
5       3
Chebyshev Result
   Finally rescale the domain back to [-π..π]

 2     ab
f x   g     x    
ba    ba

• Giving us the polynomial approximation

f ( x)  0.984020813 x 
 0.153301672 x 3 
0.00545232216 x 5
Chebyshev Result
   The approximated function f(x)
Chebyshev Absolute Error
   The absolute error sin(x)-f(x)
Chebyshev Relative Error
   The relative error tells a different story…
Chebyshev Approximation
   Good points
•   Approximates an explicit, fixed range
•   Uses easy to generate polynomials
•   Integration is numerically straightforward
•   Orthogonal Polynomials used as basis for new techniques
- E.g. Spherical Harmonic Lighting

• Imprecise control of error
• No clear way of deciding where to truncate series
• Poor relative error performance
Minimax Polynomials
Pafnuty Chebyshev 1821-1894
   Chebyshev described what the
error for the best approximating
polynomial should look like.

   An Order-N approximation error
curve should:

a. Change sign N+1 times
b. Touch the maximal error N+2 times

   This function minimizes the
maximal error, so it is called the
Minimax Polynomial
Evgeny Remez 1896-1975
   The algorithm to generate
these polynomials is the
Remez Exchange Algorithm

   The algorithm requires:
•   A function to be approximated
•   A range of approximation
•   The required order of approximation
•   A weighting function
Remez Exchange Algorithm
   The algorithm works by generating a set of
zero crossings for the error curve, e.g.

sin  x   a  bx0  cx0  0
2

sin  x   a  bx1  cx12  0

sin  x   a  bxn  cxn  0
2

   These are solved for a,b and c as a matrix
inversion. The maximal error is found and one
xn is altered. Repeat until minimized.
Remez Exchange Algorithm
   The algorithm is very sensitive to error and
requires large number representations
(e.g. 50 digits or more) and numerical solvers.

• The best place to find these tools is in a mathematical
package like Maple or Mathematica.

• In Maple, a raw call to minimax() will use all coefficients
available, so we will have to set up the calculation…
Forming The Minimax Inequality
   Start by looking at the Taylor Expansion of
sine to get a feel for what the end result
• Every odd power, first coefficient is 1.0

sin x   x  0.16667 x3  0.0083333x5  0.00019841x 7

   Transform the problem into finding P(x2) in

sin x  x  x3 Px 2 
The Minimax Inequality
   Next, we form the minimax inequality
expressing our desire to minimize the relative
error

sin x   x  x P( x )
3     2
 error
sin x 
The Minimax Inequality
   Divide through by x3

sin x  1
3
 2  P( x )
2

x        x          error
sin x 
3
x
The Minimax Inequality
   We want only every other power, so
substitute y = x2

 
sin y 1
  P( y )
32
y          y
 error
 
sin y
32
y
The Function to Approximate
   We solve for P(y) giving us

P y  
 
sin y 1

32
y      y
   With the weighting function

32
w y  
y
 
sin y
Final Details
   Before we can run this through the minimax algorithm,
we need to convert to a Taylor Series
• This stops the sin() function being evaluated and greatly
speeds the calculation

1  y   y2    y3
P( y )                 
6 120 5040 362880

   We also transform our approximation range
    
2
x 2  y  0..   0.. 2 
 2  2 
Calculating The Minimax Poly
   We run these values through the minimax
algorithm:

minimax(p(y), y=0..Pi^2/4, [2,0], w(y));

   It returns (to 5 d.p.):
P y   0.16666  0.0083143 y  0.00018542 y 2

   Finally, we resubstitute back into the original
equation

sin x   x  0.16666 x3  0.0083143x5  0.00018542 x 7
Minimax Polynomial Error
   Maximal error = 1.108e-6 at x=1.5707
Magic Numbers
   Comparing the results

taylor  x  0.16667 x 3  0.0083333x 5  0.00019841x 7
minimax  x  0.16666 x 3  0.0083143x 5  0.00018542 x 7

   An enormous increase in accuracy with a tiny
change in coefficients.
• We must be very sure of our coefficients.
• We must pay attention to floating point rounding.
Side Note: PS2 ESIN Instruction
   The manual entry for ESIN has a small error in it.
Improving On Perfection
   We can optimize polynomials for floating point
representation
• This will guarantee better accuracy for small angles, as the
first term has the most effect in these cases.
• Requires a smaller range than Pi/2, so we will use 0..Pi/4

   Using a different form of minimax inequality,
we can specify the first coefficient:

sin x   x  kx3  x5 P x 2 
Optimizing For Floating Point
   Now we have to specify the first coefficient as
a single-precision float.


k  0.1666666666   1 6

   Multiply k by 224, round to zero and divide by
224 to create an exactly representable number.

k  2 24   2796203

24
2         16777216
 0.16666668653488159179687500000
Optimizing For Floating Point
   Solving the Minimax Inequality for P(x) and

sin x   x 
2796203 3
x +0.00833282412 x 5  0.000195878418 x 7
16777216

   Maximal error = 6.275e-9 at x=0.7855
• About twice the error of the unoptimized version but with
better accuracy for small angles.
Float Optimized Sine Minimax
Optimizing for Small Angles
   Most functions are bound by the accuracy of
their first coefficient
• In the same way we split the range reduction into two
coefficients, we can do the same for the first coefficient of a
polynomial, e.g.

ln( x)  a0 x  a1 x  x P( x)    2

• Where a0+a1 = k, the leading coefficient.
Fast Polynomial Evaluation
Polynomial Evaluation
   Now we have our coefficients, we need to
evaluate them efficiently.

• When presented with a polynomial of the form

y  cx  bx  a
2

• Most of us would produce code like

y = c*x*x + b*x + a;
Large Polynomials
   Faced with a larger polynomial…

y  Cx  Bx  Ax  x
7      5      3

… you can see the need for better methods

y = C*x*x*x*x*x*x*x +
B*x*x*x*x*x +
A*x*x*x +
x;
Horner Form
   Most good programmers will use Horner Form
to encode the polynomial

z = x*x;
y = (((C*z + B)*z + A)*z + 1)*x;

   This also codes well if you have multiply-add
instructions available

   Horner Form is used by nearly all libraries,
with coefficients stored as arrays of values
The Problem With Horner Form
   Horner Form leads to very serial computations
• Later instructions are waiting for earlier results.
• A lot of instructions are stalled in the pipeline.
• Makes no use of available SIMD instructions.

   There is a better way – Estrin’s Method.
• Comes from parallel computation research
• Only documented in Knuth – nowhere online!
Estrin’s Method
   Stems from the observation that by
bracketing, we can produce regular sub-
expressions

f ( x)  Dx  Cx  Bx  A
3      2

 Dx  C x  Bx  A
2

 Px 2  Q

                                        x
The trick is to keep a running copy of 2N   as
“glue” between each sub expression…
Estrin’s Method Table
p0(x) =                                                             A
p1(x) =                                                         (Bx+A)
p2(x) =                                                   (C)x2+(Bx+A)
p3(x) =                                               (Dx+C)x2+(Bx+A)
p4(x) =                                       (E)x4+((Dx+C)x2+(Bx+A))
p5(x) =                                    (Fx+E)x4+((Dx+C)x2+(Bx+A))
p6(x) =                            ((G)x2+(Fx+E))x4+((Dx+C)x2+(Bx+A))
p7(x) =                        ((Hx+G)x2+(Fx+E))x4+((Dx+C)x2+(Bx+A))
p8(x) =                 (I)x8+(((Hx+G)x2+(Fx+E))x4+((Dx+C)x2+(Bx+A)))
p9(x) =             (Jx+I)x8+(((Hx+G)x2+(Fx+E))x4+((Dx+C)x2+(Bx+A)))
p10(x)=     ((K)x2+(Jx+I))x8+(((Hx+G)x2+(Fx+E))x4+((Dx+C)x2+(Bx+A)))
p11(x)= ((Lx+K)x2+(Jx+I))x8+(((Hx+G)x2+(Fx+E))x4+((Dx+C)x2+(Bx+A)))
Estrin’s Method Chart
Packing Estrin’s Into SIMD
   Evaluating the polynomial takes O(log N) time
• Proves that you can improve on Horner Form

   Each row should be evaluated in parallel
• Estrin’s assumes N processing units
• We only have four-way SIMD instructions
• We will still have serialization stalls

   Many architectures don’t allow interaction
between vector elements
• We have to split calculations across SIMD registers
• We can often use redundant calculations to speed evaluation
Brute Force Searching
   One tool can find the perfect evaluation order
for every architecture – Brute Force
• We generate trees of expressions
• Flatten the tree into an ordered sequence of instructions
• Evaluate the sequence w.r.t a particular architecture

   Each run can generate millions of trees to test
• Run the program overnight, keeping the best result so far
• Distribute the calculation over many machines
Expression Tree
f ( x)  Ax  Bx  Cx  Dx
3     5        7
Common Expression Removal
Tree After Rotations
                         
f ( x)  x  A  Bz   z 2 C  Dz 
where z  x     2
Coding Using SIMD
   Using SIMD instructions to code this
• Temptation is to gather all coefficients into a 4-vector
• Calculate [x, x3, x5, x7] and do one big multiply
• Finally sum the result

...
calculate [ x, x3, x5, x7 ]
...
N :=   [ A, B , C , D ]
* [ x, x3, x5, x7 ]
...
sum the elements of N
...
Coding With SIMD
   This turns out to be very slow
• You are only using one parallel instruction!
• Better to think of calculating in two pairs
...
calculate [ x, x2, x, x2 ]
...
N :=   [ A, B, C, D ]
* [ x, x2, x, x2 ]

N :=     [ Ax, Bx2, Cx, Dx2 ]
* [ -, -, x2, x2 ]
...
sum the elements of N
...
Summing Elements
   Summing the elements of a vector is very
serial if you don’t have a special instruction
• Better to split the problem over two registers.

A.x = A.x + A.y          A.x = A.x + A.y
.          B.x = B.x + B.y
.          .
.          .
A.x = A.x + A.z          .
.          A.x = A.x + B.x
.          .
.          .
A.x = A.x + A.w          .
.          Finished.
.
.
Finished.
SIMD Insights
   The more work you have to do, the more
opportunities for parallelisation
• Easier to code higher power polynomials as O(log n) starts
to be more and more effective
• Use redundant elements as “free duplicate values” to avoid
moving elements around
• You end up distributing coefficients in small groups

   SinCos a good example              Coeff1   :=   [A,   A,   E,   E]
Coeff2   :=   [B,   B,   F,   F]
•More efficient to do both sine   Coeff3   :=   [C,   C,   G,   G]
and cosine at the same time       Coeff4   :=   [D,   D,   H,   H]
Higher Order Functions
More Transcendental Functions
   What about the other Transcendentals?
• Tangent, Arctangent, Logarithm, Exponent, Roots, Powers

   Let’s take a look at how well they are
approximated by polynomials
• For each power, make an approximation (e.g. minimax
between [0..1])
• Calculate the maximum error within the range
• Find the negative base-2 log of the error

   This value tells us how many bits of accuracy
we have for each approximation
Table Of Error In Bits

2      3      4      5      6      7      8
Exp(x)    6.8   10.8   15.1   19.8   24.6   29.6   34.7

Sin(x)    7.8   12.7   16.1   21.6   25.5   31.3   35.7

Ln(1+x)    8.2   11.1   14.0   16.8   19.6   22.3   25.0

Atan(x)    8.7   9.8    13.2   15.5   17.2   21.2   22.3

Tan(x)    4.8   6.9    8.9    10.9   12.9   14.9   16.9

Asin(x)   3.4   4.0    4.4    4.7    4.9    5.1    5.3

Sqrt(x)   3.9   4.4    4.8    5.2    5.4    5.6    5.8
Interpretation
   Sine and Exponent give us 4 bits per power
• Will only need a 7th order approximation for single floats

   Why are tangent and arcsine so badly
behaved?
• We will find out shortly…

   Square Root will never be approximated by a
polynomial.
• Less than half a bit of accuracy per power
• We must use Newton’s Method with good initial guesses
Hardware Square Root
   You will never out perform a hardware square root
• To avoid serializing the Newtons Iterations hardware designs
use a redundant signed-digit number system.

637  6 10 2  3 101  7 100
 1103  4 10 2  4 101  3 100  144 3
 1103  4 10 2  3 101  7 100  1437

• Carry propagation happens only once, at the end of the
calculation
Tangent
Tangent Definitions
   Tangent is defined as the ratio of sine over
cosine

sin x 
tan x  
cosx 

   For infrequent use this calculation is good
enough
• Rememember to test for cos(x)=0 and return BIGNUM
• Has twice the inaccuracy of sin or cosine
Using Trig Identities
   We must range reduce as much as possible
• Using trigonometric identities to help us

tan  x    tan x 

   We record the sign, force x to positive
• We will replace the sign during reconstruction
• We have halved our input range
Range Reduction
   The next identity is the fun one

tan x  
1
     
tan   x 
2     
 cot  2  x 

   This tells us two things
1. The range 0..Pi/2 is repeated over and over.
2. There is a sweet spot at Pi/4 where both sides are equal.
Tangent and Cotangent
Tangent and Cotangent Detail
Minimax Approximations
   Tangent between 0..Pi/4 is simple enough
• Looking at the Taylor Series, it uses odd powers.

x 3 2 x 5 17 x 7
tan( x)  x                
3 15 315

• Apart from being strangely resistant to polynomial
approximation, it is a straightforward minimax.

tan( x)  x  x P x   3
 2
Cotangent Minimax
   But how to approximate Cotangent?
• It has infinite values and high curvature

   There is a trick using small angles
• The arctangent is just cosine/sine
• As x0 the sinex and cosine1
• Meaning that as x0 then cotangent 1/x
Cotangent and 1/x
Cotangent Minus 1/x
Cotangent Minimax
   We can use a different minimax form

1
cot( x)   xP x
x
2
 
   This has an interesting property
• The reciprocal is based on nothing but x
• We can start a reciprocal instruction way up the pipeline and
use the result after calculating the difference polynomial
Final Tangent Form
   The final form of the tangent function

tan x         0 x 
4

cot x   4


x
tan  x   
4      2

cot x         
2   x   3
4

tan x   4

3
4    x 
Tangent Error Plot 0..2Pi
Rational Polynomials
   Minimax can also produce rational polynomials
• A pair of polynomials that are divided by each other to
produce the final result

tan( x) 
xP x 2  
Q x2   
   Rational Polynomials can be very accurate for
low powers
• If you can take the cost of a divide or reciprocal at the end
of the approximation process
• Popular for high accuracy functions
Rational Tangent
   Rational approximation to tangent
• Gives 1.3e-8 accuracy using fewer operations

1.15070768 x  0.110238971x 3
1.15070770  0.49380897 x 2  0.0111809593x 4

• Can improve this by dividing through by denominator
constant, saving us a load-constant instruction

0.999999986 x  0.0958010197 x 3
1  0.429135022 x 2  0.00971659383x 4
Arctangent
Inverse Trig Functions
   Arctangent
• The inverse of a tangent
• Maps [–inf..+inf] to [–/2../2]

   Arcsine and Arccosine
• Asin maps [-1..+1] to [–/2../2]
• Acos maps [-1..+1] to [..0]
Why only Arctangent?
   Atan, Acos and Asin are closely related
 1 x2        
arccosx   arctan               
 x            
              
 x           
arcsin x   arctan 



 1 x        
2


arccosx          arcsin x 
2
 1 x 
 2 arcsin 
 2 

      
Range Reduction
   Arctan has these interesting properties

arctan  x    arctan( x)
       1
arctan x    arctan  
2          x

   We only need to define arctan between 0..1
• That is one huge range reduction
Arctan Taylor Series
   The Taylor series for Arctangent looks
strangely familiar, yet different…
• It converges very, very slowly
• Doesn’t even converge for |x|>1

3        5   7
arctan x   x     
x x x
3 5 7

• For 1e-8 accuracy you need to evaluate 7071 terms!
• We need polynomial approximations to make the function a
viable calculation
Minimax Inequality
   The minimax form is very straightforward

arctan x   x  x P x3
 
2

   But gives us very poor accuracy
• A 13th power minimax approximation over 0..1 gives a
maximal error of 0.9!
• We need to reduce the range a lot more
Double Angle Formulation
   Arctangent is the inverse of a tangent
• So let’s assume x = tan(a+b) so that arctan(x) = a+b.
What are a and b ?
tan a   tan b 
x  tan a  b  
1  tan a  tan b 
z  tan a   a  arctan  z 
k  tan b   b  arctan k 
zk         xk
x           z 
1  zk      1  kx
arctan x   arctan k   arctan z 
 b  arctan z 
The X-Z Mapping

 xk 
arctan x   b  arctan         
 1  kx 

   How do we use this?
• It remaps input x into the smaller z-range
• We can subdivide 0..1 into, say, 2 segments, each with their
own b offset and k parameter.

   What value of k to choose?
• If z(0) = -z(1) we have equal distribution either side of the
graph, which happens when k = sqrt(2)-1
• Or divide by equal input values – ¼, ½, ¾
Some X-Z Mappings
   Now you have a set of             float arctan(x)
tools to play with:               {
float k,b,sign = +1.0f;
• Range reduction to 0..1           bool compl = false;
// remove sign
• Compliment if x>1.0               if(x<0.0) { s = -s; y = -y; }
• Choice of X-Z map                 // compliment if x>1.0
- Equal-Angle maps are most     if(x>1.0) { compl = true;
accurate                                  x = 1/x;}
// x-z range reduce
- Equal-Input maps are          if(x<0.5) { k = 0.25f;
faster to range reduce                    b = arctan(k); }
• Choice of minimax order           else { k = 0.75f;
b = arctan(k); }
y = (y-k)/(1+k*y);
// poly approx
y = b + arctan_poly(x);
// finish compliment
if(compl) y:=Pi/2 - y;
// reinstate sign
return y*sign;
}
Using Arctan for Arccosine
   The ideal example for using IEEE754 flags
• To convert from arctan to arccosine we need to cope as x
tends towards zero

 1 x2   
arccosx   arctan          
 x       
         

• Arccos(0.0) = Arctan(+inf) = Pi/2

• We must remember to reset the divide-by-zero flag before
returning from arccos(0.0)
Exponent
Exponent: The Anti-Log
   All exponents can be expressed as powers of
the Euler Constant.

10  e   ln10
 10  e
x
 ln10 x

 e x ln10
2  e ln 2       2 x  e x ln 2
Range Reduction
   Exponent also uses additive range reduction,
much like sine and cosine

x  N ln( 2) / 2  r
 NC  r
   This means we break x into two parts
• 2 raised to some exponent plus a fractional residual.
• Which floating point numbers already do!
N
exp( x)  2 e    2    r
Polynomial Approximation
   Looking at the Taylor series

x 2 x3 x 4
exp x   1  x     
2 6 24

   We can subtract the 1 and minimax it over the
range [0..ln(2)/2] with

exp x   1  x  x 2 Px 
The Exponent Function
   Two methods of coding             float fast_exp(x)
{
const float C = 2.f/ln(2);
1. Breaking your input into         const float iC = ln(2)/2.f;
exponent and mantissa and        // range reduction
reassembling at the end          int N = (int)(x*C);
float y = x-N*invC;
2. Using tables for 2N to keep      // approximation
all work in float registers,     y = exp_poly(y) + 1.0f;
but only supporting a small      // reconstruction
range of input                   return power2[N>>1]*y;
}
Semi-Table Based Reconstruction
   New research has better methods for
reconstructing the power of two
• There are a family of algorithms by Tang where K > 1 in
N

exp( x)  2 e r   2K

• To reconstruct, say, 2N/128 we calculate

N  128M  16 K  J
N           K        J
2   128
 2M  2  2
8       128

• Where the last two terms are from 8 and 16 entry tables
Fast Float Exponent
   How about just constructing a float directly
from an integer power?
• Remember how the IEEE exponent works.

01011100000100000000000000000000

• float = sign * 2(exponent + 127) * 1.mantissa
• Note also that :
x              1
x
ln 2 
ex  2              2   ln(2 )

2   kx
Fast Float Exponent
   Take an integer in [-126..127], e.g. 12
00000000000000000000000000001100

   Integer multiply it by integer 223 / ln(2)

* 00000000101110001010101000111011
= 00001000101001111111101011000101

   Add in the bias offset integer 127 * 223

+ 00111111100000000000000000000000
= 01001000001001111111101011000110
= 172011.09375
Fast Float Exponent
   Plotting the function 1/(1-exp(-x)) shows an offset:
Fast Exponent Fixed
   Subtracting 0.7*223 = 5872025 removes the offset
Increasing Resolution
   The problem with single precision floats
• there are only 255 exponent values we can use hence the
staircase effect.

   Solution:
• Allow the input power to have some fractional bits.
• Take float, multiply it by 23 = 8 and floor it, e.g. 12.125 = 97
00000000000000000000000001100001
• Multiply by 2(23-3) / ln(2), add IEEE bias 127*223

• Pick how many fractional bits you want, but the bias
changes each time
- e.g. if bits = 6 then bias = 3*217
High Resolution Fast Exponent
High Res Fast Exponent Detail
High Res Relative Error
Logarithm
Logarithms
   Ln(x) returns exponents
• Defined in the range [0..+infinity]

   There are two main logarithm functions
• log(x) refers to the logarithm base 10
• ln(x) refers to the Natural Logarithm base e

   They are freely convertible
ln( x)
log( x) 
ln( 10)
Taylor Series
   The Taylor Series comes in two forms
• First is closely related to the arctangent series

    z 3 z 5 z 7 z11    x 1
ln x   2
 z      where z 
    3 5 7 11          x 1

• Second is simpler but converges as slowly

z2 z3 z4 z5 z6
ln x  1  x      
2 3 4 5 6
Range Reduction
   Range reduction for ln uses multiplicative
range reduction
• Given that we store floating point numbers as exponent and
mantissa

x  sign  2n  f        where 1.0  f  2.0

• we find that the logarithm of such a number is

ln x   ln 2n  f 
 
 ln 2  ln f
n

 n ln 2  ln f
Dissecting Floats
   How do we split a float into exponent and
mantissa?
• On a PC it’s easy, you just mask and shift bits.
• Remember to subtract 127 from the exponent.

   On the PS2 Vector Units, we have a problem.
• Only 16 bit integer registers – not enough to hold even the
entire mantissa.
• No way to shift the exponent into lower bits
• Float-to-int conversion loses the exponent

   But there is a trick…
Extracting The Exponent
   What happens if you convert a float to a float?
• Pretend the contents of a floating-point register is an integer
value and convert this “integer” into a float.
• For example, take the value 257*1.125
• sign = 0
exponent = 57 + 127 = 184
mantissa = 1.125

01011100000100000000000000000000

• This bit pattern represents the integer 1544552448
Extracting The Exponent
• Multiply 1544552448 by 1/223 to rescale the decimal point

1
1544552448  23  184.125
2
• Subtract the exponent bias of 127

184.125 127  57.125

• Convert to integer to obtain the true exponent
A Mathematical Oddity: Bitlog
   A real mathematical oddity
•   The integer log2 of a 16 bit integer
•   Given an N-bit value, locate the leftmost nonzero bit.
•   b = the bitwise position of this bit, where 0 = LSB.
•   n = the NEXT three bits (ignoring the highest 1)

bitlog( x)  8  b 1  n
   Bitlog is exactly 8 times larger than log2(x)-1
Bitlog Example
   For example take the number 88

88 = 10110002
b = 6th bit
n = 0112 = 3
bitlog(88) = 8*(6-1)+3
= 43
• (43/8)+1 = 6.375
• Log2(88) = 6.4594…

   This relationship holds down to bitlog(8)
Bitlog Graph
X To The Power N
Integer Powers
   Raising to an integer power
• multiply all powers of x where N has a bit set

N  23  101112
x  x x x x
23       16      4      2

   If N<0 then calculate 1/result
   If x<0 then record sign and reinstate later
• Note that odd powers of –ve number lose the sign
Integer Power Function
   As simple as it sounds        float powi(float x,int N)
{
• Loop through bits of N
float y,w;
• Keep squaring x each loop     bool ns = false,xs = false;
// record signs
if(N<0) ns = true, N = -N;
   Careful noting and              if(x<0.0f) xs = true, x = -x;
reinstating signs               // setup initial powers
if(N&1) y = x;
else y = 1.0f, xs = false;
w = x; N >>= 1;
while(N) {
w = w * w;
if(N&1) y *= w;
N >>= 1;
}
// reinstate signs
if(xs) y = -y;
if(ns) y = 1.0f/y;
}
XN Approximation
   From Graphics Gems 4
x
pow( x, n)               for 0  x  1
n  nx  x

approximation                actual
Raising to a Power
   The canonical method is to do the calculation
in two parts – log and exponent

pow x, N   e N ln x
N log 2 x
2
• Range reduction only needs to be done once.
• MUST use extended accuracy in internal workings.
- Log2(x) must have 8+ extra bits of accuracy
• The most expensive operation in the toolbox.
• Also the most difficult to write.
- e.g. calculating 56156 naively gives an error of 1052 ULPS!
Floating Point Power Hack
   The log and power can be faked
• If we force a float through the int-to-float conversion, the
result is approximately logarithmic.

int_to_flo at x  A log 2 x  B

• And conversely the inverse operation approximates

xB
float_to_i nt x   2     A

   Credited to Matthew Jones, Infogrames
Expanding Out
   Combine these two operations to approximate
the power function

( A log2 x  B )  B
N
x 2
N                     A

( N 1) B
N log2 x 
2                    A

• So we have an extra term to subtract before we can use the
approximate power.
Floating Point Power Hack
   So, the algorithm goes like this
• First, convert our float into logarithmic space
float a = float_as_int(x);

• We multiply this value by the power we want to raise it to.
a = a * power;

• Then we subtract the magic constant
magic = float_as_int(1.0f) * (power-1.0f);
a = a – magic;

• And finally reconvert back to a float
a = int_as_float(x);
Scaling the Result
   We can scale the resulting value
• by reformulating the magic number

magic = float_as_int(1.0f) * (power-1.0f);
a = a – magic;

• Can be rewritten to give us a 255 * xN for no additional work

magic = float_as_int(255.0f) –
float_as_int(1.0f) * power;
a = a + magic;
Cos16(x) Graph
Cos16(x) Absolute Error
Cos16(x) Relative Error
Another Approach
   An interesting effect
• When raising pixel intensities in the range [0..255] to a power, for
most of the range the value is less than one intensity level

• As noted by Beaudoin & Guardado in “Direct3D ShaderX: Vertex and Pixel
Tips and Tricks”
Low Power Approximation
   Higher powers look a bit like lower powers
scaled and offset
• Example: Using x4 to approximate x16
Low Power Approximation
   Our function is zero until it is greater than one
part in 255
Low Power Approximation
   New power approximation is:

x  max ax  b,0
m                       n

• Where n<<m and a and b are:

1
a
1  2561 n
b  1 a
Low Power Approx Error
Improved Error
   Using a binary search, improved values of a and b can
minimize the maximal error
Low Power Approx Problem…
   There is however a problem

• The paper doesn’t seem to have obvious “best” values
for a and b.
- The results are just presented as a table of values with errors.

• Closer powers don’t seem to reduce the error
- so there is no obvious way to control error, relative or absolute.

   What is going on?
• Let’s plot the maximal error for every power approximating
every other as a surface…
Low Power Approx Problem…

x4 approx of x16

   The original example was a global minimum!
Conclusions
Conclusions
   Get to know IEEE754
• It’s being updated as we speak
• If your implementation differs, you will understand
it’s weaknesses

   Always test both absolute and relative
error
• Functions may look accurate on a graph
• Relative error will show the tough weaknesses
Conclusions
   Incremental Algorithms
• Used for generating sequences of trig operations
• Some of the fastest techniques around

   Table Based Algorithms
• Don’t discount them just yet
• Tables are usually far too big – check the size
• Use range reduction and reconstruction on tables
Conclusions
   Polynomial Approximations
• Some of the most flexible techniques around
• Range Reduce, Approximate, Reconstruct
• Choose between
-   Aggressive or simple range reduction
-   Polynomial order
-   Parallel polynomial evaluation techniques
-   Expensive or simple reconstruction
-   Semi table-based reconstruction

Conclusions
   New techniques
• Floating Point Hacks
- An exciting new set of techniques

• Gaining support form hardware designers
- IA64 architecture uses 8-bit mantissa lookup for 1/x

• Combine with semi table-based reconstruction
- Accuracy over a limited range
- Research to be done on accurate approximations
References
   Many of the original books are out of print
- Researchers work with piles of photocopies!

• Hart, J.F., Computer Approximations, John Wiley & Sons, 1968
• Cody & Waite, Software Manual for the Elementary
Functions, Prentice Hall, 1980
• Goldberg, Steve, What Every Computer Scientist Should
Know About Floating Point Arithmetic, ACM Computing
Surveys, March, 1991
• Muller, J.M., Elementary Functions: Algorithms and
Implementations, Birkhaüser, 1997
• Tang, Ping Tak Peter, Table Lookup Algorithms for
Elementary Functions and Their Error Analysis, Proceedings
of 10th Symposium on Computer Arithmetic, 1991
Website
   Notes available from:

http://research.scea.com/

robin_green@playstation.sony.com
Questions?

```
To top