VIEWS: 0 PAGES: 215 POSTED ON: 11/10/2011
Faster Math Functions Robin Green R&D Programmer Sony Computer Entertainment America What Is This Talk About? This is an Advanced Lecture • There will be equations • Programming experience is assumed Writing your own Math functions • Optimize for Speed • Optimize for Accuracy • Optimize for Space • Understand the trade-offs Running Order Part One – 10:00 to 11:00 • Floating Point Recap • Measuring Error • Incremental Methods - Sine and Cosine Part Two – 11:15 to 12:30 • Table Based Methods • Range Reduction • Polynomial Approximation Running Order Part Three – 2:00 to 4:00 • Fast Polynomial Evaluation • Higher Order functions - Tangent - Arctangent, Arcsine and Arccosine Part Four – 4:15 to 6:00 • More Functions - Exponent and Logarithm - Raising to a Power • Q&A Floating Point Formats 32-bit Single Precision Float Floating Point Standards IEEE 754 is undergoing revision. • In process right now. Get to know the issues. • Quiet and Signaling NaNs. • Specifying Transcendental Functions. • Fused Multiply-Add instructions. History of IEEE 754 History of IEEE 754 IEEE754 ratified in 1985 after 8 years of meetings. A story of pride, ignorance, political intrigue, industrial secrets and genius. A battle of Good Enough vs. The Best. Timeline: The Dark Ages Tower of Babel • On one machine, values acted as non-zero for add/subtract and zero for multiply-divide. b = b * 1.0; if(b==0.0) error; else return a/b; • On another platform, some values would overflow if multiplied by 1.0, but could grow by addition. • On another platform, multiplying by 1.0 would remove the lowest 4 bits of your value. • Programmers got used to storing numbers like this b = (a + a) - a; Timeline: 8087 needs “The Best” Intel decided the 8087 has to appeal to the new mass market. • Help “normal” programmers avoid the counterintuitive traps. • Full math library in hardware, using only 40,000 gates. • Kahan, Coonen and Stone prepare draft spec, the K-C-S document. Timeline: IEEE Meetings Nat Semi, IBM, DEC, Zilog, Motorola, Intel all present specifications. • Cray and CDC do not attend… DEC with VAX has largest installed base. • Double float had 8-bit exponent. • Added an 11-bit “G” format to match K-C-S, but with a different exponent bias. K-C-S has mixed response. • Looks complicated and expensive to build. • But there is a rationale behind every detail. Timeline: The Big Argument K-C-S specified Gradual Underflow. DEC didn’t. Timeline: The Big Argument Both Cray and VAX had no way of detecting flush-to-zero. Experienced programmers could add extra code to handle these exceptions. How to measure the Cost/Benefit ratio? Timeline: Trench Warfare DEC vs. Intel • DEC argued that Gradual Underflow was impossible to implement on VAX and too expensive. • Intel had cheap solutions that they couldn’t share (similar to a pipelined cache miss). Advocates fought for every inch • George Taylor from U.C.Berkeley built a drop-in VAX replacement FPU. • The argument for “impossible to build” was broken. Timeline: Trench Warfare DEC turned to theoretical arguments • If DEC could show that GU was unnecessary then K-C-S would be forced to be identical to VAX. K-C-S had hard working advocates • Prof Donald Knuth, programming guru. • Dr. J.H. Wilkinson, error-analysis & FORTRAN. Then DEC decided to force the impasse… Timeline: Showdown DEC found themselves a hired gun • U.Maryland Prof G.W.Stewart III, a highly respected numerical analyst and independent researcher In 1981 in Boston, he delivered his verdict verbally… “On balance, I think Gradual Underflow is the right thing to do.” Timeline: Aftermath By 1984, IEEE 754 had been implemented in hardware by: • Intel • Nat. Semi. • AMD • Weitek • Apple • Zilog • IBM • AT&T It was the de facto standard long before being a published standard. Why IEEE 754 is best The Format Sign, Exponent, Mantissa • Mantissa used to be called “Significand” Why base2? • Base2 has the smallest “wobble”. • Base2 also has the hidden bit. - More accuracy than any other base for N bits. - Base3 arguments always argue using fixed-point values Why 32, 64 and 80-bit formats? • Because 8087 could only do 64-bits of carry propagation in a cycle! Why A Biased Exponent? For sorting. Biased towards underflow. exp_max = 127; exp_min = -126; • Small number reciprocals will never Overflow. • Large numbers will use Gradual Underflow. The Format Note the Symmetry 1 11111111 ??????????????????????? Not A Number 1 11111111 00000000000000000000000 Negative Infinity 1 11111110 ??????????????????????? Negative Numbers 1 00000000 ??????????????????????1 Negative Denormal 1 00000000 00000000000000000000000 Negative Zero 0 00000000 00000000000000000000000 Positive Zero 0 00000000 ??????????????????????1 Positive Denormal 0 00000001 ??????????????????????? Positive Numbers 0 11111111 00000000000000000000000 Positive Infinity 0 11111111 ??????????????????????? Not A Number Rounding IEEE says operations must be “exactly rounded towards even”. Why towards even? • To stop iterations slewing towards infinity. • Cheap to do using hidden “guard digits”. Why support different rounding modes? • Used in special algorithms, e.g. decimal to binary conversion. Rounding How to round irrational numbers? • Impossible to round infinite numbers accurately. • Called the Table Makers Dilemma. - In order to calculate the correct rounding, you need to calculate worst case values to infinite precision. - E.g. Sin(x) = 0.02310000000000000007 IEEE754 just doesn’t specify functions • Recent work looking into worst case values Special Values Zero • 0.0 = 0x00000000 NaN • Not an number. • NaN = sqrt(-x), 0*infinity, 0/0, etc. • Propagates into later expressions. Special Values ±Infinity • Allows calculation to continue without overflow. Why does 0/0=NaN when ±x/0=±infinity? • Because of limit values. • a/b can approach many values, e.g. sin x 1 x 1 cosx as x 0 0 x Signed Zero Basically, WTF? • Guaranteed that +0 = -0, so no worries. Used to recover the sign of an overflowed value • Allows 1/(1/x) = x as xinf • Allows log(0)=-inf and log(-x)=NaN • In complex math, sqrt(1/-1) = 1/sqrt(-1) only works if you have signed zero Destructive Cancellation The nastiest problem in floating point. Caused by subtracting two very similar values • For example, in quadratic equation if b2 ≈ 4ac • In fixed point… 1.10010011010010010011101 - 1.10010011010010010011100 0.00000000000000000000001 • Which gets renormalised with no signal that almost all digits have been lost. Compiler “Optimizations” Floating Point does not obey the laws of algebra. • Replace x/2 with 0.5*x – good • Replace x/10 with 0.1*x – bad • Replace x*y-x*z with x*(y-z) – bad if y≈z • Replace (x+y)+z with x+(y+z) – bad A good compiler will not alter or reorder floating point expressions. • Compilers should flag bad constants, e.g. float x = 1.0e-40; Decimal to Binary Conversion In order to reconstruct the correct binary value from a decimal constant Single float : 9 digits Double float : 17 digits • Loose proof in the Proceedings - works by analyzing the number of representable values in sub- ranges of the number line, showing a need for between 6 and 9 decimal digits for single precision Approximation Error Measuring Error Absolute Error • Measures the size of deviation, but tell us nothing about the significance • The abs() is often ignored for graphing errorabs f actual f approx Measuring Error Absolute Error sometimes written ULPs • Units in the Last Place Approx Actual ULPs 0.0312 0.0314 2 0.0314 0.0314159 0.159 Measuring Error Relative Error • A measure of how important the error is. f approx errorrel 1 f actual Example: Smoothstep Function Used for ease-in ease-out animations and anti-aliasing hard edges • Flat tangents at x=0 and x=1 1 cos x f ( x) 2 2 Smoothstep Function Smoothstep Approximation A cheap polynomial approximation • From the family of Hermite blending functions. f approx( x) 3x 2 x 2 3 Smoothstep Approximation Absolute Error Relative Error Relative Error Detail Incremental Algorithms Incremental Methods Q: What is the fastest method to calculate sine and cosine of an angle? A: Just two instructions. There are however two provisos. 1. You have a previous answer to the problem. 2. You are taking equally spaced steps. Resonant Filter Example using 64 steps int N = 64; per cycle. float a = sin(2PI/N); float c = 1.0f; NOTE: new s uses the float s = 0.0f; previously updated c. for(int i=0; i<M; ++i) { output_sin = s; output_cos = c; c = c – s*a; s = s + c*a; ... } Resonant Filter Graph Resonant Filter Quarter Circle Goertzels Algorithm A more accurate float cb = 2*cos(b); algorithm float s2 = sin(a+b); float s1 = sin(a+2*b); • Uses two previous samples float c2 = cos(a+b); (Second Order) float c1 = cos(a+2*b); float s,c; Calculates x = for(int i=0; i<m; ++i) { sin(a+n*b) for all s = cb*s1-s2; integer n c = cb*c1-c2; s2 = s1; c2 = c1; s1 = s; c1 = c; output_sin = s; output_cos = c; ... } Goertzels Algorithm Graph Goertzels Initialization Needs careful initialization • You must account for a three iteration lag // N steps over 2PI radians float b = 2PI/N; // subtract three steps from initial value float new_a = a – 3.0f * b; Goertzels Algorithm Quarter Circle Table Based Solutions Table Based Algorithms Traditionally the sine/cosine table was the fastest possible algorithm • With slow memory accesses, it no longer is New architectures resurrect the technique • Vector processors with closely coupled memory • Large caches with small tables forced in-cache Calculate point samples of the function • Hash off the input value to find the nearest samples • Interpolate these closest samples to get the result Table Based Sine Table Based Sine Error Precalculating Gradients Given an index i, the approximation is… sin x table i * (table i 1 table i) table i * gradient i Which fits nicely into a 4-vector… sine cosine sin-grad cos-grad How Accurate Is My Table? The largest error occurs when two samples straddle the highest curvature. • Given a stepsize of ∆x, the error E is: x E 1 cos 2 • e.g. for 16 samples, the error will be: 1 cos 16 0.0192147 How Big Should My Table Be? Turning the problem around, how big should a table be for an accuracy of E? • We just invert the expression… E 1% 1 cos N 1% cos N 1 0.01 N arccos0.99 N 22.19587... N 23 How Big Should My Table Be? We can replace the arccos() with a small angle approximation, giving us a looser bound. N 2E Applying this to different accuracies gives us a feel for where tables are best used. Table Sizes E 360° 45° 1% accurate 0.01 23 3 0.1% accurate 0.001 71 9 0.01% accurate 0.0001 223 28 1 degree 0.01745 17 3 0.1 degree 0.001745 54 7 8-bit int 2-7 26 4 16-bit int 2-15 403 51 24-bit float 10-5 703 88 32-bit float 10-7 7025 880 64-bit float 10-17 ~infinite 8.7e+8 Range Reduction Range Reduction We need to map an infinite range of input values x onto a finite working range [0..C]. For most transcendentals we use a technique called “Additive Range Reduction” • Works like y = x mod C but without a divide. • We just work out how many copies of C to subtract from x to get it within the target range. Additive Range Reduction 1. We remap 0..C into the 0..1 range by scaling const float C = range; const float invC = 1.0f/C; x = x*invC; 2. We then truncate towards zero (e.g. convert to int) int k = (int)(x*invC); // or (x*invC+0.5f); 3. We then subtract k copies of C from x. float y = x – (float)k*C; High Accuracy Range Reduction Notice that y = x-k*C has a destructive subtraction. Avoid this by encoding C in several constants. • First constant C1 is a rational that has M bits of C’s mantissa, e.g. PI = 201/64 = 3.140625 • Second constant C2 = C - C1 • Overall effect is to encode C using more bits than machine accuracy. float n = (float)k; float y = (x – n*C1) – n*C2; Truncation Towards Zero Another method for truncation • Add the infamous 1.5 * 224 constant to your float • Subtract it again • You will have lost the fractional bits of the mantissa A = 123.45 = 1111011.01110011001100110 B = 1.5*2^24 = 1100000000000000000000000. A = A+B = 1100000000000000001111011. A = A-B = 1111011.00000000000000000 • This technique requires you know the range of your input parameter… Quadrant Tests Instead of range reducing to a whole cycle, let’s use C=Pi/2 - a quarter cycle • The lower bits of k now holds which quadrant our angle is in Why is this useful? • Because we can use double angle formulas • A is our range reduced angle. • B is our quadrant offset angle. sin( A B) sin( A) cos( B) cos( A) sin( B) cos( A B) cos( A) cos( B) sin( A) sin( B) Double Angle Formulas With four quadrants, the double angle formulas now collapses into this useful form sin y 0 * 2 sin y sin y 1* 2 cos y sin y 2 * 2 cos y sin y 3 * 2 sin y Four Segment Sine A Sine Function Leading to code like this: float table_sin(float x) { const float C = PI/2.0f; const float invC = 2.0f/PI; int k = (int)(x*invC); float y = x-(float)k*C; switch(k&3) { case 0: return sintable(y); case 1: return sintable(TABLE_SIZE-y); case 2: return -sintable(TABLE_SIZE-y); default: return –sintable(y); } return 0; } More Quadrants Why stop at just four quadrants? • If we have more quadrants we need to calculate both the sine and the cosine of y. • This is called the reconstruction phase. 3 3 3 sin y sin y * cos cos y * sin 16 16 16 • Precalculate and store these constants. • For little extra effort, why not return both the sine AND cosine of the angle at the same time? • This function traditionally called sincos()in FORTRAN libraries Sixteen Segment Sine float table_sin(float x) { const float C = PI/2.0f; const float invC = 2.0f/PI; int k = (int)(x*invC); float y = x-(float)k*C; float s = sintable(y); float c = costable(y); switch(k&15) { case 0: return s; case 1: return s*0.923879533f + c*0.382683432f; case 2: return s*0.707106781f + c*0.707106781f; case 3: return s*0.382683432f + c*0.923879533f; case 4: return c; ... } return 0; } Math Function Forms Most math functions follow three phases of execution 1. Range Reduction 2. Approximation 3. Reconstruction This is a pattern you will see over and over • Especially when we meet Polynomial Approximations Polynomial Approximation Infinite Series Most people learn about approximating functions from Calculus and Taylor series 3 5 7 9 sin x x x x x x 3! 5! 7! 9! If we had infinite time and infinite storage, this would be the end of the lecture. Taylor Series Taylor series are generated by repeated differentiation • More strictly, the Taylor Series around x=0 is called the Maclauren series f 0 f 0 f x f 0 f 0 2! 3! Usually illustrated by graphs of successive approximations fitting to a sine curve. Taylor Approx of Sine Properties Of Taylor Series x3 x5 x7 x9 sin x x 3! 5! 7! 9! This series shows all the signs of convergence • Alternating signs • Rapidly increasing divisor If we truncate at the 7th order, we get: 1 3 1 5 1 7 sin( x) x x x x 6 120 5040 x 0.16667 x 0.0083333x 5 0.00019841x 7 Graph of Taylor Series Error The Taylor Series, however, has problems • The problem lies in the error • Very accurate for small values but is exponentially bad for larger values. So we just reduce the range, right? • This improves the maximal error. • Bigger reconstruction cost, large errors at boundaries. • The distribution of error remain the same. How about generating series about x=Pi/4 • Improves the maximal error. • Now you have twice as many coefficients. Taylor 7th Order for –Pi/2..Pi/2 Taylor 7th Order for 0..Pi/2 Taylor 7th Order for 0..Pi/2 And now the bad news. sin( x) 0.0000023014110 1.000023121x 0.00010117322 x 2 0.1664154429 x 3 0.00038530806 x 4 0.008703147018 x 5 0.0002107589082 x 6 0.0001402989645 x 7 Taylor Series Conclusion For our purposes a Taylor series is next to useless • Wherever you squash error it pops back up somewhere else. • Sine is a well behaved function, the general case is much worse. We need a better technique. • Make the worst case nearly as good as the best case. Orthogonal Polynomials Orthogonal Polynomials Families of polynomials with interesting properties. • Named after the mathematicians who discovered them • Chebyshev, Laguerre, Jacobi, Legendre, etc. Integrating the product of two O.P.s returns zero if the two functions are different. c j if i j w( x) Pi x Pj x dx 0 otherwise • Where w(x) is a weighting function. Orthogonal Polynomials Why should we care? • If we replace Pi(x) an arbitrary function f(x), we end up with a scalar value that states how similar f(x) is to Pj(x). • This process is called projection and is often notated as f Pj f w Pj f x Pj x wx dx Orthogonal polynomials can be used to approximate functions • Much like a Fourier Transform, they can break functions into approximating components. Chebyshev Polynomials Lets take a concrete example • The Chebyshev Polynomials Tn(x) T0 x 1 T1 x x T2 x 2 x 2 1 T3 x 4 x 3 3 x T4 x 8 x 4 8 x 2 1 T4 x 16 x 5 20 x 3 5 x Tn 1 x 2 xTn x Tn 1 x Chebyshev Plots The first five Chebyshev polynomials Chebyshev Approximation A worked example. • Let’s approximate f(x) = sin(x) over [-π..π] using Chebyshev Polynomials. • First, transform the domain into [-1..1] a b a b ab g x f x 2 2 sin x Chebyshev Approximation Calculate coefficient kn for each Tn(x) g x T x wx dx 1 n kn 1 cn Where the constant cn and weighting function w(x) are if n 0 w x 1 cn 2 otherwise 1 x2 Chebyshev Coefficients The resulting coefficients k0 0.0 k1 0.5692306864 k 2 0.0 k 2 0.666916672 k 4 0.0 k5 0.104282369 k6 • This is an infinite series, but we truncate it to produce an approximation to g(x) Chebyshev Reconstruction Reconstruct the polynomial in x • Multiply through using the coefficients kn g ( x) k0 1 k1 x k2 2 x 2 1 k 4 x 3 x 3 3 k 4 x 3 x 3 3 k 8 x 8 x 1 4 4 2 k 16 x 20 x 5 x 5 5 3 Chebyshev Result Finally rescale the domain back to [-π..π] 2 ab f x g x ba ba • Giving us the polynomial approximation f ( x) 0.984020813 x 0.153301672 x 3 0.00545232216 x 5 Chebyshev Result The approximated function f(x) Chebyshev Absolute Error The absolute error sin(x)-f(x) Chebyshev Relative Error The relative error tells a different story… Chebyshev Approximation Good points • Approximates an explicit, fixed range • Uses easy to generate polynomials • Integration is numerically straightforward • Orthogonal Polynomials used as basis for new techniques - E.g. Spherical Harmonic Lighting Bad points • Imprecise control of error • No clear way of deciding where to truncate series • Poor relative error performance Minimax Polynomials Pafnuty Chebyshev 1821-1894 Chebyshev described what the error for the best approximating polynomial should look like. An Order-N approximation error curve should: a. Change sign N+1 times b. Touch the maximal error N+2 times This function minimizes the maximal error, so it is called the Minimax Polynomial Evgeny Remez 1896-1975 The algorithm to generate these polynomials is the Remez Exchange Algorithm The algorithm requires: • A function to be approximated • A range of approximation • The required order of approximation • A weighting function Remez Exchange Algorithm The algorithm works by generating a set of zero crossings for the error curve, e.g. sin x a bx0 cx0 0 2 sin x a bx1 cx12 0 sin x a bxn cxn 0 2 These are solved for a,b and c as a matrix inversion. The maximal error is found and one xn is altered. Repeat until minimized. Remez Exchange Algorithm The algorithm is very sensitive to error and requires large number representations (e.g. 50 digits or more) and numerical solvers. • The best place to find these tools is in a mathematical package like Maple or Mathematica. • In Maple, a raw call to minimax() will use all coefficients available, so we will have to set up the calculation… Forming The Minimax Inequality Start by looking at the Taylor Expansion of sine to get a feel for what the end result • Every odd power, first coefficient is 1.0 sin x x 0.16667 x3 0.0083333x5 0.00019841x 7 Transform the problem into finding P(x2) in sin x x x3 Px 2 The Minimax Inequality Next, we form the minimax inequality expressing our desire to minimize the relative error sin x x x P( x ) 3 2 error sin x The Minimax Inequality Divide through by x3 sin x 1 3 2 P( x ) 2 x x error sin x 3 x The Minimax Inequality We want only every other power, so substitute y = x2 sin y 1 P( y ) 32 y y error sin y 32 y The Function to Approximate We solve for P(y) giving us P y sin y 1 32 y y With the weighting function 32 w y y sin y Final Details Before we can run this through the minimax algorithm, we need to convert to a Taylor Series • This stops the sin() function being evaluated and greatly speeds the calculation 1 y y2 y3 P( y ) 6 120 5040 362880 We also transform our approximation range 2 x 2 y 0.. 0.. 2 2 2 Calculating The Minimax Poly We run these values through the minimax algorithm: minimax(p(y), y=0..Pi^2/4, [2,0], w(y)); It returns (to 5 d.p.): P y 0.16666 0.0083143 y 0.00018542 y 2 Finally, we resubstitute back into the original equation sin x x 0.16666 x3 0.0083143x5 0.00018542 x 7 Minimax Polynomial Error Maximal error = 1.108e-6 at x=1.5707 Magic Numbers Comparing the results taylor x 0.16667 x 3 0.0083333x 5 0.00019841x 7 minimax x 0.16666 x 3 0.0083143x 5 0.00018542 x 7 An enormous increase in accuracy with a tiny change in coefficients. • We must be very sure of our coefficients. • We must pay attention to floating point rounding. Side Note: PS2 ESIN Instruction The manual entry for ESIN has a small error in it. Improving On Perfection We can optimize polynomials for floating point representation • This will guarantee better accuracy for small angles, as the first term has the most effect in these cases. • Requires a smaller range than Pi/2, so we will use 0..Pi/4 Using a different form of minimax inequality, we can specify the first coefficient: sin x x kx3 x5 P x 2 Optimizing For Floating Point Now we have to specify the first coefficient as a single-precision float. k 0.1666666666 1 6 Multiply k by 224, round to zero and divide by 224 to create an exactly representable number. k 2 24 2796203 24 2 16777216 0.16666668653488159179687500000 Optimizing For Floating Point Solving the Minimax Inequality for P(x) and reinserting the answer gives us sin x x 2796203 3 x +0.00833282412 x 5 0.000195878418 x 7 16777216 Maximal error = 6.275e-9 at x=0.7855 • About twice the error of the unoptimized version but with better accuracy for small angles. Float Optimized Sine Minimax Optimizing for Small Angles Most functions are bound by the accuracy of their first coefficient • In the same way we split the range reduction into two coefficients, we can do the same for the first coefficient of a polynomial, e.g. ln( x) a0 x a1 x x P( x) 2 • Where a0+a1 = k, the leading coefficient. Fast Polynomial Evaluation Polynomial Evaluation Now we have our coefficients, we need to evaluate them efficiently. • When presented with a polynomial of the form y cx bx a 2 • Most of us would produce code like y = c*x*x + b*x + a; Large Polynomials Faced with a larger polynomial… y Cx Bx Ax x 7 5 3 … you can see the need for better methods y = C*x*x*x*x*x*x*x + B*x*x*x*x*x + A*x*x*x + x; Horner Form Most good programmers will use Horner Form to encode the polynomial z = x*x; y = (((C*z + B)*z + A)*z + 1)*x; This also codes well if you have multiply-add instructions available Horner Form is used by nearly all libraries, with coefficients stored as arrays of values The Problem With Horner Form Horner Form leads to very serial computations • Later instructions are waiting for earlier results. • A lot of instructions are stalled in the pipeline. • Makes no use of available SIMD instructions. There is a better way – Estrin’s Method. • Comes from parallel computation research • Only documented in Knuth – nowhere online! Estrin’s Method Stems from the observation that by bracketing, we can produce regular sub- expressions f ( x) Dx Cx Bx A 3 2 Dx C x Bx A 2 Px 2 Q x The trick is to keep a running copy of 2N as “glue” between each sub expression… Estrin’s Method Table p0(x) = A p1(x) = (Bx+A) p2(x) = (C)x2+(Bx+A) p3(x) = (Dx+C)x2+(Bx+A) p4(x) = (E)x4+((Dx+C)x2+(Bx+A)) p5(x) = (Fx+E)x4+((Dx+C)x2+(Bx+A)) p6(x) = ((G)x2+(Fx+E))x4+((Dx+C)x2+(Bx+A)) p7(x) = ((Hx+G)x2+(Fx+E))x4+((Dx+C)x2+(Bx+A)) p8(x) = (I)x8+(((Hx+G)x2+(Fx+E))x4+((Dx+C)x2+(Bx+A))) p9(x) = (Jx+I)x8+(((Hx+G)x2+(Fx+E))x4+((Dx+C)x2+(Bx+A))) p10(x)= ((K)x2+(Jx+I))x8+(((Hx+G)x2+(Fx+E))x4+((Dx+C)x2+(Bx+A))) p11(x)= ((Lx+K)x2+(Jx+I))x8+(((Hx+G)x2+(Fx+E))x4+((Dx+C)x2+(Bx+A))) Estrin’s Method Chart Packing Estrin’s Into SIMD Evaluating the polynomial takes O(log N) time • Proves that you can improve on Horner Form Each row should be evaluated in parallel • Estrin’s assumes N processing units • We only have four-way SIMD instructions • We will still have serialization stalls Many architectures don’t allow interaction between vector elements • We have to split calculations across SIMD registers • We can often use redundant calculations to speed evaluation Brute Force Searching One tool can find the perfect evaluation order for every architecture – Brute Force • We generate trees of expressions • Flatten the tree into an ordered sequence of instructions • Evaluate the sequence w.r.t a particular architecture Each run can generate millions of trees to test • Run the program overnight, keeping the best result so far • Distribute the calculation over many machines Expression Tree f ( x) Ax Bx Cx Dx 3 5 7 Common Expression Removal Tree After Rotations f ( x) x A Bz z 2 C Dz where z x 2 Coding Using SIMD Using SIMD instructions to code this • Temptation is to gather all coefficients into a 4-vector • Calculate [x, x3, x5, x7] and do one big multiply • Finally sum the result ... calculate [ x, x3, x5, x7 ] ... N := [ A, B , C , D ] * [ x, x3, x5, x7 ] ... sum the elements of N ... Coding With SIMD This turns out to be very slow • You are only using one parallel instruction! • Better to think of calculating in two pairs ... calculate [ x, x2, x, x2 ] ... N := [ A, B, C, D ] * [ x, x2, x, x2 ] N := [ Ax, Bx2, Cx, Dx2 ] * [ -, -, x2, x2 ] ... sum the elements of N ... Summing Elements Summing the elements of a vector is very serial if you don’t have a special instruction • Better to split the problem over two registers. A.x = A.x + A.y A.x = A.x + A.y . B.x = B.x + B.y . . . . A.x = A.x + A.z . . A.x = A.x + B.x . . . . A.x = A.x + A.w . . Finished. . . Finished. SIMD Insights The more work you have to do, the more opportunities for parallelisation • Easier to code higher power polynomials as O(log n) starts to be more and more effective • Use redundant elements as “free duplicate values” to avoid moving elements around • You end up distributing coefficients in small groups SinCos a good example Coeff1 := [A, A, E, E] Coeff2 := [B, B, F, F] •More efficient to do both sine Coeff3 := [C, C, G, G] and cosine at the same time Coeff4 := [D, D, H, H] Higher Order Functions More Transcendental Functions What about the other Transcendentals? • Tangent, Arctangent, Logarithm, Exponent, Roots, Powers Let’s take a look at how well they are approximated by polynomials • For each power, make an approximation (e.g. minimax between [0..1]) • Calculate the maximum error within the range • Find the negative base-2 log of the error This value tells us how many bits of accuracy we have for each approximation Table Of Error In Bits 2 3 4 5 6 7 8 Exp(x) 6.8 10.8 15.1 19.8 24.6 29.6 34.7 Sin(x) 7.8 12.7 16.1 21.6 25.5 31.3 35.7 Ln(1+x) 8.2 11.1 14.0 16.8 19.6 22.3 25.0 Atan(x) 8.7 9.8 13.2 15.5 17.2 21.2 22.3 Tan(x) 4.8 6.9 8.9 10.9 12.9 14.9 16.9 Asin(x) 3.4 4.0 4.4 4.7 4.9 5.1 5.3 Sqrt(x) 3.9 4.4 4.8 5.2 5.4 5.6 5.8 Interpretation Sine and Exponent give us 4 bits per power • Will only need a 7th order approximation for single floats Why are tangent and arcsine so badly behaved? • We will find out shortly… Square Root will never be approximated by a polynomial. • Less than half a bit of accuracy per power • We must use Newton’s Method with good initial guesses Hardware Square Root You will never out perform a hardware square root • To avoid serializing the Newtons Iterations hardware designs use a redundant signed-digit number system. 637 6 10 2 3 101 7 100 1103 4 10 2 4 101 3 100 144 3 1103 4 10 2 3 101 7 100 1437 • Carry propagation happens only once, at the end of the calculation Tangent Tangent Definitions Tangent is defined as the ratio of sine over cosine sin x tan x cosx For infrequent use this calculation is good enough • Rememember to test for cos(x)=0 and return BIGNUM • Has twice the inaccuracy of sin or cosine Using Trig Identities We must range reduce as much as possible • Using trigonometric identities to help us tan x tan x We record the sign, force x to positive • We will replace the sign during reconstruction • We have halved our input range Range Reduction The next identity is the fun one tan x 1 tan x 2 cot 2 x This tells us two things 1. The range 0..Pi/2 is repeated over and over. 2. There is a sweet spot at Pi/4 where both sides are equal. Tangent and Cotangent Tangent and Cotangent Detail Minimax Approximations Tangent between 0..Pi/4 is simple enough • Looking at the Taylor Series, it uses odd powers. x 3 2 x 5 17 x 7 tan( x) x 3 15 315 • Apart from being strangely resistant to polynomial approximation, it is a straightforward minimax. tan( x) x x P x 3 2 Cotangent Minimax But how to approximate Cotangent? • It has infinite values and high curvature There is a trick using small angles • The arctangent is just cosine/sine • As x0 the sinex and cosine1 • Meaning that as x0 then cotangent 1/x Cotangent and 1/x Cotangent Minus 1/x Cotangent Minimax We can use a different minimax form 1 cot( x) xP x x 2 This has an interesting property • The reciprocal is based on nothing but x • We can start a reciprocal instruction way up the pipeline and use the result after calculating the difference polynomial Final Tangent Form The final form of the tangent function tan x 0 x 4 cot x 4 x tan x 4 2 cot x 2 x 3 4 tan x 4 3 4 x Tangent Error Plot 0..2Pi Rational Polynomials Minimax can also produce rational polynomials • A pair of polynomials that are divided by each other to produce the final result tan( x) xP x 2 Q x2 Rational Polynomials can be very accurate for low powers • If you can take the cost of a divide or reciprocal at the end of the approximation process • Popular for high accuracy functions Rational Tangent Rational approximation to tangent • Gives 1.3e-8 accuracy using fewer operations 1.15070768 x 0.110238971x 3 1.15070770 0.49380897 x 2 0.0111809593x 4 • Can improve this by dividing through by denominator constant, saving us a load-constant instruction 0.999999986 x 0.0958010197 x 3 1 0.429135022 x 2 0.00971659383x 4 Arctangent Inverse Trig Functions Arctangent • The inverse of a tangent • Maps [–inf..+inf] to [–/2../2] Arcsine and Arccosine • Asin maps [-1..+1] to [–/2../2] • Acos maps [-1..+1] to [..0] Why only Arctangent? Atan, Acos and Asin are closely related 1 x2 arccosx arctan x x arcsin x arctan 1 x 2 arccosx arcsin x 2 1 x 2 arcsin 2 Range Reduction Arctan has these interesting properties arctan x arctan( x) 1 arctan x arctan 2 x We only need to define arctan between 0..1 • That is one huge range reduction Arctan Taylor Series The Taylor series for Arctangent looks strangely familiar, yet different… • It converges very, very slowly • Doesn’t even converge for |x|>1 3 5 7 arctan x x x x x 3 5 7 • For 1e-8 accuracy you need to evaluate 7071 terms! • We need polynomial approximations to make the function a viable calculation Minimax Inequality The minimax form is very straightforward arctan x x x P x3 2 But gives us very poor accuracy • A 13th power minimax approximation over 0..1 gives a maximal error of 0.9! • We need to reduce the range a lot more Double Angle Formulation Arctangent is the inverse of a tangent • So let’s assume x = tan(a+b) so that arctan(x) = a+b. What are a and b ? tan a tan b x tan a b 1 tan a tan b z tan a a arctan z k tan b b arctan k zk xk x z 1 zk 1 kx arctan x arctan k arctan z b arctan z The X-Z Mapping xk arctan x b arctan 1 kx How do we use this? • It remaps input x into the smaller z-range • We can subdivide 0..1 into, say, 2 segments, each with their own b offset and k parameter. What value of k to choose? • If z(0) = -z(1) we have equal distribution either side of the graph, which happens when k = sqrt(2)-1 • Or divide by equal input values – ¼, ½, ¾ Some X-Z Mappings Designing Your Function Now you have a set of float arctan(x) tools to play with: { float k,b,sign = +1.0f; • Range reduction to 0..1 bool compl = false; // remove sign • Compliment if x>1.0 if(x<0.0) { s = -s; y = -y; } • Choice of X-Z map // compliment if x>1.0 - Equal-Angle maps are most if(x>1.0) { compl = true; accurate x = 1/x;} // x-z range reduce - Equal-Input maps are if(x<0.5) { k = 0.25f; faster to range reduce b = arctan(k); } • Choice of minimax order else { k = 0.75f; b = arctan(k); } y = (y-k)/(1+k*y); // poly approx y = b + arctan_poly(x); // finish compliment if(compl) y:=Pi/2 - y; // reinstate sign return y*sign; } Using Arctan for Arccosine The ideal example for using IEEE754 flags • To convert from arctan to arccosine we need to cope as x tends towards zero 1 x2 arccosx arctan x • Arccos(0.0) = Arctan(+inf) = Pi/2 • We must remember to reset the divide-by-zero flag before returning from arccos(0.0) Exponent Exponent: The Anti-Log All exponents can be expressed as powers of the Euler Constant. 10 e ln10 10 e x ln10 x e x ln10 2 e ln 2 2 x e x ln 2 Range Reduction Exponent also uses additive range reduction, much like sine and cosine x N ln( 2) / 2 r NC r This means we break x into two parts • 2 raised to some exponent plus a fractional residual. • Which floating point numbers already do! N exp( x) 2 e 2 r Polynomial Approximation Looking at the Taylor series x 2 x3 x 4 exp x 1 x 2 6 24 We can subtract the 1 and minimax it over the range [0..ln(2)/2] with exp x 1 x x 2 Px The Exponent Function Two methods of coding float fast_exp(x) { const float C = 2.f/ln(2); 1. Breaking your input into const float iC = ln(2)/2.f; exponent and mantissa and // range reduction reassembling at the end int N = (int)(x*C); float y = x-N*invC; 2. Using tables for 2N to keep // approximation all work in float registers, y = exp_poly(y) + 1.0f; but only supporting a small // reconstruction range of input return power2[N>>1]*y; } Semi-Table Based Reconstruction New research has better methods for reconstructing the power of two • There are a family of algorithms by Tang where K > 1 in N exp( x) 2 e r 2K • To reconstruct, say, 2N/128 we calculate N 128M 16 K J N K J 2 128 2M 2 2 8 128 • Where the last two terms are from 8 and 16 entry tables Fast Float Exponent How about just constructing a float directly from an integer power? • Remember how the IEEE exponent works. 01011100000100000000000000000000 • float = sign * 2(exponent + 127) * 1.mantissa • Note also that : x 1 x ln 2 ex 2 2 ln(2 ) 2 kx Fast Float Exponent Take an integer in [-126..127], e.g. 12 00000000000000000000000000001100 Integer multiply it by integer 223 / ln(2) * 00000000101110001010101000111011 = 00001000101001111111101011000101 Add in the bias offset integer 127 * 223 + 00111111100000000000000000000000 = 01001000001001111111101011000110 = 172011.09375 Fast Float Exponent Plotting the function 1/(1-exp(-x)) shows an offset: Fast Exponent Fixed Subtracting 0.7*223 = 5872025 removes the offset Increasing Resolution The problem with single precision floats • there are only 255 exponent values we can use hence the staircase effect. Solution: • Allow the input power to have some fractional bits. • Take float, multiply it by 23 = 8 and floor it, e.g. 12.125 = 97 00000000000000000000000001100001 • Multiply by 2(23-3) / ln(2), add IEEE bias 127*223 • Pick how many fractional bits you want, but the bias changes each time - e.g. if bits = 6 then bias = 3*217 High Resolution Fast Exponent High Res Fast Exponent Detail High Res Relative Error Logarithm Logarithms Ln(x) returns exponents • Defined in the range [0..+infinity] There are two main logarithm functions • log(x) refers to the logarithm base 10 • ln(x) refers to the Natural Logarithm base e They are freely convertible ln( x) log( x) ln( 10) Taylor Series The Taylor Series comes in two forms • First is closely related to the arctangent series z 3 z 5 z 7 z11 x 1 ln x 2 z where z 3 5 7 11 x 1 • Second is simpler but converges as slowly z2 z3 z4 z5 z6 ln x 1 x 2 3 4 5 6 Range Reduction Range reduction for ln uses multiplicative range reduction • Given that we store floating point numbers as exponent and mantissa x sign 2n f where 1.0 f 2.0 • we find that the logarithm of such a number is ln x ln 2n f ln 2 ln f n n ln 2 ln f Dissecting Floats How do we split a float into exponent and mantissa? • On a PC it’s easy, you just mask and shift bits. • Remember to subtract 127 from the exponent. On the PS2 Vector Units, we have a problem. • Only 16 bit integer registers – not enough to hold even the entire mantissa. • No way to shift the exponent into lower bits • Float-to-int conversion loses the exponent But there is a trick… Extracting The Exponent What happens if you convert a float to a float? • Pretend the contents of a floating-point register is an integer value and convert this “integer” into a float. • For example, take the value 257*1.125 • sign = 0 exponent = 57 + 127 = 184 mantissa = 1.125 01011100000100000000000000000000 • This bit pattern represents the integer 1544552448 Extracting The Exponent • Multiply 1544552448 by 1/223 to rescale the decimal point 1 1544552448 23 184.125 2 • Subtract the exponent bias of 127 184.125 127 57.125 • Convert to integer to obtain the true exponent A Mathematical Oddity: Bitlog A real mathematical oddity • The integer log2 of a 16 bit integer • Given an N-bit value, locate the leftmost nonzero bit. • b = the bitwise position of this bit, where 0 = LSB. • n = the NEXT three bits (ignoring the highest 1) bitlog( x) 8 b 1 n Bitlog is exactly 8 times larger than log2(x)-1 Bitlog Example For example take the number 88 88 = 10110002 b = 6th bit n = 0112 = 3 bitlog(88) = 8*(6-1)+3 = 43 • (43/8)+1 = 6.375 • Log2(88) = 6.4594… This relationship holds down to bitlog(8) Bitlog Graph X To The Power N Integer Powers Raising to an integer power • multiply all powers of x where N has a bit set N 23 101112 x x x x x 23 16 4 2 If N<0 then calculate 1/result If x<0 then record sign and reinstate later • Note that odd powers of –ve number lose the sign Integer Power Function As simple as it sounds float powi(float x,int N) { • Loop through bits of N float y,w; • Keep squaring x each loop bool ns = false,xs = false; // record signs if(N<0) ns = true, N = -N; Careful noting and if(x<0.0f) xs = true, x = -x; reinstating signs // setup initial powers if(N&1) y = x; else y = 1.0f, xs = false; w = x; N >>= 1; while(N) { w = w * w; if(N&1) y *= w; N >>= 1; } // reinstate signs if(xs) y = -y; if(ns) y = 1.0f/y; } XN Approximation From Graphics Gems 4 x pow( x, n) for 0 x 1 n nx x approximation actual Raising to a Power The canonical method is to do the calculation in two parts – log and exponent pow x, N e N ln x N log 2 x 2 • Range reduction only needs to be done once. • MUST use extended accuracy in internal workings. - Log2(x) must have 8+ extra bits of accuracy • The most expensive operation in the toolbox. • Also the most difficult to write. - e.g. calculating 56156 naively gives an error of 1052 ULPS! Floating Point Power Hack The log and power can be faked • If we force a float through the int-to-float conversion, the result is approximately logarithmic. int_to_flo at x A log 2 x B • And conversely the inverse operation approximates xB float_to_i nt x 2 A Credited to Matthew Jones, Infogrames Expanding Out Combine these two operations to approximate the power function ( A log2 x B ) B N x 2 N A ( N 1) B N log2 x 2 A • So we have an extra term to subtract before we can use the approximate power. Floating Point Power Hack So, the algorithm goes like this • First, convert our float into logarithmic space float a = float_as_int(x); • We multiply this value by the power we want to raise it to. a = a * power; • Then we subtract the magic constant magic = float_as_int(1.0f) * (power-1.0f); a = a – magic; • And finally reconvert back to a float a = int_as_float(x); Scaling the Result We can scale the resulting value • by reformulating the magic number magic = float_as_int(1.0f) * (power-1.0f); a = a – magic; • Can be rewritten to give us a 255 * xN for no additional work magic = float_as_int(255.0f) – float_as_int(1.0f) * power; a = a + magic; Cos16(x) Graph Cos16(x) Absolute Error Cos16(x) Relative Error Another Approach An interesting effect • When raising pixel intensities in the range [0..255] to a power, for most of the range the value is less than one intensity level • As noted by Beaudoin & Guardado in “Direct3D ShaderX: Vertex and Pixel Tips and Tricks” Low Power Approximation Higher powers look a bit like lower powers scaled and offset • Example: Using x4 to approximate x16 Low Power Approximation Our function is zero until it is greater than one part in 255 Low Power Approximation New power approximation is: x max ax b,0 m n • Where n<<m and a and b are: 1 a 1 2561 n b 1 a Low Power Approx Error Improved Error Using a binary search, improved values of a and b can minimize the maximal error Low Power Approx Problem… There is however a problem • The paper doesn’t seem to have obvious “best” values for a and b. - The results are just presented as a table of values with errors. The choice is already made for you. • Closer powers don’t seem to reduce the error - so there is no obvious way to control error, relative or absolute. What is going on? • Let’s plot the maximal error for every power approximating every other as a surface… Low Power Approx Problem… x4 approx of x16 The original example was a global minimum! Conclusions Conclusions Get to know IEEE754 • It’s being updated as we speak • If your implementation differs, you will understand it’s weaknesses Always test both absolute and relative error • Functions may look accurate on a graph • Relative error will show the tough weaknesses Conclusions Incremental Algorithms • Used for generating sequences of trig operations • Some of the fastest techniques around Table Based Algorithms • Don’t discount them just yet • Tables are usually far too big – check the size • Use range reduction and reconstruction on tables Conclusions Polynomial Approximations • Some of the most flexible techniques around • Range Reduce, Approximate, Reconstruct • Choose between - Aggressive or simple range reduction - Polynomial order - Parallel polynomial evaluation techniques - Expensive or simple reconstruction - Semi table-based reconstruction • Use Maple or Mathematica to your advantage Conclusions New techniques • Floating Point Hacks - An exciting new set of techniques • Gaining support form hardware designers - IA64 architecture uses 8-bit mantissa lookup for 1/x • Combine with semi table-based reconstruction - Accuracy over a limited range - Research to be done on accurate approximations References Many of the original books are out of print - Researchers work with piles of photocopies! • Hart, J.F., Computer Approximations, John Wiley & Sons, 1968 • Cody & Waite, Software Manual for the Elementary Functions, Prentice Hall, 1980 • Goldberg, Steve, What Every Computer Scientist Should Know About Floating Point Arithmetic, ACM Computing Surveys, March, 1991 • Muller, J.M., Elementary Functions: Algorithms and Implementations, Birkhaüser, 1997 • Tang, Ping Tak Peter, Table Lookup Algorithms for Elementary Functions and Their Error Analysis, Proceedings of 10th Symposium on Computer Arithmetic, 1991 Website Notes available from: http://research.scea.com/ robin_green@playstation.sony.com Questions?