# P573 Scientific Computing Lecture 3 Floating Point Arithmetic

Document Sample

```					                                                                                        Outline

P573                                                         ° A little history
Scientific Computing                                                  ° IEEE floating point formats
° Error analysis
Lecture 3:
° Exception handling
Floating Point Arithmetic                                                   • Using exception handling to go faster

° How to get extra precision cheaply
Peter Gottschling                                      ° Dangers of Parallel and Heterogeneous Computing
pgottsch@cs.indiana.edu
www.osl.iu.edu/~pgottsch/courses/p573-06

Based on slides from UC Berkeley
www.cs.berkeley.edu/~demmel/cs267_Spr05
02/28/2005                          CS267 Lecture 11                                  02/28/2005                             CS267 Lecture 11

A little history                                                                      Defining Floating Point Arithmetic
° Von Neumann and Goldstine - 1947                                                    ° Representable numbers
• “Can’t expect to solve most big [n>15] linear systems without carrying              • Scientific notation:      +/- d.d…d   x   r exp
many decimal digits [d>8], otherwise the computed answer would be                   • sign bit +/-
completely inaccurate.” - WRONG!
• radix r (usually 2 or 10, sometimes 16)
° Turing - 1949                                                                           • significand d.d…d (how many base-r digits d?)
• “Carrying d digits is equivalent to changing the input data in the d-th place       • exponent exp (range?)
and then solving Ax=b. So if A is only known to d digits, the answer is as          • others?
accurate as the data deserves.”
• Backward Error Analysis                                                         ° Operations:
• arithmetic: +,-,x,/,...
° Rediscovered in 1961 by Wilkinson and publicized (Turing Award 1970)                          -   how to round result to fit in format
° Starting in the 1960s- many papers doing backward error analysis of                     • comparison (<, =, >)
various algorithms                                                                      • conversion between different formats
-   short to long FP numbers, FP to integer
° Many years where each machine did FP arithmetic slightly differently
• exception handling
• Both rounding and exception handling differed
-   what to do for 0/0, 2*largest_number, etc.
• Hard to write portable and reliable software
• binary/decimal conversion
• Motivated search for industry-wide standard, beginning late 1970s
-   for I/O, when radix not 10
• First implementation: Intel 8087

° Turing Award 1989 to W. Kahan for design of the IEEE Floating Point                 ° Language/library support for these operations
Standards 754 (binary) and 854 (decimal)
• Nearly universally implemented in general purpose machines
02/28/2005                          CS267 Lecture 11                                  02/28/2005                             CS267 Lecture 11

1
IEEE Floating Point Arithmetic Standard 754 - Normalized Numbers                                                    Rules for performing arithmetic
° Normalized Nonzero Representable Numbers: +- 1.d…d x 2exp                                                      ° As simple as possible:
• Macheps = Machine epsilon = 2-#significand bits = relative error in each operation                            • Take the exact value, and round it to the nearest floating point
• OV = overflow threshold = largest number                                                                        number (correct rounding)
• UN = underflow threshold = smallest number                                                                    • Break ties by rounding to nearest floating point number whose
bottom bit is zero (rounding to nearest even)
Format # bits #significand bits                 macheps #exponent bits exponent range
---------- -------- -----------------------     ------------ -------------------- ----------------------
• Other rounding options too (up, down, towards 0)
Single       32          23+1                 2-24 (~10 -7)         8             2-126 - 2127 (~10+-38)
Double        64          52+1                2-53 (~10-16)       11              2-1022 - 21023 (~10+-308)       ° Don’t need exact value to do this!
Double >=80           >=64                  <=2-64(~10-19)      >=15             2-16382 - 216383 (~10+-4932)          • Early implementors worried it might be too expensive, but it isn’t
Extended (80 bits on Intel machines)
° Applies to
° +- Zero: +-, significand and exponent all zero
• +,-,*,/
• Why bother with -0 later
• sqrt
• conversion between formats
• rem(a,b) = remainder of a after dividing by b
-   a = q*b + rem, q = floor(a/b)
-   cos(x) = cos(rem(x,2*pi))       for |x| >= 2*pi
-   cos(x) is exactly periodic, with period = rounded(2*pi)
02/28/2005                               CS267 Lecture 11                                                        02/28/2005                          CS267 Lecture 11

Error Analysis                                                                                                   Example: polynomial evaluation using Horner’s rule
n
° Basic error formula                                                                                          ° Horner’s rule to evaluate p = Σ ck * xk
k=0
• fl(a op b) = (a op b)*(1 + d) where                                                                        • p = c n, for k=n-1 downto 0, p = x*p + ck
-   op one of +,-,*,/                                                                            ° Numerically Stable:
-   |d| <= ε = machine epsilon = macheps                                                             • Get ^ = Σ ^ k * x k where ^ k = ck (1+ε k ) and | εk | ≤ (n+1) ε
p     c               c
-   assuming no overflow, underflow, or divide by zero
° Apply to (x-2)9 = x9 - 18*x8 + … - 512
• fl(x1+x2+x3+x4 ) = {[(x1+x2)*(1+d1) + x3 ]*(1+d2) + x4}*(1+d3)
= x1 *(1+d1)*(1+d2)*(1+d3 ) + x2*(1+d1)*(1+d2 )*(1+d3 )
+ x3*(1+d2)*(1+d3 ) + x4*(1+d3)
= x1 *(1+e1 ) + x2*(1+e2 ) + x3*(1+e3) + x4 *(1+e4)
where each |ei| <~ 3*macheps
• get exact sum of slightly changed summands xi*(1+ei)
• Backward Error Analysis - algorithm called numerically stable if it
gives the exact result for slightly changed inputs
• Numerical Stability is an algorithm design goal

02/28/2005                               CS267 Lecture 11                                                        02/28/2005                          CS267 Lecture 11

2
Example: polynomial evaluation (continued)                                           Cray Arithmetic
° (x-2)9 = x9 - 18*x8 + … - 512                                                    ° Historically very important
• Crays among the fastest machines
° We can compute error bounds using                                                    • Other fast machines emulated it (Fujitsu, Hitachi, NEC)

• fl(a op b)=(a op b)*(1+d)                                                   ° Sloppy rounding
• fl(a + b) not necessarily (a + b)(1+d) but instead
fl(a + b) = a*(1+da) + b*(1+db) where |da|,|db| <= macheps
• Means that fl(a+b) could be either 0 when should be nonzero, or twice too large when a+b
“cancels”
• Sloppy division too

° Some impacts:
• arccos(x/sqrt(x2 + y2)) can yield exception, because x/sqrt(x2 + y2) >1
-     Not with IEEE arithmetic
• Fastest (at one time) eigenvalue algorithm in LAPACK fails
-   Need Π k (ak - bk) accurately
-     Need to preprocess by setting each ak = 2*ak - ak (kills bottom bit)

° Latest Crays do IEEE arithmetic
° More cautionary tales:
• www.cs.berkeley.edu/~wkahan/ieee754status/why-ieee.pdf

02/28/2005                         CS267 Lecture 11                                  02/28/2005                              CS267 Lecture 11

General approach to error analysis
° Suppose we want to evaluate x = f(z)
• Ex: z = (A,b), x = solution of linear system Ax=b

° Suppose we use a backward stable algorithm alg(z)
• Ex: Gaussian elimination with pivoting
What happens when the “exact value” is
° Then alg(z) = f(z + e) where backward error e is “small”                                            not a real number, or is too small or too
° Error bound from Taylor expansion (scalar case):                                                         large to represent accurately?
• alg(z) = f(z+e) ≈ f(z) + f’(z)*e
• Absolute error = |alg(z) – f(z)| ≈ |f’(z)|* |e|
• Relative error = |alg(z) – f(z)| / |f(z)| ≈ |f’(z)*z/f(z)|* |e/z|
You get an “exception”
• Condition number = |f’(z)*z/f(z)|
• Relative error (of output) = condition number * relative error of input |e/z|

° Applies to multivariate case too
• Ex: Gaussian elimination with pivoting for solving Ax=b

02/28/2005                         CS267 Lecture 11                                  02/28/2005                              CS267 Lecture 11

3
Exception Handling                                                        IEEE Floating Point Arithmetic Standard 754 - “Denorms”

° What happens when the “exact value” is not a real                       ° Denormalized Numbers: +-0.d…d x 2min_exp
number, or too small or too large to represent                              • sign bit, nonzero significand, minimum exponent
accurately?                                                                 • Fills in gap between UN and 0

° 5 Exceptions:                                                           ° Underflow Exception
• Overflow - exact result > OV, too large to represent                    • occurs when exact nonzero result is less than underflow threshold UN
• Ex: UN/3
• Underflow - exact result nonzero and < UN, too small to represent
• return a denorm, or zero
• Divide-by-zero - nonzero/0
• Invalid - 0/0, sqrt(-1), …                                          ° Why bother?
• Necessary so that following code never divides by zero
• Inexact - you made a rounding error (very common!)
if (a != b) then x = a/(a-b)
° Possible responses
• Stop with error message (unfriendly, not default)
• Keep computing (default, but how?)

02/28/2005                           CS267 Lecture 11                     02/28/2005                                 CS267 Lecture 11

IEEE Floating Point Arithmetic Standard 754 - +- Infinity                IEEE Floating Point Arithmetic Standard 754 - NAN (Not A Number)

° +- Infinity: Sign bit, zero significand, maximum exponent               ° NAN: Sign bit, nonzero significand, maximum exponent
° Overflow Exception                                                      ° Invalid Exception
• occurs when exact finite result too large to represent accurately       • occurs when exact result not a well-defined real number
• Ex: 2*OV                                                                • 0/0
• return +- infinity                                                      • sqrt(-1)
• infinity-infinity, infinity/infinity, 0*infinity
° Divide by zero Exception
• NAN + 3
• return +- infinity = 1/+-0
• NAN > 3?
• sign of zero important! Example later…
• Return a NAN in all these cases
° Also return +- infinity for
• 3+infinity, 2*infinity, infinity*infinity
° Two kinds of NANs
• Quiet - propagates without raising an exception
• Result is exact, not an exception!
-     good for indicating missing data
-     Ex: max(3,NAN) = 3
• Signaling - generate an exception when touched
- good for detecting uninitialized data

02/28/2005                           CS267 Lecture 11                     02/28/2005                                 CS267 Lecture 11

4
Exception Handling User Interface                                                          Exploiting Exception Handling to Design Faster Algorithms

° Each of the 5 exceptions has the following features                                     ° Paradigm:
• A sticky flag, which is set as soon as an exception occurs                               1) Try fast, but possibly “risky” algorithm
2) Quickly test for accuracy of answer (use exception handling)
• The sticky flag can be reset and read by the user                                        3) In rare case of inaccuracy, rerun using slower “low risk” algorithm
reset overflow_flag and invalid_flag
perform a computation                                                    ° Quick with high probability
test overflow_flag and invalid_flag to see if any exception occurred          • Assumes exception handling done quickly! (Will manufacturers do this?)

• An exception flag, which indicate whether a trap should occur                       ° Ex 1: Solving triangular system Tx=b
-    Not trapping is the default                                                  • Part of BLAS2 - highly optimized, but risky
-    Instead, continue computing returning a NAN, infinity or denorm              • If T “nearly singular”, as when computing condition numbers, expect
-    On a trap, there should be a user-writable exception handler with              very large x, so scale inside inner loop: slow but low risk
access to the parameters of the exceptional operation                        • Use paradigm with sticky flags to detect nearly singular T
-    Trapping or “precise interrupts” like this are rarely implemented for        • Up to 9x faster on Dec Alpha
performance reasons.
° Ex 2: Computing eigenvalues (part of next LAPACK release)
For k= 1 to n                                        For k= 1 to n
d = ak - s - bk 2 /d                vs.              d = ak - s - bk 2 /d … ok to divide by 0
if |d| < tol, d = -tol                               count += signbit(d)
if d < 0, count++
° Demmel/Li (www.cs.berkeley.edu/~xiaoye)
02/28/2005                             CS267 Lecture 11                                    02/28/2005                               CS267 Lecture 11

Summary of Values Representable in IEEE FP                                                 Simulating extra precision
° What if 64 or 80 bits is not enough?
° +- Zero                                                 +-   0…0      0……………………0             • Very large problems on very large machines may need more
• Sometimes only known way to get right answer (mesh generation)
° Normalized nonzero numbers                              +- Not 0 or
all 1s
anything               • Sometimes you can trade communication for extra precision

° Denormalized numbers                                    +-   0…0      nonzero           ° Can simulate high precision efficiently just using floating point
° Each extended precision number s is represented by an array
° +-Infinity                                              +-   1….1 0……………………0              (s1,s2,…,s n) where
• each sk is a FP number
° NANs                                                    +-   1….1     nonzero                • s = s1 + s2 + … + sn in exact arithmetic
• Signaling and quiet                                                                      • s1 >> s2 >> … >> sn

• Many systems have only quiet                                                        ° Ex: Computing (s1,s2) = a + b
if |a|<|b|, swap them
s1 = a+b            … roundoff may occur
s2 = (a - s1) + b   … no roundoff!
• s1 contains leading bits of a+b, s2 contains trailing bits

° Systematic algorithms for higher precision
• Priest / Shewchuk (www.cs.berkeley.edu/~jrs)
• Bailey / Li / Hida (crd.lbl.gov/~dhbailey/mpdist/index.html)
• Demmel / Li et al (crd.lbl.gov/~xiaoye/XBLAS)
• Demmel / Hida / Riedy / Li (www.cs.berkeley.edu/~yozo)
02/28/2005                             CS267 Lecture 11                                    02/28/2005                           CS267 Lecture 11

5
Further References on Floating Point Arithmetic
° Notes for Prof. Kahan’s CS267 lecture from 1996
• www.cs.berkeley.edu/~wkahan/ieee754status/cs267fp.ps
• Note for Kahan 1996 cs267 Lecture

° Prof. Kahan’s “Lecture Notes on IEEE 754”
• www.cs.berkeley.edu/~wkahan/ieee754status/ieee754.ps

° Prof. Kahan’s “The Baleful Effects of Computer
Benchmarks on Applied Math, Physics and
Chemistry
• www.cs.berkeley.edu/~wkahan/ieee754status/baleful.ps

° Notes for Demmel’s CS267 lecture from 1995
• www.cs.berkeley.edu/~demmel/cs267/lecture21/lecture21.html

02/28/2005                  CS267 Lecture 11

6

```
DOCUMENT INFO
Shared By:
Categories:
Stats:
 views: 23 posted: 1/19/2010 language: English pages: 6