Non-Analytic Derivatives — Finite Differencing
Finite Differencing — Sparsity and Symmetry
#include
Numerical Optimization
Lecture Notes #16
Calculating Derivatives — Finite Differencing
Peter Blomgren,
blomgren.peter@gmail.com
Department of Mathematics and Statistics
Dynamical Systems Group
Computational Sciences Research Center
San Diego State University
San Diego, CA 92182-7720
http://terminus.sdsu.edu/
Fall 2011
Peter Blomgren, blomgren.peter@gmail.com Calculating Derivatives — Finite Differencing — (1/19)
Non-Analytic Derivatives — Finite Differencing
Finite Differencing — Sparsity and Symmetry
#include
Outline
1 Non-Analytic Derivatives — Finite Differencing
Taylor’s Theorem ⇒ Finite Differencing
Finite Difference Gradient
Finite Difference Hessian
2 Finite Differencing — Sparsity and Symmetry
3 #include
Project Milestone #2
Peter Blomgren, blomgren.peter@gmail.com Calculating Derivatives — Finite Differencing — (2/19)
Non-Analytic Derivatives — Finite Differencing Taylor’s Theorem ⇒ Finite Differencing
Finite Differencing — Sparsity and Symmetry Finite Difference Gradient
#include Finite Difference Hessian
Derivatives Needed!!!
As we have seen (and will see), algorithms for nonlinear
optimization (and nonlinear equations) require knowledge of
derivatives:
Nonlinear Optimization Nonlinear Equations
Gradient, vector, 1st order Jacobian, matrix, 1st order
Hessian, matrix, 2nd order
Often it is quite trivial to provide the code which computes those
derivatives, but in some cases the analytic expression for the
derivatives are not available and/or not practical to evaluate.
In those cases we need some other way to compute or
approximate the derivatives.
Peter Blomgren, blomgren.peter@gmail.com Calculating Derivatives — Finite Differencing — (3/19)
Non-Analytic Derivatives — Finite Differencing Taylor’s Theorem ⇒ Finite Differencing
Finite Differencing — Sparsity and Symmetry Finite Difference Gradient
#include Finite Difference Hessian
Finite Differences — The Return of Taylor’s Theorem
x
We can get an approximation of the gradient ∇f (¯) by evaluating
the objective f at (n + 1) points, using the forward difference
formula
x
∂f (¯) x e x
f (¯ + ǫ¯i ) − f (¯)
≈ , i = 1, 2, . . . , n,
∂xi ǫ
e
where ¯i is the i th unit vector, and ǫ > 0 is small.
If f is twice continuously differentiable, then by Taylor’s Theorem
1
f (¯ + p) = f (¯) + ∇f (¯)T p + pT ∇2 f (¯ + t¯)¯,
x ¯ x x ¯ ¯ x pp t ∈ (0, 1),
2
¯ e
with p = ǫ¯i , i.e.
1
f (¯+ǫ¯i ) = f (¯)+ǫ∇f (¯)T ¯i + ǫ2¯T ∇2 f (¯+tǫ¯i )¯i ,
x e x x e e x e e i = 1, 2, . . . , n.
2 i
Peter Blomgren, blomgren.peter@gmail.com Calculating Derivatives — Finite Differencing — (4/19)
Non-Analytic Derivatives — Finite Differencing Taylor’s Theorem ⇒ Finite Differencing
Finite Differencing — Sparsity and Symmetry Finite Difference Gradient
#include Finite Difference Hessian
Forward Differences Building the Gradient
With a bit of re-arrangement we see
x e x
f (¯ + ǫ¯i ) − f (¯) 1
∇f (¯)T ¯i =
x e − ǫ ¯T ∇2 f (¯ + tǫ¯i )¯i
e x e e
ǫ 2 i
x
∂f (¯)
∂xi Finite Difference Approximation Approximation Error
If the Hessian ∇2 f (¯) is bounded, i.e. ∇2 f (¯) ≤ Lc , then we
x x
have
x
∂f (¯) x e x
f (¯ + ǫ¯i ) − f (¯)
≈ ,
∂xi ǫ
where the approximation error is bounded by
ǫLc
.
2
Since the error is proportional to ǫ, this is a first-order
approximation.
Peter Blomgren, blomgren.peter@gmail.com Calculating Derivatives — Finite Differencing — (5/19)
Non-Analytic Derivatives — Finite Differencing Taylor’s Theorem ⇒ Finite Differencing
Finite Differencing — Sparsity and Symmetry Finite Difference Gradient
#include Finite Difference Hessian
Selecting ǫ Machine Epsilon / Unit Roundoff 1 of 3
Clearly, the smaller the ǫ the smaller the error. How small can we
set ǫ in finite precision???
Let ǫmach denote value for machine epsilon, a.k.a. unit roundoff, it
is essentially the largest value for which
((1.0 + ǫmach ) − 1.0) = 0, in finite precision
ǫmach ≈ 10−16 in double-precision arithmetic (IEEE 64-bit floating
point: “C” double, and Matlab internals on typical Intel-based
systems.)
ǫmach is a measure of how well (or badly) we can represent any
number in finite precision, and in extension a measure of the (best
case) quality of every computation.
Peter Blomgren, blomgren.peter@gmail.com Calculating Derivatives — Finite Differencing — (6/19)
Non-Analytic Derivatives — Finite Differencing Taylor’s Theorem ⇒ Finite Differencing
Finite Differencing — Sparsity and Symmetry Finite Difference Gradient
#include Finite Difference Hessian
Selecting ǫ 2 of 3
x x
If Lf is a bound on the value of f (¯), i.e. |f (¯)| ≤ Lf , then in finite
precision we have
x
computed(f (¯)) − f (¯) x ≤ ǫmach Lf
x e x e
computed(f (¯ + ǫ¯i )) − f (¯ + ǫ¯i ) ≤ ǫmach Lf .
Now, if we recall our finite difference approximation (with a slight
abuse of notation)
x
∂f (¯) x e x
f (¯ + ǫ¯i ) − f (¯) ǫLc
≈ + error .
∂xi ǫ 2
We find that the total error is
2ǫmach Lf ǫLc
error ∼ + .
ǫ 2
Peter Blomgren, blomgren.peter@gmail.com Calculating Derivatives — Finite Differencing — (7/19)
Non-Analytic Derivatives — Finite Differencing Taylor’s Theorem ⇒ Finite Differencing
Finite Differencing — Sparsity and Symmetry Finite Difference Gradient
#include Finite Difference Hessian
Selecting ǫ 3 of 3
Now,
derror 2ǫmach Lf Lc 4ǫmach Lf
∼− 2
+ =0 ⇒ ǫ2 = ,
dǫ ǫ 2 Lc
gives us the optimal value for epsilon. Since Lf and L are unknown
in general, most software packages, tend to select
√
ǫ= ǫmach ,
which is close to optimal in most cases.
Hence, the error in the approximated gradient is
√ Lc √
error ∼ 2Lf ǫmach + ǫmach .
2
Peter Blomgren, blomgren.peter@gmail.com Calculating Derivatives — Finite Differencing — (8/19)
Non-Analytic Derivatives — Finite Differencing Taylor’s Theorem ⇒ Finite Differencing
Finite Differencing — Sparsity and Symmetry Finite Difference Gradient
#include Finite Difference Hessian
Central Differences O h2 Accuracy
` ´
At twice the cost, we can get about 2.67 extra digits of precision in the
finite difference approximation, by using central differences.
More Taylor expansions...
2
∂f 1 ∂
f (¯ + ǫ¯i ) = f (¯) + ǫ ∂xi + 2 ǫ2 ∂xf2 + O(ǫ3 )
x e x
i
2
∂f 1 ∂
f (¯ − ǫ¯i ) = f (¯) − ǫ ∂xi + 2 ǫ2 ∂xf2 + O(ǫ3 )
x e x
i
∂f
f (¯ + ǫ¯i ) − f (¯ − ǫ¯i ) =
x e x e 2ǫ ∂xi + O(ǫ3 )
We get
x
∂f (¯) x e x e
f (¯ + ǫ¯i ) − f (¯ − ǫ¯i )
= + O(ǫ2 ),
∂xi 2ǫ
by arguments similar to the ones for the forward difference formula, we
can show that the optimal ǫ and overall error is
√ 2/3
ǫ= 3
ǫmach ⇒ error ∼ O ǫmach .
Peter Blomgren, blomgren.peter@gmail.com Calculating Derivatives — Finite Differencing — (9/19)
Non-Analytic Derivatives — Finite Differencing Taylor’s Theorem ⇒ Finite Differencing
Finite Differencing — Sparsity and Symmetry Finite Difference Gradient
#include Finite Difference Hessian
Approximating the Hessian The Easy Case 1 of 5
The easy case: Analytic Gradient given
If the analytic gradient is known, then we can get an
approximation of the Hessian by applying forward or central
differencing to each element of the gradient vector in turn.
When the second derivatives exist and are Lipschitz continuous,
Taylor’s theorem says
∇f (¯ + p) = ∇f (¯) + ∇2 f (¯)¯ + O( p 2 ).
x ¯ x xp ¯
¯ e
Again, we let p = ǫ¯i , i = 1, 2, . . . , n and get
x e x
∇f (¯ + ǫ¯i ) − ∇f (¯)
∇2 f (¯)¯i ≈
xe + O(ǫ), or
ǫ
x e x e
∇f (¯ + ǫ¯i ) − ∇f (¯ − ǫ¯i )
∇2 f (¯)¯i ≈
xe + O(ǫ2 ).
2ǫ
Peter Blomgren, blomgren.peter@gmail.com Calculating Derivatives — Finite Differencing — (10/19)
Non-Analytic Derivatives — Finite Differencing Taylor’s Theorem ⇒ Finite Differencing
Finite Differencing — Sparsity and Symmetry Finite Difference Gradient
#include Finite Difference Hessian
Approximating the Hessian Symmetrize 2 of 5
It is worth noting that this is a column-at-a-time process, which
does not — due to numerical roundoff and approximation errors —
necessarily give a symmetric Hessian.
It is often necessary to symmetrize the result
sym 1 T
Hnum = Hnum + Hnum .
2
Peter Blomgren, blomgren.peter@gmail.com Calculating Derivatives — Finite Differencing — (11/19)
Non-Analytic Derivatives — Finite Differencing Taylor’s Theorem ⇒ Finite Differencing
Finite Differencing — Sparsity and Symmetry Finite Difference Gradient
#include Finite Difference Hessian
Approximating the Hessian Special Case 3 of 5
Special Case: In Newton-CG methods we do not require full
knowledge of the Hessian. Each iteration requires the
Hessian-vector product ∇2 f (¯)¯, where p is the given search
xp ¯
direction, this expression can be approximated
x p x p
∇f(¯ + ǫ¯) − ∇f (¯[−ǫ¯])
∇2 f (¯)¯ ≈
xp + O(ǫ[2] )
[2]ǫ
This approximation is very cheap — only one [two] extra gradient
evaluation[s] is [are] needed.
Peter Blomgren, blomgren.peter@gmail.com Calculating Derivatives — Finite Differencing — (12/19)
Non-Analytic Derivatives — Finite Differencing Taylor’s Theorem ⇒ Finite Differencing
Finite Differencing — Sparsity and Symmetry Finite Difference Gradient
#include Finite Difference Hessian
Approximating the Hessian Hard (Realistic) Case 4 of 5
The harder case: Analytic Gradient not given
When the analytic gradient is not given we must use a finite
difference formula using only function values to approximate the
Hessian.
The first order forward difference approximation is given by
∂ 2 f (¯)
x x e e
f (¯ + ǫ¯i + ǫ¯j ) − f (¯ + ǫ¯i ) − f (¯ + ǫ¯j ) + f (¯)
x e x e x
≈ 2
∂xi ∂xj ǫ
j
−1 +1
i
+1 −1
Peter Blomgren, blomgren.peter@gmail.com Calculating Derivatives — Finite Differencing — (13/19)
Non-Analytic Derivatives — Finite Differencing Taylor’s Theorem ⇒ Finite Differencing
Finite Differencing — Sparsity and Symmetry Finite Difference Gradient
#include Finite Difference Hessian
Approximating the Hessian 5 of 5
At a price of ∼ n2 additional function evaluations (an increase of
33%) we can use the second order central difference
approximation
∂ 2 f (¯)
x x e e x e
f (¯ + ǫ¯i + ǫ¯j ) − f (¯ + ǫ¯i − ǫ¯j ) − f (¯ − ǫ¯i + ǫ¯j ) + f (¯ − ǫ¯i − ǫ¯j )
e x e e x e e
≈
∂xi ∂xj 4ǫ2
j
−1 +1
i
+1 −1
∂ 2 f (¯)
x
Figure: The second order 4-point central difference approximation stencil for ∂xi ∂xj
at the central point in the stencil — note that the value in that point is not part
of the evaluation!
Peter Blomgren, blomgren.peter@gmail.com Calculating Derivatives — Finite Differencing — (14/19)
Non-Analytic Derivatives — Finite Differencing
Finite Differencing — Sparsity and Symmetry
#include
Sparsity and Symmetry 1 of 3
Now that we are paying ∼ 4 function evaluations per entry in the
Hessian matrix, it is worth taking sparsity and symmetry into
account.
Ponder the extended Rosenbrock function:
double function rosenbrock( int n, double *x )
{
xxdouble f = 0.0;
xxintxxxxi;
xxfor( i=0; i 1.
Peter Blomgren, blomgren.peter@gmail.com Calculating Derivatives — Finite Differencing — (15/19)
Non-Analytic Derivatives — Finite Differencing
Finite Differencing — Sparsity and Symmetry
#include
Sparsity and Symmetry 2 of 3
The fill-pattern of the Hessian of the extended Rosenbrock
function consists of 2×2-diagonal blocks:
0
1
2
3
4
5
6
7
8
9
0 1 2 3 4 5 6 7 8 9
nz = 16
There are a lot of zero-entries in this Hessian. If somehow we have
knowledge of the sparsity pattern, then we can exploit this by not
computing/touching the zeros.
Peter Blomgren, blomgren.peter@gmail.com Calculating Derivatives — Finite Differencing — (16/19)
Non-Analytic Derivatives — Finite Differencing
Finite Differencing — Sparsity and Symmetry
#include
Sparsity and Symmetry 3 of 3
By using the fact that the Hessian is symmetric, we can save about
half of the work,
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
nz = 36 nz = 28
Figure: The entries to the left Hij , j ≤ i must be computed, but using symmetry
we can fill in the missing ones Hij = Hji , j > i.
Peter Blomgren, blomgren.peter@gmail.com Calculating Derivatives — Finite Differencing — (17/19)
Non-Analytic Derivatives — Finite Differencing
Finite Differencing — Sparsity and Symmetry Project Milestone #2
#include
Project Extensions Milestone #2: with linesearch+1
Add the following to your codebase:
fdhessg Finite difference approximation to the Hessian using analytic gradient.
(Executed when analgrad=TRUE, analhess=FALSE, cheapf=TRUE.)
fdjac The core call from fdhessg, note that fvec in the pseudo-code
corresponds to your analytic gradient.
fdgrad Finite difference (forward) approximation to the gradient. (Executed
when analgrad=FALSE.)
fdhessf Finite difference approximation to the Hessian using only func-
tion values. (Executed when analgrad=FALSE, analhess=FALSE,
cheapf=TRUE.)
Compare: Performance of analytic everything (from before) / analytic gradient
(fdhessg+fdjac) / finite difference everything (fdhessf+fdgrad). Try optimal and
non-optimal ǫ. Use 2 test problems from Dennis-Schnabel, Appendix B.
Add-on: Central differencing strategies.
Peter Blomgren, blomgren.peter@gmail.com Calculating Derivatives — Finite Differencing — (18/19)
Non-Analytic Derivatives — Finite Differencing
Finite Differencing — Sparsity and Symmetry Project Milestone #2
#include
Project: Milestone #0
Please let me know in the very near future what you are working
on!!!
Peter Blomgren, blomgren.peter@gmail.com Calculating Derivatives — Finite Differencing — (19/19)