# Automatic Differentiation Tutorial

Document Sample

```					Automatic Differentiation
Tutorial

Paul Hovland
Andrew Lyons
Boyana Norris
Jean Utke
Mathematics & Computer Science Division
Argonne National Laboratory

 Forward Mode
– Propagates derivative vectors, often denoted u or g_u
– Derivative vector u contains derivatives of u with respect to independent variables
– Time and storage proportional to vector length (# indeps)
 Reverse Mode
– Propagates adjoints, denoted ū or u_bar
– Adjoint ū contains derivatives of dependent variables with respect to u
– Propagation starts with dependent variables—must reverse flow of computation
– Time proportional to adjoint vector length (# dependents)
– Storage proportional to number of operations
– Because of this limitation, often applied to subprograms
Which mode to use?

 Use forward mode when
– # independents is very small
– Only a directional derivative (Jv) is needed
– Reverse mode is not tractable
 Use reverse mode when
– # dependents is very small
– Only JTv is needed

– Use language features to implement differentiation rules or to generate trace
(―tape‖) of computation
– Implementation can be very simple
– Difficult to go beyond one operation/statement at a time in doing AD
– Potential overhead due to compiler-generated temporaries,
e.g. w = x*y*z  tmp = x*y; w = tmp*z.
 Source Transformation
– Requires significant (compiler) infrastructure
– More flexibility in exploiting chain rule associativity

class a_double{
private:
public:
/* constructors */
a_double(double v=0.0, double g=0.0){value=v; grad = g;}

/* operators */
friend a_double operator+(const a_double &g1, const a_double &g2) {
}
friend a_double operator-(const a_double &g1, const a_double &g2) {
}
friend a_double operator*(const a_double &g1, const a_double &g2) {
}
friend a_double sin(const a_double &g1) {
}
friend a_double cos(const a_double &g1) {
}
// ...

#include <math.h>
#include <stream.h>

void func(a_double *f, a_double x, a_double y){
a_double a,b;

if (x > y) {
a = cos(x);
b = sin(y)*y*y;
} else {
a = x*sin(x)/y;
b = exp(y);
}
*f = exp(a*b);
}
Source Transformation: simple example

C           Generated by TAPENADE     (INRIA, Tropics team)
C   ...
SUBROUTINE FUNC_D(f, fd, x, xd, y, yd)
DOUBLE PRECISION f, fd, x, xd, y, yd
DOUBLE PRECISION a, ad, arg1, arg1d, b, bd
INTRINSIC COS, EXP, SIN
C
IF (x .GT. y) THEN
a = COS(x)
bd = yd*COS(y)*y*y + SIN(y)*(yd*y+y*yd)
b = SIN(y)*y*y
ELSE
a = x*SIN(x)/y
bd = yd*EXP(y)
b = EXP(y)
END IF
arg1 = a*b
fd = arg1d*EXP(arg1)
f = EXP(arg1)
RETURN
END
Tools

   Fortran 95
   C/C++
   Fortran 77
   MATLAB
   Other languages: Ada, Python, ...
   More tools at http://www.autodiff.org/
Tools: Fortran 95

 TAF (FastOpt)
– Commercial tool
– Support for (almost) all of Fortran 95
– Used extensively in geophysical sciences applications
– Support for many Fortran 95 features
– Developed by a team with extensive compiler experience
– Support for many Fortran 95 features
– Developed by a team with expertise in combintorial algorithms,
compilers, software engineering, and numerical analysis
– Development driven by climate model & astrophysics code
 All three: forward and reverse; source transformation
Tools: C/C++

– Mature tool
– Support for all of C++
– Support for all of C, some C++
– Source transformation; forward mode (reverse under development)
– New version (2.0) based on industrial strength compiler infrastructure
– Shares some infrastructure with OpenAD/F
– See Phipps presentation
 TAC++ (FastOpt)
– Commercial tool (under development)
– Support for much of C/C++
– Source transformation; forward and reverse modes
– Shares some infrastructure with TAF
Tools: Fortran 77

– Mature and very robust tool
– Support for all of Fortran 77
– Forward and (adequate) reverse modes
– Hundreds of users; ~150 citations

Tools: MATLAB

 Various research prototypes
ADIFOR 2.0: simple example (x only)
C   This file was generated on 09/15/03 by the version of
C   ADIFOR compiled on June, 1998.
C ...
subroutine g_func(f, g_f, x, g_x, y)
double precision f, x, y, a, b
double precision d1_w, d2_v, d1_p, d2_b, g_a, g_x, g_d1_w, g_f
save g_a, g_d1_w

if (x .gt. y) then
d2_v = cos(x)
d1_p = -sin(x)
g_a = d1_p * g_x
a = d2_v
C--------
b = sin(y) * y * y
else
d2_v = sin(x)
d1_p = cos(x)
d2_b = 1.0d0 / y
g_a = (d2_b * d2_v + d2_b * x * d1_p) * g_x
a = x * d2_v / y
C--------
b = exp(y)
endif
g_d1_w = b * g_a
d1_w = a * b
d2_v = exp(d1_w)
d1_p = d2_v
g_f = d1_p * g_d1_w
f = d2_v
C--------
C
return
end
subroutine g_func_proc(ad_p_, f, g_f, x, g_x, y, g_y)
include 'g_maxs.h'
c+declarations here
double precision d_tmp_s_0_a
double precision d_tmp_ipar_0
double precision d_tmp_val_0
double precision d_tmp_val_1
double precision f, x, y, a, b
double precision X2_dtmp1
d_tmp_ipar_0 = (-sin(x))
a = cos(x)
d_tmp_val_0 = sin(y)
d_tmp_val_1 = d_tmp_val_0 * y
d_tmp_ipar_0 = cos(y)
d_tmp_s_0_a = d_tmp_val_1 + y * d_tmp_val_0 + y * y * d_tmp_ipar
+_0
b = d_tmp_val_1 * y
call g_accum_d_4(ad_p_, g_X2_dtmp1, a, g_b, b, g_a)
X2_dtmp1 = a * b
d_tmp_ipar_0 = exp(X2_dtmp1)
f = exp(X2_dtmp1)
C
return
end

C         Generated by TAPENADE    (INRIA, Tropics team)
C Version 2.0.6 - (Id: 1.14 vmp Stable - Fri Sep 5 14:08:23 MEST 2003)
C ...
SUBROUTINE FUNC_B(f, fb, x, xb, y, yb)
DOUBLE PRECISION f, fb, x, xb, y, yb
DOUBLE PRECISION a, ab, arg1, arg1b, b, bb
INTEGER branch
INTRINSIC COS, EXP, SIN
C
IF (x .GT. y) THEN
a = COS(x)
b = SIN(y)*y*y
CALL PUSHINTEGER4(0)
ELSE
a = x*SIN(x)/y
b = EXP(y)
CALL PUSHINTEGER4(1)
END IF
arg1 = a*b
f = EXP(arg1)
arg1b = EXP(arg1)*fb
ab = b*arg1b
bb = a*arg1b
CALL POPINTEGER4(branch)
IF (branch .LT. 1) THEN
yb = yb + (SIN(y)*y+y*SIN(y)+y*y*COS(y))*bb
xb = xb - SIN(x)*ab
ELSE
yb = yb + EXP(y)*bb - x*SIN(x)*ab/y**2
xb = xb + (x*COS(x)/y+SIN(x)/y)*ab
END IF
fb = 0.D0
END
SUBROUTINE func(F, X, Y)
use w2f__types
use active_module
C Declarations
C ...
IF (X%v .GT. Y%v) THEN
ELSE
C ...
#include <math.h>
void   ad_func(DERIV_TYPE *f,DERIV_TYPE x,DERIV_TYPE y) {

if (DERIV_val(x) > DERIV_val(y))      {
DERIV_val(a) = cos( DERIV_val(x)); /*cos*/
{
}
{
}
{
}
}
else      {
// ...

Original code: y = x1*x2*x3*x4;
typedef struct {                        DERIV_val(y): value of program
double value;                                 variable y
} DERIV_TYPE;                                         associated with y

for(i=0;i<nindeps;i++)

Matrix Coloring

 Jacobian matrices are often sparse
 The forward mode of AD computes J × S, where S is usually an identity
matrix or a vector
 Can ―compress‖ Jacobian by choosing S such that structurally orthogonal
columns are combined
 A set of columns are structurally orthogonal if no two of them have
nonzeros in the same row
 Equivalent problem: color the graph whose adjacency matrix is JTJ
 Equivalent problem: distance-2 color the bipartite graph of J
Compressed Jacobian
What is feasible & practical

 Key point: forward mode computes JS at cost proportional to number of
columns in S; reverse mode computes JTW at cost proportional to
number of columns in W
 Jacobians of functions with small number (1—1000) of independent
variables (forward mode, S=I)
 Jacobians of functions with small number (1—100) of dependent
 Very (extremely) large, but (very) sparse Jacobians and Hessians
(forward mode plus coloring)
 Jacobian-vector products (forward mode)
 Hessian-vector products (forward + adjoint modes)
 Large, dense Jacobian matrices that are effectively sparse or effectively
low rank (e.g., see Abdel-Khalik et al., AD2008)
Scenarios

p                  k
n           A                    B                 C          m

   N small: use forward mode on full computation
   M small: use reverse mode on full computation
   M & N large, P small: use reverse mode on A, forward mode on B&C
   M & N large, K small: use reverse mode on A&B, forward mode on C
   N, P, K, M large, Jacobians of A, B, C sparse: compressed forward mode
   N, P, K, M large, Jacobians of A, B, C low rank: scarce forward mode
Scenarios

p                  k
n           A                    B                 C          m

   N small: use forward mode on full computation
   M small: use reverse mode on full computation
   M & N large, P small: use reverse mode on A, forward mode on B&C
   M & N large, K small: use reverse mode on A&B, forward mode on C
   N, P, K, M large, Jacobians of A, B, C sparse: compressed forward mode
   N, P, K, M large, Jacobians of A, B, C low rank: scarce forward mode
   N, P, K, M large: Jacobians of A, B, C dense: what to do?
Automatic Differentiation: What Can Go
Wrong (and what to do about it)
Issues with Black Box Differentiation

 Source code may not be available or may be difficult to work with
 Simulation may not be (chain rule) differentiable
– Feedback due to adaptive algorithms
– Nondifferentiable functions
– Noisy functions
– Convergence rates
– Etc.
 Accurate derivatives may not be needed (FD might be cheaper)
 Differentiation and discretization do not commute
Difficulties in ADIFOR 2.0 processing of FLUENT

 Dubious programming techniques:
– Type mismatches in actual & declared parameters
 Bugs:
– inconsistent number of arguments in subroutine calls
 Not conforming to Fortran77 standard
– while statement in one subroutine
– I/O statements containing function invocations

 Dynamic range of derivative code often is larger than that of original code.
 This may lead to overflows in the derivative code, in particular in 32-bit
arithmetic.

if (cendiv.eq.0.0)                    r4_v = ap / cendiv
cendiv = zero                       r4_b = 1.0 / cendiv
endif                                 r5_b = (-r4_v) / cendiv
axp = axp+ap/cendiv                   do g_i_ = 1, g_pmax_
g_axp(g_i_) = g_axp(g_i_)
+ r4_b * g_ap(g_i_)
Note: The value of „zero―              + r5_b * g_cendiv(g_i_)
is a small number, not 0.0           enddo
axp = axp + r4_v
Wisconsin Sea Ice Model
Chain rule (non-)differentiability

if (x .eq. 1.0) then
a = y
else if ((x .eq. 0.0) then
a = 0
else
a = x*y
endif

b = sqrt(x**4 + y**4)
Mathematical Challenges

 Derivatives of intrinsic functions at points of non-differentiability
 Derivatives of implicitly defined functions
 Derivatives of functions computed using numerical methods
Points of Nondifferentiability

 Due to intrinsic functions
– Several intrinsic functions are defined at points where their derivatives are not,
e.g.:
• abs(x), sqrt(x) at x=0
• max(x,y) at x=y
– Requirements:
• Record/report exceptions
• Optionally, continue computation using some generalized gradient
• User-selected reporting mechanism
– [1.0,0.0] for max(x,0)
– [0.5,0.5] for max(x,y)
• Various ways of handling
– Verbose reports (file, line, type of exception)
– Terse summary (like IEEE flags)
– Ignore
 Due to conditional branches
– May be able to handle using trust regions
Implicitly Defined Functions

– Implicitly defined functions often computed using iterative methods
– Function and derivatives may converge at different rates
– Derivative may not be ―accurate‖ if iteration halted upon function
convergence
– Solutions:
• Tighten function convergence criteria
• Add derivative convergence to stopping criteria
• Compute derivatives directly, e.g. A x = b
Derivatives of Functions Computed Using Numerical Methods

 Differentiation and approximation may not commute
 Need to be careful about how derivatives of numerical approximations are
used
 E.g., differentiating through an ODE integrator can provide unexpected
results due to feedback induced by adaptive stepsize control:

z      z 1               1
z  1 t 
1   1

t      p
CVODE_S: Objective

F(y,p)
p
ODE Solver            y|t=t1, t2, ...
y|t=0           (CVODE)

automatically

p                                 y|t=t1, t2, ...
Sensitivity
Solver              dy/dp|t=t1, t2, ...
y|t=0
Possible Approaches

Solve sensitivity eqns:
p
CVODE_S           y, dy/dp |t=t1, t2, ...
y|t=0
CVODE as ODE + sensitivity solver

 Augmented ODE initial-value problem:

Y   F (t , Y , p) with Y (t0 )  Y0 .

     f (t, y, p) 
 y                   y                                           
                                               f
                                          f 
    

 w1 


w1  p1        
 w                                                 y            p1 
Y     
1
,       Y         ,     F (t,Y , p)    



                                                       
                                              
                                                 f             f 
    
 wm 


 wm                           wm  pm         
                                               y             pm 
                    
CVODE_S: Test Problem

equation
   100x100 structured grid
   16 Pentium III nodes
SensPVODE: Number of Timesteps
SensPVODE: Time/Timestep

 Detect points of nondifferentiability
– currently supported for intrinsic functions, but not conditional
statements
 Exploit mathematics to avoid differentiating through an adaptive algorithm
 Modify termination criterion for implicitly defined functions
– Tighten tolerance
– Add derivatives to termination test (preferred)
Conclusions

 There are some potential ―gotchas‖ when applying AD in a black box
fashion
 Some care should be taken to ensure that the desired quantity is
computed
 There are usually workarounds
Practical Matters: constructing computational graphs

 At compile time (source transformation)
– Structure of graph is known, but edge weights are not: in effect,
implement inspector (symbolic) phase at compile time (offline),
executor (numeric) phase at run time (online)
– In order to assemble graph from individual statements, must be able
to resolve aliases, be able to match variable definitions and uses
– Scope of computational graph construction is usually limited to
statements or basic blocks
– Computational graph usually has O(10)—O(100) vertices
– Structure and weights both discovered at runtime
– Completely online—cannot afford polynomial time algorithms to
analyze graph
– Computational graph may have O(10,000) vertices
Checkpointing real applications

 In practice, need a combination of all of these techniques
 At the timestep level, 2- or 3-level checkpointing is typical: too many
timesteps to checkpoint every timestep
 At the call tree level, some mixture of joint and split mode is desirable
– Pure split mode consumes too much memory
– Pure joint mode wastes time recomputing at the lowest levels of the
call tree
 Currently, OpenAD provides a templating mechanism to simplify the use
of mixed checkpointing strategies
 Future research will attempt to automate some of the checkpointing
What to do for very large, dense Jacobian matrices?

 Jacobian matrix might be large and dense, but ―effectively sparse‖
– Many entries below some threshold ε (―almost zero‖)
– Can tolerate errors up to δ in entries greater than ε
– Example: advection-diffusion for finite time step: nonlocal terms fall
off exponentially
– Solution: do a partial coloring to compress this dense Jacobian:
requires solving a modified graph coloring problem
 Jacobian might be large and dense, but ―effectively low rank‖
– Can be well approximated by a low rank matrix
– Jacobian-vector products (and JTv) are cheap
– Adaptation of efficient subspace method uses Jv and JTv in random
directions to build low rank approximation (Abdel-Khalik et al., AD08)
Application highlights

   Atmospheric chemistry
   Breast cancer biostatistical analysis
   CFD: CFL3D, NSC2KE, (Fluent 4.52: Aachen) ...
   Chemical kinetics
   Climate and weather: MITGCM, MM5, CICE
   Semiconductor device simulation
   Water reservoir simulation
Parameter tuning: sea ice model

- Simulated (yellow) and observed (green) March ice thickness (m)
Tuned parameters                 Standard parameters
Differentiated Toolkit: CVODES (nee SensPVODE)

diffusion equation
 100x100 structured grid
 16 Pentium III nodes
Sensitivity Analysis: mesoscale weather model
Conclusions & Future Work

 Automatic differentiation research involves a wide range of combinatorial
problems
 AD is a powerful tool for scientific computing
 Modern automatic differentiation tools are robust and produce efficient
code for complex simulation codes
– Robustness requires an industrial-strength compiler infrastructure
– Efficiency requires sophisticated compiler analysis
 Effective use of automatic differentiation depends on insight into problem
structure
 Future Work
– Further develop and test techniques for computing Jacobians that are
effectively sparse or effectively low rank
– Develop techniques to automatically generate complex and adaptive
checkpointing strategies

 Andreas Griewank, Evaluating Derivatives, SIAM, 2000.
 Griewank, ―On Automatic Differentiation‖; this and other technical reports
available online at: http://www.mcs.anl.gov/autodiff/tech_reports.html
 AD in general: http://www.mcs.anl.gov/autodiff/, http://www.autodiff.org/