16 Automatic Differentiation
One machine can do the work of fifty ordinary men. No machine can do the work of one extraordinary man. - Elbert Hubbard, Thousand and One Epigrams (1911)
16.1 Chapter Overview
Harnessing the chain rule to compute derivatives not just of simple functions, but of complex programs.
16.2 Motivation for (Automatic) Derivatives
Derivatives are one of the most useful analytical tools we have. Determining the rate of change with respect to an input is effectively sensitivity testing. Knowing the derivative lets you optimize things faster (see Chapter 17). You can test properties and implications (monotonicity, maxima/minima). These applications make derivatives foundational for machine learning, generative models, scientific computing, and more.
We will work up concepts on computing derivatives, from the most basic (finite differentiation) to automatic differentiation (forward mode and then reverse mode). These tools can be used in many modeling applications. Automatic Differentiation (“AD” or “autodiff”) refers to innovative techniques to get computers to perform differentiation of complex algorithms (code) that would be intractable for a human to perform manually.
16.3 Finite Differentiation
Finite differentiation is evaluating a function \(f(x)\) at a value \(x\) and then at a nearby value \(x+\epsilon\). The line drawn through these two points effectively estimates the line that is tangent to the function \(f\) at \(x\) - effectively the derivative has been found by approximation. That is, we are looking to approximate the derivative using the property:
\[ f'(x) = \lim_{{\epsilon \to 0}} \frac{{f(x_0 + \epsilon) - f(x_0)}}{{\epsilon}} \]
We can approximate the result by simply choosing a small \(\epsilon\).
There are three common finite-difference formulas to be aware of:
| Method | Notes |
|---|---|
| Forward difference | \(\displaystyle \frac{f(x_0+\epsilon) - f(x_0)}{\epsilon}\), where the step is taken above \(x_0\). |
| Backward difference | \(\displaystyle \frac{f(x_0) - f(x_0-\epsilon)}{\epsilon}\), where the step is taken below \(x_0\). |
| Central difference | \(\displaystyle \frac{f(x_0+\epsilon) - f(x_0-\epsilon)}{2\epsilon}\), which averages the two one-sided slopes. |
The central difference often balances the information from both sides of the evaluation point and avoids some issues in areas where the derivative is changing rapidly. The plot below compares the three estimates for \(f(x)=x^2\) around \(x_0 = 0\):
Central differences usually beat forward or backward differences in accuracy, but the extra evaluation can matter when the underlying function is expensive. Take, for example, the process of optimizing a function to find a maxima or minima.
Maxima-finding algorithms usually involve guessing an initial point, evaluating the function at that point, and determining what the derivative of the function is at that point. Both items are used to update the guess to one that’s closer to the solution. This approach is used in many optimization algorithms such as Newton’s Method. At each step you need to evaluate the function three times: at \(x\), \(x+\epsilon\), and \(x-\epsilon\). With forward (or backward) finite differences, you can often reuse the prior evaluation at \(x\) from the previous iteration, thereby saving an expensive computation.
There are additional challenges with the finite differentiation method. In practice, we are often interested in much more complex functions than \(x^2\). For example, we may actually be interested in the sum of a series that is many elements long or contains more complex operations than basic algebra. In the prior example, the \(\epsilon\) is set unusually wide for demonstration purposes. As \(\epsilon\) grows smaller, the accuracy of all three finite difference methods generally improves. However, it is not always the case: for sufficiently small \(\epsilon\) the subtraction \(f(x_0+\epsilon) - f(x_0)\) can lose significant digits due to floating-point round-off, corrupting the derivative estimate.
To demonstrate, here is a more complex example using an arbitrary function
\[ f(x) = exp(x) \]
for this example we’ll show the results of the three methods calculated at different values of \(\epsilon\):
f(x) = exp(x)
ϵ = 10 .^ (range(-16, stop=0, length=100))
x0 = 1
estimate = @. (f(x0 + ϵ) - f(x0 - ϵ)) / 2ϵ
1actual = f(x0)
fig = Figure()
ax = Axis(fig[1, 1], xscale=log10, yscale=log10, xlabel="ϵ", ylabel="absolute error")
scatter!(ax, ϵ, abs.(estimate .- actual))
fig- 1
- The derivative of \(f(x) = exp(x)\) is itself. That is \(f'(x) = f(x)\) in this special case.
The @. in the code example above is a macro that applies broadcasting each function to its right. @. (f(x0 + ϵ) - f(x0 - ϵ)) / 2ϵ is the same as (f.(x0 .+ ϵ) .- f.(x0 .- ϵ)) ./ (2 .* ϵ)
A few observations:
- At virtually every value of
ϵwe observe some error from the true derivative. - That error is the sum of two parts:
- Truncation error is inherent in that we are using a given non-zero value for
ϵand not determining the limiting analytic value as \(\epsilon \to 0\). The larger \(\epsilon\) is, the larger the truncation error. - Roundoff error which arises due to the limited precision of floating point math.
- Truncation error is inherent in that we are using a given non-zero value for
The implications of this are that we need to often be careful about the choice of ϵ, as the optimal choice will vary depending on the function and the point we are attempting to evaluate. This presents a number of practical difficulties in various algorithms.
Additionally, when computing the finite difference we must evaluate the function multiple times to determine a single estimate of the derivative. When performing something like optimization, the process typically involves iteratively making many guesses requiring many evaluations of the approximate derivative. Further, the efficiency of the algorithm usually depends on the accuracy of computing the derivative!
Despite the accuracy and computational overhead, finite differences can be very useful in many circumstances. However, a more appealing alternative approach will be covered next.
16.4 Automatic Differentiation
Automatic differentiation is essentially the practice of defining algorithmically what the derivatives of function should be. We are able to do this through a creative application of the chain rule. Recall that the chain rule allows us to compute the derivative of a composite function using the derivatives of the component functions:
\[ h(x)=f(g(x)) \] \[ h'(x) = f'(g(x)) g'(x) \]
Using this rule, we can define how elementary operations act when differentiated. Combined with the fact that most computer code is building up from a bunch of elementary operations, we can get a very long way in differentiating complex functions.
16.4.1 Dual Numbers
To understand where we are going, let’s remind ourselves about complex numbers. Complex numbers are of the form which has a real part (\(r\)) and an imaginary part (\(iq\)):
\[ r+iq \]
By definition we say that \(i^2 = -1\). This is useful because it allows us to perform certain types of operations (e.g. finding a square root of a negative number) that is otherwise unsolvable with just the real numbers1. After defining how the normal algebraic operations (addition, multiplication, etc.) work for the imaginary number, we are able to utilize the imaginary numbers for a variety of practical mathematical tasks.
What is meant by extending the algebraic operations for imaginary numbers? For example, stating how addition should work for imaginary numbers:
\[ (r+iq) + (s+iu) = (r+s) + i(q+u) \]
In a similar fashion as extending the Real (\(\mathbb{R}\)) numbers with an imaginary part, for automatic differentiation we will extend them with a dual part. A dual number is one of the form:
\[ a + \epsilon b \]
Where \(\epsilon^2 = 0\) and \(\epsilon \neq 0\) by definition. While \(a\) represents the function value, \(b\) carries its derivative. An example should make this clearer. First let’s define a DualNumber:
- 1
-
We define this type parametrically to handle all sorts of
<:Realtypes and allowaandbto have different types in case a mathematical operation causes a type change (e.g. as in the case of integers becoming a floating point number like10/4 == 2.5) - 2
-
In the constructor, we set the default value of
bto bezero(a).zero(a)is a generic way to create a value equal to zero with the same type of the argumenta. E.g.zero(12.0) == 0.0andzero(12) == 0.
Now let’s define how dual numbers work under addition. The mathematical rule is:
\[ (a+\epsilon b)+(c+\epsilon d)=(a+c)+(b+d)\epsilon \]
We then need to define how it works for the combinations of numbers that we might receive as arguments to our function (this is an example where multiple dispatch greatly simplifies the code compared to object oriented single dispatch!):
Base.:+(d::DualNumber, e::DualNumber) = DualNumber(d.a + e.a, d.b + e.b)
Base.:+(d::DualNumber, x) = DualNumber(d.a + x, d.b)
Base.:+(x, d::DualNumber) = d + xAnd here’s how we would get the derivative of a very simple function:
f1(x) = 5 + x
f1(DualNumber(10, 1))DualNumber{Int64, Int64}(15, 1)
That’s not super interesting though - the derivative of f1 is just 1 and we supplied that in the construction of DualNumber. We did at least prove that we can add the 10 and 5!
Let’s make this more interesting by also defining the multiplication operation on dual numbers. We’ll follow the product rule:
\[ (u v)' = u' v + u v' \]
Base.:*(d::DualNumber, e::DualNumber) = DualNumber(d.a * e.a, d.b * e.a + d.a * e.b)
Base.:*(x, d::DualNumber) = DualNumber(d.a * x, d.b * x)
Base.:*(d::DualNumber, x) = x * dNow what if we evaluate this function:
f2(x) = 5 + 3x
f2(DualNumber(10, 1))DualNumber{Int64, Int64}(35, 3)
We have found that the second component is 3, which is indeed the derivative of \(5+3x\) with respect to \(x\). And in the first part we have the value of f2 evaluated at 10.
When calculating the derivative, why do we start with 1 in the dual part of the number? Because the derivative of a variable with respect to itself is 1. From this unitary starting point, the various operations applied accumulate the derivative of the various operations in the \(b\) part of \(a + \epsilon b\).
We can also define this for things like transcendental functions:
Base.exp(d::DualNumber) = DualNumber(exp(d.a), exp(d.a) * d.b)
Base.sin(d::DualNumber) = DualNumber(sin(d.a), cos(d.a) * d.b)
Base.cos(d::DualNumber) = DualNumber(cos(d.a), -sin(d.a) * d.b)
exp(DualNumber(1, 1))DualNumber{Float64, Float64}(2.718281828459045, 2.718281828459045)
sin(DualNumber(0, 1))DualNumber{Float64, Float64}(0.0, 1.0)
cos(DualNumber(0, 1))DualNumber{Float64, Float64}(1.0, -0.0)
And finally, to put it all together in a more usable wrapper, we can define a function which will calculate the derivative of another function at a certain point. This function applies f to an initialized DualNumber and then returns the \(b\) component from the result:
derivative(f, x) = f(DualNumber(x, one(x))).bderivative (generic function with 1 method)
And then evaluating it on a more complex function like \(f(x) = 5e^{sin(x)}+3x\) at \(x = 0\), we would analytically derive \(8\), which matches what we calculate next:
f3(x) = 5 * exp(sin(x)) + 3x
derivative(f3, 0)8.0
We have demonstrated that through the clever use of dual numbers and the chain rule that complex expressions can be automatically differentiated by a computer to an exact level, limited only by the same machine precision that applies to our primary function of interest as well.
The demonstration above re-implements dual numbers for pedagogical purposes. In production code you would typically rely on ForwardDiff.Dual (from ForwardDiff.jl), which already provides these algebraic rules along with extensive coverage of transcendental functions.
Libraries exist (such as ChainRules.jl) which define large numbers of predefined rules for many more operations, even beyond basic algebraic functions. This allows complex programs to be differentiated automatically.
16.5 Performance of Automatic Differentiation
Recall that in the finite difference method, we generally had to evaluate the function two or three times to approximate the derivative. Here we have a single function call that provides both the value and the derivative at that value. How does this compare performance-wise to simply evaluating the function a single time? Let’s check how long it takes to compute a Float64 versus a DualNumber:
using BenchmarkTools
@btime f3(x) setup = x = rand() 5.541 ns (0 allocations: 0 bytes)
9.3279708592136
@btime f3(DualNumber(x, 1)) setup = x = rand() 9.843 ns (0 allocations: 0 bytes)
DualNumber{Float64, Float64}(10.706822324653386, 10.268448327779748)
In performing this computation, the compiler has been able to optimize it such that we effectively are able to compute the function and its derivative at less than two times the cost of the base function evaluation. As the function gets more complex, the overhead does increase but is still a generally preferred versus finite differentiation. This advantage becomes more pronounced as we contemplate derivatives with respect to many variables at once or for higher-order derivatives.
In fact, it’s largely due to the advances in applications of automatic differentiation that has led to the explosion of machine learning and artificial intelligence techniques in the 2010s/2020s. The “learning” process relies on solving parameter weights and would be too computationally expensive if using finite differences.
These applications of AD in specialized C++ libraries underpin the libraries like PyTorch, TensorFlow, and Keras. These libraries specialize in allowing for AD on a limited subset of operations. Julia’s available AD libraries are more general and can be applied to many more scenarios.
16.6 Automatic Differentiation in Practice
We have, of course, not defined an exhaustive list of operations, covering only +, *, exp, sin, and cos. There are only a few more arithmetic (-, /) and transcendental (log, more trigonometric functions, etc.) before we would have a very robust set of algebraic operations defined for our DualNumber. In fact, it’s possible to go even further and to define the behavior through conditional expressions and iterations to differentiate fairly complex functions or to extend the mechanism to partial derivatives and higher-order derivatives as well.
import Distributions
import ForwardDiff
N(x) = Distributions.cdf(Distributions.Normal(), x)
function d1(S, K, τ, r, σ, q)
return (log(S / K) + (r - q + σ^2 / 2) * τ) / (σ * √(τ))
end
function d2(S, K, τ, r, σ, q)
return d1(S, K, τ, r, σ, q) - σ * √(τ)
end
"""
eurocall(parameters)
Calculate the Black-Scholes implied option price
for a european call where `parameters` is a vector
with the following six elements:
- `S` is the current asset price
- `K` is the strike or exercise price
- `τ` is the time remaining to maturity (can be typed with \\tau[tab])
- `r` is the continuously compounded risk free rate
- `σ` is the (implied) volatility (can be typed with \\sigma[tab])
- `q` is the continuously paid dividend rate
"""
function eurocall(parameters)
1 S, K, τ, r, σ, q = parameters
iszero(τ) && return max(zero(S), S - K)
d₁ = d1(S, K, τ, r, σ, q)
d₂ = d2(S, K, τ, r, σ, q)
return S * exp(-q * τ) * N(d₁) - K * exp(-r * τ) * N(d₂)
end- 1
-
We put the various variables inside a single
parametersvector to allow calling a singlegradientcall instead of multiplederivativecalls for each parameter.
Main.Notebook.eurocall
S = 1.0
K = 1.0
τ = 30 / 365
r = 0.05
σ = 0.2
q = 0.0
params = [S, K, τ, r, σ, q]
eurocall(params)0.024933768194037365
Some terminology in differentiation:
- Scalar-valued function: A function whose output is a single scalar.
- Vector-valued (array-valued) function: A function whose output is a vector or array.
- Derivative (or partial derivative): The (instantaneous) rate of change of the output with respect to one input variable. In a multivariate context, these are partial derivatives, e.g., \(\frac{\partial}{\partial x} f(x,y,z)\).
- Gradient: For a scalar-valued function of several variables, the gradient is the vector of all first partial derivatives, e.g. \(\nabla f(x,y,z) = \bigl[\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}, \frac{\partial f}{\partial z}\bigr]^T\).
- Jacobian: For a vector-valued function, the Jacobian is the matrix of first partial derivatives. If \(f: \mathbb{R}^n \to \mathbb{R}^m\), its Jacobian is an \(m \times n\) matrix.
- Hessian: For a scalar-valued function of several variables, the Hessian is the matrix of all second partial derivatives (i.e., it’s an \(n \times n\) matrix).
With the above code, now we can get the partial derivatives with respect to each parameter. The first, third, fourth, fifth, and sixth correspond to the common “greeks” delta, theta, rho, vega, and epsilon respectively. The second term is the partial derivative with respect to the strike price:
ForwardDiff.gradient(eurocall, params)6-element Vector{Float64}:
0.5399635456230847
-0.5150297774290475
0.16420676980838977
0.04233121458320943
0.11379886104405815
-0.044380565393678295
- Entry 1: delta \(\partial C / \partial S\)
- Entry 2: sensitivity to strike \(\partial C / \partial K\) (no standard Greek)
- Entry 3: \(\partial C / \partial \tau\), which is the negative of the market convention “theta” (\(\Theta = \partial C / \partial t = -\partial C / \partial \tau\))
- Entry 4: rho \(\partial C / \partial r\)
- Entry 5: vega \(\partial C / \partial \sigma\)
- Entry 6: dividend rho (often denoted “phi”) \(\partial C / \partial q\)
We can also get the second order greeks with another call. This includes many uncommon second order partial derivatives, but the popular gamma is in the [1,1] position for example:
ForwardDiff.hessian(eurocall, params)6×6 Matrix{Float64}:
6.92276 -6.92276 0.242297 0.568994 -0.0853491 -0.613375
-6.92276 6.92276 -0.07809 -0.526663 0.199148 0.568994
0.242297 -0.07809 -0.846846 0.521448 0.685306 -0.559878
0.568994 -0.526663 0.521448 0.0432874 -0.0163683 -0.0467667
-0.0853491 0.199148 0.685306 -0.0163683 0.00245525 0.007015
-0.613375 0.568994 -0.559878 -0.0467667 0.007015 0.0504144
Tip from practice: desks that run daily risk charge calculations often rely on AD to refresh Greeks after every market data cut. A forward finite-difference grid of six inputs would require at least seven full revaluations of the pricing stack. With AD you obtain the same sensitivities (and, if needed, the Hessian) in a single pass, freeing up time for scenario analysis and stress testing.
16.6.1 Performance
Earlier we assessed the impact on performance for the derivatives using DualNumber on a very basic function. What about if we take a more realistic example like eurocall? We can observe approximately a 9x slowdown when computing all of the first order derivatives which isn’t bad considering we are computing 6x of the outputs!
@btime eurocall($params) 19.433 ns (0 allocations: 0 bytes)
0.024933768194037365
let
1 g = similar(params)
@btime ForwardDiff.gradient!($g, eurocall, $params)
end- 1
-
To avoid benchmarking memory allocation for the new array, we pre-allocate the array in memory to store the result and then call
gradient!to fill ingfor each result.
321.892 ns (3 allocations: 704 bytes)
6-element Vector{Float64}:
0.5399635456230847
-0.5150297774290475
0.16420676980838977
0.04233121458320943
0.11379886104405815
-0.044380565393678295
16.7 Forward Mode and Reverse Mode
The approach outlined above is forward-mode AD, where derivatives are propagated alongside values. Reverse-mode AD evaluates the function, records a computation graph, and then accumulates sensitivities backwards.
Reverse mode requires more book-keeping because the derivative needs to be carried backwards through the computation graph, rather than being propagated forward as with the DualNumber approach.
16.8 Practical tips for Automatic Differentiation
Here are a few practical tips to keep in mind.
16.8.1 Choosing between Reverse Mode and Forward Mode
Forward mode is more efficient when the number of outputs is much larger than the number of inputs. When the number of inputs is much larger than the number of outputs, then reverse mode will generally be more efficient. Examples of the number of inputs being larger than the outputs might be in a statistical analysis where many features are used to predict a limited number of outcome variables or a complex model with a lot of parameters.
- Use forward mode when the number of inputs is small relative to the number of outputs.
- Use reverse mode when the number of inputs is large and the number of outputs is small (e.g., loss functions).
16.8.2 Mutation
Auto-differentiation works through most code, but a particularly tricky part to get right is when values within arrays are mutated (changed). It’s possible to do so but may require a little bit more boilerplate to setup. As of 2024, Enzyme.jl has the best support for functions with mutation inside of them.
16.8.3 Custom Rules
Custom rules for new or unusual functions can be defined, but this is an area that should be explored equipped with a bit of calculus and a deeper understanding of both forward-mode and reverse-mode. ChainRules.jl provides an interface for defining additional rules that hook into the AD infrastructure in Julia as well as provide a good set of documentation on how to extend the rules for your custom function.
16.8.4 Available Libraries in Julia
- ForwardDiff.jl provides robust forward-mode AD.
- Zygote.jl is a reverse-mode package with the innovations of being able to differentiate
structsin addition to arrays and scalars. - Enzyme.jl is a newer package which allows for both forward and reverse mode, but has the advantage of supporting array mutation. Additionally, Enzyme works at the level of LLVM code (an intermediate level between high level Julia code and machine code) which allows for different, sometimes better, optimizations.
- DifferentiationInterface.jl is a wrapper library providing a consistent API while being able exchange different backends.
In the authors’ experience, they would probably recommend DifferentiationInterface.jl as a starting point, and diving into specific libraries if certain features are needed.
16.9 References
Richard Feynman has a wonderful, short lecture on algebra here: https://www.feynmanlectures.caltech.edu/I_22.html↩︎