16 Automatic Differentiation

Author

Alec Loudenback

One machine can do the work of fifty ordinary men. No machine can do the work of one extraordinary man. - Elbert Hubbard, Thousand and One Epigrams (1911)

16.1 Chapter Overview

Harnessing the chain rule to compute derivatives not just of simple functions, but of complex programs.

16.2 Motivation for (Automatic) Derivatives

Derivatives are one of the most useful analytical tools we have. Determining the rate of change with respect to an input is effectively sensitivity testing. Knowing the derivative lets you optimize things faster (see Chapter 17). You can test properties and implications (monotonicity, maxima/minima). These applications make derivatives foundational for machine learning, generative models, scientific computing, and more.

We will work up concepts on computing derivatives, from the most basic (finite differentiation) to automatic differentiation (forward mode and then reverse mode). These tools can be used in many modeling applications. Automatic Differentiation (“AD” or “autodiff”) refers to innovative techniques to get computers to perform differentiation of complex algorithms (code) that would be intractable for a human to perform manually.

16.3 Finite Differentiation

Finite differentiation is evaluating a function \(f(x)\) at a value \(x\) and then at a nearby value \(x+\epsilon\). The line drawn through these two points effectively estimates the line that is tangent to the function \(f\) at \(x\) - effectively the derivative has been found by approximation. That is, we are looking to approximate the derivative using the property:

\[ f'(x) = \lim_{{\epsilon \to 0}} \frac{{f(x_0 + \epsilon) - f(x_0)}}{{\epsilon}} \]

We can approximate the result by simply choosing a small \(\epsilon\).

There are three common finite-difference formulas to be aware of:

Three different methods for calculating a finite difference.
Method	Notes
Forward difference	\(\displaystyle \frac{f(x_0+\epsilon) - f(x_0)}{\epsilon}\), where the step is taken above \(x_0\).
Backward difference	\(\displaystyle \frac{f(x_0) - f(x_0-\epsilon)}{\epsilon}\), where the step is taken below \(x_0\).
Central difference	\(\displaystyle \frac{f(x_0+\epsilon) - f(x_0-\epsilon)}{2\epsilon}\), which averages the two one-sided slopes.

The central difference often balances the information from both sides of the evaluation point and avoids some issues in areas where the derivative is changing rapidly. The plot below compares the three estimates for \(f(x)=x^2\) around \(x_0 = 0\):

Central differences usually beat forward or backward differences in accuracy, but the extra evaluation can matter when the underlying function is expensive. Take, for example, the process of optimizing a function to find a maximum or minimum.

Maximum-finding algorithms usually involve guessing an initial point, evaluating the function at that point, and determining what the derivative of the function is at that point. Both items are used to update the guess to one that’s closer to the solution. This approach is used in many optimization algorithms such as Newton’s Method. At each step you need to evaluate the function three times: at \(x\), \(x+\epsilon\), and \(x-\epsilon\). With forward (or backward) finite differences, you can often reuse the prior evaluation at \(x\) from the previous iteration, thereby saving an expensive computation.

There are additional challenges with the finite differentiation method. In practice, we are often interested in much more complex functions than \(x^2\). For example, we may actually be interested in the sum of a series that is many elements long or contains more complex operations than basic algebra. In the prior example, the \(\epsilon\) is set unusually wide for demonstration purposes. As \(\epsilon\) grows smaller, the accuracy of all three finite difference methods generally improves. However, it is not always the case: for sufficiently small \(\epsilon\) the subtraction \(f(x_0+\epsilon) - f(x_0)\) can lose significant digits due to floating-point round-off, corrupting the derivative estimate.

To demonstrate, here is a more complex example using an arbitrary function

\[ f(x) = exp(x) \]

for this example we’ll show the results of the three methods calculated at different values of \(\epsilon\):

f(x) = exp(x)
ϵ = 10 .^ (range(-16, stop=0, length=100))
x0 = 1
estimate = @. (f(x0 + ϵ) - f(x0 - ϵ)) / 2ϵ
1actual = f(x0)

fig = Figure()
ax = Axis(fig[1, 1], xscale=log10, yscale=log10, xlabel="ϵ", ylabel="absolute error")
scatter!(ax, ϵ, abs.(estimate .- actual))
fig

1: The derivative of \(f(x) = exp(x)\) is itself. That is \(f'(x) = f(x)\) in this special case.

Figure 16.1: A log-log plot showing the absolute error of the finite differences. Further to the left, roundoff error dominates while further to the right, truncation error dominates.

Note

The @. in the code example above is a macro that broadcasts each function call to its right. @. (f(x0 + ϵ) - f(x0 - ϵ)) / 2ϵ is the same as (f.(x0 .+ ϵ) .- f.(x0 .- ϵ)) ./ (2 .* ϵ)

A few observations:

At virtually every value of ϵ we observe some error from the true derivative.
That error is the sum of two parts:
1. Truncation error is inherent in that we are using a given non-zero value for ϵ and not determining the limiting analytic value as \(\epsilon \to 0\). The larger \(\epsilon\) is, the larger the truncation error.
2. Roundoff error, which arises due to the limited precision of floating point math.

The implications of this are that we need to often be careful about the choice of ϵ, as the optimal choice will vary depending on the function and the point we are attempting to evaluate. This presents a number of practical difficulties in various algorithms.

Additionally, when computing the finite difference we must evaluate the function multiple times to determine a single estimate of the derivative. When performing something like optimization, the process typically involves iteratively making many guesses requiring many evaluations of the approximate derivative. Further, the efficiency of the algorithm usually depends on the accuracy of computing the derivative!

Despite the accuracy and computational overhead, finite differences can be very useful in many circumstances. However, a more appealing alternative approach will be covered next.

16.4 Automatic Differentiation

Automatic differentiation is essentially the practice of defining algorithmically what the derivatives of a function should be. We are able to do this through a creative application of the chain rule. Recall that the chain rule allows us to compute the derivative of a composite function using the derivatives of the component functions:

\[ h(x)=f(g(x)) \] \[ h'(x) = f'(g(x)) g'(x) \]

Using this rule, we can define how elementary operations act when differentiated. Combined with the fact that most computer code is built up from a bunch of elementary operations, we can get a very long way in differentiating complex functions.

16.4.1 Dual Numbers

To understand where we are going, let’s remind ourselves about complex numbers. Complex numbers have a form which has a real part (\(r\)) and an imaginary part (\(iq\)):

\[ r+iq \]

By definition we say that \(i^2 = -1\). This is useful because it allows us to perform certain types of operations (e.g. finding a square root of a negative number) that are otherwise unsolvable with just the real numbers¹. After defining how the normal algebraic operations (addition, multiplication, etc.) work for the imaginary number, we are able to utilize the imaginary numbers for a variety of practical mathematical tasks.

What is meant by extending the algebraic operations for imaginary numbers? For example, stating how addition should work for imaginary numbers:

\[ (r+iq) + (s+iu) = (r+s) + i(q+u) \]

In a similar fashion as extending the Real (\(\mathbb{R}\)) numbers with an imaginary part, for automatic differentiation we will extend them with a dual part. A dual number is one of the form:

\[ a + \epsilon b \]

Where \(\epsilon^2 = 0\) and \(\epsilon \neq 0\) by definition. While \(a\) represents the function value, \(b\) carries its derivative. An example should make this clearer. First let’s define a DualNumber:

1struct DualNumber{T,U}
    a::T
    b::U
2    function DualNumber(a::T, b::U=zero(a)) where {T,U}
        return new{T,U}(a, b)
    end
end

1: We define this type parametrically to handle all sorts of <:Real types and allow a and b to have different types in case a mathematical operation causes a type change (e.g. as in the case of integers becoming a floating point number like 10/4 == 2.5)
2: In the constructor, we set the default value of b to be zero(a). zero(a) is a generic way to create a value equal to zero with the same type of the argument a. E.g. zero(12.0) == 0.0 and zero(12) == 0.

Now let’s define how dual numbers work under addition. The mathematical rule is:

\[ (a+\epsilon b)+(c+\epsilon d)=(a+c)+(b+d)\epsilon \]

We then need to define how it works for the combinations of numbers that we might receive as arguments to our function (this is an example where multiple dispatch greatly simplifies the code compared to object oriented single dispatch!):

Base.:+(d::DualNumber, e::DualNumber) = DualNumber(d.a + e.a, d.b + e.b)
Base.:+(d::DualNumber, x) = DualNumber(d.a + x, d.b)
Base.:+(x, d::DualNumber) = d + x

And here’s how we would get the derivative of a very simple function:

f1(x) = 5 + x

f1(DualNumber(10, 1))

DualNumber{Int64, Int64}(15, 1)

That’s not super interesting though - the derivative of f1 is just 1 and we supplied that in the construction of DualNumber. We did at least prove that we can add the 10 and 5!

Let’s make this more interesting by also defining the multiplication operation on dual numbers. We’ll follow the product rule:

\[ (u v)' = u' v + u v' \]

Base.:*(d::DualNumber, e::DualNumber) = DualNumber(d.a * e.a, d.b * e.a + d.a * e.b)
Base.:*(x, d::DualNumber) = DualNumber(d.a * x, d.b * x)
Base.:*(d::DualNumber, x) = x * d

Now what if we evaluate this function:

f2(x) = 5 + 3x

f2(DualNumber(10, 1))

DualNumber{Int64, Int64}(35, 3)

We have found that the second component is 3, which is indeed the derivative of \(5+3x\) with respect to \(x\). And in the first part we have the value of f2 evaluated at 10.

Note

When calculating the derivative, why do we start with 1 in the dual part of the number? Because the derivative of a variable with respect to itself is 1. From this unitary starting point, the various operations applied accumulate the derivative of the various operations in the \(b\) part of \(a + \epsilon b\).

We can also define this for things like transcendental functions:

Base.exp(d::DualNumber) = DualNumber(exp(d.a), exp(d.a) * d.b)
Base.sin(d::DualNumber) = DualNumber(sin(d.a), cos(d.a) * d.b)
Base.cos(d::DualNumber) = DualNumber(cos(d.a), -sin(d.a) * d.b)
exp(DualNumber(1, 1))

DualNumber{Float64, Float64}(2.718281828459045, 2.718281828459045)


sin(DualNumber(0, 1))

DualNumber{Float64, Float64}(0.0, 1.0)

cos(DualNumber(0, 1))

DualNumber{Float64, Float64}(1.0, -0.0)

And finally, to put it all together in a more usable wrapper, we can define a function which will calculate the derivative of another function at a certain point. This function applies f to an initialized DualNumber and then returns the \(b\) component from the result:

derivative(f, x) = f(DualNumber(x, one(x))).b

derivative (generic function with 1 method)

And then evaluating it on a more complex function like \(f(x) = 5e^{sin(x)}+3x\) at \(x = 0\), we would analytically derive \(8\), which matches what we calculate next:

f3(x) = 5 * exp(sin(x)) + 3x
derivative(f3, 0)

8.0

We have demonstrated that through the clever use of dual numbers and the chain rule that complex expressions can be automatically differentiated by a computer to an exact level, limited only by the same machine precision that applies to our primary function of interest as well.

Tip

The demonstration above re-implements dual numbers for pedagogical purposes. In production code you would typically rely on ForwardDiff.Dual (from ForwardDiff.jl), which already provides these algebraic rules along with extensive coverage of transcendental functions.

Libraries exist (such as ChainRules.jl) which define large numbers of predefined rules for many more operations, even beyond basic algebraic functions. This allows complex programs to be differentiated automatically.

16.5 Performance of Automatic Differentiation

Recall that in the finite difference method, we generally had to evaluate the function two or three times to approximate the derivative. Here we have a single function call that provides both the value and the derivative at that value. How does this compare performance-wise to simply evaluating the function a single time? Let’s check how long it takes to compute a Float64 versus a DualNumber:

using BenchmarkTools
@btime f3(x) setup = x = rand()

  4.833 ns (0 allocations: 0 bytes)

5.7553577215268295

@btime f3(DualNumber(x, 1)) setup = x = rand()

  9.467 ns (0 allocations: 0 bytes)

DualNumber{Float64, Float64}(12.970782005055941, 10.051127943138852)

The dual number version takes somewhat longer than the plain function evaluation — but for this additional cost, we get both the function value and its exact derivative. This is typically less than twice the cost of the base evaluation alone. Compare this to finite differences, which require two or three separate function evaluations just to approximate the derivative. As the function gets more complex, the overhead does increase but is still generally preferred over finite differentiation. This advantage becomes more pronounced as we contemplate derivatives with respect to many variables at once or for higher-order derivatives.

Note

In fact, it’s largely due to the advances in applications of automatic differentiation that have led to the explosion of machine learning and artificial intelligence techniques in the 2010s/2020s. The “learning” process relies on solving parameter weights and would be too computationally expensive if using finite differences.

These applications of AD in specialized C++ libraries underpin the libraries like PyTorch, TensorFlow, and Keras. These libraries specialize in allowing for AD on a limited subset of operations. Julia’s available AD libraries are more general and can be applied to many more scenarios.

16.6 Automatic Differentiation in Practice

We have, of course, not defined an exhaustive list of operations, covering only +, *, exp, sin, and cos. There are only a few more arithmetic (-, /) and transcendental (log, more trigonometric functions, etc.) before we would have a very robust set of algebraic operations defined for our DualNumber. In fact, it’s possible to go even further and to define the behavior through conditional expressions and iterations to differentiate fairly complex functions or to extend the mechanism to partial derivatives and higher-order derivatives as well.

import Distributions
import ForwardDiff

N(x) = Distributions.cdf(Distributions.Normal(), x)

function d1(S, K, τ, r, σ, q)
    return (log(S / K) + (r - q + σ^2 / 2) * τ) / (σ * √(τ))
end

function d2(S, K, τ, r, σ, q)
    return d1(S, K, τ, r, σ, q) - σ * √(τ)
end

"""
    eurocall(parameters)

Calculate the Black-Scholes implied option price 
for a European call where `parameters` is a vector 
with the following six elements:

- `S` is the current asset price
- `K` is the strike or exercise price
- `τ` is the time remaining to maturity (can be typed with \\tau[tab])
- `r` is the continuously compounded risk free rate
- `σ` is the (implied) volatility (can be typed with \\sigma[tab])
- `q` is the continuously paid dividend rate
"""
function eurocall(parameters)
1    S, K, τ, r, σ, q = parameters
    iszero(τ) && return max(zero(S), S - K)
    d₁ = d1(S, K, τ, r, σ, q)
    d₂ = d2(S, K, τ, r, σ, q)
    return S * exp(-q * τ) * N(d₁) - K * exp(-r * τ) * N(d₂)
end

1: We put the various variables inside a single parameters vector to allow calling a single gradient call instead of multiple derivative calls for each parameter.

Main.Notebook.eurocall

S = 1.0
K = 1.0
τ = 30 / 365
r = 0.05
σ = 0.2
q = 0.0
params = [S, K, τ, r, σ, q]
eurocall(params)

0.024933768194037365

Tip

Some terminology in differentiation:

Scalar-valued function: A function whose output is a single scalar.
Vector-valued (array-valued) function: A function whose output is a vector or array.
Derivative (or partial derivative): The (instantaneous) rate of change of the output with respect to one input variable. In a multivariate context, these are partial derivatives, e.g., \(\frac{\partial}{\partial x} f(x,y,z)\).
Gradient: For a scalar-valued function of several variables, the gradient is the vector of all first partial derivatives, e.g. \(\nabla f(x,y,z) = \bigl[\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}, \frac{\partial f}{\partial z}\bigr]^T\).
Jacobian: For a vector-valued function, the Jacobian is the matrix of first partial derivatives. If \(f: \mathbb{R}^n \to \mathbb{R}^m\), its Jacobian is an \(m \times n\) matrix.
Hessian: For a scalar-valued function of several variables, the Hessian is the matrix of all second partial derivatives (i.e., it’s an \(n \times n\) matrix).

With the above code, now we can get the partial derivatives with respect to each parameter. The first, third, fourth, fifth, and sixth correspond to the common “Greeks” delta, theta, rho, vega, and phi respectively. The second term is the partial derivative with respect to the strike price:

ForwardDiff.gradient(eurocall, params)

6-element Vector{Float64}:
  0.5399635456230847
 -0.5150297774290475
  0.16420676980838977
  0.04233121458320943
  0.11379886104405815
 -0.044380565393678295

Entry 1: delta \(\partial C / \partial S\)
Entry 2: sensitivity to strike \(\partial C / \partial K\) (no standard Greek)
Entry 3: \(\partial C / \partial \tau\), which is the negative of the market convention theta (\(\Theta = \partial C / \partial t = -\partial C / \partial \tau\))
Entry 4: rho \(\partial C / \partial r\)
Entry 5: vega \(\partial C / \partial \sigma\)
Entry 6: dividend rho (often denoted phi) \(\partial C / \partial q\)

We can also get the second order Greeks with another call. This includes many uncommon second order partial derivatives, but the popular gamma is in the [1,1] position for example:

ForwardDiff.hessian(eurocall, params)

6×6 Matrix{Float64}:
  6.92276    -6.92276    0.242297   0.568994   -0.0853491   -0.613375
 -6.92276     6.92276   -0.07809   -0.526663    0.199148     0.568994
  0.242297   -0.07809   -0.846846   0.521448    0.685306    -0.559878
  0.568994   -0.526663   0.521448   0.0432874  -0.0163683   -0.0467667
 -0.0853491   0.199148   0.685306  -0.0163683   0.00245525   0.007015
 -0.613375    0.568994  -0.559878  -0.0467667   0.007015     0.0504144

Note

Tip from practice: desks that run daily risk charge calculations often rely on AD to refresh Greeks after every market data cut. A forward finite-difference grid of six inputs would require at least seven full revaluations of the pricing stack. With AD you obtain the same sensitivities (and, if needed, the Hessian) in a single pass, freeing up time for scenario analysis and stress testing.

16.6.1 Performance

Earlier we assessed the impact on performance for the derivatives using DualNumber on a very basic function. What about if we take a more realistic example like eurocall?

@btime eurocall($params)

  17.034 ns (0 allocations: 0 bytes)

0.024933768194037365

let
1    g = similar(params)
2    cfg = ForwardDiff.GradientConfig(eurocall, params)
    @btime ForwardDiff.gradient!($g, eurocall, $params, $cfg)
end

1: Pre-allocate the output array so the benchmark measures only the gradient computation.
2: Pre-allocate the GradientConfig to avoid ForwardDiff re-creating internal buffers on each call — without this, the overhead is dominated by allocations rather than actual derivative computation.

  89.700 ns (0 allocations: 0 bytes)

6-element Vector{Float64}:
  0.5399635456230847
 -0.5150297774290475
  0.16420676980838977
  0.04233121458320943
  0.11379886104405815
 -0.044380565393678295

We can compare this against a naïve finite-difference gradient that requires seven evaluations of the pricing function (one base evaluation plus one per input):

function fd_gradient!(g, f, x; h=1e-8)
    fx = f(x)
    for i in eachindex(x)
        xi = x[i]
        x[i] = xi + h
        g[i] = (f(x) - fx) / h
        x[i] = xi
    end
    g
end

let
    g = similar(params)
    @btime fd_gradient!($g, eurocall, $params)
end

  128.903 ns (0 allocations: 0 bytes)

6-element Vector{Float64}:
  0.5399635849556716
 -0.5150297410771998
  0.1642067704032968
  0.042331216310032005
  0.11379887032703095
 -0.044380565888957335

Forward-mode AD computes all six first-order derivatives faster than finite differences and returns exact results — no step-size tuning and no truncation error (recall Figure 16.1).

16.7 Forward Mode and Reverse Mode

The approach outlined above is forward-mode AD, where derivatives are propagated alongside values. Reverse-mode AD evaluates the function, records a computation graph, and then accumulates sensitivities backwards.

Reverse mode requires more book-keeping because the derivative needs to be carried backwards through the computation graph, rather than being propagated forward as with the DualNumber approach.

16.8 Practical Tips for Automatic Differentiation

Here are a few practical tips to keep in mind.

16.8.1 Choosing between Reverse Mode and Forward Mode

Forward mode is more efficient when the number of inputs is small relative to the number of outputs, because each forward pass computes the derivative with respect to one input. Reverse mode is more efficient when the number of outputs is small relative to the number of inputs, because each reverse pass computes the derivative with respect to one output. Examples of reverse mode being preferred include loss functions in statistical analysis where many features or parameters are used to predict a single outcome variable.

Use forward mode when the number of inputs is small relative to the number of outputs.
Use reverse mode when the number of inputs is large and the number of outputs is small (e.g., loss functions).

16.8.2 Mutation

Auto-differentiation works through most code, but a particularly tricky part to get right is when values within arrays are mutated (changed). It’s possible to do so but may require a little bit more boilerplate to set up. Enzyme.jl has the best support for functions with mutation inside of them.

16.8.3 Custom Rules

Custom rules for new or unusual functions can be defined, but this is an area that should be explored equipped with a bit of calculus and a deeper understanding of both forward-mode and reverse-mode. ChainRules.jl provides an interface for defining additional rules that hook into the AD infrastructure in Julia as well as provide a good set of documentation on how to extend the rules for your custom function.

16.8.4 Available Libraries in Julia

ForwardDiff.jl provides robust forward-mode AD.
Zygote.jl is a reverse-mode package with the innovations of being able to differentiate structs in addition to arrays and scalars.
Enzyme.jl is a newer package which allows for both forward and reverse mode, but has the advantage of supporting array mutation. Additionally, Enzyme works at the level of LLVM code (an intermediate level between high level Julia code and machine code) which allows for different, sometimes better, optimizations.
DifferentiationInterface.jl is a wrapper library providing a consistent API while being able to exchange different backends.

In the authors’ experience, they would probably recommend DifferentiationInterface.jl as a starting point, and diving into specific libraries if certain features are needed.

16.9 References

Rackauckas (2020)
Center (2019)

Richard Feynman has a wonderful, short lecture on algebra here: https://www.feynmanlectures.caltech.edu/I_22.html↩︎