10  Writing Performant Single-Threaded Code

Author

Alec Loudenback

Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away. - Antoine de Saint-Exupéry, Airman’s Odyssey

10.1 Chapter Overview

Understand single-threaded performance by focusing on memory access, type stability, and branch prediction; profile and benchmark before reaching for parallelism; learn how to upgrade hot loops and data structures in Julia so that parallel versions start from a solid baseline.

10.2 Introduction

On today’s hardware, the highest-throughput workloads often run on GPUs via massive parallelism. However, writing correct, performant parallel code relies on understanding efficient sequential patterns first. Secondly, many problems are not “massively parallelizable” and a sequential model architecture is required. For these reasons, it’s critical to understand sequential patterns before moving onto parallel code.

For those coming from fundamentally slower languages (such as R or Python), the common advice to speed things up is often to try to parallelize code or use array-based techniques. With many high level languages the only way to achieve reasonable runtime is to utilize parallelism. In contrast, fast languages like Julia can produce surprisingly quick single-threaded programs when you keep an eye on allocations, types, and cache behavior.

Further, it may be that a simpler, easier to maintain sequential model is preferable to a more complex parallel version if maximum performance is not required. Like the quote that opened the chapter, you may prefer a simpler sequential version of a model to a more complex parallel one.

Tip

Developer time (your time) is often more expensive than wall-clock runtime, so be prepared to accept a good-enough but sub-optimal implementation instead of spending days chasing microseconds that the business will never notice.

Note

This section of the book continues to focus on concepts that are more general than just Julia. We will elaborate on Julia-specific techniques in the section starting with Chapter 21.

10.3 Patterns of Performant Sequential Code

Sequential code performance depends on five key patterns:

  1. Efficient memory usage
  2. Optimized memory access
  3. Appropriate data types
  4. Type stability
  5. Branch prediction optimization

Understanding these patterns helps you write faster code by leveraging CPU architecture and Julia’s compiler optimizations. We’ll look at each of these in turn.

10.3.1 Minimize Memory Allocations

Allocating memory onto the Heap takes a lot more time than (1) not using intermediate memory storage at all, or (2) utilizing the Stack. Each allocation requires time for memory management and requires the garbage collector, which can significantly impact performance, especially in tight loops or frequently called functions.

Note

Tight loops or hot loops are the performance critical section of the code that are performed many times during a computation. They are often the “inner-most” loop of a nested loop algorithm.

A general rule of thumb is that dynamically sizable or mutable objects (arrays, mutable structs) will be heap allocated while small fixed size objects can be stack allocated. For mutable objects, a common technique is to pre-allocate an array and then re-use that array for subsequent calculations. In the following stylized example we accumulate bond-level cashflows into a shared vector rather than constructing (pre-allocating) intermediate arrays and summing them later:

end_time = 10
cashflow_output = zeros(end_time)

par_bonds = map(1:1000) do i
    (tenor=rand((3, 5, 10)), rate=rand() / 10)
end

# sum up all of the bond cashflows into cashflow_output
for asset in par_bonds
    for t in 1:end_time
        if t == asset.tenor
            cashflow_output[t] += 1 + asset.rate
        else
            cashflow_output[t] += asset.rate
        end
    end
end
cashflow_output
10-element Vector{Float64}:
  50.26350997644971
  50.26350997644971
 344.26350997644994
  50.26350997644971
 400.2635099764497
  50.26350997644971
  50.26350997644971
  50.26350997644971
  50.26350997644971
 406.26350997644965

Julia’s @allocated macro will display the number of bytes allocated by an expression, helping you identify and eliminate unnecessary allocations.

random_sum_alloc() = sum([rand() for _ in 1:10])      # allocates
random_sum_noalloc() = sum(rand() for _ in 1:10)      # generator, no array

@allocated random_sum_alloc()
@allocated random_sum_noalloc()
0

10.3.2 Optimize Memory Access Patterns

Optimizing memory access patterns is essential for leveraging the CPU’s cache hierarchy effectively. Modern CPUs have multiple levels of cache (L1, L2, L3), each with different sizes and access speeds. By structuring your code to access memory in a cache-friendly manner, you can significantly reduce memory latency and improve overall performance. For finance this shows up when you value a matrix of policy projections or simulate thousands of Monte Carlo paths: the order in which you touch memory dominates runtime once the arithmetic becomes trivial.

What is cache-friendly memory access? Essentially it boils down to spatial and temporal locality.

10.3.2.1 Spatial Locality

Spatial locality refers to accessing data that is physically near each other in memory (e..g contiguous blocks of data in an array).

For example, it is better to access data in a linear order rather than random order. In actuarial projections we often sum along policy-by-policy arrays; keeping the loop aligned with the storage order avoids cache misses. The benchmark below highlights the gap:

using BenchmarkTools, Random

# Create a large array of structs to emphasize memory access patterns
struct DataPoint
    value::Float64
    # Add padding to make each element 64 bytes (a typical cache line size)
    padding::NTuple{7,Float64}
end

function create_large_array(n)
    [DataPoint(rand(), tuple(rand(7)...)) for _ in 1:n]
end

# Create a large array
const N = 1_000_000
large_array = create_large_array(N)

# Function for sequential access
function sequential_sum(arr)
    total = 0.0
    for i in eachindex(arr)
        total += arr[i].value
    end
    total
end

# Function for random access
function random_sum(arr, indices)
    total = 0.0
    for i in indices
        total += arr[i].value
    end
    total
end

# Create shuffled indices
shuffled_indices = shuffle(1:N)

# Benchmark
println("Sequential access:")
@btime sequential_sum($large_array)

println("\nRandom access:")
@btime random_sum($large_array, $shuffled_indices)
Precompiling packages...
    452.8 msQuartoNotebookWorkerJSONExt (serial)
  1 dependency successfully precompiled in 0 seconds
Sequential access:
  529.083 μs (0 allocations: 0 bytes)

Random access:
  2.341 ms (0 allocations: 0 bytes)
500217.8478410592

When the data is accessed in a linear order, it means that the computer can load chunks of data into the cache and it can operate on that cached data for several cycles before new data needs to be loaded into the cache. In contrast, when accessing the data randomly, then the cache frequently needs to be populated with a different set of bits from a completely different part of our array.

10.3.2.1.1 Column vs Row Major Order

All multi-dimensional arrays in computer memory are actually stored linearly. When storing the multi-dimensional array, an architectural decision needs to be made at the language-level and Julia is column-major, similar to many performance-oriented languages and libraries (e.g. LAPACK, Fortran, Matlab). Values are stored going down the columns instead of across the rows.

For example, this 2D array would be stored as [1,2,3,...] in memory, which is made clear via vec (which turns a multi-dimensional array into a 1D vector):

let
    array = [
        1 4 7
        2 5 8
        3 6 9
    ]

    vec(array)
end
9-element Vector{Int64}:
 1
 2
 3
 4
 5
 6
 7
 8
 9

When working with arrays, prefer accessing elements in column-major order (the default in Julia) to maximize spatial locality. This allows the CPU to prefetch data more effectively.

You can see how summing up values across the first (column) dimension is much faster than summing across rows, because column-major storage keeps column elements contiguous in memory:

@btime sum(arr, dims=1) setup = arr = rand(1000, 1000)
@btime sum(arr, dims=2) setup = arr = rand(1000, 1000)
  73.250 μs (3 allocations: 8.08 KiB)
  109.958 μs (3 allocations: 8.08 KiB)
1000×1 Matrix{Float64}:
 506.9509638329025
 507.15757023381997
 494.5905934248326
 499.08303602462314
 509.46463091668625
 502.5064217978588
 496.3025522836511
 501.99622607677884
 501.6834504230503
 482.6450714059836
   ⋮
 501.2391476370913
 496.6624443082166
 498.3487226467483
 504.22630364555044
 511.46346791677286
 498.71795014698375
 506.3927432812148
 496.95554299557244
 474.48814314642726
Note

In contrast to the prior example, a row-major memory layout would have the associated data stored in memory as:

[1,4,7,2,5,8,3,6,9]

The choice between row and column major reflects the historical development of scientific computing and mathematical conventions. Column-major originated in Fortran and LAPACK, cornerstones of high-performance computing and linear algebra. This aligns with mathematical notation where vectors are typically represented as matrix columns.

Row major languages include:

  • C/C++
  • Python (NumPy arrays)
  • C#
  • Java
  • JavaScript
  • Rust

Column major languages (and libraries):

  • Julia
  • MATLAB/Octave
  • R
  • Fortran
  • LAPACK/BLAS libraries

10.3.2.2 Temporal Locality

Hardware prefetchers (predictors) and the cache hierarchy exploit temporal locality by keeping recently used data “hot” and proactively fetching nearby data based on recent access patterns. This is an example of keeping “hot” data more readily accessible to the CPU than “cold” data.

When working sets exceed available RAM, the operating system may page memory to a “swap file” on disk. Recently accessed pages tend to remain in RAM, while “colder” pages are more likely to be swapped out.

10.3.3 Use Efficient Data Types

The right data type can lead to more compact memory representations, better cache utilization, and more efficient CPU instructions. This is another case of where having a smaller memory footprint allows for higher utilization of the CPU since computers tend to be memory-constrained in speed.

On some CPUs, you may find performance by using the smallest data type that can accurately represent your data. For example, prefer Int32 over Int64 if your values will never exceed 32-bit integer range. For floating-point numbers, use Float32 instead of Float64 if the reduced precision is acceptable for your calculations. These smaller types not only save memory but also allow for more efficient vectorized operations (see Chapter 11) on modern CPUs.

Warning

When choosing data types, consider that on modern 64-bit architectures, using smaller integer or floating-point types does not always improve performance. In fact, many operations are optimized for 64-bit (or even larger SIMD) registers, and using smaller data types may introduce unnecessary overhead due to conversions or misalignment.

As a rough guideline, if your data naturally fits within 64 bits (e.g. Float64 or Int64), starting there is often the best balance of performance and simplicity. You can experiment with smaller types if you know your values never exceed certain ranges and memory footprint is critical. However, always benchmark to confirm any gains - simply using a smaller type does not guarantee improved throughput on modern CPUs.

For collections, choose appropriate container types based on your use case. Arrays are efficient for calculations that loop through all or most elements, while Dictionaries are better for sparse look-ups or outside of the “hot loop” portion of a computation.

Consider using small, statically sized collections when the data is suited for it. Small, fixed-size arrays, (such as StaticArrays.jl in Julia) can be allocated on the stack and lead to better performance in certain scenarios than dynamically sizable arrays. The trade-off is that the static arrays require more up-front compile time and after a certain point (length in the 50-100 element range) it usually is not worth trying to use them.

10.3.4 Avoid Type Instabilities

Type instability occurs when Julia’s compiler cannot infer a concrete result type for a variable or expression; this inhibits specialization and often triggers performance degradation through dynamic dispatch. A separate but related performance trap is using abstractly typed containers (e.g., Vector{Any}) in hot loops, which forces dynamic dispatch on each element even if the function’s overall return type is inferrable.

To avoid type instabilities, ensure that functions have inferrable, concrete types across all code paths. For example:


function unstable_example(array)
1    x = []
    for y in array
        push!(x, y)
    end
    sum(x)
end

function stable_example(array)
2    x = eltype(array)[]
    for y in array
        push!(x, y)
    end
    sum(x)
end

data = rand(1000)
@btime unstable_example(data)
@btime stable_example(data)
1
Without a type given, [] will create a Vector{Any} which can contain elements of Any type which is flexible but requires runtime dispatch to determine correct behavior.
2
eltype(array) returns the type contained within the array, which for data is Float64. Thus x is created as a Vector{Float64} which is more efficient code.
  11.583 μs (2008 allocations: 49.91 KiB)
  818.182 ns (10 allocations: 18.69 KiB)
490.29235967228567

The unstable_example function illustrates a common anti-pattern wherein an Any typed array is created and then elements are added to it. Because any type can be added to an Any array (we happen to just add floats to it) then Julia’s not sure what types to expect inside the container and therefore has to determine it at runtime.

Note

Heterogeneous return types are not the same thing as type instability. In the example above, the return type is not unstable: the compiler recognizes that the single parametric type Union{Float64,Int64} will be returned. While Julia can optimize some unions with a small number of elements, you should prefer concrete, predictable types in hot code.

Tip

See Section 24.6.1 and Section 24.6.2 for tools in Julia to troubleshoot type instabilities.

10.3.5 Optimize for Branch Prediction

Modern CPUs use branch prediction to speculatively execute instructions before knowing the outcome of conditional statements. This was described in the prior chapter in the logistics warehouse analogy as trying to predict where a package will be going before you’ve inspected the label. CPUs execute one branch of the instructions without seeing the data (either true data, or data in the form of CPU instructions) because the CPU has higher speed/capacity than the memory throughput allows. This is another example of a technique developed for performance in a memory-throughput constrained world.

Optimizing your code for branch prediction can significantly improve performance, especially in tight loops or frequently executed code paths.

To optimize for branch prediction:

  1. Structure your code to make branching patterns more predictable. For instance, in if-else statements, put the more likely condition first. This allows the CPU to more accurately predict the branch outcome.

  2. Use loop unrolling to reduce the number of branches. This technique involves manually repeating loop body code to reduce the number of loop iterations and associated branch instructions. See Section 11.4 for more on what this means.

  3. Consider using Julia’s @inbounds macro to eliminate bounds checking in array operations when you’re certain the accesses are safe. This reduces the number of conditional checks the CPU needs to perform.

  4. For performance-critical sections with unpredictable branches, consider using branch-free algorithms or bitwise operations instead of conditional statements. This can help avoid the penalties associated with branch mispredictions.

  5. In some cases, it may be beneficial to replace branches with arithmetic operations (e.g., using the ternary operator or multiplication by boolean values) to allow for better vectorization and reduce the impact of branch mispredictions. An example of this would be using a statement like y += max(x,0) instead of if x > 0; y += x; end.

Here’s an example demonstrating the impact of branch prediction:

function process_data(data, threshold)
    total = 0.0
    for x in data
        if x > threshold
            total += log(x)
        else
            total += x
        end
    end
    total
end

# Random data = unpredictable branches
rand_data = rand(1_000_000)

# Sorted data = predictable branches
sorted_data = sort(rand(1_000_000))

@btime process_data($rand_data, 0.5)
@btime process_data($sorted_data, 0.5);
  5.126 ms (0 allocations: 0 bytes)
  1.596 ms (0 allocations: 0 bytes)

In this example, having sorted data means that the CPU will predict which branch of the if statement is likely to be utilized. Sorting makes the branch outcome highly predictable for long stretches (all below-threshold values, then all above), improving the effectiveness of the branch predictor. By speculatively executing the code that it thinks will be used, the overall program time is faster when processing sorted_data.

Remember that optimizing for branch prediction often involves trade-offs. The benefits can vary depending on the specific hardware and the nature of your data. If performance critical, profile your code to ensure that your optimizations are actually improving performance in your specific use case. Over-optimizing on one set of hardware (e.g. local computer) may not translate the same on another set of hardware (e.g. server deployment).

10.3.6 Further Reading

10.3.7 Key Takeaways

  • Measure first: use @btime, @allocated, and profiling tools to find hot loops before redesigning algorithms.
  • Favor cache-friendly access patterns; touching memory in the order it is stored often doubles throughput for actuarial grids or scenario cubes.
  • Keep containers concrete and preallocate buffers in tight loops to avoid hidden heap allocations and dynamic dispatch.
  • Start with Float64/Int64 unless memory pressure forces a smaller type; benchmark any change.
  • Make branches predictable when you can and sort or bucket data if it helps align sequential behavior.