10  Writing Performant Single-Threaded Code

Alec Loudenback

Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away. - Antoine de Saint-Exupéry, Airman’s Odyssey

10.1 Chapter Overview

Understanding single-threaded performance, strategies for efficient sequential code, recognizing when parallelization is needed, practical comparisons of single-threaded and parallel approaches, optimizing code in Julia.

10.2 Introduction

With today’s hardware, the highest throughput computations utilize GPUs for massive parallelization. However, writing parallel code, let alone performant parallel code, typically relies heavily on understanding patterns of non-parallel code. Secondly, many problems are not “massively parallelizable” and a sequential model architecture is required. For these reasons, it’s critical to understand sequential patterns before moving onto parallel code.

For those coming from fundamentally slower languages (such as R or Python), the common advice to speed things up is often to try to parallelize code or use array-based techniques. With many high level languages the only way to achieve reasonable runtime is to utilize parallelism. In contrast, fast languages like Julia can output surprisingly quick single threaded programs.

Further, it may be that a simpler, easier to maintain sequential model is preferable to a more complex parallel version if maximum performance is not required. Like the quote that opened the chapter, you may prefer a simpler sequential version of a model to a more complex parallel one.

Tip

Developer time (your time) is often more expensive than runtime, so be prepared to accept a good-enough but sub-optimal code instead of spending a lot of time optimizing every last nanosecond out of it!

Note

This section of the book continues to focus on concepts that are more general than just Julia. We will elaborate on Julia-specific techniques in the section starting with Chapter 21.

10.3 Patterns of Performant Sequential Code

Sequential code performance depends on five key patterns:

  1. Efficient memory usage
  2. Optimized memory access
  3. Appropriate data types
  4. Type stability
  5. Branch prediction optimization

Understanding these patterns helps you write faster code by leveraging CPU architecture and Julia’s compiler optimizations. We’ll look at each of these in turn.

10.3.1 Minimize Memory Allocations

Allocating memory onto the Heap takes a lot more time than (1) not using intermediate memory storage at all, or (2) utilizing the Stack. Each allocation requires time for memory management and requires the garbage collector, which can significantly impact performance, especially in tight loops or frequently called functions.

Note

Tight loops or hot loops are the performance critical section of the code that are performed many times during a computation. They are often the “inner-most” loop of a nested loop algorithm.

A general rule of thumb is that dynamically sizable or mutable objects (arrays, mutable structs) will be heap allocated while small fixed size objects can be stack allocated. For mutable objects, a common technique is to pre-allocate an array and then re-use that array for subsequent calculations. In the following example, note how we pre-allocate the output vector instead of creating vectors for each bond and then summing the vectors together at the end:

end_time = 10
cashflow_output = zeros(end_time)

par_bonds = map(1:1000) do i
    (tenor=rand((3, 5, 10)), rate=rand() / 10)
end

for asset in par_bonds
    for t in 1:end_time
        if t == asset.tenor
            cashflow_output[t] += 1 + asset.rate
        else
            cashflow_output[t] += asset.rate
        end
    end
end
cashflow_output
10-element Vector{Float64}:
  51.06221758555309
  51.06221758555309
 371.062217585553
  51.06221758555309
 395.06221758555307
  51.06221758555309
  51.06221758555309
  51.06221758555309
  51.06221758555309
 387.06221758555296

Julia’s @allocated macro will display the number of bytes allocated by an expression, helping you identify and eliminate unnecessary allocations.

random_sum() = sum([rand() for _ in 1:10])
@allocated random_sum()
144

10.3.2 Optimize Memory Access Patterns

Optimizing memory access patterns is essential for leveraging the CPU’s cache hierarchy effectively. Modern CPUs have multiple levels of cache (L1, L2, L3), each with different sizes and access speeds. By structuring your code to access memory in a cache-friendly manner, you can significantly reduce memory latency and improve overall performance.

What is cache-friendly memory access? Essentially it boils down to spatial and temporal locality.

10.3.2.1 Spatial Locality

Spatial locality refers to accessing data that is physically near each other in memory (e..g contiguous blocks of data in an array).

For example, it’s better to access data in a linear order rather than random order. For example, if we sum up the elements of an array in order it will be significantly faster than if we do it randomly:

using BenchmarkTools, Random

# Create a large array of structs to emphasize memory access patterns
struct DataPoint
    value::Float64
    # Add padding to make each element 64 bytes (a typical cache line size)
    padding::NTuple{7,Float64}
end

function create_large_array(n)
    [DataPoint(rand(), tuple(rand(7)...)) for _ in 1:n]
end

# Create a large array
const N = 1_000_000
large_array = create_large_array(N)

# Function for sequential access
function sequential_sum(arr)
    sum = 0.0
    for i in eachindex(arr)
        sum += arr[i].value
    end
    sum
end

# Function for random access
function random_sum(arr, indices)
    sum = 0.0
    for i in indices
        sum += arr[i].value
    end
    sum
end

# Create shuffled indices
shuffled_indices = shuffle(1:N)

# Benchmark
println("Sequential access:")
@btime sequential_sum($large_array)

println("\nRandom access:")
@btime random_sum($large_array, $shuffled_indices)
Sequential access:
  500.834 μs (0 allocations: 0 bytes)

Random access:
  2.173 ms (0 allocations: 0 bytes)
499628.98428300093

When the data is accessed in a linear order, it means that the computer can load chunks of data into the cache and it can operate on that cached data for several cycles before new data needs to be loaded into the cache. In contrast, when accessing the data randomly, then the cache frequently needs to be populated with a different set of bits from a completely different part of our array.

10.3.2.1.1 Column vs Row Major Order

All multi-dimensional arrays in computer memory are actually stored linearly. When storing the multi-dimensional array, an architectural decision needs to be made at the language-level and Julia is column-major, similar to many performance-oriented languages and libraries (e.g. LAPACK, Fortran, Matlab). Values are stored going down the columns instead of across the rows.

For example, this 2D array would be stored as [1,2,3,...] in memory, which is made clear via vec (which turns a multi-dimensional array into a 1D vector):

let
    array = [
        1 4 7
        2 5 8
        3 6 9
    ]

    vec(array)
end
9-element Vector{Int64}:
 1
 2
 3
 4
 5
 6
 7
 8
 9

When working with arrays, prefer accessing elements in column-major order (the default in Julia) to maximize spatial locality. This allows the CPU to prefetch data more effectively.

You can see how summing up values across the first (column) dimension is much faster than summing across rows:

@btime sum(arr, dims=1) setup = arr = rand(1000, 1000)
@btime sum(arr, dims=2) setup = arr = rand(1000, 1000)
  71.666 μs (3 allocations: 8.08 KiB)
  110.125 μs (3 allocations: 8.08 KiB)
1000×1 Matrix{Float64}:
 495.4868492595919
 493.7948581628845
 509.4405501416571
 511.60817829407677
 496.10515330422044
 511.1232758906476
 513.1177880122705
 507.0352269740659
 486.3896533816539
 503.36584818860285
   ⋮
 488.84489302719885
 504.3273082808261
 512.2731022153907
 495.84752488514073
 508.4534243443502
 501.301653497301
 488.94990880743325
 504.08967020209184
 501.14794126420543
Note

In contrast to the prior example, a row-major memory layout would have the associated data stored in memory as:

[1,4,6,2,5,8,3,6,9]

The choice between row and column major reflects the historical development of scientific computing and mathematical conventions. Column-major originated in Fortran and LAPACK, cornerstones of high-performance computing and linear algebra. This aligns with mathematical notation where vectors are typically represented as matrix columns.

Row major languages include:

  • C/C++
  • Python (NumPy arrays)
  • C#
  • Java
  • JavaScript
  • Rust

Column major languages:

  • Julia
  • MATLAB/Octave
  • R
  • Fortran
  • LAPACK/BLAS libraries

10.3.2.2 Temporal Locality

The scheduler and branch prediction will recognize data that is accessed together closely in time and prefetch relevant blocks of data. This is an example of keeping “hot” data more readily accessible to the CPU than “cold” data.

Sometimes this happens at the operating system level - if you load a dataset that exceeds the available RAM, some of the “active” memory will actually be kept in a space on the persistent disk (e.g. your SSD) to avoid the computer crashing from being out of memory. Segments of the data that have been accessed recently will be in RAM while section of the data not recently accessed are likely to be in the portion stored on the persistent disk.

10.3.3 Use Efficient Data Types

The right data type can lead to more compact memory representations, better cache utilization, and more efficient CPU instructions. This is another case of where having a smaller memory footprint allows for higher utilization of the CPU since computers tend to be memory-constrained in speed.

On some CPUs, you may find performance if using the smallest data type that can accurately represent your data. For example, prefer Int32 over Int64 if your values will never exceed 32-bit integer range. For floating-point numbers, use Float32 instead of Float64 if the reduced precision is acceptable for your calculations. These smaller types not only save memory but also allow for more efficient vectorized operations (see Chapter 11) on modern CPUs.

Warning

When choosing data types, consider that on modern 64-bit architectures, using smaller integer or floating-point types doesn’t always improve performance. In fact, many operations are optimized for 64-bit (or even larger SIMD) registers, and using smaller data types may introduce unnecessary overhead due to conversions or misalignment.

As a rough guideline, if your data naturally fits within 64 bits (e.g. Float64 or Int64), starting there is often the best balance of performance and simplicity. You can experiment with smaller types if you know your values never exceed certain ranges and memory footprint is critical. However, always benchmark to confirm any gains—simply using a smaller type does not guarantee improved throughput on modern CPUs.

For collections, choose appropriate container types based on your use case. Arrays are efficient for calculations that loop through all or most elements, while Dictionaries are better for sparse look-ups or outside of the “hot loop” portion of a computation.

Consider using small, statically sized collections when the data is suited for it. Small, fixed-size arrays, (such as StaticArrays.jl in Julia) can be allocated on the stack and lead to better performance in certain scenarios than dynamically sizeable arrays. The trade-off is that the static arrays require more up-front compile time and after a certain point (length in the 50-100 element range) it usually isn’t worth trying to use them.

10.3.4 Avoid Type Instabilities

Type instabilities occur when the compiler cannot infer a single concrete type for a variable or function return value. These instabilities can significantly hinder Julia’s ability to generate optimized machine code, leading to performance degradation. When the compiler is not able to infer the types at compile-time (compile time dispatch), then while the program is running a lookup needs to be performed to find the most appropriate functions for the given type (runtime dispatch). When the types are known at compile-time, Julia is able to create efficient machine code that will point directly to the desired function instead of needing to perform that lookup.

To avoid type instabilities, ensure that functions have inferrable, concrete types across all code paths. For example:


function unstable_example(array)
1    x = []
    for y in array
        push!(x, y)
    end
    sum(x)
end

function stable_example(array)
2    x = eltype(array)[]
    for y in array
        push!(x, y)
    end
    sum(x)
end

data = rand(1000)
@btime unstable_example(data)
@btime stable_example(data)
1
Without a type given, [] will create a Vector{Any} which can contain elements of Any type which is flexible but requires runtime dispatch to determine correct behavior.
2
eltype(array) returns the type contained within the array, which for data is Float64. Thus x is created as a Vector{Float64} which is more efficient code.
  10.042 μs (2007 allocations: 53.73 KiB)
  804.462 ns (9 allocations: 22.52 KiB)
501.8372838449958

The unstable_example function illustrates a common anti-pattern wherein an Any typed array is created and then elements are added to it. Because any type can be added to an Any array (we happen to just add floats to it) then Julia’s not sure what types to expect inside the container and therefore has to determine it at runtime.

Note that having heterogeneous types as above is not the same thing as type instability, which is when Julia cannot determine in advance what the data types will be. In the example above, the return type is not unstable: the compiler recognizes that the single parametric type Union{Float64,Int64} will be returned., even though two different types can be returned the When the types cannot be determined by the compiler, it leads to runtime dispatch.

Tip

See Section 24.6.1 and Section 24.6.2 for tools in Julia to troubleshoot type instabilities.

10.3.5 Optimize for Branch Prediction

Modern CPUs use branch prediction to speculatively execute instructions before knowing the outcome of conditional statements. This was described in the prior chapter in the logistics warehouse analogy as trying to predict where a package will be going before you’ve inspected the label. CPUs execute one branch of the instructions without seeing the data (either true data, or data in the form of CPU instructions) because the CPU has higher speed/capacity than the memory throughput allows. This is another example of a technique developed for performance in a memory-throughput constrained world.

Optimizing your code for branch prediction can significantly improve performance, especially in tight loops or frequently executed code paths.

To optimize for branch prediction:

  1. Structure your code to make branching patterns more predictable. For instance, in if-else statements, put the more likely condition first. This allows the CPU to more accurately predict the branch outcome.

  2. Use loop unrolling to reduce the number of branches. This technique involves manually repeating loop body code to reduce the number of loop iterations and associated branch instructions. See Section 11.4 for more on what this means.

  3. Consider using Julia’s @inbounds macro to eliminate bounds checking in array operations when you’re certain the accesses are safe. This reduces the number of conditional checks the CPU needs to perform.

  4. For performance-critical sections with unpredictable branches, consider using branch-free algorithms or bitwise operations instead of conditional statements. This can help avoid the penalties associated with branch mispredictions.

  5. In some cases, it may be beneficial to replace branches with arithmetic operations (e.g., using the ternary operator or multiplication by boolean values) to allow for better vectorization and reduce the impact of branch mispredictions. An example of this would be using a statement like y += max(x,0) instead of if x > 0; y += x; end.

Here’s an example demonstrating the impact of branch prediction:

function process_data(data, threshold)
    sum = 0.0
    for x in data
        if x > threshold
            sum += log(x)
        else
            sum += x
        end
    end
    sum
end

# Random data = unpredictable branches
rand_data = rand(1_000_000)

# Sorted data = predictable branches
sorted_data = sort(rand(1_000_000))

@btime process_data($rand_data, 0.5)
@btime process_data($sorted_data, 0.5);
  4.714 ms (0 allocations: 0 bytes)
  1.598 ms (0 allocations: 0 bytes)

In this example, having sorted data means that the CPU will predict which branch of the if statement is likely to be utilized. By then speculatively executing the code that it thinks will be used, the overall program time is faster when processing sorted_data.

Remember that optimizing for branch prediction often involves trade-offs. The benefits can vary depending on the specific hardware and the nature of your data. If performance critical, profile your code to ensure that your optimizations are actually improving performance in your specific use case. Over-optimizing on one set of hardware (e.g. local computer) may not translate the same on another set of hardware (e.g. server deployment).

10.3.6 Further Reading