Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away. - Antoine de Saint-Exupéry, Airman’s Odyssey
10.1 Chapter Overview
Understand single-threaded performance by focusing on memory access, type stability, and branch prediction; profile and benchmark before reaching for parallelism; learn how to upgrade hot loops and data structures in Julia so that parallel versions start from a solid baseline.
10.2 Introduction
On today’s hardware, the highest-throughput workloads often run on GPUs via massive parallelism. However, writing correct, performant parallel code relies on understanding efficient sequential patterns first. Secondly, many problems are not “massively parallelizable” and a sequential model architecture is required. For these reasons, it’s critical to understand sequential patterns before moving onto parallel code.
For those coming from fundamentally slower languages (such as R or Python), the common advice to speed things up is often to try to parallelize code or use array-based techniques. With many high level languages the only way to achieve reasonable runtime is to utilize parallelism. In contrast, fast languages like Julia can produce surprisingly quick single-threaded programs when you keep an eye on allocations, types, and cache behavior.
Further, it may be that a simpler, easier to maintain sequential model is preferable to a more complex parallel version if maximum performance is not required. Like the quote that opened the chapter, you may prefer a simpler sequential version of a model to a more complex parallel one.
Tip
Developer time (your time) is often more expensive than wall-clock runtime, so be prepared to accept a good-enough but sub-optimal implementation instead of spending days chasing microseconds that the business will never notice.
Note
This section of the book continues to focus on concepts that are more general than just Julia. We will elaborate on Julia-specific techniques in the section starting with Chapter 21.
10.3 Patterns of Performant Sequential Code
Sequential code performance depends on five key patterns:
Efficient memory usage
Optimized memory access
Appropriate data types
Type stability
Branch prediction optimization
Understanding these patterns helps you write faster code by leveraging CPU architecture and Julia’s compiler optimizations. We’ll look at each of these in turn.
10.3.1 Minimize Memory Allocations
Allocating memory onto the Heap takes a lot more time than (1) not using intermediate memory storage at all, or (2) utilizing the Stack. Each allocation requires time for memory management and requires the garbage collector, which can significantly impact performance, especially in tight loops or frequently called functions.
Note
Tight loops or hot loops are the performance critical section of the code that are performed many times during a computation. They are often the “inner-most” loop of a nested loop algorithm.
A general rule of thumb is that dynamically sizable or mutable objects (arrays, mutable structs) will be heap allocated while small fixed size objects can be stack allocated. For mutable objects, a common technique is to pre-allocate an array and then re-use that array for subsequent calculations. In the following stylized example we accumulate bond-level cashflows into a shared vector rather than constructing (pre-allocating) intermediate arrays and summing them later:
end_time =10cashflow_output =zeros(end_time)par_bonds =map(1:1000) do i (tenor=rand((3, 5, 10)), rate=rand() /10)end# sum up all of the bond cashflows into cashflow_outputfor asset in par_bondsfor t in1:end_timeif t == asset.tenor cashflow_output[t] +=1+ asset.rateelse cashflow_output[t] += asset.rateendendendcashflow_output
Julia’s @allocated macro will display the number of bytes allocated by an expression, helping you identify and eliminate unnecessary allocations.
random_sum_alloc() =sum([rand() for _ in1:10]) # allocatesrandom_sum_noalloc() =sum(rand() for _ in1:10) # generator, no array@allocatedrandom_sum_alloc()@allocatedrandom_sum_noalloc()
0
10.3.2 Optimize Memory Access Patterns
Optimizing memory access patterns is essential for leveraging the CPU’s cache hierarchy effectively. Modern CPUs have multiple levels of cache (L1, L2, L3), each with different sizes and access speeds. By structuring your code to access memory in a cache-friendly manner, you can significantly reduce memory latency and improve overall performance. For finance this shows up when you value a matrix of policy projections or simulate thousands of Monte Carlo paths: the order in which you touch memory dominates runtime once the arithmetic becomes trivial.
What is cache-friendly memory access? Essentially it boils down to spatial and temporal locality.
10.3.2.1 Spatial Locality
Spatial locality refers to accessing data that is physically near each other in memory (e..g contiguous blocks of data in an array).
For example, it is better to access data in a linear order rather than random order. In actuarial projections we often sum along policy-by-policy arrays; keeping the loop aligned with the storage order avoids cache misses. The benchmark below highlights the gap:
usingBenchmarkTools, Random# Create a large array of structs to emphasize memory access patternsstruct DataPoint value::Float64# Add padding to make each element 64 bytes (a typical cache line size) padding::NTuple{7,Float64}endfunctioncreate_large_array(n) [DataPoint(rand(), tuple(rand(7)...)) for _ in1:n]end# Create a large arrayconst N =1_000_000large_array =create_large_array(N)# Function for sequential accessfunctionsequential_sum(arr) total =0.0for i ineachindex(arr) total += arr[i].valueend totalend# Function for random accessfunctionrandom_sum(arr, indices) total =0.0for i in indices total += arr[i].valueend totalend# Create shuffled indicesshuffled_indices =shuffle(1:N)# Benchmarkprintln("Sequential access:")@btimesequential_sum($large_array)println("\nRandom access:")@btimerandom_sum($large_array, $shuffled_indices)
Precompiling packages...
452.8 ms ✓ QuartoNotebookWorkerJSONExt (serial)
1 dependency successfully precompiled in 0 seconds
Sequential access:
529.083 μs (0 allocations: 0 bytes)
Random access:
2.341 ms (0 allocations: 0 bytes)
500217.8478410592
When the data is accessed in a linear order, it means that the computer can load chunks of data into the cache and it can operate on that cached data for several cycles before new data needs to be loaded into the cache. In contrast, when accessing the data randomly, then the cache frequently needs to be populated with a different set of bits from a completely different part of our array.
10.3.2.1.1 Column vs Row Major Order
All multi-dimensional arrays in computer memory are actually stored linearly. When storing the multi-dimensional array, an architectural decision needs to be made at the language-level and Julia is column-major, similar to many performance-oriented languages and libraries (e.g. LAPACK, Fortran, Matlab). Values are stored going down the columns instead of across the rows.
For example, this 2D array would be stored as [1,2,3,...] in memory, which is made clear via vec (which turns a multi-dimensional array into a 1D vector):
let array = [147258369 ]vec(array)end
9-element Vector{Int64}:
1
2
3
4
5
6
7
8
9
When working with arrays, prefer accessing elements in column-major order (the default in Julia) to maximize spatial locality. This allows the CPU to prefetch data more effectively.
You can see how summing up values across the first (column) dimension is much faster than summing across rows, because column-major storage keeps column elements contiguous in memory:
In contrast to the prior example, a row-major memory layout would have the associated data stored in memory as:
[1,4,7,2,5,8,3,6,9]
The choice between row and column major reflects the historical development of scientific computing and mathematical conventions. Column-major originated in Fortran and LAPACK, cornerstones of high-performance computing and linear algebra. This aligns with mathematical notation where vectors are typically represented as matrix columns.
Row major languages include:
C/C++
Python (NumPy arrays)
C#
Java
JavaScript
Rust
Column major languages (and libraries):
Julia
MATLAB/Octave
R
Fortran
LAPACK/BLAS libraries
10.3.2.2 Temporal Locality
Hardware prefetchers (predictors) and the cache hierarchy exploit temporal locality by keeping recently used data “hot” and proactively fetching nearby data based on recent access patterns. This is an example of keeping “hot” data more readily accessible to the CPU than “cold” data.
When working sets exceed available RAM, the operating system may page memory to a “swap file” on disk. Recently accessed pages tend to remain in RAM, while “colder” pages are more likely to be swapped out.
10.3.3 Use Efficient Data Types
The right data type can lead to more compact memory representations, better cache utilization, and more efficient CPU instructions. This is another case of where having a smaller memory footprint allows for higher utilization of the CPU since computers tend to be memory-constrained in speed.
On some CPUs, you may find performance by using the smallest data type that can accurately represent your data. For example, prefer Int32 over Int64 if your values will never exceed 32-bit integer range. For floating-point numbers, use Float32 instead of Float64 if the reduced precision is acceptable for your calculations. These smaller types not only save memory but also allow for more efficient vectorized operations (see Chapter 11) on modern CPUs.
Warning
When choosing data types, consider that on modern 64-bit architectures, using smaller integer or floating-point types does not always improve performance. In fact, many operations are optimized for 64-bit (or even larger SIMD) registers, and using smaller data types may introduce unnecessary overhead due to conversions or misalignment.
As a rough guideline, if your data naturally fits within 64 bits (e.g. Float64 or Int64), starting there is often the best balance of performance and simplicity. You can experiment with smaller types if you know your values never exceed certain ranges and memory footprint is critical. However, always benchmark to confirm any gains - simply using a smaller type does not guarantee improved throughput on modern CPUs.
For collections, choose appropriate container types based on your use case. Arrays are efficient for calculations that loop through all or most elements, while Dictionaries are better for sparse look-ups or outside of the “hot loop” portion of a computation.
Consider using small, statically sized collections when the data is suited for it. Small, fixed-size arrays, (such as StaticArrays.jl in Julia) can be allocated on the stack and lead to better performance in certain scenarios than dynamically sizable arrays. The trade-off is that the static arrays require more up-front compile time and after a certain point (length in the 50-100 element range) it usually is not worth trying to use them.
10.3.4 Avoid Type Instabilities
Type instability occurs when Julia’s compiler cannot infer a concrete result type for a variable or expression; this inhibits specialization and often triggers performance degradation through dynamic dispatch. A separate but related performance trap is using abstractly typed containers (e.g., Vector{Any}) in hot loops, which forces dynamic dispatch on each element even if the function’s overall return type is inferrable.
To avoid type instabilities, ensure that functions have inferrable, concrete types across all code paths. For example:
functionunstable_example(array)1 x = []for y in arraypush!(x, y)endsum(x)endfunctionstable_example(array)2 x =eltype(array)[]for y in arraypush!(x, y)endsum(x)enddata =rand(1000)@btimeunstable_example(data)@btimestable_example(data)
1
Without a type given, [] will create a Vector{Any} which can contain elements of Any type which is flexible but requires runtime dispatch to determine correct behavior.
2
eltype(array) returns the type contained within the array, which for data is Float64. Thus x is created as a Vector{Float64} which is more efficient code.
The unstable_example function illustrates a common anti-pattern wherein an Any typed array is created and then elements are added to it. Because any type can be added to an Any array (we happen to just add floats to it) then Julia’s not sure what types to expect inside the container and therefore has to determine it at runtime.
Note
Heterogeneous return types are not the same thing as type instability. In the example above, the return type is not unstable: the compiler recognizes that the single parametric type Union{Float64,Int64} will be returned. While Julia can optimize some unions with a small number of elements, you should prefer concrete, predictable types in hot code.
Modern CPUs use branch prediction to speculatively execute instructions before knowing the outcome of conditional statements. This was described in the prior chapter in the logistics warehouse analogy as trying to predict where a package will be going before you’ve inspected the label. CPUs execute one branch of the instructions without seeing the data (either true data, or data in the form of CPU instructions) because the CPU has higher speed/capacity than the memory throughput allows. This is another example of a technique developed for performance in a memory-throughput constrained world.
Optimizing your code for branch prediction can significantly improve performance, especially in tight loops or frequently executed code paths.
To optimize for branch prediction:
Structure your code to make branching patterns more predictable. For instance, in if-else statements, put the more likely condition first. This allows the CPU to more accurately predict the branch outcome.
Use loop unrolling to reduce the number of branches. This technique involves manually repeating loop body code to reduce the number of loop iterations and associated branch instructions. See Section 11.4 for more on what this means.
Consider using Julia’s @inbounds macro to eliminate bounds checking in array operations when you’re certain the accesses are safe. This reduces the number of conditional checks the CPU needs to perform.
For performance-critical sections with unpredictable branches, consider using branch-free algorithms or bitwise operations instead of conditional statements. This can help avoid the penalties associated with branch mispredictions.
In some cases, it may be beneficial to replace branches with arithmetic operations (e.g., using the ternary operator or multiplication by boolean values) to allow for better vectorization and reduce the impact of branch mispredictions. An example of this would be using a statement like y += max(x,0) instead of if x > 0; y += x; end.
Here’s an example demonstrating the impact of branch prediction:
functionprocess_data(data, threshold) total =0.0for x in dataif x > threshold total +=log(x)else total += xendend totalend# Random data = unpredictable branchesrand_data =rand(1_000_000)# Sorted data = predictable branchessorted_data =sort(rand(1_000_000))@btimeprocess_data($rand_data, 0.5)@btimeprocess_data($sorted_data, 0.5);
5.126 ms (0 allocations: 0 bytes)
1.596 ms (0 allocations: 0 bytes)
In this example, having sorted data means that the CPU will predict which branch of the if statement is likely to be utilized. Sorting makes the branch outcome highly predictable for long stretches (all below-threshold values, then all above), improving the effectiveness of the branch predictor. By speculatively executing the code that it thinks will be used, the overall program time is faster when processing sorted_data.
Remember that optimizing for branch prediction often involves trade-offs. The benefits can vary depending on the specific hardware and the nature of your data. If performance critical, profile your code to ensure that your optimizations are actually improving performance in your specific use case. Over-optimizing on one set of hardware (e.g. local computer) may not translate the same on another set of hardware (e.g. server deployment).