Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away. - Antoine de Saint-Exupéry, Airman’s Odyssey
10.1 Chapter Overview
Understanding single-threaded performance, strategies for efficient sequential code, recognizing when parallelization is needed, practical comparisons of single-threaded and parallel approaches, optimizing code in Julia.
10.2 Introduction
With today’s hardware, the highest throughput computations utilize GPUs for massive parallelization. However, writing parallel code, let alone performant parallel code, typically relies heavily on understanding patterns of non-parallel code. Secondly, many problems are not “massively parallelizable” and a sequential model architecture is required. For these reasons, it’s critical to understand sequential patterns before moving onto parallel code.
For those coming from fundamentally slower languages (such as R or Python), the common advice to speed things up is often to try to parallelize code or use array-based techniques. With many high level languages the only way to achieve reasonable runtime is to utilize parallelism. In contrast, fast languages like Julia can output surprisingly quick single threaded programs.
Further, it may be that a simpler, easier to maintain sequential model is preferable to a more complex parallel version if maximum performance is not required. Like the quote that opened the chapter, you may prefer a simpler sequential version of a model to a more complex parallel one.
Tip
Developer time (your time) is often more expensive than runtime, so be prepared to accept a good-enough but sub-optimal code instead of spending a lot of time optimizing every last nanosecond out of it!
Note
This section of the book continues to focus on concepts that are more general than just Julia. We will elaborate on Julia-specific techniques in the section starting with Chapter 21.
10.3 Patterns of Performant Sequential Code
Sequential code performance depends on five key patterns:
Efficient memory usage
Optimized memory access
Appropriate data types
Type stability
Branch prediction optimization
Understanding these patterns helps you write faster code by leveraging CPU architecture and Julia’s compiler optimizations. We’ll look at each of these in turn.
10.3.1 Minimize Memory Allocations
Allocating memory onto the Heap takes a lot more time than (1) not using intermediate memory storage at all, or (2) utilizing the Stack. Each allocation requires time for memory management and requires the garbage collector, which can significantly impact performance, especially in tight loops or frequently called functions.
Note
Tight loops or hot loops are the performance critical section of the code that are performed many times during a computation. They are often the “inner-most” loop of a nested loop algorithm.
A general rule of thumb is that dynamically sizable or mutable objects (arrays, mutable structs) will be heap allocated while small fixed size objects can be stack allocated. For mutable objects, a common technique is to pre-allocate an array and then re-use that array for subsequent calculations. In the following example, note how we pre-allocate the output vector instead of creating vectors for each bond and then summing the vectors together at the end:
end_time =10cashflow_output =zeros(end_time)par_bonds =map(1:1000) do i (tenor=rand((3, 5, 10)), rate=rand() /10)endfor asset in par_bondsfor t in1:end_timeif t == asset.tenor cashflow_output[t] +=1+ asset.rateelse cashflow_output[t] += asset.rateendendendcashflow_output
Julia’s @allocated macro will display the number of bytes allocated by an expression, helping you identify and eliminate unnecessary allocations.
random_sum() =sum([rand() for _ in1:10])@allocatedrandom_sum()
144
10.3.2 Optimize Memory Access Patterns
Optimizing memory access patterns is essential for leveraging the CPU’s cache hierarchy effectively. Modern CPUs have multiple levels of cache (L1, L2, L3), each with different sizes and access speeds. By structuring your code to access memory in a cache-friendly manner, you can significantly reduce memory latency and improve overall performance.
What is cache-friendly memory access? Essentially it boils down to spatial and temporal locality.
10.3.2.1 Spatial Locality
Spatial locality refers to accessing data that is physically near each other in memory (e..g contiguous blocks of data in an array).
For example, it’s better to access data in a linear order rather than random order. For example, if we sum up the elements of an array in order it will be significantly faster than if we do it randomly:
usingBenchmarkTools, Random# Create a large array of structs to emphasize memory access patternsstruct DataPoint value::Float64# Add padding to make each element 64 bytes (a typical cache line size) padding::NTuple{7,Float64}endfunctioncreate_large_array(n) [DataPoint(rand(), tuple(rand(7)...)) for _ in1:n]end# Create a large arrayconst N =1_000_000large_array =create_large_array(N)# Function for sequential accessfunctionsequential_sum(arr) sum =0.0for i ineachindex(arr) sum += arr[i].valueend sumend# Function for random accessfunctionrandom_sum(arr, indices) sum =0.0for i in indices sum += arr[i].valueend sumend# Create shuffled indicesshuffled_indices =shuffle(1:N)# Benchmarkprintln("Sequential access:")@btimesequential_sum($large_array)println("\nRandom access:")@btimerandom_sum($large_array, $shuffled_indices)
Sequential access:
500.834 μs (0 allocations: 0 bytes)
Random access:
2.173 ms (0 allocations: 0 bytes)
499628.98428300093
When the data is accessed in a linear order, it means that the computer can load chunks of data into the cache and it can operate on that cached data for several cycles before new data needs to be loaded into the cache. In contrast, when accessing the data randomly, then the cache frequently needs to be populated with a different set of bits from a completely different part of our array.
10.3.2.1.1 Column vs Row Major Order
All multi-dimensional arrays in computer memory are actually stored linearly. When storing the multi-dimensional array, an architectural decision needs to be made at the language-level and Julia is column-major, similar to many performance-oriented languages and libraries (e.g. LAPACK, Fortran, Matlab). Values are stored going down the columns instead of across the rows.
For example, this 2D array would be stored as [1,2,3,...] in memory, which is made clear via vec (which turns a multi-dimensional array into a 1D vector):
let array = [147258369 ]vec(array)end
9-element Vector{Int64}:
1
2
3
4
5
6
7
8
9
When working with arrays, prefer accessing elements in column-major order (the default in Julia) to maximize spatial locality. This allows the CPU to prefetch data more effectively.
You can see how summing up values across the first (column) dimension is much faster than summing across rows:
In contrast to the prior example, a row-major memory layout would have the associated data stored in memory as:
[1,4,6,2,5,8,3,6,9]
The choice between row and column major reflects the historical development of scientific computing and mathematical conventions. Column-major originated in Fortran and LAPACK, cornerstones of high-performance computing and linear algebra. This aligns with mathematical notation where vectors are typically represented as matrix columns.
Row major languages include:
C/C++
Python (NumPy arrays)
C#
Java
JavaScript
Rust
Column major languages:
Julia
MATLAB/Octave
R
Fortran
LAPACK/BLAS libraries
10.3.2.2 Temporal Locality
The scheduler and branch prediction will recognize data that is accessed together closely in time and prefetch relevant blocks of data. This is an example of keeping “hot” data more readily accessible to the CPU than “cold” data.
Sometimes this happens at the operating system level - if you load a dataset that exceeds the available RAM, some of the “active” memory will actually be kept in a space on the persistent disk (e.g. your SSD) to avoid the computer crashing from being out of memory. Segments of the data that have been accessed recently will be in RAM while section of the data not recently accessed are likely to be in the portion stored on the persistent disk.
10.3.3 Use Efficient Data Types
The right data type can lead to more compact memory representations, better cache utilization, and more efficient CPU instructions. This is another case of where having a smaller memory footprint allows for higher utilization of the CPU since computers tend to be memory-constrained in speed.
On some CPUs, you may find performance if using the smallest data type that can accurately represent your data. For example, prefer Int32 over Int64 if your values will never exceed 32-bit integer range. For floating-point numbers, use Float32 instead of Float64 if the reduced precision is acceptable for your calculations. These smaller types not only save memory but also allow for more efficient vectorized operations (see Chapter 11) on modern CPUs.
Warning
When choosing data types, consider that on modern 64-bit architectures, using smaller integer or floating-point types doesn’t always improve performance. In fact, many operations are optimized for 64-bit (or even larger SIMD) registers, and using smaller data types may introduce unnecessary overhead due to conversions or misalignment.
As a rough guideline, if your data naturally fits within 64 bits (e.g. Float64 or Int64), starting there is often the best balance of performance and simplicity. You can experiment with smaller types if you know your values never exceed certain ranges and memory footprint is critical. However, always benchmark to confirm any gains—simply using a smaller type does not guarantee improved throughput on modern CPUs.
For collections, choose appropriate container types based on your use case. Arrays are efficient for calculations that loop through all or most elements, while Dictionaries are better for sparse look-ups or outside of the “hot loop” portion of a computation.
Consider using small, statically sized collections when the data is suited for it. Small, fixed-size arrays, (such as StaticArrays.jl in Julia) can be allocated on the stack and lead to better performance in certain scenarios than dynamically sizeable arrays. The trade-off is that the static arrays require more up-front compile time and after a certain point (length in the 50-100 element range) it usually isn’t worth trying to use them.
10.3.4 Avoid Type Instabilities
Type instabilities occur when the compiler cannot infer a single concrete type for a variable or function return value. These instabilities can significantly hinder Julia’s ability to generate optimized machine code, leading to performance degradation. When the compiler is not able to infer the types at compile-time (compile time dispatch), then while the program is running a lookup needs to be performed to find the most appropriate functions for the given type (runtime dispatch). When the types are known at compile-time, Julia is able to create efficient machine code that will point directly to the desired function instead of needing to perform that lookup.
To avoid type instabilities, ensure that functions have inferrable, concrete types across all code paths. For example:
functionunstable_example(array)1 x = []for y in arraypush!(x, y)endsum(x)endfunctionstable_example(array)2 x =eltype(array)[]for y in arraypush!(x, y)endsum(x)enddata =rand(1000)@btimeunstable_example(data)@btimestable_example(data)
1
Without a type given, [] will create a Vector{Any} which can contain elements of Any type which is flexible but requires runtime dispatch to determine correct behavior.
2
eltype(array) returns the type contained within the array, which for data is Float64. Thus x is created as a Vector{Float64} which is more efficient code.
The unstable_example function illustrates a common anti-pattern wherein an Any typed array is created and then elements are added to it. Because any type can be added to an Any array (we happen to just add floats to it) then Julia’s not sure what types to expect inside the container and therefore has to determine it at runtime.
Note that having heterogeneous types as above is not the same thing as type instability, which is when Julia cannot determine in advance what the data types will be. In the example above, the return type is not unstable: the compiler recognizes that the single parametric type Union{Float64,Int64} will be returned., even though two different types can be returned the When the types cannot be determined by the compiler, it leads to runtime dispatch.
Modern CPUs use branch prediction to speculatively execute instructions before knowing the outcome of conditional statements. This was described in the prior chapter in the logistics warehouse analogy as trying to predict where a package will be going before you’ve inspected the label. CPUs execute one branch of the instructions without seeing the data (either true data, or data in the form of CPU instructions) because the CPU has higher speed/capacity than the memory throughput allows. This is another example of a technique developed for performance in a memory-throughput constrained world.
Optimizing your code for branch prediction can significantly improve performance, especially in tight loops or frequently executed code paths.
To optimize for branch prediction:
Structure your code to make branching patterns more predictable. For instance, in if-else statements, put the more likely condition first. This allows the CPU to more accurately predict the branch outcome.
Use loop unrolling to reduce the number of branches. This technique involves manually repeating loop body code to reduce the number of loop iterations and associated branch instructions. See Section 11.4 for more on what this means.
Consider using Julia’s @inbounds macro to eliminate bounds checking in array operations when you’re certain the accesses are safe. This reduces the number of conditional checks the CPU needs to perform.
For performance-critical sections with unpredictable branches, consider using branch-free algorithms or bitwise operations instead of conditional statements. This can help avoid the penalties associated with branch mispredictions.
In some cases, it may be beneficial to replace branches with arithmetic operations (e.g., using the ternary operator or multiplication by boolean values) to allow for better vectorization and reduce the impact of branch mispredictions. An example of this would be using a statement like y += max(x,0) instead of if x > 0; y += x; end.
Here’s an example demonstrating the impact of branch prediction:
functionprocess_data(data, threshold) sum =0.0for x in dataif x > threshold sum +=log(x)else sum += xendend sumend# Random data = unpredictable branchesrand_data =rand(1_000_000)# Sorted data = predictable branchessorted_data =sort(rand(1_000_000))@btimeprocess_data($rand_data, 0.5)@btimeprocess_data($sorted_data, 0.5);
4.714 ms (0 allocations: 0 bytes)
1.598 ms (0 allocations: 0 bytes)
In this example, having sorted data means that the CPU will predict which branch of the if statement is likely to be utilized. By then speculatively executing the code that it thinks will be used, the overall program time is faster when processing sorted_data.
Remember that optimizing for branch prediction often involves trade-offs. The benefits can vary depending on the specific hardware and the nature of your data. If performance critical, profile your code to ensure that your optimizations are actually improving performance in your specific use case. Over-optimizing on one set of hardware (e.g. local computer) may not translate the same on another set of hardware (e.g. server deployment).