11 Parallelization

Alec Loudenback

Quantity has a quality all its own. - Attributed to Vladimir Lenin

11.1 Chapter Overview

Fundamentals of parallel workloads, different mechanisms to distribute work: vectorization, multi-threading, GPU, and multi-device workflows. Different programming models: map-reduce, arrays, and tasks.

11.2 Amdahl’s Law and the Limits of Parallel Computing

An important ground-truth in computing is that there is an upper limit to how fast a workload can be sped up through distributing the workload among multiple processor units. For example, if there is a modeling workload wherein 90% of the work is independent (say policy or asset level calculations) and the remaining 10% of the workload is an aggregate (say company or portfolio level), then the theoretical maximum speedup of the process is 10x faster (1 / 90% parallelizable load). This is captured in a law known as Amdahl’s Law and it reflects the theoretical maximum speedup a workload could see. In practice, the speedup is worse than this due to overhead of moving data around, scheduling the tasks, and aggregating results. This is why in many cases a good effort in sequential workloads (see Chapter 10) is often a more fruitful effort than trying to parallelize some workloads.

That said, there are still many modeling use-cases for parallelization. Modern investment and insurance portfolios can easily contain 100’s of thousands or millions of seriatim holdings. In many cases, these can be evaluated independently, though on the often times there is interaction with the total portfolio (contract dividends, non-guaranteed elements, profit sharing, etc.). Further, even if the holdings are not parallelizable across the holdings dimension, we are often interested in independent evaluations across economic scenarios which is amenable to parallelization. \[ S(n) = \frac{1}{(1-p) + \frac{p}{n}} \]

Where:

\(S(n)\) is the theoretical speedup of the execution of the whole task
\(n\) is the number of processors
\(p\) is the proportion of the execution time that benefits from improved resources

We can visualize this for different combinations of \(p\) and \(n\) in Figure 11.1.

using CairoMakie

function amdahl_speedup(p, n)
    return 1 / ((1 - p) + p / n)
end

function main()
    fig = Figure(size=(800, 600))
    ax = Axis(fig[1, 1],
        title="Amdahl's Law",
        xlabel=L"Number of processors ($n$)",
        ylabel="Speedup",
        xscale=log2,
        xticks=2 .^ (0:16),
        xtickformat=x -> "2^" .* string.(Int.(log.(2, x))),
        yticks=0:2:20
    )

    n = 2 .^ (0:16)
    parallel_portions = [0.5, 0.75, 0.9, 0.95]
    linestyles = [:solid, :dash, :dashdot, :solid]

    for (i, p) in enumerate(parallel_portions)
        speedup = [amdahl_speedup(p, ni) for ni in n]
        lines!(ax, n, speedup, label="$(Int(p*100))%", linestyle=linestyles[i])
    end

    xlims!(ax, 1, 2^16)
    ylims!(ax, 0, 20)

    axislegend(ax, L"Parallel portion ($p$)", position=:lt)
    fig
end

main()

Figure 11.1: Theoretical upper bound for speedup of a workload given the parallelizable portion \(p\) and number of processors \(n\).

A real-world analalogy: imagine building a house. You can hire many workers to frame the walls, install plumbing, and run electrical wires simultaneously (the parallel part). However, all work must stop to wait for the single foundation to be poured and cured (the serial part). No matter how many workers you add, the total project time can never be faster than the time it takes to cure the foundation. This fundamental bottleneck is what Amdahl’s Law describes.

With this understanding, we will be able to set expectations and analyze the benefit of parallelization.

11.3 Types of Parallelism

Parallel processing comes in different flavors and is related to the details of hardware as discussed in Chapter 9. We will necessarily extend the discussion of hardware here, as parallelization is (mostly) inextricably tied to hardware details (we will revisit this in Section 11.8).

Major types of computational parallelism highlighting their key characteristics, advantages, and potential drawbacks.
Type	Description	Strengths	Weaknesses
Vectorization (SIMD)	Performs same operation on multiple data points simultaneously	Efficient for data-parallel tasks, uses specialized CPU instructions	Limited to certain types of operations, data must be contiguous
Multi-Threading	Executes multiple threads concurrently on a single CPU	Good for task parallelism, utilizes multi-core processors effectively	Overhead from thread management, potential race conditions
GPU	Uses graphics processing units (GPUs) for parallel computations	Excellent for massively parallel tasks, high throughput	Specialized programming required, data transfer overhead
Multi-Device / Distributed	Spreads computation across multiple machines or devices	Scales to very large problems, can use heterogeneous hardware	Complex to implement and manage, network latency issues

11.4 Vectorization

Vectorization in the context of parallel processing refers to special circuits within the CPU wherein the CPU will load multiple data units (e.g. 4 or 8 floating point numbers) in a contiguous block and perform the same instruction on them at the same time. This is also known as SIMD, or Single-Instruction Multiple Data. Think of it like a for loop where instead of each iteration doing one operation, instead it does four or eight at a time.

The requirements for SIMD-able code are that:

The intended section for SIMD is inside the inner-most loop.
There are no branches (if-statements) inside the loop body.

::callout-tip Note that indexing an array is actually a branch in the code, as two cases could arise: the index is either inbounds or out-of-bounds. To avoid this, either use for x in collection, for i in eachindex(collection) or for i in 1:n; @inbounds collection[i] though the last of these is discouraged in favor of the prior, safer options. :::

using BenchmarkTools

function prevent_simd(arr)
    sum = 0
    for x in arr
        if x > 0
            sum += x
        end
    end
    return sum
end

function allow_simd(arr)
    sum = 0
    for x in arr
        sum += max(x, 0)
    end
    return sum
end

let
    x = rand(10000)

    @btime prevent_simd($x)
    @btime allow_simd($x)
end

  39.667 μs (0 allocations: 0 bytes)
  4.821 μs (0 allocations: 0 bytes)

4989.035321772459

In testing the above code, the allow_simd version should be several times faster than the prevent_simd example. The reason is that prevent_simd has a branch (if x > 0) where the behavior of the code may change depending on the value in arr. Conversely, the behavior of allow_simd is always the same in each iteration, no matter the value of x. This allows the compiler to generate vectorized code automatically.

Tip

Note that Julia’s the compiler is able to identify vectorizable code in many cases, though through some cases may benefit from a more manual hint to the compiler through macro annotations (see ?@simd for details). See Section 24.9.4 for more.

Other types of parallelism that we will discuss in this chapter have some risk of errors or data corruption if not used correctly. SIMD isn’t prone to issues like this because if the code is not SIMD-able then the compiler will not auto-vectorize the code block.

11.4.1 Hardware

Vectorization is hardware dependent. If the CPU does not support vectorization you will not see speedups from it. Many consumer and professional chips have AVX2 (Advanced Vector Extensions, with the 2 signifying second-generation 256 bit width, allowing four simultaneous 64-bit operations). The next generation is AVX512, having twice the SIMD capacity as AVX2. However, as of 2025 most consumer chips do not yet have that and commercial chips may not actually be faster than the AVX2 due to thermal restrictions (SIMD uses more power and generates more heat).

11.5 Multi-Threading

This subsection starts by introducing tasks — lightweight units of computation. Next, we’ll see how tasks can communicate using channels, and then how multiple tasks (and channels) can be leveraged to achieve true parallelism through multi-threading. Think of it as layers building on one another: tasks define work units, channels allow them to share data, and multi-threading enables tasks to run simultaneously on multiple CPU cores. Exact details regarding tasks, channels, and multi-threading vary by language but the general ideas remain the same.

11.5.1 Tasks

To understand multi-threading examples, we first need to discuss Tasks, which are chunks of computation that get performed together, but after which the computer is free to switch to a new task. Technically, there are some instructions within a task that will let the computer pause and come back to that task later (such as sleep).

Tasks do not, by themselves, allow for multiple computations to be performed in parallel. For example, one task might be loading a data file from persistent storage into RAM. After that task is complete, the computer continues on with another task in the queue (rendering a web page, playing a song, etc.). In this way even a single processor computer could be “doing multiple things at once” (or “multi-tasking”) even though nothing is running in parallel. The scheduling of the tasks is handled automatically by the program’s compiler or the operating system.

Here’s an example of a couple of tasks where we write to an array. Despite being called last, the second task should actually write to the array before the first task. This is because we asked the first task to sleep (pause, while allowing the computer to yield to other tasks in the queue)¹.

let
   shared_array = zeros(5)

   task1 = @task begin
        sleep(1)
       shared_array[1] = 1

       println("Task 1: ", shared_array)
   end

   task2 = @task begin
       shared_array[2] = 2
       println("Task 2: ", shared_array)
   end

   schedule(task1); schedule(task2)
   wait(task1)
   wait(task2)

   println("Main: ", shared_array)
end

Task 2: [0.0, 2.0, 0.0, 0.0, 0.0]
Task 1: [1.0, 2.0, 0.0, 0.0, 0.0]
Main: [1.0, 2.0, 0.0, 0.0, 0.0]

11.5.1.1 Channels

Channels are a way to communicate data in a managed way between tasks. You specify a type of data that the buffer (a chunk of assigned memory) will contain and how many elements it can hold. It then stores items (via put!) in a first-in-first-out (FIFO) queue, which can be popped off the queue (via take!) by other tasks.

Here’s an example of a system which generates trades in the financial markets at random time intervals, and a monitoring tasks takes the results and tabulates running statistics:

let

    # simulate random trades and the associated profits 
    function trade_producer(channel,i)
1            sleep(rand())
2            profit = randn()
            put!(channel, profit)
            println("Producer: Trade Result #$i $(round(profit, digits=3))")
    end

    # intake trades via the communication channel
    function portfolio_monitor(channel,n)
        sum = 0.0
3        for _ in 1:n
            profit = take!(channel)
            sum += profit
            println("Monitor: Received $(round(profit, digits=3)), Cumulative profit: $(round(sum, digits=3))")
        end
    end

4    channel = Channel{Float64}(32)
    
    # Start producer and consumer tasks
5    @sync begin
6        for i in 1:5; @async trade_producer(channel,i); end
        @async portfolio_monitor(channel,5)
    end
    
    
    # Close the channel and wait for tasks to finish
    close(channel)
end

1: Random sleep between 0 and 1 seconds to simulate real trading activity and latency.
2: Generate a random number from standard normal distribution to simulate profit or loss from a trade.
3: In this teaching example, we’ve limited the system to produce just five “trades”. In practice, this could be kept running indefinitely via, e.g., while true.
4: Create a channel with a buffer size of 32 floats (in this limited example, we could have gotten away with just 5 since that’s how many the demonstration produces). In practice, you want this to be long enough that the consumer of the channel never gets so far behind that the channel fills up. The channel is created outside of the @sync block so that channel is in scope when we close it.
5: @sync waits (like wait(task)) for all of the scheduled tasks within the block to complete before proceeding with the rest of the program.
6: @async does the combination of creating a task via @task and schedule-ing in one, simpler call.

Producer: Trade Result #3 -0.036
Monitor: Received -0.036, Cumulative profit: -0.036
Producer: Trade Result #2 1.432
Monitor: Received 1.432, Cumulative profit: 1.396
Producer: Trade Result #4 0.418
Monitor: Received 0.418, Cumulative profit: 1.814
Producer: Trade Result #1 1.15
Monitor: Received 1.15, Cumulative profit: 2.964
Producer: Trade Result #5 -1.088
Monitor: Received -1.088, Cumulative profit: 1.876

This is really useful for handling events that are “external” to our program. If we were just doing a modeling exercise using static data, then we could control the order of processing and not need to worry about monitoring a volatile source of data. Nonetheless, tasks can still be useful in some cases even if a model is not using “live” data: for example if one of the steps in a model is to load a very large dataset, it may be possible to perform some computations while chunked task requests are queued to load more data from the disk.

A garbage collector will usually clean up unused channels that are still open. However, it’s a good practice to explicitly close them to ensure proper resource management, clear signaling of completion, and to avoid potential blocking or termination issues in your programs.

Caution

If the task never finishes properly inside the @sync, then your program may get stuck in an infinite loop and hang. Such as if one of the tasks never has a termination condition such as an upper bound on a loop, or a clear way to break out of a while true loop. While not different than a normal loop, such issues become less obvious underneath the layer of task abstractions.

The key takeaway for tasks is that it’s a way to chunk work into bundles that can be run in a concurrent fashion, even if nothing is technically being processed in parallel. The multi-threading and parallel programming paradigms sections build off of tasks so an understanding of tasks is helpful. However, some of the higher level libraries hide the task-based building blocks from you as the user/developer and so an intricate understanding of tasks is not required to be successful in parallelizing your Julia code.

11.5.2 Multi-Threading Overview

When a program starts on your computer, a process is created which is where the operating system allocates some overhead items (keeping track of the the code and memory allocations and layout) and block of memory in RAM that can be utilized by that process. Different processes do not have access to each other’s allocated memory.

Note

Readers may be familiar with starting Excel in different processes. When workbooks are opened within the same process (e.g. when creating a new workbook from Excel’s File menu), the workbooks may seamlessly talk to each other (copy and paste from one to another). However, when Microsoft Excel is opened in different processes, then the workbooks in each respective process do not share memory and cannot create links or use full copy/paste functionality between them (this is what happens when you hold the control button and open Excel multiple times).

Within each process, a main thread is created. That thread is where the running of the code occurs. For the level of the discussion here, you can mainly think of a process as a container with shared memory for threads, which do the real work (as illustrated in Figure 11.2). Besides the main thread, other threads can be created within the process and access the same shared memory.

Figure 11.2: When a program starts, the operating system creates a process for which multiple threads (a main thread plus optional additional threads) share memory.

The advantage of threads is that within a single physical processor, there may be multiple cores. Those cores can access the shared process memory and run tasks from different threads simultaneously. This is a technique that takes advantage of modern processor architecture wherein several (sometimes as many as 32 or more) cores exist on the same chip.

Note

Technically, there are different flavors of threading. While not critical for the understanding and modeling-focused discussion here, here is a bit more detail on different thread types for completeness:

Multi-Tasking. Recall that tasks are chunks of computation that get performed together, but after which the computer is free to switch to a new task. For example, one task might be loading a data file from persistent storage into RAM. After that task is complete, the computer continues on with another task in the queue (rendering a web page, playing a song, etc.). In this way even with a single processor and core, a computer could be “doing multiple things at once” (or “multi-tasking”) even though nothing is running in parallel.
Operating System Threads or just Threads are managed (as the name implies) at the operating system level. The benefit to this is that operating system level threads have more power: the operating system can pause or limit throughput on running programs if the operating system needs the resources for something it deems higher priority. It’s technically possible to use this power to force a higher priority for your own code, but Julia and many other languages do not offer creating of these types of threads in favor of the next type of threads. Operating system threads have a higher amount overhead (time and memory) involved in creating and destroying the threads.
Green threads, cooperative threads, fibers, or user-threads are the type of threads that Julia provides. They are managed at the process (Julia) level and don’t have as much overhead in their creation as operating system threads. Also in Julia, a thread is implemented via Tasks

Parallelism in modern computing comes in many flavors, occurs at many different levels (hardware, OS, software, network), and has many different implementations of similar concepts. The terminology of threading in practice and online documentation is prone to confusing even seasoned developers. If you are having a discussion or asking a question, feel free to take the time to ask for clarification on the terminology being used at a given point in time.

11.5.2.1 Multi-Threading Pitfalls

Different threads being able to access the same memory is a double-edged sword. It is useful because we do not need to create multiple copies of the data in RAM or in the cache² and can improve the overall throughput of our usually memory-bandwidth-limited machines. The downside is that if we are mutating the shared data for which our program relies upon, then our program may produce unintended results if the modification occurs carelessly. There are a couple of related issues to be aware of:

11.5.2.1.1 Race Conditions

The first issue is known as a race condition, which occurs when a block of memory has been read from or written to in an unintended order. For example, if we have two threads which are accumulating a sub-total, each process may read the running sub-total before the other thread has finished it’s update.

In the following example, we use the Threads.@threads to tell Julia to automatically distribute the work across threads.

function sum_bad(n)
    subtotal = 0
    Threads.@threads for i in 1:n
        subtotal += i
    end
    subtotal
end

sum_bad(100_000)

412673138

The result will be less than the expected sum (5000050000) due to a race condition. Here’s what happens:

Multiple threads read the current value of subtotal simultaneously.
Each thread adds its own, local value to that reading.
Only one thread writes its result back to subtotal first.
A different thread then overwrites subtotal with its calculation based on the out-dated starting point for subtotal.

This means some thread contributions are lost when they overwrite each other’s results. The threads may not see each other’s updates, leading to missing values in the final sum.

11.5.2.2 Avoiding Multi-threading Pitfalls

We will cover several ways to manage multi-threading race conditions, but it is the recommendation of the authors to primarily utilize higher level library code, which will be demonstrated after covering some of the more basic, manual techniques.

11.5.2.2.1 Chunking up work into single-threaded work

First, let’s level-set with a single-threaded result:

function sum_single(a)
    s = 0
    for i in a
        s += i
    end
    s
end
@btime sum_single(1:100_000)

  1.500 ns (0 allocations: 0 bytes)

5000050000

Note that in the single-threaded case, Julia is able to identify this common pattern and use a shortcut, calculating the sum of the integers \(1\) through \(n\) as \(\frac{n(n+1)}{2}\) through a compiler optimization and essentially avoid the loop entirely.

We can implement a correct threaded version by splitting the work into different threads, each of which is independent. Then, we can aggregate the results of each of the chunks.

function sum_chunker(a)

1    chunks = Iterators.partition(1:a, a ÷ Threads.nthreads())
   
2    tasks = map(chunks) do chunk
       Threads.@spawn sum_single(chunk)
    end
   
    chunk_sums = fetch.(tasks)
   
    return sum_single(chunk_sums)

end

@btime sum_chunker(100_000)

1: Create chunks of the integer range from 1 to a. Iterators.partition(1:a, a ÷ Threads.nthreads()) splits the range 1:a into contiguous subranges (chunks), each of size a ÷ Threads.nthreads(). For example, if a = 100_000 and nthreads() = 4, you’ll get four chunks of size 25_000.
2: Create a set of tasks (futures) using Threads.@spawn that call sum_single(chunk) on each of the chunks. This initiates parallel computation.

  4.417 μs (72 allocations: 5.72 KiB)

5000050000

11.5.2.2.2 Using Locks

Locks prevent memory from being accessed from more than one thread at a time.

function sum_with_lock(n)
1    subtotal = 0
    
2    lock = ReentrantLock()

3    Threads.@threads for i in 1:n
4        Base.@lock lock begin
            subtotal += i
        end
    end

    subtotal
end
@btime sum_with_lock(100_000)

1: Initialize a running total to zero.
2: Create a reentrant lock to ensure only one thread updates subtotal at a time.
3: Parallelize the loop over 1 to n using Threads.@threads.
4: Acquire the lock before updating the shared variable subtotal. This ensures that only one thread updates subtotal at a time, preventing race conditions. The lock is automatically released at the end of the block.

  18.854 ms (199514 allocations: 3.05 MiB)

5000050000

11.5.2.2.3 Using Atomics

Atomics are certain primitive values with a reduced set of operations for which Julia and the compiler can automatically create thread-safe code. This is often significantly faster than the context-switching overhead needed with locking and unlocking memory for threaded tasks. Compared with locks, atomics are simpler to implement and easier to reason about. The downside is that atomics are limited to the available primitive atomics types and methods.

function sum_with_atomic(n)
1    subtotal = Threads.Atomic{Int}(0)
    Threads.@threads for i in 1:n 
2        Threads.atomic_add!(subtotal, i)
    end
    subtotal[]
end

@btime sum_with_atomic(100_000)

1: Initialize an atomic integer (Threads.Atomic{Int}) to store the subtotal. An atomic variable ensures that increments are performed atomically, preventing race conditions without needing explicit locks.
2: Atomically add i to subtotal. The Threads.atomic_add! function ensures that the addition and update of subtotal happens as one atomic step, preventing multiple threads from interfering with each other’s updates.

  1.351 ms (53 allocations: 5.31 KiB)

5000050000

11.6 GPU and TPUs

11.6.1 Hardware

Graphics Processing Units (GPUs) and Tensor Processing Units (TPUs) are hardware accelerators for massively parallel computations. A TPU is very similar to a GPU but have special ability to handle data types and instructions that are more specialized for linear algebra operations; going forward we will simply refer to these types of accelerators as GPUs.

GPUs have similar components as the CPU as discussed in Chapter 9. They have RAM, caches for the cores, and cores that run the coded instructions on the data. The differences from a CPU are primarily:

A GPU typically has thousands of cores while a CPU generally has single or double digit cores.
- The cores typically operate at a slower clock speed than CPUs, relying on the sheer number of cores to perform computations faster.
The GPU cores essentially have to be running the same set of instructions on all of the data, not unlike vectorization (Section 11.4).
- GPU code is not suited for code with branching conditions (e.g. if statements) and so is more limited in the kinds of computations it can handle compared to the CPU.
The RAM is typically much more constrained, typically less than a quarter of what primary RAM might be.
- As a result, GPUs may need strategies to move chunks of data to and from the GPU memory for moderately large datasets. Further, it’s actually fairly common to use a lower-precision datatype (e.g. Float16 or Float32) to improve overall program throughput at the cost of some precision.
The caches are similar in concept to CPU, but unlike most CPU caches, there is relative locality to data (wherein core #1 will have much quicker access to a different subset of data than, say, core #1024).
A GPU is usually a secondary device of sorts: it physically and in device architecture is separate from the CPU. The CPU remains in charge of overall computer execution.
- The implication of this (as with any movement of memory) is that there is overhead to moving data to and from the GPU. Your calculations will need to be in the single milliseconds range of time in order to start to see benefit from utilizing a GPU.
- To some extent, separable CPU, RAM, and GPU is changing with some of the latest computer hardware. For example, the M-series of Apple processors have the CPU, GPU, and RAM in a single tightly integrated package for efficiency and computational power.

11.6.1.1 Notable Vendors and Libraries

Like the difference between x86 and ARM architectures, GPU also have specific architectures which vary by the vendor. To make full use of the hardware, the vendors need to (1) provide device drivers which allow the CPU to talk to the GPU, and (2) provide the libraries (lower level application programming interfaces, or APIs) which allow developers to utilize different hardware features without needing to write machine code.

As of the mid 2020s, here are the most important GPU vendors and the associated programming library for utilizing their specific hardware:

Important GPU and TPU vendors and the associated library/interface.
Vendor	Hardware	API Library/Package
NVIDIA	Geforce, GTX/RTX, various Data Center focused hardware	CUDA
AMD	Radeon, various Data Center focused hardware	ROCm
Intel	Core, Xeon, Arc processors	OneAPI
Apple	M Series processors	Metal
Google	Tensor processors	TensorFlow

11.6.2 Utilizing a GPU

With some of the key conceptual differences between CPUs and GPUs explained, let’s explore how to incorporate these powerful hardware accelerators. We will use Julia libraries to illustrate GPU programming, though the concepts are generally applicable to other high-level languages that offer GPU interface libraries.

11.6.2.1 Julia GPU Libraries

There’s essentially two types of GPU programming we will discuss here:

Array-based programming, where arrays are stored and operated on directly on the GPU memory. This approach abstracts away the low-level details, allowing you to work with familiar array operations that are automatically executed on the GPU.
Custom kernels, which are specialized functions that define exactly how each GPU thread should process data in parallel. A kernel explicitly specifies the computation that each GPU thread will perform on its portion of the data.

Note

Kernels in this context are specialized functions that run directly on the GPU. Rather than relying solely on high-level array operations, kernels explicitly define the sequence of low-level, parallel instructions executed across many GPU threads. In other words, a kernel directly expresses the computation you want to perform on the data, enabling fine-grained control over GPU execution.

A third approach would be to implement GPU code in a low-level vendor toolkit (such as C++ and associate CUDA libraries), but this approach will not be illustrated here.

Julia has wonderful support for several of the primary vendors (at the time of writing, CUDA, Metal, OneAPI, and ROCm) via the JuliaGPU organization. Installation of the required dependencies is also very straightforward and the interfaces at the array and generated kernel levels are very similar. The differences are obvious at the lower level vendor-API wrappers (which is the lower-level technique that will not be covered here).

The benefit of the consistency of the higher level libraries we will use here is that examples written for one of the types of accelerators will be largely directly translatable to another. This is especially true for array programming, though less so for the kernel style as architecture-specific considerations often creep in³.

Note

This book will be rendered on a Mac and therefore the examples will use Metal in order to run computational cells, however we’ll show a CUDA translation for some of the examples in order to show the straight-forward nature of translating higher level GPU code in Julia is.

GPU API	GPU Array Type	Kernel Macro
CUDA	`CuArray`	`@cuda`
Metal	`MtlArray`	`@metal`
oneAPI	`oneArray`	`@oneAPI`
ROCm	`ROCArray`	`@roc`

11.6.2.2 Array Programming on the GPU

First described in Section 6.5, array programming eschews writing loops and and instead favors initializing blocks of heap-allocated memory and filling it with data to be operated on at a single point in time. While this is often not the most efficient way to utilize CPUs, it’s essentially the required style of code to utilize GPUs.

For the example below, we will calculate the present value of a series of cashflows across a number of different scenarios. An explanation of the code is given below the example.

using Metal

1function calculate_present_values!(present_values,cashflows, discount_matrices)
    # Perform element-wise multiplication and sum along the time dimension
2    present_values .= sum(cashflows .* discount_matrices, dims=1)
end

# Example usage using 100 time periods, 100k scenarios
num_scenarios = 10^5
pvs = zeros(Float32,1,num_scenarios)
3cashflows = rand(Float32, 100)
4discount_matrices = rand(Float32, 100, num_scenarios)

# copy the data to the GPU
pvs_GPU = MtlArray(pvs)
5cashflows_GPU = MtlArray(cashflows)
discount_matrices_GPU = MtlArray(discount_matrices)

@btime calculate_present_values!($pvs,$cashflows, $discount_matrices)
6@btime calculate_present_values!($pvs_GPU,$cashflows_GPU, $discount_matrices_GPU)

1: The function calculate_present_values! is written the same way as if we were just writing CPU code. Note that we are also passing a pre-allocated vector, present_values to store the result. This will allow us to isolate the performance of the computation, rather than including any overhead of allocating the array for the result.
2: The code is broadcasted across the first dimension so that the single set of cashflows is discounted for each scenario’s discount vector.
3: Metal only supports 32 bit floating point (some CUDA hardware will support 64 bit floating point)
4: Using 100 thousand scenarios for this example.
5: MtlArray(array) will copy the array values to the GPU.
6: Note that the data still lives on the GPU and is of the MtlMatrix (a type alias for a 2-D MtlArray).

  2.755 ms (6 allocations: 38.56 MiB)
  67.959 μs (515 allocations: 13.93 KiB)

1×100000 MtlMatrix{Float32, Metal.PrivateStorage}:
 28.5894  26.293  28.624  27.9553  …  23.4252  25.6425  27.2136  26.9605

The testing suggests approximately 200 times faster computation when performed on the GPU. Note however, that does not include the overhead of (1) moving the data to the GPU (in the initial MtlArray(cashflows) call), or (2) returning the data to the CPU (since the return type for the GPU version is MtlArray). We can measure this overhead by wrapping the data transfer inside another function and benchmarking it:

function GPU_overhead_test(present_values, cashflows,discount_matrices)
    pvs_GPU = MtlArray(present_values)
    cashflows_GPU = MtlArray(cashflows)
    discount_matrices_GPU = MtlArray(discount_matrices)

    calculate_present_values!(pvs_GPU,cashflows_GPU, discount_matrices_GPU)

    Array(pvs_GPU) # convert to CPU array
end

@btime GPU_overhead_test($pvs,$cashflows,$discount_matrices)

  6.045 ms (993 allocations: 443.91 KiB)

1×100000 Matrix{Float32}:
 28.5894  26.293  28.624  27.9553  …  23.4252  25.6425  27.2136  26.9605

With the additional overhead, the computation on the GPU takes more total time than if the work were done just on the CPU. This is a very simple example, and the balance tips heavily in favor of the GPU when:

The computational demands are significantly higher (e.g. we were to do more calculations than just a simple multiply/divide/sum).
The data size grows bigger.

Note

The previous example can be translated to CUDA by simply exchanging MtlArray for CuArray.

Warning

This example again underscores that hardware parallelization is not an automatic “win” for performance. A lot of uninformed discussion around modeling performance is to simply try to get things to run on the GPU and it is often not the case that the models will run faster. Further, as the modeling logic gets more complex, it does require greater care to keep in mind GPU constraints (acceptable data types, memory limitations, avoiding scalar operations, data transfer between CPU and GPU, etc.). A best practice is to contemplate sequential performance and memory usage before leveraging GPU accelerators.

11.6.2.3 Kernel Programming on the GPU

Another approach to GPU programming is often referred to as kernel programming, or being much more explicit about how a computation is performed. This is as opposed to the declarative approach in the array-oriented style (Section 11.6.2.2) wherein we specified what we wanted the computation to be.

The key ideas here are that we need to manually specify several aspects which came ‘free’ in the array-oriented style. The tradeoff is that we can be more fine-tuned about how the computation leverages our hardware, potentially increasing performance.

The GPU libraries in Julia abstract much of the low level programming typically necessary for this style of programming, but we still need to explicitly look at:

How the GPU will iterate across different cores/threads threads.
How many threads to utilize, the optimal number depends on the shape of the computation (long vectors, multi-dimensional arrays), memory constraints, and hardware specifics.

GPU threads: Individual units of execution within a kernel. Each thread runs the same kernel code but operates on a different portion of the data.

How to chunk (group) the data to distribute the data to the different GPU threads

Our strategy for the present values example will be to distribute the work such that different GPU threads are working on different scenarios. Within a scenario, the loop is a very familiar approach: initialize a subtotal to zero and then accumulate the calculated present values.

function calculate_present_values_kernel!(present_values,cashflows, discount_matrices)
1    idx = thread_position_in_grid_1d()

2    pv = 0.0f0
    for t in 1:size(cashflows, 1)
3        pv += cashflows[t] * discount_matrices[t, idx]
    end

4    present_values[idx] = pv
5    return nothing
end

1: As the work is distributed across threads, thread_position_in_grid_1d() will give the index of the current thread so that we can index data appropriately for the work as we decide to split it up (we’ve split up the work by scenario in this example).
2: Recall that we are working with Float32 on the GPU here, so the zero value is set via the f0 notation indicating a 32-bit floating point number.
3: The loop is across timesteps within each thread, while the thread index is tracked with idx.
4: The result is written to the pre-allocated array of present values, and we avoid race conditions because the different threads are working on different scenarios.
5: We don’t explicitly have to return nothing here, but it makes it extra clear that the intention of the function is to mutate the present_values array given to it. This mutation intention is also signaled by the ! convention in the function name.

calculate_present_values_kernel! (generic function with 1 method)

The kernel above was fairly similar to how we might write code for CPU-threaded approaches, but we now need to specify the technicals of launching this on the GPU. The threads defines how many independent calculations to run at a given time, and the maximum will be dependent on the hardware used. The groups argument defines the number of threads that share memory and synchronize results together (meaning that group will wait for all threads to finish before moving onto the next chunk of data). The push-pull here is that threads that can share data avoid needing to create duplicate copies of that data in memory. However, if there is variability in how long each calculation will take, then the waiting time for synchronizing results may slow the overall computation down.

Our task utilizes shared memory of the cashflows for each thread, so through some experimentation in advance, we find that a relatively large group size of ~512 is optimal.

We bring this all together through the use of the kernel macro @metal:

threads = 1024
groups = cld(num_scenarios, 512)

@btime @metal threads=$threads groups=$groups calculate_present_values_kernel!(
    $pvs_GPU,
    $cashflows_GPU, 
    $discount_matrices_GPU
    )

  17.250 μs (149 allocations: 3.33 KiB)

Metal.HostKernel{typeof(calculate_present_values_kernel!), Tuple{MtlDeviceMatrix{Float32, 1}, MtlDeviceVector{Float32, 1}, MtlDeviceMatrix{Float32, 1}}}(Main.calculate_present_values_kernel!, Metal.MTL.MTLComputePipelineStateInstance (object of type AGXG16XFamilyComputePipeline))

This is approximately seven times faster than the array-oriented style above, meaning that the GPU kernel version’s computation is over 1000 times faster than the CPU version. However, we saw previously that the cost of moving the data to the GPU memory and then back to the CPU memory was the biggest time sink of all - again we’d need to have more scale in the problem to make offloading to the GPU beneficial overall.

Note

The Metal GPUs are able to iterate threads across three different dimensions. In the prior example, we only used one dimension and thus used thread_position_in_grid_1d(). If we were distributing the threads across, say, three dimensions then we would use thread_position_in_grid_2d().

How do you determine how many dimensions to use? A good approach is to mimic the data you are trying to parallelize. In the example of calculating a vector of present values across 100k scenarios, that was the primary ‘axis of parallelization’. If instead of a one-dimensional set of cashflows (e.g. a single asset with fixed cashflows), we had a two-dimensional set of cashflows (e.g. a portfolio of many assets), then we may find the best balance of code simplicity and performance to iterate across two dimensions of threads (but we are still limited by the same number of total available threads).

Note

The above example would be translated to CUDA by changing just a few things:

The thread indexing would be idx = threadIdx().x instead of i = thread_position_in_grid_1d()
The GPU arrays should be created with CuArray instead of MtlArray.
The kernel macro would be @cuda threads=1024 calculate_present_values_kernel!(...) instead of @metal threads=threads groups=groups calculate_present_values_kernel(...). The memory sharing and synchronizing between threads is more manual than Metal.jl, but this is not strictly necessary for our example.

function calculate_present_values_kernel!(present_values,cashflows, discount_matrices)
    idx = threadIdx().x 

    pv = 0.0f0
    for t in 1:size(cashflows, 1)
        pv += cashflows[t] * discount_matrices[t, idx]
    end

    present_values[idx] = pv
    return nothing
end

groups = cld(num_scenarios, 512)

@cuda threads=threads calculate_present_values_kernel!(
    pvs_GPU,
    cashflows_GPU, 
    discount_matrices_GPU
    )

11.7 Multi-Processing / Multi-Device

Multiple device, or multi-device computer refers to using separate groups of memory/processor combinations to accomplish tasks in parallel. This can be as simple as multiple distinct cores on within a single desktop computer, or many separate computers networked across the internet, or many processors within a high performance cluster or a computing-as-a-service provider like Amazon Web Services or JuliaHub.

Everything discussed previously related to hardware (Chapter 9, Section 11.5, Section 11.6) continues to apply. The additional complexity is attempting to synchronize the computation across multiple sets of the same (homogeneous) or different (heterogeneous) hardware.

As you might imagine, approaches to multi-device computing can vary widely. Julia’s approach tries to strike the balance between capability and user-friendliness and uses a primary/worker model wherein one of the processors is the main coordinator while other processors are “workers”. If only one processor is started, then the main processor is also a worker processor. This main/worker approach uses a “one-sided” approach to coordination. The main worker utilizes high level calls and the workers respond, with some of the communication and hand-off handled by Julia transparently from the user’s perspective.

A useful mental model is the asynchronous task-based concepts discussed in Section 11.5.1, as the main worker will effectively queue work with the worker processors. Because there may be a delay associated with the computation or the communication between the processors, the worker runs asynchronously.

Description	Task API	Distributed Analogue
Create a new task	`Task()`	`@spawnat`
Run task asynchronously	`@async`	`@spawnat`
Retrieve task result	`fetch`	`fetch`
Wait for task completion	`@sync`	`sync`
Communication between tasks	`Channel`	`RemoteChannel`

Adapting the trade producer and monitor example from above to run on multiple processors (see #sec-channels to review the base model and algorithm), we make a few key changes:

using Distributed loads the Distributed standard Julia library, providing the interface for multi-processing across different hardware.
addprocs(n) will add n worker processors (the main processor is already counted as one worker). When adding local machine processors, the processors are part of the local machine. This starts new Julia processes (you can see this in the task manager of the machine) which inherit the package environment (i.e. Project.toml and environment variables) from the main process; this does not occur automatically if not part of the same local machine.
- To add processors from other machines, see the Distributed Computing section of the Julia docs.
myid() is the identification number of the given processor that’s been spun up.
We use a RemoteChannel instead of a Channel to facilitate communication across processors.
Instead of @async, we use @spawnat n to create a task for processor number n (or :any will automatically assign a processor).

See

using Distributed
let

println("here")
    # Add worker processes if not already added
    if nworkers() == 1
        addprocs(4)  # Add 4 worker processes
    end
    
    @everywhere function trade_producer(channel, i)
        sleep(rand())
        profit = randn()
        put!(channel, profit)
        println("Producer $(myid()): Trade Result #$i $(round(profit, digits=3))")
    end
    
    @everywhere function portfolio_monitor(channel, n)
        sum = 0.0
        for _ in 1:n 
            profit = take!(channel)
            sum += profit
            println("Monitor $(myid()): Received $(round(profit, digits=3)), Cumulative profit: $(round(sum, digits=3))")
        end
    end
    
    function run_distributed_simulation()
        channel = RemoteChannel(() -> Channel{Float64}(32))
    
        # Start producer and consumer tasks
        @sync begin
            for i in 1:5
                @spawnat :any trade_producer(channel, i)
            end
            @spawnat :any portfolio_monitor(channel, 5)
        end
    
        # Close the channel and wait for tasks to finish
        close(channel)
    end
    
    # Run the simulation
    run_distributed_simulation()
end

here
      From worker 2:    Producer 2: Trade Result #5 0.917
      From worker 3:    Monitor 3: Received 0.917, Cumulative profit: 0.917
      From worker 3:    Monitor 3: Received 0.374, Cumulative profit: 1.291
      From worker 2:    Producer 2: Trade Result #1 0.374
      From worker 3:    Monitor 3: Received 0.473, Cumulative profit: 1.764
      From worker 5:    Producer 5: Trade Result #4 0.473
      From worker 3:    Producer 3: Trade Result #2 0.669
      From worker 3:    Monitor 3: Received 0.669, Cumulative profit: 2.433
      From worker 3:    Monitor 3: Received 0.816, Cumulative profit: 3.249
      From worker 4:    Producer 4: Trade Result #3 0.816

Given the similarity to the single-process task-based version above, what’s the motivation for this bothering with a distributed approach? A few differences:

In this simplified example, we are simply starting additional Julia processes on the same machine. Like a threaded approach, the work will be split across the same multi-core processor. In this context, the main difference is that the processes do not share memory.
- Communicating across processes generally has a little bit more overhead than communicating across threads.
If distributing across machines, avoiding memory sharing is advantageous if using the different machines that have their own memory stores, which need not compete with the main process (such as distributed chunks of large datasets). This essentially helps with memory constrained problems since you are no longer limited by the memory size or throughput of a single machine.
The worker processors don’t need to be the same architecture as the main processor, allowing usage of different machines or cloud computing that is communicating with a local main process.

11.8 Parallel Programming Models

The previous sections have explained the different parallel programming models and how to directly utilize them to harness additional computing power. Each approach (multi-threading, GPU, distributed processing, etc.) has unique considerations and trade-offs. These approaches in Julia are generally much more accessible to beginning and intermediate users than other languages, but admittedly still requires a decent amount of thought and care.

It is possible, if you are willing to give up some fine-grained control, to utilize some higher level approaches which look to abstract away some of the particularities of the implementation.

11.8.1 Map-Reduce

Map-Reduce (Section 6.4.4) operations are inherently parallelizable and various libraries provide parallelized versions of the base mapreduce. This is the workhorse function of many ‘big data’ workloads and many statistical operations are versions of mapreduce.

11.8.1.1 Multi-Threading

11.8.1.1.1 OhMyThreads

ThreadsX.jl provides the threaded versions of essential functions such as tmap, tmapreduce,tcollect, and tforeach (see ?tbl-funcional-methods). In most cases, the chunking and data sharing is handled automatically for you.

import OhMyThreads
@btime OhMyThreads.tmapreduce(x -> x, +, 1:100_000)

  5.542 μs (75 allocations: 5.77 KiB)

5000050000

11.8.1.1.2 ThreadsX

ThreadsX.jl is built off of the wonderful Transducers.jl package, though the latter is a bit more advanced (more abstract, but as a result more composable and powerful). ThreadsX provides threaded versions of many popular base functions. It offers a wider set of ready-made threaded functions, but has a much more complex codebase. For the vast majority of threading needs, OhMyThreads.jl should be sufficient and performant. See the documentation for all of the implemented functions, but for our illustrative example:

import ThreadsX
@btime ThreadsX.mapreduce(x -> x, +, 1:100_000)

  42.667 μs (890 allocations: 67.48 KiB)

5000050000

11.8.1.2 Multi-Processing

reduce(op,pmap(f,collection)) will use a distributed map and reduce the resulting map on the main thread. This pattern works well if each application of f to elements of collection is costly.

@distributed (op) for x in collection; f(x); end is a way to write the loop with the reduction op for which the f need not be costly.

The difference between the two approaches is that with pmap, collection is made available to all workers. In the @distributed approach, the collection is partitioned and only a subset is sent to the designated workers.

Here’s an example of both of these, calculating a simple example of counting coin flips:

# this is a example of poor utilization of pmap, since the operation is 
# fast and the overhead of moving the whole collection dominates
@btime reduce(+,pmap(x -> rand((0,1)),1:10^3))

  118.297 ms (68142 allocations: 3.28 MiB)

function dist_demo()
    @distributed (+) for _ in 1:10^5
        rand((0,1))
    end
end

@btime dist_demo()

  231.667 μs (313 allocations: 13.92 KiB)

11.8.2 Array-Based

Array based approaches will often utilize the parallelism of SIMD on the CPU or many cores on the GPU. It’s as simple as using generic library calls which will often be optimized at the compiler level. Examples:

let
    x = rand(Float32,10^8)
    x_GPU = MtlArray(x)
    @btime sum($x)
    @btime sum($x_GPU)
end

  4.896 ms (0 allocations: 0 bytes)
  1.263 ms (800 allocations: 19.99 KiB)

4.9996484f7

sum(x) compiles to SIMD instructions on the CPU, while using the GPU array type in sum(x_GPU) is enough to let the compiler dispatch on the GPU type and emit efficient, parallelized code for the GPU.

Distributed array types allow for large datasets to effectively be partitioned across multiple processors, and have implementations in the DistributedArrays.jl and Dagger.jl libraries.

11.8.3 Loop-Based

Loops which don’t have race conditions can easily become multi-threaded. Here, we have three versions of updating a collection to square the contained values:

let v = collect(1:10000)

    for i in eachindex(v)
        v[i] = v[i]^2
    end
    v[1:3]
end

3-element Vector{Int64}:
 1
 4
 9

Using multi-threading

let v = collect(1:10000)
    Threads.@threads  for i in eachindex(v)
        v[i] = v[i]^2
    end
    v[1:3]
end

3-element Vector{Int64}:
 1
 4
 9

Using multi-processing:

let v = collect(1:10000)
    Distributed.@distributed for i in eachindex(v)
        v[i] = v[i]^2
    end
    v[1:3]
end

3-element Vector{Int64}:
 1
 2
 3

For more advanced usage, including handling shared memory see Section 11.7 and Section 11.5.

11.8.4 Task-Based

Task based approaches attempt to abstract the scheduling and distribution of work from the user. Instead of saying how the computation should be done, the user specifies the intended operations and allows the library to handle the workflow. The main library for this in Julia is Dagger.jl.

Effectively, the library establishes a network topology (a map of how different processor nodes can communicate) and models the work as a directed, acyclic graph (a DAG, which is like a map of how the work is related). The library is then able to assign the work in the appropriate order to the available computation devices. The benefit of this is most apparent with complex workflows or network topologies where it would be difficult to manually assign, communicate, and schedule the workflow.

Here’s a very simple example which demonstrates Dagger waiting for the two tasks which work in parallel (we already added multiple processors to this environment in Section 11.7):

import Dagger

# This runs first:
a = Dagger.@spawn rand(100, 100)

# These run in parallel:
b = Dagger.@spawn sum(a)
c = Dagger.@spawn prod(a)

# Finally, this runs:
wait(Dagger.@spawn println("b: ", fetch(b), ", c: ", fetch(c)))

b: 4995.2353000620105, c: 0.0

11.9 Choosing a Parallelization Strategy

There is no one-size-fits-all strategy to parallelization. Here are some general guides to thinking about what parallelization technique to try given the circumstances:

CPU-bound workloads with manageable memory demands: If your entire dataset fits comfortably in RAM and your operations are primarily arithmetic or straightforward loops, start by optimizing your single-threaded performance and consider vectorization (SIMD) for inner loops and multi-threading for parallelizable tasks. This approach leverages your CPU cores efficiently without introducing significant complexity.
Large-scale linear algebra or highly data-parallel computations: If your problem involves large matrix operations, linear algebra routines, or embarrassingly parallel computations that can be batched over many independent elements, a GPU or other specialized accelerators may be beneficial. GPUs excel at uniform computations over large datasets and can provide substantial speedups—assuming data transfer overhead and memory constraints are managed effectively. Note that standard linear algebra libraries are likely to already parallelize on the CPU without any explicit parallelization needed to be coded on your part.
Distributing work across multiple machines or heterogeneous resources: If you need to scale beyond a single machine’s CPU and GPU capabilities—whether due to extremely large datasets, the need for concurrent access to geographically distributed resources, or leveraging specialized hardware—then consider distributed computing. Spreading tasks across multiple processes, servers, or clusters can scale performance horizontally. Just keep in mind the overhead of communication, potential data partitioning strategies, and the complexity of managing a distributed environment.

In practice, you may find that a combination of these approaches is ideal: start simple, measure performance, and iterate. By understanding your workload and hardware constraints, you can make informed decisions that balance complexity, cost, and the performance gains of parallel computing.

11.10 References

https://book.sciml.ai/notes/06-The_Different_Flavors_of_Parallelism/
https://docs.julialang.org/en/v1/manual/parallel-computing/
https://enccs.github.io/julia-for-hpc/

Technically, it’s possible that the second task doesn’t write to the array first. This could happen if there’s enough tasks (from our program or others on the computer) that saturate the CPU during the first task’s sleep period such that the first task gets picked up again before the second one does.↩︎
There are some chips which do not have access to the same memory in a multi-threading context, and are known as non-uniform memory access (NUMA). These architectures work more like those in Section 11.7.↩︎
The KernelAbstractions.jl library actually allows you to write generic kernels which then get compiled into different code depending on the backend you are using.↩︎