nsum_constant(n) = n * (n + 1) ÷ 2
nsum_constant (generic function with 1 method)
Alec Loudenback
“Fundamentally, computer science is a science of abstraction—creating the right model for a problem and devising the appropriate mechanizable techniques to solve it. Confronted with a problem, we must create an abstraction of that problem that can be represented and manipulated inside a computer. Through these manipulations, we try to find a solution to the original problem.” - Al Aho and Jeff Ullman (1992)
Adapting computer science concepts to work for financial professionals. Computability, computational complexity, and the language of algorithms and problem solving. A survey of important data structures. More advanced testing and verification topics.
Computer science as a term can be a bit misleading because of the overwhelming association with the physical desktop or laptop machines that we call “computers”. The discipline of computer science is much richer than consumer electronics: at its core, computer science concerns itself with areas of research and answering tough questions:
For a reader in the twenty-first century we hope that it is patently obvious how impactful the applied computer science has been as an end-user of the internet, artificial intelligence, computational photography, safety control systems, etc. have been to our lives. It is a testament to the utility of being able to harness computer science ideas for practical use.
It’s common for beneficial advances in knowledge and application to occur at the boundary between two disciplines. It’s here in this chapter that we desire to bring together the financial discipline together with computer science and to provide the financial practitioner with the language and concepts to leverage some of computer science’s most relevant ideas.
An Algorithm is a general term for a process that transforms an input to an output. It’s the set of instructions dictating how to carry out a process. That process needs to be specified in sufficient detail to be able to call itself an algorithm (versus a heuristic which lacks that specificity).
An algorithm might be the directions to accomplish the following task: summing a series of consecutive integers integers from \(1\) to \(n\). There are multiple ways that this might be accomplished, each one considered a distinct algorithm:
iterating over each number and summing them up (starting with the smallest number)
iterating over each number and summing them up (starting with the largest number)
summing up the evens and then the odds, then adding the two subtotals
and many more distinct algorithms…
We will look at some specific examples of these alternate algorithms as we introduce the next topic, computational complexity.
We can characterize the computational complexity of a problem by looking at how long an algorithm takes to complete a task when given an input of size \(n\). We can then compare two approaches to see which is computationally less complex for a given \(n\). This is a way of systematically evaluating an algorithm to determine it’s efficiency when being computed.
Note that computational complexity isn’t quite the same as how fast an algorithm will run on your computer, but it’s a very good guide. Modern computer architectures can sometimes execute multiple instructions in a single cycle of the CPU making an algorithm that is, on paper slower than another, actually run faster in practice. Additionally, sometimes algorithms are able to substantially limit the number of computations to be performed, at the expense of using a lot more memory and thereby trading CPU usage with RAM usage.
You can think of computational complexity as a measure of how much work is to be performed. Sometimes the computer is able to perform certain kinds of work more efficiently.
Further, when we analyze an algorithm recall that ultimately our code gets translated into instructions for the computer hardware. Some instructions are implemented in a way that for any type of number (e.g. floating point), it doesn’t matter if the number is 1.0
or 0.41582574300044717
, the operation will take the exact same time and number of instructions to execute (e.g. for the addition operation).
Sometimes a higher level operation is implemented in a way that takes many machine instructions. For example, division instructions may require many CPU cycles when compared to multiplication or division. Sometimes this is an important distinction and sometimes not. For this book we will ignore this granularity of analysis.
Take for example the problem of determining the sum of integers from \(1\) to \(n\). We will explore three different algorithms and the associated computational complexity for them.
A mathematical proof can show a simple formula for the result. This allows us to compute the answer in constant time, which means that for any \(n\), our algorithm is essentially the same amount of work.
nsum_constant(n) = n * (n + 1) ÷ 2
nsum_constant (generic function with 1 method)
In this we see that we perform three operations: a multiplication, a sum, and a division, no matter what n is. If n
is 10_000_000
we’d expect this to complete in about a single unit of time.
This algorithm performs a number of operations which grows in proportion with \(n\) by individually summing up each element in \(1\) through \(n\):
function nsum_linear(n)
= 0
result for i in 1:n
+= i
result end
resultend
nsum_linear (generic function with 1 method)
If \(n\) were 10_000_000
, we’d expect it to run with roughly 10 million operations, or about 3 million times as many operations as the constant time version. We can say that this version of the algorithm will take approximately \(n\) steps to complete.
What if we were less efficient, and instead we were only ever able to increment our subtotal by one. That is, instead of adding up \(1+3\), we had to instead do four operations: \(1+1+1+1\) . We can add a second loop which increments our result by a unit instead of simply adding the current i
to the running total result
. This makes our algorithm work much harder since it has to add numbers so many more times (Recall that to a computer adding two numbers is the same computational effort regardless of what the numbers are).
i
. loops over the integers \(1\) to \(n\).
j
does the busy work of adding \(1\) to our subtotal i
times.
nsum_quadratic (generic function with 1 method)
Breaking down the steps:
i
is 1
there is 1 addition in the inner loopi
is 2
there are 2 additions in the inner loopi
is n
there are n
additions in the inner loopTherefore, this computation takes \(1 + ... + (n-2) + (n-1) + n\) steps to complete. Algebraically, this simplifies down to our constant time formula \(n * (n + 1) ÷ 2\) or \(n^2 + n ÷ 2\) steps to complete.
We can categorize the above implementations using a convention called Big-O Notation2 which is a way of distilling and classifying computational complexity. We characterize the algorithms by the most significant term in the total number of operations. Table 13.1 shows for the examples constructed above what the description, order, and order of magnitude complexity is.
Function | Computational Cost | Complexity Description | Big-O Order | Steps (\(n=10,000\)) |
---|---|---|---|---|
nsum_constant |
fixed | Constant | \(O(1)\) | ~1 |
nsum_linear |
\(n\) | Linear | \(O(n)\) | ~10,000 |
nsum_quadratic |
\(n^2 + n ÷ 2\) | Quadratic | \(O(n^2)\) | ~100,000,000 |
Table 13.2 shows a comparison of a more extended set of complexity levels. For the most complex categories of problems, the cost to compute grows so fast that it boggles the mind. What sorts of problems fall into the most complex categories? \(O(2^n)\), or exponential complexity, examples include the traveling salesman problem3 if solved with dynamic programming or the recursive approach to calculating the \(nth\) Fibonacci number. The beastly \(O(n!)\) algorithms include brute force solving the traveling salesman problem or enumerating all partitions of a set. In financial modeling, we may encounter these sorts of problems in portfolio optimization (using the brute-force approach of testing every potential combination assets to optimize a portfolio).
Big-O Order | Description | \(n=10\) | \(n=1,000\) | \(n=1,000,000\) |
---|---|---|---|---|
\(O(1)\) | Constant Time | 1 | 1 | 1 |
\(O(n)\) | Linear Time | 10 | 1,000 | 1,000,000 |
\(O(n^2)\) | Quadratic Time | 100 | 1,000,000 | 10^12 |
\(O(log(n))\) | Logarithmic Time | 3 | 7 | 14 |
\(O(n\times log(n))\) | Linearithmic Time | 30 | 7,000 | 14,000,000 |
\(O(2^n)\) | Exponential Time | 1,024 | ~10^300 | ~10^301029 |
\(O(n!)\) | Factorial Time | 3,628,800 | ~10^2567 | ~10^5565708 |
We care only about the most significant term because when \(n\) is large, the most significant term tends to dominate. For example, in our quadratic time example which has \(n^2 + n ÷ 2\) steps, if \(n\) is a large number like \(10^6\), then we see that it will result in:
\[\begin{align*} n^2 + \frac{n}{2} &= (10^6)^2 + \frac{10^6}{2} \\ &= 10^{12} + \frac{10^6}{2} \end{align*}\]
\(10^{12}\) is significantly more important than \(\frac{10^6}{2}\) (sixty-four million times as important, to be precise). This is why Big-O notation reduces the problem down to only the most significant complexity cost term.
If n is small then we don’t really care about computational complexity in general. This is a lesson for our efforts as developers: focus on the most intensive parts of calculations when looking to optimize, and don’t worry about seldom used portions of the code.
The preceding examples of constant, linear, and exponential times are conceptually correct but if we try to run them in practice we see that the description doesn’t seem to hold at all for the linear time version, as it runs as quickly as the constant time version.
using BenchmarkTools
@btime nsum_constant(10_000)
0.916 ns (0 allocations: 0 bytes)
50005000
@btime nsum_linear(10_000)
1.666 ns (0 allocations: 0 bytes)
50005000
@btime nsum_quadratic(10_000)
1.104 μs (0 allocations: 0 bytes)
50005000
What happened was that the compiler was able to understand and optimize the linear version such that it effectively transformed it into the constant time version and avoid the iterative summation that we had written. For examples that are simple enough to use as a teaching problem, the compiler can often optimize different written code down to the same efficient machine code (this is the same Triangular Number optimization we saw in Section 5.4.3.4).
Another consideration is that there may be one approach which performs better in the majority of cases, at the expense of having very poor performance in specific cases. Sometimes we may risk those high cost cases if we expect the benefit to be worthwhile on the rest of the problem set.
This often happens when the data we are working with has some concept of “distance”. For example, in multi-stop route planning we can use the idea that it’s likely to be more efficient to visit nearby destinations first. Generally this works, but sometimes the nearest distance actually has a high cost (such as needing to avoid real-world obstacles in the way which force you to drive past other further away locations to get there).
So far we have focused on computational complexity, however similar analysis could be performed for space complexity, which is how much computer memory is required to solve a problem. Sometimes, an algorithm will trade computational complexity for space complexity. That is, we might be able to solve a problem much faster if we have more memory available.
For example, there has been research to improve the computational efficiency of matrix multiplication which do indeed run faster than traditional techniques. However, those algorithms don’t get implemented on computers because they require way more memory than is available!
The idea of algorithmic complexity is important because it grounds us in the harsh truth that some problems are very difficult to compute. It’s in these cases that a lot of the creativity and domain specific heuristics can become the foremost consideration.
We must remember to be thoughtful about the design of our models. When searching for additional performance to look for the “loops-within-loops” which is where combinatorial explosions tend to happen. Focusing on the places that have large \(n\) or poor Big-O order that you can transform the performance of the overall model. Sometimes though, the fundamental complexity of the problem at hand forbids greater efficiency.
The science of data structures is about how data is represented conceptually and in computer applications.
For example, how should the string “abcdef” be represented and analyzed?
There are many common data structures and many specialized subtypes. We will describe some of the most common ones here. Julia has many data structures available in the Base library, but an extensive collection of other data structures can be found in the DataStructures.jl package.
An array is a contiguous block of memory containing elements of the same type, accessed via integer indices. Arrays have fast random access and are the fastest data structure for linear/iterated access of data.
In Julia, an array is a very common data structure and is implemented with a simple declaration, such as:
= [1,2,3] x
In memory, the integers are stored as consecutive bits representing the integer values of 1
, 2
, and 3
, and would look like this (with the different integers shown on new lines for clarity):
0000000000000000000000000000000000000000000000000000000000000001
0000000000000000000000000000000000000000000000000000000000000010
0000000000000000000000000000000000000000000000000000000000000011
This is great for accessing the values one-by-one or in consecutive groups, but it’s not efficient if values need to be inserted in between. For example, if we wanted to insert 0
between the 1
and 2
in x
, then we’d need to overwrite the second position in the array, ask the operating system to allocate more memory4, and re-write the bytes that come after our new value. Inserting values at the end (push!(array, value)
) is usually fast unless more memory needs to be allocated.
A linked list is a chain of nodes where each node contains a value and a pointer to the next node. Linked lists allow for efficient insertion and deletion but slower random access compared to arrays.
In Julia, a simple linked list node could be implemented as:
mutable struct Node
::Any
value1::Union{Node, Nothing}
nextend
= Node(3,Nothing)
z = Node(2,z)
y = Node(1,y) x
Nothing
represents the end of the linked list.
Inserting a new node between existing nodes is efficient - if we wanted to insert a new node between the ones with value 2
and 3
, we could do this:
= Node(0,z) # <1> Create a new `Node` with `next` equal to `z`
a = a # <2> Set the reference to `next` in `y` to be the new `Node` `a`. y.next
Accessing the nth element requires traversing the list from the beginning to check each Node
’s next
value. This iterative approach makes it \(O(n)\) time complexity for random access. This is in contrast to an array where you know right away where the \(n\)th item will be in the data structure.
Also, the linked list is one-directional. Items further down the chain don’t know what node points to them, so it’s impossible to traverse the list backwards.
There are many related implementations which make random access or traversal in reverse order more efficient such as doubly-linked lists or “trees” (Section 13.5.6) which organize the data not as a chain but as a tree with branching nodes.
An aggregate of named fields, typically of fixed size and sequence. Records group related data together. We’ve encountered structs in Section 5.4.7, but here we’ll add that simple structs
with primitive fields can themselves be represented without creating pointers to the data stored:
struct SimpleBond
::Int
id::Float64
parend
struct LessSimpleBond
::String
id::Float64
parend
= SimpleBond(1, 100.0)
a = LessSimpleBond("1", 100.0)
b isbits(a), isbits(b)
(true, false)
Because a
is comprised of simple elements, it can be represented as a contiguous set of bits in memory. It would look something like this in memory:
0000000000000000000000000000000000000000000000000000000000000001
0100000001011001000000000000000000000000000000000000000000000000
1
100.0
In contrast, the LessSimpleBond
uses a String
to represent the ID of the bond. Strings are essentially arrays of containers, and the arrays themselves are mutable containers which is by definition not a constant set of bits. In memory, b
would look like:
.... a pointer ...
0100000001011001000000000000000000000000000000000000000000000000
100.0
In performance critical code, having data that is represented with simple bits instead of references/pointers can be much faster (see Chapter 25 for an example).
For many mutable types, there are immutable, bits-types alternatives. For example:
Arrays have a StaticArray
counterpart (from the StaticArrays.jl package).
Strings have InlineStrings
(from the InlineStrings.jl package) which use fixed-width representations of strings.
The downsides to the immutable alternatives (other than the loss of potentially desired flexibly that mutability provides) are that they can be harder on the compiler (more upfront compilation cost) to handle the specialized cases involved.
Hashes are the result of a hash function that maps arbitrary data to a fixed size value. It’s sort of a “one way” mapping to a simpler value which has the benefits of:
For example, we can calculate a type of has called an SHA hash on any data:
import SHA
let
= SHA.sha256("hello world") |> bytes2hex
a = SHA.sha256(rand(UInt8, 10^6)) |> bytes2hex
b println(a)
println(b)
end
b94d27b9934d3e08a52e52d7da7dabfac484efe37a5380ee9088f7ace2efcde9
84ab8a078063583a4de14efe374454878a541b0c4575d43d5f4d979f98ac6bb9
We can easily verify that the sha256
hash of "hello world"
is the same each time, but it’s virtually impossible to guess "hello world"
if we are just given the resulting hash. This is the premise of trying to “crack” a password when the stored password hash is stolen.
One way to check if two sets of data are the same is to compute the hash and see if the resulting hashes are equal. For example, maybe you want to see if two data files with different names contain the same data - comparing the hashes is a sure way to determine if they contain the same data.
Dictionaries map a key to a value. More specifically, they use the hash of a key to store a reference to the value.
Dictionaries offer constant-time average case access but must handle potential collisions of keys (generally, the more robust the collision handling means higher fixed cost for access).
Here’s an illustrative portfolio of assets indexed by CUSIPs:
= Dict(
assets # CUSIP => Asset
"037833AH4" => Bond("General Electric",...),
"912828M80" => Equity("AAPL",...),
"594918BQ1" => Bond("ENRON",...),
)
Then, lookup is performed by indexing the dictionary by the desired key:
"037833AH4"] # gives the General Electric Bond assets[
A graph is a collection of nodes (also called vertices) connected by edges to represent relationships or connections between entities. Graphs are versatile data structures that can model various real-world scenarios such as social networks, transportation systems, or computer networks.
In Julia, a simple graph could be implemented using a dictionary where keys are nodes and values are lists of connected nodes:
struct Graph
::Dict{Any, Vector{Any}}
nodesend
function add_edge!(graph::Graph, node1, node2)
push!(get!(graph.nodes, node1, []), node2)
push!(get!(graph.nodes, node2, []), node1)
end
= Graph(Dict())
g add_edge!(g, 1, 2)
add_edge!(g, 2, 3)
add_edge!(g, 1, 3)
This implementation represents an undirected graph. For a directed graph, you would only add the edge in one direction.
Graphs can be traversed using various algorithms such as depth-first search (DFS) or breadth-first search (BFS). These traversals are useful for finding paths, detecting cycles, or exploring connected components.
For more advanced graph operations, the Graphs.jl package provides a comprehensive set of tools for working with graphs in Julia.
A tree is a hierarchical data structure with a root node and child subtrees. Each node in a tree can have zero or more child nodes, and every node (except the root) has exactly one parent node. Trees are widely used for representing hierarchical relationships, organizing data for efficient searching and sorting, and in various algorithms.
A simple binary tree node in Julia could be implemented as:
mutable struct TreeNode
::Any
value::Union{TreeNode, Nothing}
left::Union{TreeNode, Nothing}
rightend
# Creating a simple binary tree
= TreeNode(1,
root TreeNode(2,
TreeNode(4, nothing, nothing),
TreeNode(5, nothing, nothing)
), TreeNode(3,
nothing,
TreeNode(6, nothing, nothing)
) )
Trees have various specialized forms, each with its own properties and use cases:
Trees support efficient operations like insertion, deletion, and searching, often with \(O(log n)\) time complexity for balanced trees. They are fundamental in many algorithms and data structures, including heaps, syntax trees in compilers, and decision trees in machine learning.
Data structures have strengths and weakness depending on whether you want to prioritize computational efficiency, memory (space) efficiency, code simplicity, and/or mutability. Due to the complexity of real world modeling needs, it can be the case that different representations of the data are more natural or more efficient for the use case at hand.
Formal verification is a technique used to prove or disprove the correctness of algorithms with respect to a certain formal specification or property. In essence, it’s a mathematical approach to ensuring that a system behaves exactly as intended under all possible conditions.
In formal verification, we use mathematical methods to:
This process can be automated using specialized software tools called theorem provers or model checkers.
It sounds like the perfect risk management and regulatory technique: prove that the system works exactly as intended. However, there has been very limited deployment of formal verification in industry. This is for several reasons:
Testing will be discussed in more detail in Chapter 12, but an intermediate concept between Formal Verification and typical software testing is property-based testing, which tests for general rules instead of specific examples.
For example, a function which is associative (\((a + b) + c = a + (b + c)\)) or commutative (\(a + b\) = \(b + c\)) can be tested with simple examples like:
using Test
myadd(a,b) = a + b
@test myadd(1,2) == myadd(2,1)
@test myadd(myadd(1,2),3) == myadd(1,myadd(2,3))
However, we really haven’t proven the associative and commutative properties in general. There are techniques to do this, which is a more comprehensive alternative to testing specific examples above. Packages like Supposition.jl provide functionality for this. Note that like Formal Verification, property-based testing is a more advanced topic.
Fuzzing is kind of like property based testing, but instead of testing general rules, we generalize the simple examples using randomness. For example, we could test the commutative property using random numbers instead, therefore statistically checking that the property holds:
@testset for i in 1:10000
= rand()
a = rand()
b
@test myadd(a,b) == myadd(b,a)
end
This is a good advancement over the simple @test myadd(1,2) == myadd(2,1)
, in terms of checking the correctness of myadd
, but it comes at the cost of more computational time and non-deterministic tests.
This topic will be covered in Chapter 14.↩︎
“Big-O”, so named because of the “O” in used in \(O(1)\). \(O(n)\), etc. The “O” refers to the “order” of growth for the function.↩︎
The Traveling Salesperson Problem is a classic computer science problem where you need to find the shortest possible route that visits each city in a given set exactly once and returns to the starting city. It’s a seemingly simple problem that becomes computationally intensive very quickly as the number of cities increases.↩︎
In practice, the operating system may have already allocated space for an array that’s larger than what the program is actually using so far, so this step may be ‘quick’ at times, while other times the operating system may actually need to extend the block of memory allocated to the array.↩︎