“By understanding the similarities, we unlock the potential for new insights, as patterns across different contexts often reveal hidden truths.” — Unknown
26.1 Chapter Overview
Similarity analysis helps us find “look‑alike” portfolios, policies, or scenarios so we can transfer insights across business lines. In practice this powers:
underwriting triage (match new submissions to historical claims experience),
customer segmentation (group households by benefit needs),
scenario analogues (locate past macro paths closest to today’s conditions),
model monitoring (compare new feature vectors to the training set).
This chapter walks through common similarity metrics for structured and unstructured data, shows their Julia implementations, and closes with a k-nearest-neighbor (kNN) search that you can plug into actuarial workflows.
26.2 The Data
Actuarial datasets tend to mix structured (tabular) and unstructured (text, image, voice) inputs. Regardless of origin, similarity measures require numeric vectors, so we usually:
scale numeric fields (premium, age, balances) to comparable ranges,
encode categorical features (education, occupation, territory) with one-hot or embedding schemes,
convert unstructured signals to vectors via domain encoders (TF‑IDF, Word2Vec, CNNs, spectrograms).
Stored data can generally be categorized into two formats: tabular (structured) and non-tabular (unstructured). Structured data format is a structured way of organizing and presenting data in rows and columns, resembling a table. This format is widely used for storing and representing structured datasets, making it easy to read, analyze, and manipulate data. The most common example of structured data is a spreadsheet, where data is organized into rows and columns. Structured data can also be stored in relational databases for easier lookups and matching. On the other hand, unstructured data refers to data that lacks a predefined data model or structure. Unlike structured data, which fits neatly into tables or databases, unstructured data does not have a predefined schema. It can include text documents, images, audio files, video files, social media posts, and more.
Structured data can be further categorized into numerical and categorical data based on the types of values they represent. The following data tables will be referenced throughout the chapter. Real numerical data can easily be converted or normalized to a series of floating points, and real categorical data to a series of binary literals through one-hot encoding procedures.
For unstructured data, due to the nature of their variety, the choice of representation depends on the type of data and the specific task at hand. For text data, a Word2Vec embedding is commonly used, while Convolutional Neural Networks (CNNs) are for image data and wave transforms are for audio data. No matter which transformation is applied, unstructured data can generally be converted to a series of floating points, just like numerical structured data.
df_num contains scaled benefit bases and issue ages, while arr_cat stores the one-hot encoding for sex, education, and occupation. Drop or regularize any column with zero range before scaling to avoid division-by-zero warnings. For unstructured inputs, substitute your favorite embedding model (Word2Vec for text, CNN features for images, spectrograms for audio); the downstream similarity routines stay the same once you have numeric vectors.
26.3 Common Similarity Measures
The following measures are commonly used to calculate similarities.
26.3.1 Euclidean Distance (L2 norm)
Euclidean distance, also known as the L2 norm, is defined as \[
d = \sqrt{\sum_{i=1}^{n} (w_i - v_i)^2}
\] The distance is usually meaningful when applied to numerical data. The following Julia code shows the Euclidean distance for the first two rows in df_num.
Manhattan distance, also known as the L1 norm, is defined as \[
d = \sum_{i=1}^{n} |w_i - v_i|
\] The distance is also usually meaningful when applied to numerical data. The following Julia code shows the Euclidean distance for the first two rows in df_num.
Cosine similarity is defined as \[
d = \frac{\sum_{i=1}^{n} w_i \cdot v_i}{\sqrt{\sum_{i=1}^{n} w_i^2} \cdot \sqrt{\sum_{i=1}^{n} v_i^2}}
\] The distance would be meaningful when applied to both numerical and categorical data.
The following Julia code shows the cosine similarity for the first two rows in df_num.
Note how similar the syntax of processing for numerical or categorical data is. Multiple dispatch allows Julia to identify most efficient underlying procedure for different types of data. For categorical data, the \(dot\) operation on binary vectors is essentially counts the number of shared 1’s (or the number of categories on which two observations match), while for numerical data it is the \(dot\) operation for most numerical processing libraries.
26.3.4 Jaccard Similarity
Jaccard similarity is defined as \[
d = \frac{|W \cap V|}{|W \cup V|}
\] The distance is usually meaningful when applied to categorical data. The following Julia code shows the Jaccard similarity for the first and the third rows in arr_cat.
Hamming distance is defined as d = Number of positions at which w and v differ. \[
d = \sum_{i=1}^{n} \mathbf{1}_{w_i \ne v_i}
\] The distance is usually meaningful when applied to categorical data. The following Julia code shows the Hamming distance for the first and the third rows in arr_cat.
d₁₃ =sum(arr_cat[1, :] .⊻ arr_cat[3, :])
4
26.3.6 Choosing a Metric
Data type
Typical metric
Notes
Continuous (scaled)
Euclidean, Mahalanobis
Euclidean assumes uncorrelated features; Mahalanobis accounts for covariance.
Sparse non-negative (counts, embeddings)
Cosine, Jaccard
Cosine stays stable when magnitude varies; Jaccard focuses on overlap.
Pick the measure that best reflects how “similarity” should behave for your business question (e.g., two borrowers sharing underwriting flags may be “close” even if balances differ).
26.4 k-Nearest Neighbor (kNN) Clustering
kNN is primarily known as a classification algorithm, but it can also be used for clustering, particularly in the context of density-based clustering. Density-based clustering identifies regions in the data space where the density of data points is higher, and it groups points in these high-density regions. The core idea of kNN clustering is to assign each data point to a cluster based on the density of its neighbors. A data point becomes a core point if it has at least a specified number of neighbors within a certain distance.
usingRandom, NearestNeighbors, CairoMakie# Generate synthetic dataRandom.seed!(1234)data =rand(2, 10) # 10 points with 2 dimensionsprintln("Dataset:\n", data)# Create a KD-tree for efficient nearest neighbor searchkdtree =KDTree(data)# Define a query point (for which we want to find nearest neighbors)query_point = [0.5, 0.5]# Specify how many neighbors to findk =3indices, distances =knn(kdtree, query_point, k)# Display the resultsprintln("\nQuery Point: ", query_point)println("Indices of Nearest Neighbors: ", indices)println("Distances to Nearest Neighbors: ", distances)# Visualize the points and the queryf =Figure()axis =Axis(f[1, 1], title="Nearest Neighbors Search")scatter!(data[1, :], data[2, :], label="Data Points", color=:blue)scatter!([query_point[1]], [query_point[2]], label="Query Point", color=:red, marker=:cross, markersize=10)# Highlight nearest neighborsfor idx in indiceslines!([query_point[1], data[1, idx]], [query_point[2], data[2, idx]], color=:black)endf
Precompiling packages...
3363.2 ms ✓ QuartoNotebookWorkerMakieExt (serial)
1 dependency successfully precompiled in 3 seconds
Precompiling packages...
3058.5 ms ✓ QuartoNotebookWorkerCairoMakieExt (serial)
1 dependency successfully precompiled in 3 seconds
Dataset:
[0.5798621201341324 0.9721360824554687 … 0.13102565622085904 0.5743234852783174; 0.4112941179498505 0.014908849285099945 … 0.9464532262313834 0.6776499075995779]
Query Point: [0.5, 0.5]
Indices of Nearest Neighbors: [3, 6, 1]
Distances to Nearest Neighbors: [0.14103817169408245, 0.07597457975710152, 0.11935950629343954]
indices points to the nearest policies and distances reports the Euclidean distance in the scaled space. Replace the numeric embedding with any other feature matrix (cosine-normalized TF‑IDF vectors, CNN image embeddings) to reuse the same workflow, or feed the neighbor set into downstream analytics (experience weighting, peer benchmarking). If the query point falls outside the observed min–max range, clip or rescale to keep the distances interpretable.