26 Similarity Analysis

Author

Yun-Tien Lee

“By understanding the similarities, we unlock the potential for new insights, as patterns across different contexts often reveal hidden truths.” — Unknown

26.1 Chapter Overview

Similarity analysis helps us find “look‑alike” portfolios, policies, or scenarios so we can transfer insights across business lines. In practice this powers:

underwriting triage (match new submissions to historical claims experience),
customer segmentation (group households by benefit needs),
scenario analogues (locate past macro paths closest to today’s conditions),
model monitoring (compare new feature vectors to the training set).

This chapter walks through common similarity metrics for structured and unstructured data, shows their Julia implementations, and closes with a k-nearest-neighbor (kNN) search that you can plug into actuarial workflows.

26.2 The Data

Actuarial datasets tend to mix structured (tabular) and unstructured (text, image, voice) inputs. Regardless of origin, similarity measures require numeric vectors, so we usually:

scale numeric fields (premium, age, balances) to comparable ranges,
encode categorical features (education, occupation, territory) with one-hot encoding or embedding schemes,
convert unstructured signals to vectors via domain encoders (TF‑IDF, Word2Vec, CNNs, spectrograms).

Stored data can generally be categorized into two formats: tabular (structured) and non-tabular (unstructured). Structured data format is a structured way of organizing and presenting data in rows and columns, resembling a table. This format is widely used for storing and representing structured datasets, making it easy to read, analyze, and manipulate data. The most common example of structured data is a spreadsheet, where data is organized into rows and columns. Structured data can also be stored in relational databases for easier lookups and matching. On the other hand, unstructured data refers to data that lacks a predefined data model or structure. Unlike structured data, which fits neatly into tables or databases, unstructured data does not have a predefined schema. It can include text documents, images, audio files, video files, social media posts, and more.

Structured data can be further categorized into numerical and categorical data based on the types of values they represent. The following data tables will be referenced throughout the chapter. Real numerical data can easily be converted or normalized to a series of floating points, and real categorical data to a series of binary literals through one-hot encoding procedures.

For unstructured data, due to the nature of their variety, the choice of representation depends on the type of data and the specific task at hand. For text data, a Word2Vec embedding is commonly used, while Convolutional Neural Networks (CNNs) are for image data and wave transforms are for audio data. No matter which transformation is applied, unstructured data can generally be converted to a series of floating points, just like numerical structured data.

sample_csv_data =
    IOBuffer(
        raw"id,sex,benefit_base,education,occupation,issue_age
         1,M,100000.0,college,1,30.0
         2,F,200000.0,master,3,20.0
         3,M,150000.0,high_school,4,40.0
         4,F,50000.0,college,2,60.0
         5,M,250000.0,college,1,40.0
         6,F,200000.0,high_school,2,30.0"
    );

using CSV, DataFrames, TableTransforms

df = CSV.read(sample_csv_data, DataFrame)
df_num = apply(MinMax(), df[:, [:benefit_base, :issue_age]])[1]

6×2 DataFrame

Row	benefit_base	issue_age
	Float64	Float64
1	0.25	0.25
2	0.75	0.0
3	0.5	0.5
4	0.0	1.0
5	1.0	0.5
6	0.75	0.25

using StatsBase

arr_cat = hcat(
    indicatormat(df.sex)',
    indicatormat(df.education)',
    indicatormat(df.occupation)'
)

6×9 Matrix{Bool}:
 0  1  1  0  0  1  0  0  0
 1  0  0  0  1  0  0  1  0
 0  1  0  1  0  0  0  0  1
 1  0  1  0  0  0  1  0  0
 0  1  1  0  0  1  0  0  0
 1  0  0  1  0  0  1  0  0

df_num contains scaled benefit bases and issue ages, while arr_cat stores the one-hot encoding for sex, education, and occupation. Drop or regularize any column with zero range before scaling to avoid division-by-zero warnings. For unstructured inputs, substitute your favorite embedding model (Word2Vec for text, CNN features for images, spectrograms for audio); the downstream similarity routines stay the same once you have numeric vectors.

26.3 Common Similarity Measures

The following measures are commonly used to calculate similarities.

26.3.1 Euclidean Distance (L2 norm)

Euclidean distance, also known as the L2 norm, is defined as \[ d = \sqrt{\sum_{i=1}^{n} (w_i - v_i)^2} \] The distance is usually meaningful when applied to numerical data. The following Julia code shows the Euclidean distance for the first two columns in df_num.

using LinearAlgebra

#d₁₂ = √(∑((Array(df_num[:, 1]) .- Array(df_num[:, 2])) .* 
#      (Array(df_num[:, 1]) .- Array(df_num[:, 2]))))
d₁₂ = LinearAlgebra.norm(Array(df_num[:, 1]) .- Array(df_num[:, 2]))

1.4361406616345072

26.3.2 Manhattan Distance (L1 Norm)

Manhattan distance, also known as the L1 norm or taxicab distance, is defined as \[ d = \sum_{i=1}^{n} |w_i - v_i| \]

The name comes from the grid-like street layout of Manhattan: instead of measuring the straight-line path between two points, you sum the absolute differences along each axis, as if navigating city blocks.

Manhattan distance is well suited for numerical data, particularly when features are measured on different scales or when you want to reduce the influence of outliers. Unlike Euclidean distance, which squares differences (amplifying large deviations), Manhattan distance treats all deviations linearly. This makes it more robust when a single feature has an unusually large value. In actuarial contexts, Manhattan distance can be useful for comparing policies where one dimension (e.g., benefit base) might have extreme values that would otherwise dominate a Euclidean measure.

using LinearAlgebra

d₁₂ = LinearAlgebra.norm1(Array(df_num[:, 1]) .- Array(df_num[:, 2]))

2.75

26.3.3 Cosine Similarity

Cosine similarity is defined as \[ d = \frac{\sum_{i=1}^{n} w_i \cdot v_i}{\sqrt{\sum_{i=1}^{n} w_i^2} \cdot \sqrt{\sum_{i=1}^{n} v_i^2}} \] The similarity is meaningful when applied to both numerical and categorical data.

The following Julia code shows the cosine similarity for the first two columns in df_num.

using LinearAlgebra

d₁₂ = (Array(df_num[:, 1]) ⋅ Array(df_num[:, 2])) /
      norm(df_num[:, 1]) / norm(df_num[:, 2])

0.5024594344170622

The following Julia code shows the cosine similarity for the first and the third rows in arr_cat.

using LinearAlgebra

d₁₃ = (arr_cat[1, :] ⋅ arr_cat[3, :]) / norm(arr_cat[1, :]) / norm(arr_cat[3, :])

0.33333333333333337

Note how similar the syntax of processing for numerical or categorical data is. Multiple dispatch allows Julia to identify the most efficient underlying procedure for different types of data. For categorical data, the dot operation on binary vectors essentially counts the number of shared 1’s (or the number of categories on which two observations match), while for numerical data it is the dot operation for most numerical processing libraries.

26.3.4 Jaccard Similarity

Jaccard similarity is defined as \[ d = \frac{|W \cap V|}{|W \cup V|} \] The similarity is usually meaningful when applied to categorical data. The following Julia code shows the Jaccard similarity for the first and the third rows in arr_cat.

d₁₃ = (arr_cat[1, :] ⋅ arr_cat[3, :]) / sum(arr_cat[1, :] .| arr_cat[3, :])

0.2

26.3.5 Hamming Distance

Hamming distance is defined as d = Number of positions at which w and v differ. \[ d = \sum_{i=1}^{n} \mathbf{1}_{w_i \ne v_i} \] The similarity is usually meaningful when applied to categorical data. The following Julia code shows the Hamming distance for the first and the third rows in arr_cat.

d₁₃ = sum(arr_cat[1, :] .⊻ arr_cat[3, :])

26.3.6 Choosing a Metric

Data type	Typical metric	Notes
Continuous (scaled)	Euclidean, Mahalanobis	Euclidean assumes uncorrelated features; Mahalanobis accounts for covariance.
Sparse non-negative (counts, embeddings)	Cosine, Jaccard	Cosine stays stable when magnitude varies; Jaccard focuses on overlap.
Binary/categorical	Hamming, Jaccard	Hamming counts mismatches; Jaccard ignores joint zeros.

Pick the measure that best reflects how “similarity” should behave for your business question (e.g., two borrowers sharing underwriting flags may be “close” even if balances differ).

26.4 k-Nearest Neighbor (kNN) Clustering

k-Nearest Neighbor (kNN) is primarily known as a classification algorithm, but it can also be used for clustering, particularly in the context of density-based clustering. Density-based clustering identifies regions in the data space where the density of data points is higher, and it groups points in these high-density regions. The core idea of kNN clustering is to assign each data point to a cluster based on the density of its neighbors. A data point becomes a core point if it has at least a specified number of neighbors within a certain distance.

using Random, NearestNeighbors, CairoMakie

# Generate synthetic data
Random.seed!(1234)
data = rand(2, 10) # 10 points with 2 dimensions
println("Dataset:\n", data)

# Create a KD-tree for efficient nearest neighbor search
kdtree = KDTree(data)

# Define a query point (for which we want to find nearest neighbors)
query_point = [0.5, 0.5]

# Specify how many neighbors to find
k = 3
indices, distances = knn(kdtree, query_point, k)

# Display the results
println("\nQuery Point: ", query_point)
println("Indices of Nearest Neighbors: ", indices)
println("Distances to Nearest Neighbors: ", distances)

# Visualize the points and the query
f = Figure()
axis = Axis(f[1, 1], title="Nearest Neighbors Search")
scatter!(data[1, :], data[2, :], label="Data Points", color=:blue)
scatter!([query_point[1]], [query_point[2]], label="Query Point",
    color=:red, marker=:cross, markersize=10)

# Highlight nearest neighbors
for idx in indices
    lines!([query_point[1], data[1, idx]], [query_point[2], data[2, idx]], color=:black)
end
f

Dataset:
[0.5798621201341324 0.9721360824554687 … 0.13102565622085904 0.5743234852783174; 0.4112941179498505 0.014908849285099945 … 0.9464532262313834 0.6776499075995779]

Query Point: [0.5, 0.5]
Indices of Nearest Neighbors: [3, 6, 1]
Distances to Nearest Neighbors: [0.14103817169408245, 0.07597457975710152, 0.11935950629343954]

indices point to the nearest policies and distances report the Euclidean distance in the scaled space. Replace the numeric embedding with any other feature matrix (cosine-normalized TF‑IDF vectors, CNN image embeddings) to reuse the same workflow, or feed the neighbor set into downstream analytics (experience weighting, peer benchmarking). If the query point falls outside the observed min–max range, clip or rescale to keep the distances interpretable.