26  Similarity Analysis

Author

Yun-Tien Lee

“By understanding the similarities, we unlock the potential for new insights, as patterns across different contexts often reveal hidden truths.” — Unknown

26.1 Chapter Overview

Similarity analysis helps us find “look‑alike” portfolios, policies, or scenarios so we can transfer insights across business lines. In practice this powers:

  • underwriting triage (match new submissions to historical claims experience),
  • customer segmentation (group households by benefit needs),
  • scenario analogues (locate past macro paths closest to today’s conditions),
  • model monitoring (compare new feature vectors to the training set).

This chapter walks through common similarity metrics for structured and unstructured data, shows their Julia implementations, and closes with a k-nearest-neighbor (kNN) search that you can plug into actuarial workflows.

26.2 The Data

Actuarial datasets tend to mix structured (tabular) and unstructured (text, image, voice) inputs. Regardless of origin, similarity measures require numeric vectors, so we usually:

  • scale numeric fields (premium, age, balances) to comparable ranges,
  • encode categorical features (education, occupation, territory) with one-hot or embedding schemes,
  • convert unstructured signals to vectors via domain encoders (TF‑IDF, Word2Vec, CNNs, spectrograms).

Stored data can generally be categorized into two formats: tabular (structured) and non-tabular (unstructured). Structured data format is a structured way of organizing and presenting data in rows and columns, resembling a table. This format is widely used for storing and representing structured datasets, making it easy to read, analyze, and manipulate data. The most common example of structured data is a spreadsheet, where data is organized into rows and columns. Structured data can also be stored in relational databases for easier lookups and matching. On the other hand, unstructured data refers to data that lacks a predefined data model or structure. Unlike structured data, which fits neatly into tables or databases, unstructured data does not have a predefined schema. It can include text documents, images, audio files, video files, social media posts, and more.

Structured data can be further categorized into numerical and categorical data based on the types of values they represent. The following data tables will be referenced throughout the chapter. Real numerical data can easily be converted or normalized to a series of floating points, and real categorical data to a series of binary literals through one-hot encoding procedures.

For unstructured data, due to the nature of their variety, the choice of representation depends on the type of data and the specific task at hand. For text data, a Word2Vec embedding is commonly used, while Convolutional Neural Networks (CNNs) are for image data and wave transforms are for audio data. No matter which transformation is applied, unstructured data can generally be converted to a series of floating points, just like numerical structured data.

sample_csv_data =
    IOBuffer(
        raw"id,sex,benefit_base,education,occupation,issue_age
         1,M,100000.0,college,1,30.0
         2,F,200000.0,master,3,20.0
         3,M,150000.0,high_school,4,40.0
         4,F,50000.0,college,2,60.0
         5,M,250000.0,college,1,40.0
         6,F,200000.0,high_school,2,30.0"
    )
IOBuffer(data=UInt8[...], readable=true, writable=false, seekable=true, append=false, size=278, maxsize=Inf, ptr=1, mark=-1)
using CSV, DataFrames, TableTransforms

df = CSV.read(sample_csv_data, DataFrame)
df_num = apply(MinMax(), df[:, [:benefit_base, :issue_age]])[1]
Precompiling packages...
    920.0 msQuartoNotebookWorkerJSONExt (serial)
  1 dependency successfully precompiled in 1 seconds
6×2 DataFrame
Row benefit_base issue_age
Float64 Float64
1 0.25 0.25
2 0.75 0.0
3 0.5 0.5
4 0.0 1.0
5 1.0 0.5
6 0.75 0.25
using StatsBase

arr_cat = hcat(indicatormat(df.sex)', indicatormat(df.education)', indicatormat(df.occupation)')
6×9 Matrix{Bool}:
 0  1  1  0  0  1  0  0  0
 1  0  0  0  1  0  0  1  0
 0  1  0  1  0  0  0  0  1
 1  0  1  0  0  0  1  0  0
 0  1  1  0  0  1  0  0  0
 1  0  0  1  0  0  1  0  0

df_num contains scaled benefit bases and issue ages, while arr_cat stores the one-hot encoding for sex, education, and occupation. Drop or regularize any column with zero range before scaling to avoid division-by-zero warnings. For unstructured inputs, substitute your favorite embedding model (Word2Vec for text, CNN features for images, spectrograms for audio); the downstream similarity routines stay the same once you have numeric vectors.

26.3 Common Similarity Measures

The following measures are commonly used to calculate similarities.

26.3.1 Euclidean Distance (L2 norm)

Euclidean distance, also known as the L2 norm, is defined as \[ d = \sqrt{\sum_{i=1}^{n} (w_i - v_i)^2} \] The distance is usually meaningful when applied to numerical data. The following Julia code shows the Euclidean distance for the first two rows in df_num.

using LinearAlgebra

#d₁₂ = √(∑((Array(df_num[:, 1]) .- Array(df_num[:, 2])) .* (Array(df_num[:, 1]) .- Array(df_num[:, 2]))))
d₁₂ = LinearAlgebra.norm(Array(df_num[:, 1]) .- Array(df_num[:, 2]))
1.4361406616345072

26.3.2 Manhattan Distance (L1 Norm)

Manhattan distance, also known as the L1 norm, is defined as \[ d = \sum_{i=1}^{n} |w_i - v_i| \] The distance is also usually meaningful when applied to numerical data. The following Julia code shows the Euclidean distance for the first two rows in df_num.

using LinearAlgebra

#d₁₂ = ∑(abs.(Array(df_num[:, 1]) .- Array(df_num[:, 2])))
d₁₂ = LinearAlgebra.norm1(Array(df_num[:, 1]) .- Array(df_num[:, 2]))
2.75

26.3.3 Cosine Similarity

Cosine similarity is defined as \[ d = \frac{\sum_{i=1}^{n} w_i \cdot v_i}{\sqrt{\sum_{i=1}^{n} w_i^2} \cdot \sqrt{\sum_{i=1}^{n} v_i^2}} \] The distance would be meaningful when applied to both numerical and categorical data.

The following Julia code shows the cosine similarity for the first two rows in df_num.

using LinearAlgebra

d₁₂ = (Array(df_num[:, 1])  Array(df_num[:, 2])) / norm(df_num[:, 1]) / norm(df_num[:, 2])
0.5024594344170622

The following Julia code shows the cosine similarity for the first and the third rows in arr_cat.

using LinearAlgebra

d₁₃ = (arr_cat[1, :]  arr_cat[3, :]) / norm(arr_cat[1, :]) / norm(arr_cat[3, :])
0.33333333333333337

Note how similar the syntax of processing for numerical or categorical data is. Multiple dispatch allows Julia to identify most efficient underlying procedure for different types of data. For categorical data, the \(dot\) operation on binary vectors is essentially counts the number of shared 1’s (or the number of categories on which two observations match), while for numerical data it is the \(dot\) operation for most numerical processing libraries.

26.3.4 Jaccard Similarity

Jaccard similarity is defined as \[ d = \frac{|W \cap V|}{|W \cup V|} \] The distance is usually meaningful when applied to categorical data. The following Julia code shows the Jaccard similarity for the first and the third rows in arr_cat.

d₁₃ = (arr_cat[1, :]  arr_cat[3, :]) / sum(arr_cat[1, :] .| arr_cat[3, :])
0.2

26.3.5 Hamming Distance

Hamming distance is defined as d = Number of positions at which w and v differ. \[ d = \sum_{i=1}^{n} \mathbf{1}_{w_i \ne v_i} \] The distance is usually meaningful when applied to categorical data. The following Julia code shows the Hamming distance for the first and the third rows in arr_cat.

d₁₃ = sum(arr_cat[1, :] .⊻ arr_cat[3, :])
4

26.3.6 Choosing a Metric

Data type Typical metric Notes
Continuous (scaled) Euclidean, Mahalanobis Euclidean assumes uncorrelated features; Mahalanobis accounts for covariance.
Sparse non-negative (counts, embeddings) Cosine, Jaccard Cosine stays stable when magnitude varies; Jaccard focuses on overlap.
Binary/categorical Hamming, Jaccard Hamming counts mismatches; Jaccard ignores joint zeros.

Pick the measure that best reflects how “similarity” should behave for your business question (e.g., two borrowers sharing underwriting flags may be “close” even if balances differ).

26.4 k-Nearest Neighbor (kNN) Clustering

kNN is primarily known as a classification algorithm, but it can also be used for clustering, particularly in the context of density-based clustering. Density-based clustering identifies regions in the data space where the density of data points is higher, and it groups points in these high-density regions. The core idea of kNN clustering is to assign each data point to a cluster based on the density of its neighbors. A data point becomes a core point if it has at least a specified number of neighbors within a certain distance.

using Random, NearestNeighbors, CairoMakie

# Generate synthetic data
Random.seed!(1234)
data = rand(2, 10) # 10 points with 2 dimensions
println("Dataset:\n", data)

# Create a KD-tree for efficient nearest neighbor search
kdtree = KDTree(data)

# Define a query point (for which we want to find nearest neighbors)
query_point = [0.5, 0.5]

# Specify how many neighbors to find
k = 3
indices, distances = knn(kdtree, query_point, k)

# Display the results
println("\nQuery Point: ", query_point)
println("Indices of Nearest Neighbors: ", indices)
println("Distances to Nearest Neighbors: ", distances)

# Visualize the points and the query
f = Figure()
axis = Axis(f[1, 1], title="Nearest Neighbors Search")
scatter!(data[1, :], data[2, :], label="Data Points", color=:blue)
scatter!([query_point[1]], [query_point[2]], label="Query Point", color=:red, marker=:cross, markersize=10)

# Highlight nearest neighbors
for idx in indices
    lines!([query_point[1], data[1, idx]], [query_point[2], data[2, idx]], color=:black)
end
f
Precompiling packages...
   3363.2 msQuartoNotebookWorkerMakieExt (serial)
  1 dependency successfully precompiled in 3 seconds
Precompiling packages...
   3058.5 msQuartoNotebookWorkerCairoMakieExt (serial)
  1 dependency successfully precompiled in 3 seconds
Dataset:
[0.5798621201341324 0.9721360824554687 … 0.13102565622085904 0.5743234852783174; 0.4112941179498505 0.014908849285099945 … 0.9464532262313834 0.6776499075995779]

Query Point: [0.5, 0.5]
Indices of Nearest Neighbors: [3, 6, 1]
Distances to Nearest Neighbors: [0.14103817169408245, 0.07597457975710152, 0.11935950629343954]

indices points to the nearest policies and distances reports the Euclidean distance in the scaled space. Replace the numeric embedding with any other feature matrix (cosine-normalized TF‑IDF vectors, CNN image embeddings) to reuse the same workflow, or feed the neighbor set into downstream analytics (experience weighting, peer benchmarking). If the query point falls outside the observed min–max range, clip or rescale to keep the distances interpretable.