27  Similarity Analysis

Yun-Tien Lee

“By understanding the similarities, we unlock the potential for new insights, as patterns across different contexts often reveal hidden truths.” — Unknown

27.1 Chapter Overview

Given a set of interest, understanding the relative similarity (or not) of features of interest is useful in classification and data compression techniques.

27.2 The Data

Stored data can generally be categorized into two formats: tabular (structured) and non-tabular (unstructured). Structured data format is a structured way of organizing and presenting data in rows and columns, resembling a table. This format is widely used for storing and representing structured datasets, making it easy to read, analyze, and manipulate data. The most common example of structured data is a spreadsheet, where data is organized into rows and columns. Structured data can also be stored in relational databases for easier lookups and matching. On the other hand, unstructured data refers to data that lacks a predefined data model or structure. Unlike structured data, which fits neatly into tables or databases, unstructured data does not have a predefined schema. It can include text documents, images, audio files, video files, social media posts, and more.

Structured data can be further categorized into numerical and categorical data based on the types of values they represent. The following data tables will be referenced throughout the chapter. Real numerical data can easily be converted or normalized to a series of floating points, and real categorical data to a series of binary literals through one-hot encoding procedures.

sample_csv_data =
    IOBuffer(
        raw"id,sex,benefit_base,education,occupation,issue_age
         1,M,100000.0,college,1,30.0
         2,F,200000.0,master,3,20.0
         3,M,150000.0,high_school,4,40.0
         4,F,50000.0,college,2,60.0
         5,M,250000.0,college,1,40.0
         6,F,200000.0,high_school,2,30.0"
    )
IOBuffer(data=UInt8[...], readable=true, writable=false, seekable=true, append=false, size=278, maxsize=Inf, ptr=1, mark=-1)
using CSV, DataFrames, TableTransforms

df = CSV.read(sample_csv_data, DataFrame)
df_num = apply(MinMax(), df[:, [:benefit_base, :issue_age]])[1]
6×2 DataFrame
Row benefit_base issue_age
Float64 Float64
1 0.25 0.25
2 0.75 0.0
3 0.5 0.5
4 0.0 1.0
5 1.0 0.5
6 0.75 0.25
using StatsBase

arr_cat = hcat(indicatormat(df.sex)', indicatormat(df.education)', indicatormat(df.occupation)')
6×9 Matrix{Bool}:
 0  1  1  0  0  1  0  0  0
 1  0  0  0  1  0  0  1  0
 0  1  0  1  0  0  0  0  1
 1  0  1  0  0  0  1  0  0
 0  1  1  0  0  1  0  0  0
 1  0  0  1  0  0  1  0  0

For unstructured data, due to the nature of their variety, the choice of representation depends on the type of data and the specific task at hand. For text data, a Word2Vec embedding is commonly used, while Convolutional Neural Networks (CNNs) are for image data and wave transforms are for audio data. No matter which transformation is applied, unstructured data can generally be converted to a series of floating points, just like numerical structured data.

27.3 Common Similarity Measures

The following measures are commonly used to calculate similarities.

27.3.1 Euclidean Distance (L2 norm)

Euclidean distance, also known as the L2 norm, is defined as \[ d = \sqrt{\sum_{i=1}^{n} (w_i - v_i)^2} \] The distance is usually meaningful when applied to numerical data. The following Julia code shows the Euclidean distance for the first two rows in df_num.

using LinearAlgebra

#d₁₂ = √(∑((Array(df_num[1, :]) .- Array(df_num[2, :])) .* (Array(df_num[1, :]) .- Array(df_num[2, :]))))
d₁₂ = LinearAlgebra.norm(Array(df_num[1, :]) .- Array(df_num[2, :]))
0.5590169943749475

27.3.2 Manhattan Distance (L1 Norm)

Manhattan distance, also known as the L1 norm, is defined as \[ d = \sum_{i=1}^{n} |w_i - v_i| \] The distance is also usually meaningful when applied to numerical data. The following Julia code shows the Euclidean distance for the first two rows in df_num.

using LinearAlgebra

#d₁₂ = ∑(abs.(Array(df_num[1, :]) .- Array(df_num[2, :])))
d₁₂ = LinearAlgebra.norm1(Array(df_num[1, :]) .- Array(df_num[2, :]))
0.75

27.3.3 Cosine Similarity

Cosine similarity is defined as \[ d = \frac{\sum_{i=1}^{n} w_i \cdot v_i}{\sqrt{\sum_{i=1}^{n} w_i^2} \cdot \sqrt{\sum_{i=1}^{n} v_i^2}} \] The distance would be meaningful when applied to both numerical and categorical data.

The following Julia code shows the cosine similarity for the first two rows in df_num.

using LinearAlgebra

d₁₂ = (Array(df_num[1, :])  Array(df_num[2, :])) / norm(df_num[1, :]) / norm(df_num[2, :])
0.7071067811865475

The following Julia code shows the cosine similarity for the first and the third rows in arr_cat.

using LinearAlgebra

d₁₃ = (arr_cat[1, :]  arr_cat[3, :]) / norm(arr_cat[1, :]) / norm(arr_cat[3, :])
0.33333333333333337

Note how similar the syntax of processing for numerical or categorical data is. Multiple dispatch allows Julia to identify most efficient underlying procedure for different types of data. For categorical data, the \(dot\) operation on binary vectors is essentially count of 1’s, while for numerical data it is the \(dot\) operation for most numerical processing libraries.

27.3.4 Jaccard Similarity

Jaccard similarity is defined as \[ d = \frac{|W \cap V|}{|W \cup V|} \] The distance is usually meaningful when applied to categorical data. The following Julia code shows the Jaccard similarity for the first and the third rows in arr_cat.

d₁₃ = (arr_cat[1, :]  arr_cat[3, :]) / sum(arr_cat[1, :] .| arr_cat[3, :])
0.2

27.3.5 Hamming Distance

Hamming distance is defined as d = Number of positions at which w and v differ. The distance is usually meaningful when applied to categorical data. The following Julia code shows the Hamming distance for the first and the third rows in arr_cat.

d₁₃ = sum(arr_cat[1, :] .⊻ arr_cat[3, :])
4

27.4 k-Nearest Neighbor (kNN) Clustering

kNN is primarily known as a classification algorithm, but it can also be used for clustering, particularly in the context of density-based clustering. Density-based clustering identifies regions in the data space where the density of data points is higher, and it groups points in these high-density regions. The core idea of kNN clustering is to assign each data point to a cluster based on the density of its neighbors. A data point becomes a core point if it has at least a specified number of neighbors within a certain distance.

using Random, NearestNeighbors, CairoMakie

# Step 1: Generate synthetic data
Random.seed!(1234)  # For reproducibility
data = rand(10, 2)  # 10 points with 2 dimensions
println("Dataset:\n", data)

# Step 2: Create a KD-tree for efficient nearest neighbor search
kdtree = KDTree(data)

# Step 3: Define a query point (for which we want to find nearest neighbors)
query_point = [0.5, 0.5]

# Step 4: Find the nearest neighbors
# Specify how many neighbors to find
k = 1
indices, distances = knn(kdtree, query_point, k)

# Step 5: Display the results
println("\nQuery Point: ", query_point)
println("Indices of Nearest Neighbors: ", indices)
println("Distances to Nearest Neighbors: ", distances)

# Step 6: Visualize the points and the query
f = Figure()
axis = Axis(f[1, 1], title="Nearest Neighbors Search")
scatter!(data[:, 1], data[:, 2], label="Data Points", color=:blue)
scatter!([query_point[1]], [query_point[2]], label="Query Point", color=:red, marker=:cross, markersize=10)

# Highlight nearest neighbors
for idx in indices
    lines!([query_point[1], data[idx, 1]], [query_point[2], data[idx, 2]], color=:black)
end
f
Dataset:
[0.5798621201341324 0.5667043025437501; 0.4112941179498505 0.5363685687293304; … ; 0.7897644095351307 0.5743234852783174; 0.6960406981439002 0.6776499075995779]

Query Point: [0.5, 0.5]
Indices of Nearest Neighbors: [2]
Distances to Nearest Neighbors: [0.07597457975710152]