Cluster Analysis


Introduction

Cluster analysis is a data analysis technique that aims to identify and group observations or data points that are homogeneous, meaning they share similar characteristics or properties. It is a data mining technique and is considered an unsupervised learning algorithm that helps reveal underlying patterns and structures within the data.

Cluster analysis is a powerful tool that can be used for a variety of tasks, including:

  • Customer segmentation: Cluster analysis can be used to segment customers into groups based on their purchase behavior, demographics, or other factors. This can help businesses to target their marketing campaigns more effectively.
  • Fraud detection: Cluster analysis can be used to identify fraudulent transactions by finding groups of transactions that are similar in terms of their characteristics.
  • Product recommendation: Cluster analysis can be used to recommend products to users based on their past purchase history or other factors.
  • Anomaly detection: Cluster analysis can be used to identify anomalies or outliers in data. This can be helpful for identifying fraud, detecting system errors, or identifying new trends.

Types

Centroid-based clustering and Hierarchical clustering are two distinct methods commonly used in clustering analysis.

In centroid-based clustering, the number of clusters is predetermined and specified before the clustering process begins. The algorithm aims to partition the observations into the specified number of clusters by minimizing the within-cluster sum of squares.

In contrast, hierarchical clustering does not require specifying the number of clusters in advance. It generates a tree-like structure called a dendrogram, which shows the hierarchical relationships between the observations at different levels of similarity. By cutting the dendrogram at different heights, we can obtain clusterings for different numbers of clusters, ranging from 1 to the total number of observations.

Centroid-based Clustering

This is a bottom-up approach to clustering, where objects are assigned to non-overlapping clusters based on their proximity to a central point, called a centroid. The most commonly centroid-based clustering algorithms include k-means, k-medoids, and fuzzy c-means.

Hierarchical Clustering

Hierarchical clustering is a top-down approach to clustering that creates a hierarchy of clusters by recursively merging or splitting them based on their similarity. It can be either agglomerative (where objects are merged together to form clusters) or divisive (where clusters are split apart to form smaller clusters).


R Code

Step 1: Load Data

df <- read.csv("df_coffee.csv")

Taking a sample of 20 observations for the demonstration purpose:

set.seed(100)
df <- df[sample(1:nrow(df), size = 20),]

Always check descriptive statistics:

skimr::skim(df)
Table 1: Data summary
Name df
Number of rows 20
Number of columns 16
_______________________
Column type frequency:
character 1
numeric 15
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Color 0 1 5 12 0 7 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Aroma 0 1 7.75 0.35 7.17 7.48 7.75 8.00 8.50 ▇▆▇▆▃
Flavor 0 1 7.77 0.34 7.17 7.56 7.75 7.94 8.50 ▃▃▇▃▁
Aftertaste 0 1 7.59 0.31 7.00 7.40 7.58 7.77 8.17 ▂▆▇▅▂
Acidity 0 1 7.67 0.30 7.08 7.50 7.71 7.81 8.25 ▂▃▇▃▁
Body 0 1 7.66 0.25 7.17 7.48 7.67 7.83 8.17 ▂▃▇▃▂
Balance 0 1 7.67 0.33 7.17 7.42 7.67 7.85 8.25 ▅▃▇▃▃
Uniformity 0 1 10.00 0.00 10.00 10.00 10.00 10.00 10.00 ▁▁▇▁▁
Clean.Cup 0 1 10.00 0.00 10.00 10.00 10.00 10.00 10.00 ▁▁▇▁▁
Sweetness 0 1 10.00 0.00 10.00 10.00 10.00 10.00 10.00 ▁▁▇▁▁
Overall 0 1 7.67 0.38 7.17 7.33 7.62 7.92 8.50 ▇▃▃▁▃
Defects 0 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ▁▁▇▁▁
Total.Cup.Points 0 1 83.77 2.09 80.17 82.50 83.70 84.89 87.58 ▅▇▇▆▅
Moisture.Percentage 0 1 10.94 1.31 9.00 9.70 11.20 11.83 13.10 ▇▃▃▇▃
Category.One.Defects 0 1 0.15 0.49 0.00 0.00 0.00 0.00 2.00 ▇▁▁▁▁
Quakers 0 1 0.35 1.14 0.00 0.00 0.00 0.00 5.00 ▇▁▁▁▁

Three of the variables have no variability (standard deviation of zero). So they should be removed:

df <- dplyr::select(df, -c("Uniformity","Clean.Cup","Sweetness","Defects"))

We can only use numeric variables in cluster analysis functions in R for now. (There are some advanced techniques that discuss methods related to cluster analysis in mixed data types).

df_num <- df[,-ncol(df)]  # removing the last variable in the dataset which is character type

Step 2: Complete Cases

Take only complete cases:

df_com <- na.omit(df_num)

Step 3: Scaling

Scaling the dataset is important because different features may have different scales, and if they are not scaled, then the features with larger scales will have a greater influence on the clustering results.

Common scaling techniques used in clustering include standardization (subtracting the mean and dividing by the standard deviation), normalization (scaling the values to a specific range).

\(X_{stand} = \frac{X - \mu_x}{\sigma_x}\)
\(X_{norm} = \frac{X - X_{min}}{X_{max}-X_{min}}\)
\(E[Y] = E[\Gamma'X]=\Gamma'E[X]=0\)

# standardization
df_scaled <- data.frame(scale(df_com, center = TRUE, scale = TRUE))

# normalization
# df_scaled <- scale(df_com, 
#                    center = apply(df_com, 2, min), 
#                    scale = apply(df_com, 2, function(x) {max(x) - min(x)}))  

Step 4: Optimal Number of Clusters

The optimal number of clusters depends on the specific dataset and the application. However, there are a few methods that can be used to help determine the optimal number of clusters in centroid-based clustering.

Elbow Method

This method involves plotting the within-cluster sum of squares (WSS) as a function of the number of clusters. The WSS is a measure of how well the data points are clustered together. The elbow method works by looking for the point in the WSS curve where the rate of decrease starts to slow down. This point is often considered to be the optimal number of clusters.

library(factoextra)
fviz_nbclust(df_scaled, kmeans, method = "wss") +  
  geom_vline(xintercept = 2, linetype = 2) + # add line for better viz
  labs(subtitle = "Elbow method") # add subtitle

Silhouette Score/Width

The silhouette score is a measure of how well each data point is clustered with its own cluster compared to other clusters. It ranges from -1 to 1, with a value of 1 indicating that the data point is perfectly clustered with its own cluster, 0 indicating the data point is on the border between the two clusters and a value of -1 indicating that the data point is equally well-clustered with two or more clusters. The optimal number of clusters is often the number that maximizes the average silhouette score.

fviz_nbclust(df_scaled, kmeans, method = "silhouette") +  
  labs(subtitle = "Silhouette method")

To compute Silhouette information for a cluster:

library(cluster)
km_res <- kmeans(df_scaled, centers = 2)  # defining cluster

sil <- silhouette(km_res$cluster, dist(df_scaled))
plot(sil)

fviz_silhouette(sil) + theme_minimal()
#   cluster size ave.sil.width
# 1       1   12          0.26
# 2       2    8          0.48

A positive silhouette coefficient indicates that an observation is well-matched to its own cluster.

Gap Statistic Method

The gap statistic helps determine the optimal number of clusters by comparing the observed within-cluster dispersion with a reference distribution. The idea is to select the number of clusters that maximizes the gap statistic. A larger gap indicates a more significant deviation from randomness, suggesting a better-defined clustering structure. By comparing the gaps for different numbers of clusters, we can identify the number of clusters that provides the most distinct and meaningful clustering pattern.

fviz_nbclust(df_scaled, kmeans, method = "gap_stat") +
  labs(subtitle = "Gap statistic method")

Combination of All

But the clever way is to try out different methods and see what majority returns. See here

The parameter package comes to help in this case:

library(parameters)
n_clust <- n_clusters(df_scaled,
                      package = c("easystats", "NbClust", "mclust"),
                      standardize = TRUE
                      )
print(n_clust)
# # Method Agreement Procedure:
# 
# The choice of 2 clusters is supported by 7 (31.82%) methods out of 22 (Elbow, Silhouette, kl, Duda, Pseudot2, Ratkowsky, Mcclain).
plot(n_clust)

Step 5: Hierarchical Clustering

Step 5.1: Distance

Numeric Data

Use the function dist():

df_dist_euc <- dist(df_scaled, method = 'euclidian')
# df_dist_max <- dist(df_scaled, method = 'maximum')
# df_dist_man <- dist(df_scaled, method = 'manhattan')
# df_dist_can <- dist(df_scaled, method = 'canberra')
# df_dist_mink <- dist(df_scaled, method = 'minkowski')

Categorical Data

# dist(dummified_data, method = "binary")

Gower Distance

Gower distance is a similarity measure commonly used for clustering categorical and numerical mixed data. It takes into account the different types of variables and calculates the dissimilarity between observations based on their attribute values.

For categorical variables, Gower distance measures dissimilarity as the proportion of attributes where two observations differ. For numerical variables, it calculates the standardized absolute difference between the values. The Gower distance is then calculated as the weighted sum of the dissimilarities for all variables.

df <- dplyr::mutate_if(df, is.character, as.factor) # convert character to factor

df_dist_gower <- cluster::daisy(df, metric = "gower") # Dissimilarity Matrix Calculation

Step 5.2: Linkage Criteria

Linkage refers to the way in which clusters are merged together. There are several different linkage criteria that can be used, and the choice of linkage criterion can have a significant impact on the results of the clustering -

  • Complete linkage: Maximum distance between two sets
  • Single linkage: Minimum distance between two sets
  • Average linkage: Average distance between two sets
  • Ward’s linkage: Minimize the total within cluster variance

Using the Euclidean distance on the numeric data the following code finds the hierarchical clusters by different linkage methods:

hclust_single <- hclust(df_dist_euc, method = "single")
hclust_complete <- hclust(df_dist_euc, method = "complete")
hclust_average <- hclust(df_dist_euc, method = "average")
hclust_centroid <- hclust(df_dist_euc, method = "centroid")
hclust_ward <- hclust(df_dist_euc, method = "ward.D2")

Get the cluster information:

cutree(hclust_ward, k = 2) # specify number of clusters
# 202 102 112 151 198   4  55  70  98 135   7 183  43 189 140  51 146  25   2 179 
#   1   1   1   1   1   2   2   2   2   1   2   1   2   1   1   2   1   2   2   1
cutree(hclust_ward, h = 0.7) # specify height to cut
# 202 102 112 151 198   4  55  70  98 135   7 183  43 189 140  51 146  25   2 179 
#   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20

Summary by clusters:

library(dplyr)
df_num %>%
  mutate(cluster = cutree(hclust_ward, k = 2)) %>% 
  group_by(cluster) %>%
  summarise(across(everything(), list(Average = mean))) 
# # A tibble: 2 × 12
#   cluster Aroma_Average Flavor_Average Aftertaste_Average Acidity_Average
#     <int>         <dbl>          <dbl>              <dbl>           <dbl>
# 1       1          7.55           7.55               7.37            7.48
# 2       2          8.00           8.03               7.85            7.91
# # ℹ 7 more variables: Body_Average <dbl>, Balance_Average <dbl>,
# #   Overall_Average <dbl>, Total.Cup.Points_Average <dbl>,
# #   Moisture.Percentage_Average <dbl>, Category.One.Defects_Average <dbl>,
# #   Quakers_Average <dbl>

Dendrogram will help to visualize as well as to determine the number of clusters to use or the height where to cut.

Step 5.3: Dendrogram

Pass the hclust object to the plot function:

plot(hclust_ward, hang = -1)
rect.hclust(hclust_ward, k = 2, border = "blue")

Using dendextend package:

# install.packages("dendextend")
library(dendextend)

plot(color_branches(dend = hclust_ward, k = 2)) # specify k, can also use h to specify height

plot(color_branches(dend = hclust_ward, k = 2), 
     # parameters for nodes
     nodePar = list(col = 3:2, 
                    cex = c(1.5, 0.75), 
                    pch =  21:22,
                    bg =  c("light blue", "pink"),
                    lab.cex = 0.75, 
                    lab.col = "tomato"))

# function to get color labels; Change color and cut based on number of clusters
colLab <- function(n, labelColors = c("#036564", "#EB6841"), cut = 2) {
  clusMember = cutree(hclust_ward, cut)
  if (is.leaf(n)) {
    a <- attributes(n)
    labCol <- labelColors[clusMember[which(names(clusMember) == a$label)]]
    attr(n, "nodePar") <- c(a$nodePar, lab.col = labCol)
  }
  n
}

hclust_ward %>% 
  color_branches(k = 2, col = c("#036564", "#EB6841")) %>% 
  dendrapply(colLab) %>%  # applying to all nodes
  plot()  # finally plotting

library("ape")
plot(as.phylo(hclust_ward), type = "unrooted", cex = 0.7, no.margin = T)

Step 5: K-means Clustering

Categorical data cannot be passed to the kmeans() function in R. The kmeans() function uses the Euclidean distance metric to calculate the distance between observations, and this metric is not defined for categorical data.

km <- kmeans(df_scaled, centers = 2) # centers specify number of clusters
km
# K-means clustering with 2 clusters of sizes 12, 8
# 
# Cluster means:
#        Aroma     Flavor Aftertaste    Acidity       Body    Balance    Overall
# 1 -0.5530586 -0.5889311 -0.6267575 -0.5641766 -0.5577865 -0.6337761 -0.6825517
# 2  0.8295879  0.8833967  0.9401363  0.8462649  0.8366797  0.9506641  1.0238275
#   Total.Cup.Points Moisture.Percentage Category.One.Defects    Quakers
# 1       -0.6521312           0.3165048            0.2043483  0.2052711
# 2        0.9781969          -0.4747571           -0.3065225 -0.3079067
# 
# Clustering vector:
# 202 102 112 151 198   4  55  70  98 135   7 183  43 189 140  51 146  25   2 179 
#   1   1   1   1   1   2   2   2   1   1   2   1   2   1   1   2   1   2   2   1 
# 
# Within cluster sum of squares by cluster:
# [1] 88.09225 26.35215
#  (between_SS / total_SS =  45.2 %)
# 
# Available components:
# 
# [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
# [6] "betweenss"    "size"         "iter"         "ifault"

Visualizing clusters:

fviz_cluster(km, df_scaled, ggtheme = theme_minimal())

Visualizing clusters in terms of variables in the dataset:

library(ggplot2)
df_km <- mutate(df_num, cluster = km$cluster)
df_km %>% ggplot(aes(x = Aroma, y = Aftertaste)) + 
  geom_point(aes(color = as.factor(cluster))) +
  labs(color = "Clusters") + theme_minimal()

Summary on clusters:

df_km %>%
  group_by(cluster) %>%
  summarise(across(everything(), list(Average = mean))) 
# # A tibble: 2 × 12
#   cluster Aroma_Average Flavor_Average Aftertaste_Average Acidity_Average
#     <int>         <dbl>          <dbl>              <dbl>           <dbl>
# 1       1          7.56           7.57               7.40            7.50
# 2       2          8.04           8.06               7.88            7.93
# # ℹ 7 more variables: Body_Average <dbl>, Balance_Average <dbl>,
# #   Overall_Average <dbl>, Total.Cup.Points_Average <dbl>,
# #   Moisture.Percentage_Average <dbl>, Category.One.Defects_Average <dbl>,
# #   Quakers_Average <dbl>

Md Ahsanul Islam
Md Ahsanul Islam
Freelance Data Analysis and R Programmer

Statistics graduate student currently researching on econometrics