Principal Component Analysis - Theory and Application with R

Introduction

Karl Pearson invented PCA in 1901, defining it as “On lines and planes of closest fit to systems of points in space.”](https://www.tandfonline.com/doi/abs/10.1080/14786440109462720) Harold Hotelling later developed and named it independently in the 1930s.

Principal component analysis (PCA) is a technique to transform the original set of variables into a smaller set of linear combination of variables so that most of the statistical information (variance) in the original data is preserved by those linear combinations. It accomplishes this by generating new uncorrelated variables that gradually maximize variance. Finding such new variables, known as principal components (PCs), is essentially the same as solving an eigenvalue/eigenvector problem.¹

An eigen vector is a direction of the transformation of the original dataset for which no information in the data is lost and an eigenvalue shows the amount of variance that is explained in the data in that direction. So the eigen vector corresponding to the highest eigenvalue is the first principal component.

Geometrically to say, PCA is a type of linear transformation of a given dataset \X\ that contains values for a given number of variables (or say coordinates) for a given number of spaces. This linear transformation fits this dataset to a new coordinate system in such a way that the first coordinate (component) has the most variation, and each subsequent coordinate is orthogonal to the first and has less variance.²,³<\cite>

Underlying Theory

Pre-requisites

Before diving into the PCA, there are a few things one should be aware of -

Linear Algebra
- Trace of a matrix and its properties
- Determinant of a matrix
- Diagonalization of a matrix
- Orthogonal transformation
- Eigen values and eigen vectors
Statistics
- Mean vector
- Variance-Covariance Matrix
- Correlation Matrix

Theory

Let, p be the number of variables, n be the number of observations, \ X \ be a \p n\ matrix of a dataset, mean vector \{X}\ be a zero vector and the variance-covariance matrix \\ be a real positive definite matrix or at least positive semi-definite of rank \r p\.

Let,\ _i i=1,2,…,p \ be the eigen values of \\ where, \_1 > _2 > _3 > … > _p\ and their corresponding eigenvectors are \_1, _2, _3, …, _p\.

where, \D\ is a diagonal matrix where the diagonal elements are the eigenvalues - $$ D =\begin{bmatrix} \lambda_1 & 0 & … & 0\newline 0 & \lambda_2 & … & 0\newline . & . & … & .\newline 0 & 0 & … & \lambda_p \end{bmatrix}_{p\ \times\ p} $$

and the matrix of eigen vectors, $$ \Gamma =\begin{bmatrix} \gamma_1,\ \gamma_2,\ …\ ,\gamma_p\ \end{bmatrix}_{p\ \times\ p} $$

The matrix \\ diagonalizes \\ such that, \D = ’ \ or, \=D’\

Since \\ is an orthogonal matrix, then \‘=’= I\.⁴

We consider an orthogonal transformation (that will preserve all the information) of \X\ into \Y\ by -
\Y=’X\

or,
\ =’ \

Here, \Y_1\, \Y_2\, …, \Y_p\ are the p components of \Y\ and are called the Principal Components.

For example if \X=[x_1 x_2]\ is a two variable data set and the first eigen vactor is \[a b]’\.

Then the first principal component will be - \P_1=x_1a+x_2 b\

Similarly if the second eigen vector is \[c d]’\ then the second principal component will be - \P_2=x_1c+x_2 d\

Preserving Variance

How does the technique preserve the total variation after the transformation?

Since \\ is the variance-covariance matrix of \X\, we can measure the following things:

Overall measure of the variability of \X\ as \trace()\ or,
Generalized variance of \X\ as \||\

Now, \trace()\
\=trace(D‘)\
\=trace(’D)\ [Cyclic property]
\=trace(D)\ [Since, \\ is an orthogonal matrix]
\=_{i=1}^{p}_i\

Again, \||\
\=|D‘|\
\=||| D||’|\
\=|D|\ \=_{i=1}^{p}_i\

Now we have the transformation, \Y=’X\

Here, \E[Y] = E[’X]=‘E[X]=0\
and \V(Y)=E[YY’]=E[’XX’]=’E[XX’]=’ =D\

Thus, for \Y\, overall measure of variability is \trace(D)=_{i=1}{p}i\
or, generalized variance is \|D|={i=1}{p}_i\

Which shows that the total variation of \X\ remains unchanged in \Y\.

Measuring Accounted Variance by PCs

How to know the percentage of variation that is explained by a principal component?

Since \\ is a positive definite matrix, the strict positivity of \_i\ is guaranteed.

Now, \Y_1\ is the first principal component corresponding to first eigenvalue \_1\, and similarly \Y_i\ is the i-th principal component corresponding to i-th eigenvalue \_i\.

The percentage of variation of the original data \X\ explained by the i-th principal component is calculated using \\

At the beginning of the theory development since \_i\s were taken such as \_1 > _2 > _3 > > _p\, the first principal component explains the largest amount of variation.

Practical Demonstration

Dataset

Importing a dataset -

data <- readxl::read_excel("data/1.2 data.xlsx")
data$Family <- ifelse(data$Family==0, "Serranidae",
                      ifelse(data$Family==1, "Carangidae", "Sparidae"))
df_unstd <- data[,-8]

Download the dataset

Data description:

x1: centrum’s length
x2: anterior centrum’s width
x3: posterior centrum’s width
x4: centrum’s height
x5: distance between the base of the neural spine and the posterior face of the centrum
x6: distance between the base of the haemal spine and the posterior face of the centrum
x7: length of anterior zygapophysis along with other variables
Family: family of fish specimens; 0 - Serranidae, 1 - Carangidae, 2 - Sparidae

Correlation plot of the original dataset and the principal component transformed (we’ll see it later) dataset -

library(ggcorrplot)
pc_prcomp <- prcomp(df_unstd, center = TRUE, scale. = TRUE)
corr <- cor(df_unstd)
a <- ggcorrplot(corr, show.legend = F, 
           hc.order = F, type = "upper", lab = T, 
           digits = 2, method = "square",
           colors = c("#6D9EC1", "white", "#E46726")) +
  labs(title = "Correlations Matrix: Original dataset")
b <- ggcorrplot(cor(pc_prcomp$x), show.legend = F, 
           hc.order = F, type = "upper", lab = T, 
           digits = 2, method = "square",
           colors = c("#6D9EC1", "white", "#E46726")) +
  labs(title = "Correlations Matrix: Principal Component Transformed")
ggpubr::ggarrange(a, b, 
          ncol = 2, nrow = 1)

The idea is that if many variables correlate with one another, they will all contribute strongly to the same principal component.⁵

Manual Calculation

The principal components can be found using things -

Variance-covariance matrix
Correlation matrix

If the data is not standardized, it is suggested to use the correlation matrix in principal component analysis. Because the variance of variable is affected by the scale of measurement.

But if the data is standardized, both of variance-covariance matrix and correlation matrix gives the same principal components. So, it is recommended to use standardize the data before performing principal component analysis.

Standardization is done using the following formula - $$ Z = \frac{X-\mu}{\sigma} $$

Calculating mean vector -

mn <- colMeans(df_unstd)   # Mean vector
mn

#       x1       x2       x3       x4       x5       x6       x7 
# 7.297222 5.811111 5.672222 5.672222 2.552778 3.677778 2.369444

Variance-covariance matrix -

var(df_unstd)

#           x1          x2         x3         x4          x5         x6        x7
# x1 1.3442778  0.90374603 0.93420635  0.9853492  0.35529365 0.52679365 0.3961984
# x2 0.9037460  1.28387302 1.17974603  1.4508889 -0.09203175 0.57996825 0.1774921
# x3 0.9342063  1.17974603 1.35520635  1.5217778  0.06007937 0.58936508 0.2674127
# x4 0.9853492  1.45088889 1.52177778  2.0803492 -0.11277778 0.79450794 0.2434127
# x5 0.3552937 -0.09203175 0.06007937 -0.1127778  0.45627778 0.04749206 0.2662302
# x6 0.5267937  0.57996825 0.58936508  0.7945079  0.04749206 0.38234921 0.1581587
# x7 0.3961984  0.17749206 0.26741270  0.2434127  0.26623016 0.15815873 0.2718968

Extracting the standard deviations -

vardf_unstd <- diag(var(df_unstd))^0.5
vardf_unstd

#        x1        x2        x3        x4        x5        x6        x7 
# 1.1594299 1.1330812 1.1641333 1.4423416 0.6754834 0.6183439 0.5214373

Subtracting the columns by their corresponding means and dividing by their corresponding standard deviations to obtain standardized dataset -

df_std <- as.matrix((df_unstd - rep(mn,each = nrow(df_unstd)))/rep(vardf_unstd, each = nrow(df_unstd)))

Mean vector of the new dataset -

round(colMeans(df_std))

# x1 x2 x3 x4 x5 x6 x7 
#  0  0  0  0  0  0  0

Using Correlation Matrix

Using correlation matrix of standardized data to calculate eigenvalues -

eigcorr <- eigen(cor(df_std))      # using correlation matrix on standardized data
output <- cbind(eigenvalue = round(eigcorr$values,2),
      cumul_percentage = round(cumsum(eigcorr$values/sum(eigcorr$values))*100, 2))

It is seen that the first principal component can explain 64.32% of the total variation in the data. The second principal component explains 25.35% variance. Similarly the third PC explains 4.04% variance.

Thus the first two principal components together explain 89.66% of the total variance.

Using correlation matrix of unstandardized data to calculate eigenvalues -

eigcorr <- eigen(cor(df_unstd))      # using correlation matrix on unstandardized data
output <- cbind(eigenvalue = round(eigcorr$values,2),
                cumul_percentage = round(cumsum(eigcorr$values/sum(eigcorr$values))*100, 2))

So, it is verified that the principal components are not affected by the scales if the correlation matrix is used.

The eigen vectors -

round(eigcorr$vectors,3)

#        [,1]   [,2]   [,3]   [,4]   [,5]   [,6]   [,7]
# [1,] -0.402  0.223  0.778  0.114  0.097  0.398 -0.055
# [2,] -0.423 -0.263  0.218 -0.253  0.268 -0.726 -0.202
# [3,] -0.438 -0.137 -0.157 -0.619 -0.225  0.246  0.519
# [4,] -0.423 -0.271 -0.352  0.071 -0.176  0.340 -0.688
# [5,] -0.110  0.706 -0.027 -0.158 -0.574 -0.297 -0.213
# [6,] -0.439 -0.084 -0.139  0.713 -0.251 -0.210  0.407
# [7,] -0.291  0.532 -0.423  0.027  0.667  0.072  0.053

So the principal component transformation of the original data will be -

princr_df <- df_std %*% eigcorr$vectors
# View the principal components
scroll_box(height = "300px", 
           kable_styling(
             kable(
               princr_df, digits = 2, format = "html", col.names = paste0("PC", 1:7)
             ),
             bootstrap_options = c("striped", "hover"),
             full_width = T, font_size = 12, position = "left"))

Using Variance-Covariance Matrix

Using variance-covariance matrix of standardized data to calculate eigenvalues -

eigcov <- eigen(cov(df_std))    # using variance-covariance matrix on standardized data
output <- cbind(eigenvalue = round(eigcov$values,2),
                cumul_percentage = round(cumsum(eigcov$values/sum(eigcov$values))*100,2))

Using variance-covariance matrix of unstandardized data to calculate eigenvalues -

eigcov <- eigen(cov(df_unstd))    # using variance-covariance matrix on unstandardized data
output <- cbind(eigenvalue = round(eigcov$values,2),
                cumul_percentage = round(cumsum(eigcov$values/sum(eigcov$values))*100,2))

So the principal components are influenced by the scales. Hence, it is recommended to standardize the data before calculating the principal components.

Using prcomp()

To make things easier in R, there are two built-in functions to perform PCA - prcomp() and princomp(). The function princomp() uses the spectral decomposition approach while the functions prcomp() and PCA()[from package FactoMineR] use the singular value decomposition (SVD).

Note that are two general methods to perform PCA in R:

1. Spectral decomposition which examines the covariances / correlations between variables
2. Singular value decomposition which examines the covariances / correlations between individuals

Due to higher level of accuracy prcomp() is preferred over princomp(). Standardization can be done by passing two arguments to the prcomp() function center = TRUE and scale. = TRUE -

pc_prcomp <- prcomp(df_unstd, center = TRUE, scale. = TRUE)
summary(pc_prcomp)

# Importance of components:
#                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
# Standard deviation     2.1218 1.3320 0.53204 0.43966 0.38258 0.25684 0.18685
# Proportion of Variance 0.6432 0.2535 0.04044 0.02761 0.02091 0.00942 0.00499
# Cumulative Proportion  0.6432 0.8966 0.93706 0.96468 0.98559 0.99501 1.00000

The result is consistent with the manual calculation.

The function prcomp() returns the following things -

names(pc_prcomp)

# [1] "sdev"     "rotation" "center"   "scale"    "x"

x denotes the transformed dataset/principal components -

pc_prcomp$x

sdev denotes to the standard deviations of the principal components -

pc_prcomp$sdev^2   # eigenvalues (variances)

# [1] 4.50218058 1.77420306 0.28306214 0.19330296 0.14637109 0.06596874 0.03491141

apply(pc_prcomp$x, 2, var)  # variance of the principal components

#        PC1        PC2        PC3        PC4        PC5        PC6        PC7 
# 4.50218058 1.77420306 0.28306214 0.19330296 0.14637109 0.06596874 0.03491141

In rotation, the columns are actually eigen vectors -

pc_prcomp$rotation

#           PC1         PC2         PC3         PC4         PC5         PC6
# x1 -0.4015216 -0.22330581  0.77779224  0.11376703 -0.09709303  0.39818828
# x2 -0.4228354  0.26263576  0.21829956 -0.25341590 -0.26839234 -0.72641133
# x3 -0.4382952  0.13669090 -0.15742089 -0.61921884  0.22529984  0.24584891
# x4 -0.4229505  0.27070943 -0.35156231  0.07051460  0.17588063  0.33970948
# x5 -0.1100766 -0.70615050 -0.02748501 -0.15839631  0.57425067 -0.29742388
# x6 -0.4385403  0.08385339 -0.13877297  0.71318889  0.25070443 -0.21049616
# x7 -0.2908030 -0.53245569 -0.42307656  0.02655884 -0.66650427  0.07182518
#            PC7
# x1 -0.05504745
# x2 -0.20161965
# x3  0.51942077
# x4 -0.68769497
# x5 -0.21252061
# x6  0.40693163
# x7  0.05327880

Individual Points Plot

If we visualize the data by different colors for different level of family we will be able to identify them in clusters because each family type is likely to contain specific characteristics which are now almost entirely explained by the first two principal components -

library(tidyverse)
data.frame(pc_prcomp$x, Family = data$Family) %>% 
  ggplot(aes(x=PC1, y=PC2, color=Family)) +
  geom_point(size=2) + theme_bw() +
  labs(title = "First 2 prinipal components",
       x = "Principal Component 1", y ="Principal Component 2",
       col = "Family")

3D Plot

The following 3D plot shows the points for the first 3 principal components -

library(plotly)
df2 <- data.frame(pc_prcomp$x, Family = data$Family)
df2 %>% 
  plot_ly(x = ~PC1, y = ~PC2, z = ~PC3, color = ~Family) %>% 
  add_markers(
         hovertemplate = paste(
           '<br><b>PC1</b>: %{x:.2f}',
           '<br><b>PC2</b>: %{y:.2f}',
           '<br><b>PC3</b>: %{z:.2f}</b>')) %>% 
  layout(
    # setting axis titles
    scene = list(xaxis = list(title = 'PC1'),
                 yaxis = list(title = 'PC2'),
                 zaxis = list(title = 'PC3')),
    legend = list(title=list(text='<b> Family </b>'),
                  orientation = "h",   # show entries horizontally
                  xanchor = "center",  # use center of legend as anchor
                  x = 0.5))            # put legend in center of x-axis

Scree plot

Scree plot is useful in determining the number of principal components to retain in a principal component analysis or the number of factors to keep in an exploratory factor analysis-

Using the manual calculation’s output eigcorr$values to draw a scree plot -

plot(x = 1:7, y = eigcorr$values, type = "b", pch = 19,
     col="#952C22", main = "Scree plot", las = 1, 
     ylab = "Variances Explained", xlab = "",
     xaxt = "n"  # to hide x-axis labels
     )
axis(side = 1, at = 1:7, labels = paste0("PC",1:7))

In 1966, Raymond B. Cattell invented the scree plot. Scree plot is used to identify statistically significant components. The process is also known as scree test.

According to the scree test, the components that are at the left side of the elbow are to be retained as significant.⁶

Another way to select the components to retain is inspecting the variance of the principal components. If the variance is greater than the average of all the variances then the component will be retained, which in the case of standardized data is -

mean(eigcorr$values)

# [1] 1

The average will be 1 irrespective of what was used to determine the eigen values and vectors, i.e., var-covariance matrix and correlation matrix if the data was standardized before calculating.

The scree plot -

plot(x = 1:7, y = eigcorr$values, type = "b", pch = 19,
     col="#952C22", main = "Scree plot", las = 1, 
     ylab = "Variances Explained", xlab = "",
     xaxt = "n"  # to hide x-axis labels
     )
axis(side = 1, at = 1:7, labels = paste0("PC",1:7))
abline(h=1, col="#2E7BF9")

So the first two principal components will be kept.

Extra:

To make a pretty scree plot using ggplot2 with secondary y-axis that shows the percentage of variance explained -

library(tidyverse)
# creating a tidy data frame
data.frame(PC= paste0("PC",1:length(pc_prcomp$sdev)),
           Variance=pc_prcomp$sdev^2) %>%
  # using the data frame to make plot
  ggplot(aes(x=PC, y=Variance, group=1)) +
  geom_point(size=1, col = "red") +
  geom_line(col = "red") + theme_bw() +
  scale_y_continuous(
    name = "Variance",
    # adding a secondary axis
    sec.axis = sec_axis(
      trans = ~ . / sum((pc_prcomp$sdev) ^ 2),
      # name of secondary axis
      name = "Percentage of Variance Explained",
      labels = function(x) {
        paste0(round(x * 100,), "%")
      }
    )
    ) +
  geom_line(y = 1, color = "#2E7BF9", linetype="dashed") +
  labs(title="Scree plot", x = "Principal Components")

Constructing a scree plot using the function fviz_eig() from the factoextra package.

library(factoextra)
fviz_screeplot(pc_prcomp, addlabels = T, linecolor = "red",
               choice = "eigenvalue", geom = "line") +
  geom_line(y = 1, color = "#2E7BF9", linetype="dashed") +
  xlab("Principal Components")

FactoMineR’s PCA()

Aside from using prcomp(), the function PCA() from FactoMineR package can be used which extracts principal components from the correlation matrix -

library(FactoMineR)
pc_fmr <- PCA(df_unstd, graph = F, scale.unit = T)

To see the eigenvalues of the components and their corresponding information -

pc_fmr$eig

#        eigenvalue percentage of variance cumulative percentage of variance
# comp 1 4.50218058             64.3168655                          64.31687
# comp 2 1.77420306             25.3457580                          89.66262
# comp 3 0.28306214              4.0437449                          93.70637
# comp 4 0.19330296              2.7614709                          96.46784
# comp 5 0.14637109              2.0910156                          98.55885
# comp 6 0.06596874              0.9424106                          99.50127
# comp 7 0.03491141              0.4987345                         100.00000

Plotting the individual points of the first two principal components in a scatterplot -

plot.PCA(pc_fmr, axes = c(1,2))

Biplot

We have already constructed PCs using different functions in R. The signs of the components’ are not same in all the cases. If we construct the biplots for the two functions used to calculate PCs (Base R’s prcomp() and FactorMineR’s PCA()) using the function fviz_pca_biplot() from the factoextra we will get -

ggpubr::ggarrange(
fviz_pca_biplot(pc_fmr,              # using - FactoMineR's PCA()
                repel = TRUE,        # to avoid overplotting
                axes = c(1, 2),      # To plot 1st and 2nd PC
                col.var = "#2E9FDF", # Variables color
                col.ind = "#696969",  # Individuals color
                title = "PCA - Using FactoMineR's PCA()"
                ),
fviz_pca_biplot(pc_prcomp,           # using - base R's prcomp()
                repel = TRUE,        # to avoid overplotting
                axes = c(1, 2),      # To plot 1st and 2nd PC
                col.var = "#2E9FDF", # Variables color
                col.ind = "#696969",  # Individuals color
                title = "PCA - Using base R's prcomp()"),
ncol = 1, nrow = 2)

Here, if a positive or negative PC is found depending on the function used it just means that the eigenvector is projected pointing in one direction or 180 degree away in the other direction. The interpretation remains same regardless of the sign, because the variance that is explained is unchanged.⁷

Another visualization using the ggbiplot() function from ggbiplot package -

# remotes::install_github("vqv/ggbiplot")
library(ggbiplot)
ggbiplot(pc_fmr, groups=as.factor(data$Family), ellipse = TRUE) +
  labs(title = "Biplot: Using the first two PCs",
       col = "Family") + theme_bw()

References

Textbook Bhuyan KC. Multivariate analysis and its applications. New Central Book Agency; 2005.

1 Jolliffe IT, Cadima J. Principal component analysis: a review and recent developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences. 2016 Apr 13;374(2065):20150202.

2 Datacamp - Principal Component Analysis in R Tutorial

3 StackExchange - Making sense of principal component analysis, eigenvectors & eigenvalues

4 Scree plot

5 sthda - Principal Component Methods in R: Practical Guide

6 Does the sign of scores or of loadings in PCA or FA have a meaning? May I reverse the sign?

7Alex Dmitrienko; Christy Chuang-Stein; Ralph B. D’Agostino (2007). Pharmaceutical Statistics Using SAS: A Practical Guide. SAS Institute. p. 380. ISBN 978-1-59994-357-2.