Introduction
Karl Pearson invented PCA in 1901, defining it as “On lines and planes of closest fit to systems of points in space.”](https://www.tandfonline.com/doi/abs/10.1080/14786440109462720) Harold Hotelling later developed and named it independently in the 1930s.
Principal component analysis (PCA) is a technique to transform the original set of variables into a smaller set of linear combination of variables so that most of the statistical information (variance) in the original data is preserved by those linear combinations. It accomplishes this by generating new uncorrelated variables that gradually maximize variance. Finding such new variables, known as principal components (PCs), is essentially the same as solving an eigenvalue/eigenvector problem.1
An eigen vector is a direction of the transformation of the original dataset for which no information in the data is lost and an eigenvalue shows the amount of variance that is explained in the data in that direction. So the eigen vector corresponding to the highest eigenvalue is the first principal component.
Geometrically to say, PCA is a type of linear transformation of a given dataset \X\ that contains values for a given number of variables (or say coordinates) for a given number of spaces. This linear transformation fits this dataset to a new coordinate system in such a way that the first coordinate (component) has the most variation, and each subsequent coordinate is orthogonal to the first and has less variance.2,3<\cite>
Underlying Theory

Pre-requisites
Before diving into the PCA, there are a few things one should be aware of -
- Linear Algebra
- Trace of a matrix and its properties
- Determinant of a matrix
- Diagonalization of a matrix
- Orthogonal transformation
- Eigen values and eigen vectors
- Statistics
- Mean vector
- Variance-Covariance Matrix
- Correlation Matrix
Theory
Let, p be the number of variables, n be the number of observations, \ X \ be a \p n\ matrix of a dataset, mean vector \{X}\ be a zero vector and the variance-covariance matrix \\ be a real positive definite matrix or at least positive semi-definite of rank \r p\.
Let,\ _i i=1,2,…,p \ be the eigen values of \\ where, \_1 > _2 > _3 > … > _p\ and their corresponding eigenvectors are \_1, _2, _3, …, _p\.
where, \D\ is a diagonal matrix where the diagonal elements are the eigenvalues - $$ D =\begin{bmatrix} \lambda_1 & 0 & … & 0\newline 0 & \lambda_2 & … & 0\newline . & . & … & .\newline 0 & 0 & … & \lambda_p \end{bmatrix}_{p\ \times\ p} $$
and the matrix of eigen vectors, $$ \Gamma =\begin{bmatrix} \gamma_1,\ \gamma_2,\ …\ ,\gamma_p\ \end{bmatrix}_{p\ \times\ p} $$
The matrix \\ diagonalizes \\ such that, \D = ’ \ or, \=D’\
Since \\ is an orthogonal matrix, then \‘=’= I\.4
We consider an orthogonal transformation (that will preserve all the information) of \X\ into \Y\ by -
\Y=’X\
or,
\
=’
\
Here, \Y_1\, \Y_2\, …, \Y_p\ are the p components of \Y\ and are called the Principal Components.
For example if \X=[x_1 x_2]\ is a two variable data set and the first eigen vactor is \[a b]’\.
Then the first principal component will be - \P_1=x_1a+x_2 b\
Similarly if the second eigen vector is \[c d]’\ then the second principal component will be - \P_2=x_1c+x_2 d\
Preserving Variance
How does the technique preserve the total variation after the transformation?
Since \\ is the variance-covariance matrix of \X\, we can measure the following things:
- Overall measure of the variability of \X\ as \trace()\ or,
- Generalized variance of \X\ as \||\
Now, \trace()\
\=trace(D‘)\
\=trace(’D)\ [Cyclic property]
\=trace(D)\ [Since, \\ is an orthogonal matrix]
\=_{i=1}^{p}_i\
Again, \||\
\=|D‘|\
\=||| D||’|\
\=|D|\
\=_{i=1}^{p}_i\
Now we have the transformation, \Y=’X\
Here, \E[Y] = E[’X]=‘E[X]=0\
and \V(Y)=E[YY’]=E[’XX’]=’E[XX’]=’ =D\
Thus, for \Y\, overall measure of variability is \trace(D)=_{i=1}{p}i\
or, generalized variance is \|D|={i=1}{p}_i\
Which shows that the total variation of \X\ remains unchanged in \Y\.
Measuring Accounted Variance by PCs
How to know the percentage of variation that is explained by a principal component?
Since \\ is a positive definite matrix, the strict positivity of \_i\ is guaranteed.
Now, \Y_1\ is the first principal component corresponding to first eigenvalue \_1\, and similarly \Y_i\ is the i-th principal component corresponding to i-th eigenvalue \_i\.
The percentage of variation of the original data \X\ explained by the i-th principal component is calculated using \\
At the beginning of the theory development since \_i\s were taken such as \_1 > _2 > _3 > > _p\, the first principal component explains the largest amount of variation.
Practical Demonstration
Dataset
Importing a dataset -
data <- readxl::read_excel("data/1.2 data.xlsx")
data$Family <- ifelse(data$Family==0, "Serranidae",
ifelse(data$Family==1, "Carangidae", "Sparidae"))
df_unstd <- data[,-8]
Data description:
- x1: centrum’s length
- x2: anterior centrum’s width
- x3: posterior centrum’s width
- x4: centrum’s height
- x5: distance between the base of the neural spine and the posterior face of the centrum
- x6: distance between the base of the haemal spine and the posterior face of the centrum
- x7: length of anterior zygapophysis along with other variables
- Family: family of fish specimens; 0 - Serranidae, 1 - Carangidae, 2 - Sparidae
Correlation plot of the original dataset and the principal component transformed (we’ll see it later) dataset -
library(ggcorrplot)
pc_prcomp <- prcomp(df_unstd, center = TRUE, scale. = TRUE)
corr <- cor(df_unstd)
a <- ggcorrplot(corr, show.legend = F,
hc.order = F, type = "upper", lab = T,
digits = 2, method = "square",
colors = c("#6D9EC1", "white", "#E46726")) +
labs(title = "Correlations Matrix: Original dataset")
b <- ggcorrplot(cor(pc_prcomp$x), show.legend = F,
hc.order = F, type = "upper", lab = T,
digits = 2, method = "square",
colors = c("#6D9EC1", "white", "#E46726")) +
labs(title = "Correlations Matrix: Principal Component Transformed")
ggpubr::ggarrange(a, b,
ncol = 2, nrow = 1)
The idea is that if many variables correlate with one another, they will all contribute strongly to the same principal component.5
Manual Calculation
The principal components can be found using things -
- Variance-covariance matrix
- Correlation matrix
If the data is not standardized, it is suggested to use the correlation matrix in principal component analysis. Because the variance of variable is affected by the scale of measurement.
But if the data is standardized, both of variance-covariance matrix and correlation matrix gives the same principal components. So, it is recommended to use standardize the data before performing principal component analysis.
Standardization is done using the following formula - $$ Z = \frac{X-\mu}{\sigma} $$
Calculating mean vector -
mn <- colMeans(df_unstd) # Mean vector
mn
# x1 x2 x3 x4 x5 x6 x7
# 7.297222 5.811111 5.672222 5.672222 2.552778 3.677778 2.369444
Variance-covariance matrix -
var(df_unstd)
# x1 x2 x3 x4 x5 x6 x7
# x1 1.3442778 0.90374603 0.93420635 0.9853492 0.35529365 0.52679365 0.3961984
# x2 0.9037460 1.28387302 1.17974603 1.4508889 -0.09203175 0.57996825 0.1774921
# x3 0.9342063 1.17974603 1.35520635 1.5217778 0.06007937 0.58936508 0.2674127
# x4 0.9853492 1.45088889 1.52177778 2.0803492 -0.11277778 0.79450794 0.2434127
# x5 0.3552937 -0.09203175 0.06007937 -0.1127778 0.45627778 0.04749206 0.2662302
# x6 0.5267937 0.57996825 0.58936508 0.7945079 0.04749206 0.38234921 0.1581587
# x7 0.3961984 0.17749206 0.26741270 0.2434127 0.26623016 0.15815873 0.2718968
Extracting the standard deviations -
vardf_unstd <- diag(var(df_unstd))^0.5
vardf_unstd
# x1 x2 x3 x4 x5 x6 x7
# 1.1594299 1.1330812 1.1641333 1.4423416 0.6754834 0.6183439 0.5214373
Subtracting the columns by their corresponding means and dividing by their corresponding standard deviations to obtain standardized dataset -
df_std <- as.matrix((df_unstd - rep(mn,each = nrow(df_unstd)))/rep(vardf_unstd, each = nrow(df_unstd)))
Mean vector of the new dataset -
round(colMeans(df_std))
# x1 x2 x3 x4 x5 x6 x7
# 0 0 0 0 0 0 0
Using Correlation Matrix
Using correlation matrix of standardized data to calculate eigenvalues -
eigcorr <- eigen(cor(df_std)) # using correlation matrix on standardized data
output <- cbind(eigenvalue = round(eigcorr$values,2),
cumul_percentage = round(cumsum(eigcorr$values/sum(eigcorr$values))*100, 2))
It is seen that the first principal component can explain 64.32% of the total variation in the data. The second principal component explains 25.35% variance. Similarly the third PC explains 4.04% variance.
Thus the first two principal components together explain 89.66% of the total variance.
Using correlation matrix of unstandardized data to calculate eigenvalues -
eigcorr <- eigen(cor(df_unstd)) # using correlation matrix on unstandardized data
output <- cbind(eigenvalue = round(eigcorr$values,2),
cumul_percentage = round(cumsum(eigcorr$values/sum(eigcorr$values))*100, 2))
So, it is verified that the principal components are not affected by the scales if the correlation matrix is used.
The eigen vectors -
round(eigcorr$vectors,3)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] -0.402 0.223 0.778 0.114 0.097 0.398 -0.055
# [2,] -0.423 -0.263 0.218 -0.253 0.268 -0.726 -0.202
# [3,] -0.438 -0.137 -0.157 -0.619 -0.225 0.246 0.519
# [4,] -0.423 -0.271 -0.352 0.071 -0.176 0.340 -0.688
# [5,] -0.110 0.706 -0.027 -0.158 -0.574 -0.297 -0.213
# [6,] -0.439 -0.084 -0.139 0.713 -0.251 -0.210 0.407
# [7,] -0.291 0.532 -0.423 0.027 0.667 0.072 0.053
So the principal component transformation of the original data will be -
princr_df <- df_std %*% eigcorr$vectors
# View the principal components
scroll_box(height = "300px",
kable_styling(
kable(
princr_df, digits = 2, format = "html", col.names = paste0("PC", 1:7)
),
bootstrap_options = c("striped", "hover"),
full_width = T, font_size = 12, position = "left"))
Using Variance-Covariance Matrix
Using variance-covariance matrix of standardized data to calculate eigenvalues -
eigcov <- eigen(cov(df_std)) # using variance-covariance matrix on standardized data
output <- cbind(eigenvalue = round(eigcov$values,2),
cumul_percentage = round(cumsum(eigcov$values/sum(eigcov$values))*100,2))
Using variance-covariance matrix of unstandardized data to calculate eigenvalues -
eigcov <- eigen(cov(df_unstd)) # using variance-covariance matrix on unstandardized data
output <- cbind(eigenvalue = round(eigcov$values,2),
cumul_percentage = round(cumsum(eigcov$values/sum(eigcov$values))*100,2))
So the principal components are influenced by the scales. Hence, it is recommended to standardize the data before calculating the principal components.
Using prcomp()
To make things easier in R, there are two built-in functions to perform PCA - prcomp() and princomp(). The function princomp() uses the spectral decomposition approach while the functions prcomp() and PCA()[from package FactoMineR] use the singular value decomposition (SVD).
Note that are two general methods to perform PCA in R:
1. Spectral decomposition which examines the covariances / correlations between variables
2. Singular value decomposition which examines the covariances / correlations between individuals
Due to higher level of accuracy prcomp() is preferred over princomp(). Standardization can be done by passing two arguments to the prcomp() function center = TRUE and scale. = TRUE -
pc_prcomp <- prcomp(df_unstd, center = TRUE, scale. = TRUE)
summary(pc_prcomp)
# Importance of components:
# PC1 PC2 PC3 PC4 PC5 PC6 PC7
# Standard deviation 2.1218 1.3320 0.53204 0.43966 0.38258 0.25684 0.18685
# Proportion of Variance 0.6432 0.2535 0.04044 0.02761 0.02091 0.00942 0.00499
# Cumulative Proportion 0.6432 0.8966 0.93706 0.96468 0.98559 0.99501 1.00000
The result is consistent with the manual calculation.
The function prcomp() returns the following things -
names(pc_prcomp)
# [1] "sdev" "rotation" "center" "scale" "x"
x denotes the transformed dataset/principal components -
pc_prcomp$x
sdev denotes to the standard deviations of the principal components -
pc_prcomp$sdev^2 # eigenvalues (variances)
# [1] 4.50218058 1.77420306 0.28306214 0.19330296 0.14637109 0.06596874 0.03491141
apply(pc_prcomp$x, 2, var) # variance of the principal components
# PC1 PC2 PC3 PC4 PC5 PC6 PC7
# 4.50218058 1.77420306 0.28306214 0.19330296 0.14637109 0.06596874 0.03491141
In rotation, the columns are actually eigen vectors -
pc_prcomp$rotation
# PC1 PC2 PC3 PC4 PC5 PC6
# x1 -0.4015216 -0.22330581 0.77779224 0.11376703 -0.09709303 0.39818828
# x2 -0.4228354 0.26263576 0.21829956 -0.25341590 -0.26839234 -0.72641133
# x3 -0.4382952 0.13669090 -0.15742089 -0.61921884 0.22529984 0.24584891
# x4 -0.4229505 0.27070943 -0.35156231 0.07051460 0.17588063 0.33970948
# x5 -0.1100766 -0.70615050 -0.02748501 -0.15839631 0.57425067 -0.29742388
# x6 -0.4385403 0.08385339 -0.13877297 0.71318889 0.25070443 -0.21049616
# x7 -0.2908030 -0.53245569 -0.42307656 0.02655884 -0.66650427 0.07182518
# PC7
# x1 -0.05504745
# x2 -0.20161965
# x3 0.51942077
# x4 -0.68769497
# x5 -0.21252061
# x6 0.40693163
# x7 0.05327880
Individual Points Plot
If we visualize the data by different colors for different level of family we will be able to identify them in clusters because each family type is likely to contain specific characteristics which are now almost entirely explained by the first two principal components -
library(tidyverse)
data.frame(pc_prcomp$x, Family = data$Family) %>%
ggplot(aes(x=PC1, y=PC2, color=Family)) +
geom_point(size=2) + theme_bw() +
labs(title = "First 2 prinipal components",
x = "Principal Component 1", y ="Principal Component 2",
col = "Family")
3D Plot
The following 3D plot shows the points for the first 3 principal components -
library(plotly)
df2 <- data.frame(pc_prcomp$x, Family = data$Family)
df2 %>%
plot_ly(x = ~PC1, y = ~PC2, z = ~PC3, color = ~Family) %>%
add_markers(
hovertemplate = paste(
'<br><b>PC1</b>: %{x:.2f}',
'<br><b>PC2</b>: %{y:.2f}',
'<br><b>PC3</b>: %{z:.2f}</b>')) %>%
layout(
# setting axis titles
scene = list(xaxis = list(title = 'PC1'),
yaxis = list(title = 'PC2'),
zaxis = list(title = 'PC3')),
legend = list(title=list(text='<b> Family </b>'),
orientation = "h", # show entries horizontally
xanchor = "center", # use center of legend as anchor
x = 0.5)) # put legend in center of x-axis
Scree plot
Scree plot is useful in determining the number of principal components to retain in a principal component analysis or the number of factors to keep in an exploratory factor analysis-
Using the manual calculation’s output eigcorr$values to draw a scree plot -
plot(x = 1:7, y = eigcorr$values, type = "b", pch = 19,
col="#952C22", main = "Scree plot", las = 1,
ylab = "Variances Explained", xlab = "",
xaxt = "n" # to hide x-axis labels
)
axis(side = 1, at = 1:7, labels = paste0("PC",1:7))
In 1966, Raymond B. Cattell invented the scree plot. Scree plot is used to identify statistically significant components. The process is also known as scree test.
According to the scree test, the components that are at the left side of the elbow are to be retained as significant.6
Another way to select the components to retain is inspecting the variance of the principal components. If the variance is greater than the average of all the variances then the component will be retained, which in the case of standardized data is -
mean(eigcorr$values)
# [1] 1
The average will be 1 irrespective of what was used to determine the eigen values and vectors, i.e., var-covariance matrix and correlation matrix if the data was standardized before calculating.
The scree plot -
plot(x = 1:7, y = eigcorr$values, type = "b", pch = 19,
col="#952C22", main = "Scree plot", las = 1,
ylab = "Variances Explained", xlab = "",
xaxt = "n" # to hide x-axis labels
)
axis(side = 1, at = 1:7, labels = paste0("PC",1:7))
abline(h=1, col="#2E7BF9")
So the first two principal components will be kept.
Extra:
To make a pretty scree plot using ggplot2 with secondary y-axis that shows the percentage of variance explained -
library(tidyverse)
# creating a tidy data frame
data.frame(PC= paste0("PC",1:length(pc_prcomp$sdev)),
Variance=pc_prcomp$sdev^2) %>%
# using the data frame to make plot
ggplot(aes(x=PC, y=Variance, group=1)) +
geom_point(size=1, col = "red") +
geom_line(col = "red") + theme_bw() +
scale_y_continuous(
name = "Variance",
# adding a secondary axis
sec.axis = sec_axis(
trans = ~ . / sum((pc_prcomp$sdev) ^ 2),
# name of secondary axis
name = "Percentage of Variance Explained",
labels = function(x) {
paste0(round(x * 100,), "%")
}
)
) +
geom_line(y = 1, color = "#2E7BF9", linetype="dashed") +
labs(title="Scree plot", x = "Principal Components")
Constructing a scree plot using the function fviz_eig() from the factoextra package.
library(factoextra)
fviz_screeplot(pc_prcomp, addlabels = T, linecolor = "red",
choice = "eigenvalue", geom = "line") +
geom_line(y = 1, color = "#2E7BF9", linetype="dashed") +
xlab("Principal Components")
FactoMineR’s PCA()
Aside from using prcomp(), the function PCA() from FactoMineR package can be used which extracts principal components from the correlation matrix -
library(FactoMineR)
pc_fmr <- PCA(df_unstd, graph = F, scale.unit = T)
To see the eigenvalues of the components and their corresponding information -
pc_fmr$eig
# eigenvalue percentage of variance cumulative percentage of variance
# comp 1 4.50218058 64.3168655 64.31687
# comp 2 1.77420306 25.3457580 89.66262
# comp 3 0.28306214 4.0437449 93.70637
# comp 4 0.19330296 2.7614709 96.46784
# comp 5 0.14637109 2.0910156 98.55885
# comp 6 0.06596874 0.9424106 99.50127
# comp 7 0.03491141 0.4987345 100.00000
Plotting the individual points of the first two principal components in a scatterplot -
plot.PCA(pc_fmr, axes = c(1,2))
Biplot
We have already constructed PCs using different functions in R. The signs of the components’ are not same in all the cases.
If we construct the biplots for the two functions used to calculate PCs (Base R’s prcomp() and FactorMineR’s PCA()) using the function fviz_pca_biplot() from the factoextra we will get -
ggpubr::ggarrange(
fviz_pca_biplot(pc_fmr, # using - FactoMineR's PCA()
repel = TRUE, # to avoid overplotting
axes = c(1, 2), # To plot 1st and 2nd PC
col.var = "#2E9FDF", # Variables color
col.ind = "#696969", # Individuals color
title = "PCA - Using FactoMineR's PCA()"
),
fviz_pca_biplot(pc_prcomp, # using - base R's prcomp()
repel = TRUE, # to avoid overplotting
axes = c(1, 2), # To plot 1st and 2nd PC
col.var = "#2E9FDF", # Variables color
col.ind = "#696969", # Individuals color
title = "PCA - Using base R's prcomp()"),
ncol = 1, nrow = 2)
Here, if a positive or negative PC is found depending on the function used it just means that the eigenvector is projected pointing in one direction or 180 degree away in the other direction. The interpretation remains same regardless of the sign, because the variance that is explained is unchanged.7
Another visualization using the ggbiplot() function from ggbiplot package -
# remotes::install_github("vqv/ggbiplot")
library(ggbiplot)
ggbiplot(pc_fmr, groups=as.factor(data$Family), ellipse = TRUE) +
labs(title = "Biplot: Using the first two PCs",
col = "Family") + theme_bw()
References
- Textbook Bhuyan KC. Multivariate analysis and its applications. New Central Book Agency; 2005.
1 Jolliffe IT, Cadima J. Principal component analysis: a review and recent developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences. 2016 Apr 13;374(2065):20150202.
2 Datacamp - Principal Component Analysis in R Tutorial
3 StackExchange - Making sense of principal component analysis, eigenvectors & eigenvalues
4 Scree plot
5 sthda - Principal Component Methods in R: Practical Guide
6 Does the sign of scores or of loadings in PCA or FA have a meaning? May I reverse the sign?
7Alex Dmitrienko; Christy Chuang-Stein; Ralph B. D’Agostino (2007). Pharmaceutical Statistics Using SAS: A Practical Guide. SAS Institute. p. 380. ISBN 978-1-59994-357-2.
-
https://royalsocietypublishing.org/doi/10.1098/rsta.2015.0202 ↩︎
-
https://www.datacamp.com/community/tutorials/pca-analysis-r. ↩︎
-
https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues/140579#140579 ↩︎
-
https://math.stackexchange.com/questions/142645/are-all-eigenvectors-of-any-matrix-always-orthogonal ↩︎
-
https://www.datacamp.com/community/tutorials/pca-analysis-r ↩︎
-
https://stats.stackexchange.com/questions/88880/does-the-sign-of-scores-or-of-loadings-in-pca-or-fa-have-a-meaning-may-i-revers ↩︎