Pierre LegendreDépartement de sciences biologiques
Université de Montréalhttp://www.NumericalEcology.com/
1.1. Principal component analysis
© Pierre Legendre 2017
Outline of the presentation
1. PCA algebra and computation steps
2. Data transformations before PCA
3. Scalings in PCA
4. Equilibrium contribution of variables
5. The meaningful components
6. Algorithms for PCA
7. Some applications of PCA in ecology
8. References
PCAalgebraandcomputationsteps
This section of the ordination course will describe the most fundamental of all ordination methods. It is called Principal component analysis (PCA).
In a sense, it is “the mother” of the other ordination methods that we will study in later sections of the course because these other methods will try to produce the same type of ordination plots as PCA using data that are not quite appropriate, or that are inappropriate for PCA.
I will first describe the algebra and computation steps of PCA.
PCA algebra and computation steps
PCAalgebraandcomputationsteps
Principal component analysis (PCA) is an ordination method preserving the Euclidean distance among the objects.
Definition of PCA
PCA is only applicable to multivariate quantitative data.
PCAalgebraandcomputationsteps
PCA: Computation steps
For the means, variances and covariances to make sense, the variables must be quantitative in PCA.
PCAalgebraandcomputationsteps
PCAalgebraandcomputationsteps
Multiply the centred data by the eigenvector matrix U
to obtain the positions of the objects in the PCA ordination plot:
F = y− y⎡⎣
⎤⎦U
PCAalgebraandcomputationsteps
F = y− y⎡⎣
⎤⎦U
PCAalgebraandcomputationsteps
Matrix U contains direction cosines, which are cosines of the angles between the
variables and the PCA axesF
R code for this example –Y <- matrix(c(2,3,5,7,9,1,4,0,6,2),5,2)Y.c <- scale(Y, center=TRUE, scale=FALSE)Y.eig <- eigen(cov(Y.c))U <- Y.eig$vectorsF <- Y.c %*% U
biplot(F,U)
PCAalgebraandcomputationsteps
Note – The axes may be inverted due to an arbitrary decision that the software has to make during calculation, which decides which end of each eigenvector should correspond to the positive direction
This decision is of no fundamental consequence for the ordination:
the distances among objects are the same if any or all of the axes are inverted.
After inversion of the signs along one or several axes, PCA still preserves the Euclidean distances among the objects.
(b)Centrethevariables
0 2 4 6 8 10
-20
24
68
Var.1
Var.2
(a)Scatterdiagram
-4 -2 0 2 4
-4-2
02
4
Var.1 centred
Var
.2 c
entre
d
Var1centred
Var2
centred
PCAalgebraandcomputationsteps
Animation: the PCA example in 4 steps
Var1centred
Var2
centred
-4 -2 0 2 4
-4-2
02
4
(c)ComputePCAaxes (d)PCAbiplot
-3 -2 -1 0 1 2 3
-3-2
-10
12
3PCA axis 1
PC
A a
xis
2
1
2
3
4
5
-1.0 -0.5 0.0 0.5
-1.0
-0.5
0.0
0.5
Var 1
Var 2
1
Var 1
Rotate
thegraph
PCAalgebraandcomputationsteps
Data transformations before PCA
Field data often need to be transformed before they are analysed by PCA. We will see why.
PCA if a form of variance decomposition
1. The sum of the eigenvalues is equal to the sum of the variances of the variables. Yes, variances
have physical dimensions!!
Properties of variances
DatatransformationsbeforePCA
2. Variables have physical dimensions (e.g. altitude is measured in m) and the variance of a variable has the same dimension as the variable, squared.
3. The variances of all variables subjected together to PCA must have the same physical dimensions; otherwise, we could not add them into a sum to be decomposed into eigenvalues during PCA.
For example, the variance of altitude is expressed in m2.
• Ranging is also sometimes used:
y 'i =yi − yminymax − ymin
Both methods make the transformed variables dimensionless.
• When the input variables do not all have the same physical dimension, they must be subjected to standardization:
Input variable standardization is an option of PCA functions.
zi =yi − ysy
DatatransformationsbeforePCA
Make physical variables dimensionless
• Hence, the sum of the eigenvalues of a matrix of standardized variables (argument scale) is also p.
• Standardized variables have a mean of 0 and a variance of 1.• The sum of the variances of p standardized variables is p.
# PCA, spider environmental data using vegan’s rda()spiders.env <- read.table(file.choose())dim(spiders.env)# [1] 28 15 <= 28 sites, 15 variables
rda.spiders.env <- rda(spiders.env, scale=TRUE)sum(rda.spiders.env$CA$eig)# [1] 15 <= Sum of the PCA eigenvalues
The data file used here is ‘Spiders_env_(28x15).txt’, containing 15 variables.
PCA produces the best dispersion of the points when the input variables have symmetrical distributions. The ideal situation, rarely achieved, is when their distribution is multivariate normal.
PCA graphs are easier to interpret when the distributions are at least symmetrical. Variables may be transformed to make their distributions more symmetrical. The normalizing transformations most often used are the square root (exponent ½), double square root (exponent ¼), and log.
Any log base can be used for transformation. For a data vector vec, log10(vec) is a linear transformation of loge(vec); the two transformed vectors are perfectly correlated.
DatatransformationsbeforePCA
“Normalizing” transformations
Example in two dimensions: random lognormal versus normal data.=> Which graph shows the best dispersion of the points ?
# Generate a matrix of random lognormal data# with 50 rows and 5 columnsn <- 50 ; p <- 5mat2.2 <- matrix(exp(rnorm((n*p),mean=0,sd=2.5)),n,p)colnames(mat2.2) <- paste("Var",1:5,sep=".")
# Apply a log transformation to the datamat2.2.log <- log(mat2.2)
# Plot columns 1 and 2 of these data files
DatatransformationsbeforePCA
0 50 100 150 200
0100
200
300
Random lognormal data
Var.1
Var.2
-6 -4 -2 0 2 4-4
-20
24
6
Random normal data
Var.1
Var.2
log
=> Which graph shows the best dispersion of the points?
1. Transformation to reduce the asymmetry of species distributions: y' = log(y + 1).
2. Before PCA, community composition data can also be transformed using transformations that are appropriate for the study of beta diversity =>
Transformations for community composition data
For community data, we use y' = log(y + 1) because the lowest value in community data matrices is 0.
log(0) = –Inf, but log(0 + 1) = 0.
The R function to carry out this transformation is:
log1p(x)
The Hellinger and chord transformations (Legendre & Gallagher, 2001) are appropriate before PCA because …
• Chord transformation:
chord-transformed data + Euclidean distance => chord distance
€
" y ij = yij yij2
j=1
p∑
• Hellinger tranformation:
Hellinger-transf. data + Euclidean distance => Hellinger distance
€
" y ij = yij yi +
DatatransformationsbeforePCA
The Hellinger and chord distances are appropriate for ordination of community data because they have 9 properties that are important for beta diversity studies.
These two distances will be described in the course on dissimilarities and their properties will be shown in the course on beta diversity.
Example: the spider data set (28 sites, 12 species) —PCA of the original abundance and Hellinger-transformed data.
spiders <- read.table(file.choose())spiders.hel <- decostand(spiders, "hellinger")
# PCA using function prcomp of {stats}pca.spiders <- prcomp(spiders)pca.spiders.hel <- prcomp(spiders.hel)
# PCA biplots from prcomp {stats}par(mfrow=c(1,2))biplot(pca.spiders, scale=0)biplot(pca.spiders.hel, scale=0)
Which PCA biplot shows the best dispersion of the points?
DatatransformationsbeforePCA
Original abundance data Hellinger-transformed data
-50 0 50 100
-50
050
100
PC1
PC2
Site1
Site2
Site3
Site4
Site5
Site6
Site7
Site8
Site9Site10
Site11Site12
Site13Site14
Site15Site16Site17Site18Site19Site20Site21Site22Site23Site24
Site25Site26Site27Site28
-0.2 0.0 0.2 0.4 0.6 0.8 1.0
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
Alop.acce
Alop.cune
Alop.fabrArct.luteArct.periAulo.albi
Pard.lugu
Pard.mont
Pard.nigr
Pard.pull
Troc.terrZora.spin
-0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4
-0.8
-0.6
-0.4
-0.2
0.0
0.2
0.4
PC1
PC2
Site1
Site2
Site3 Site4Site5Site6
Site7
Site8
Site9
Site10Site11
Site12
Site13
Site14
Site15
Site16Site17Site18Site19Site20
Site21Site22Site23Site24
Site25
Site26
Site27
Site28
-0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4
-0.8
-0.6
-0.4
-0.2
0.0
0.2
0.4
Alop.acce
Alop.cune
Alop.fabr
Arct.lute
Arct.peri
Aulo.albi
Pard.lugu
Pard.mont
Pard.nigr
Pard.pull
Troc.terr
Zora.spin
DatatransformationsbeforePCA
The role of scalings is to transform the eigenvectors in matrix U into coordinates that are appropriate to represent the sites and the species in biplots.
Different types of scalings are available in PCA.
Scalings in PCA
ScalingsinPCA
PCA biplots are graphs in which objects and variables (descriptors) are represented together.
Biplots
ScalingsinPCA
In biplots, objects and variables can be presented together in two different ways, called scalings:
• Scaling type 1: distance biplot, used when the interest is on the positions of the objects with respect to one another. –Ø Plot matrices F to represent the objects and U for the variables.
• Scaling type 2: correlation biplot, used when the angular relationships among the variables are of primary interest. –
Ø Plot matrices G to represent the objects and Usc2 for the variables, where G = FΛ–1/2 , and Usc2 = UΛ1/2.
Scalings in biplots
ScalingsinPCA
The generally accepted rule for representing sites and variables (e.g. species) together in a PCA biplot is the following: In biplots, we can use matrices which, together, reconstruct the centred data Yc . Hence,
Ø In a distance biplot, we can use F and U together because F U' = Yc ;
Ø In a correlation biplot, we can use G and Usc2 together because G Usc2' = (FΛ–1/2) (UΛ1/2)' = Yc .
This biplot rule was proposed by K. Ruben Gabriel in 1971.
The biplot rule
ScalingsinPCA
Extended code for the numerical example, scalings 1 and 2 –Y <- matrix(c(2,3,5,7,9,1,4,0,6,2),5,2)Y.c <- scale(Y, center=TRUE, scale=FALSE)Y.eig <- eigen(cov(Y))k <- length(which(Y.eig$values > 1e-10))
# Scaling 1 (distance biplot)U <- Y.eig$vectorsF <- Y.c %*% Ubiplot(F, U, expand=1.5, xlim=c(-4,4), ylim=c(-4,4))abline(h=0, v=0, lty=2, col="grey60")
# Scaling 2 (correlation biplot)U.sc2 <- U %*% diag(Y.eig$values[1:k]^(0.5))G <- F %*% diag(Y.eig$values[1:k]^(-0.5))biplot(G, U.sc2, expand=1.3, xlim=c(-1.5,1.5), ylim=c(-1.5,1.5))abline(h=0, v=0, lty=2, col="grey60")
Biplot,scaling1 Biplot,scaling2
-4 -2 0 2 4
-4-2
02
4
PCA axis 1
PC
A a
xis
2
1
2
3
4
5
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
0.5
1.0
Var 1
Var 2
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
PCA axis 1
PC
A a
xis
2
1
2
3
4
5
-3 -2 -1 0 1 2 3
-3-2
-10
12
3
Var 1
Var 2
ScalingsinPCA
• In scaling 1 (distance biplot), Ø the sites have variances, along each axis (or principal
component), equal to the axis eigenvalue (column of F);Ø the eigenvectors (columns of U) are normed to lengths = 1; Ø the length (norm) of each species vector in the p-
dimensional ordination space (rows of U) is 1.
• In scaling 2 (correlation biplot), Ø the sites have unit variance along each axis (columns of G);Ø the eigenvectors (columns of Usc2) are normed to
lengths = sqrt(eigenvalues); Ø the norm of each species vector in the p-dimensional
ordination space (rows of Usc2) is its standard deviation.
Mathematical relationships in scaled matrices
In scaling 1 (distance biplot), 1. Distances among objects approximate their Euclidean distances in
full multidimensional space.2. Projecting an object at right angle on a descriptor approximates the
position of the object along that descriptor. 3. Since descriptors have equal lengths of 1 in the full-dimensional
space, the length of the projection of a descriptor in reduced space indicates how much it contributes to the formation of that space.
4. A scaling 1 biplot thus shows which variables contribute the most to the ordination in a few dimensions (see also section: Equilibrium contribution of variables).
5. The descriptor-axes are orthogonal (90°) to one another in multidimensional space. These right angles, projected in reduced space, do not reflect the variables’ correlations.
Interpretation of relationships in biplots
ScalingsinPCA
Revise these relationships when you
compute a PCA
In scaling 2 (correlation biplot), 1. Distances among objects approximate their Mahalanobis distances
in full multidimensional space.2. Projecting an object at right angle on a descriptor approximates the
position of the object along that descriptor.3. Since descriptors have lengths sj in full-dimensional space, the
length of the projection of a descriptor j in reduced space is an approximation of its standard deviation sj. Note: sj is 1 when the variables have been standardized.
4. The angles between descriptors in the biplot reflect their correlations.
5. When the distance relationships among objects are important for interpretation, this type of biplot is inadequate; a distance biplot should be used.
ScalingsinPCA
Revise these relationships when you
compute a PCA
Equilibrium contribution of variables
If p variables contributed equally to all dimensions of a reduced PCA space (i.e. in 2 dimensions), we would say that their contributions are in equilibrium with respect to the axes.
For any set of p variables, we can draw a circle on the PCA scaling 1 biplot whose radius is equal to the length of the projection of a variable that would contribute equally to all axes of the reduced space.The logic is explained in the following slides =>
length = 1
projection = 2 / 3
In PCA scaling 1, the p-dimensional space preserves the Euclidean distance among objects. The descriptors all have lengths (or norms) of 1 in multidimensional space.
Lengths of descriptors in PCA space
Hence, the lengths of their projections in 2-dimensional plots, for example, can be compared: Ø long arrows represent variables that contribute highly to the
axes of this projection in 2 dimensions,Ø short arrows represent variables that contribute less.
In PCA scaling 1, the variables are at right angles to one another in multivariate space. Projected in reduced 2-dimensional space, these angles look acute or obtuse, but they are still right angles.
Equilibriumcontributionofvariables
A circle can be drawn on the scaling 1 biplot, corresponding to the hypothesis of equal contributions of all descriptors to the reduced space (e.g. in 2 dimensions): àit is called the equilibrium circle of descriptors, or circle of equilibrium contribution. àIts radius is sqrt(d/p), where
d = dimension of the reduced space (usually, d = 2)p = dimensionality of the multivariate PCA space, which is the number of eigenvalues > 0; usually equal to the number of descriptors.
Equilibrium circle of descriptors
Equilibriumcontributionofvariables
Contributionéquilibréedesvariables
length = 1
projection = 2 / 3
Radius (R) of the circle of equilibrium contribution: R = sqrt(d/p), where
d = dimension of the reduced space (d = 2 in this example)p = number of descriptors in the PCA (p = 3)
With an equilibrium circle, the scaling 1 biplot shows the 6 species that contribute the most to the dispersion of the sites in reduced space. Angles among descriptors are not interpretable in that type of biplot.
PCA biplot, scaling 1, of the Hellinger-transformed spider data.
The circle of equilibrium contribution is shown in red. Its radius is sqrt(2/12) = 0.408.
Six species have arrows longer than the radius of the circle.
Drawn with function cleanplot.pca() of the Numerical ecology with R book.
-1.0 -0.5 0.0 0.5
-0.5
0.0
0.5
PCA biplot - Scaling 1
PCA 1
PC
A 2
Site1
Site2
Site3Site4Site5Site6
Site7
Site8
Site9
Site10
Site11
Site12
Site13
Site14
Site15
Site16
Site17Site18Site19Site20
Site21Site22Site23Site24
Site25
Site26
Site27
Site28
Alop.acce
Alop.cune
Alop.fabr
Arct.lute
Arct.peri
Aulo.albi
Pard.lugu
Pard.mont
Pard.nigr
Pard.pull
Troc.terr
Zora.spin
In PCA scaling 2, the p-dimensional space preserves the Mahalanobis distance among objects, not the Euclidean distance.In that multivariate space, • the variables are not at right angles with respect to one another but at angles that reflect their correlations;• the lengths of the variable vectors = their standard deviations.
In scaling 2: no equilibrium circle
Ø The lengths of the variable projections in (for example) a 2-dimensional reduced space depend on:• their standard deviations, which are all different if the
variables have not been standardized;•the angles of their projections, in Mahalanobis space, onto
the 2-dimensional projection plane.
For these reasons, in scaling 2,•the angles between descriptors reflect their correlations, •no value of equilibrium contribution would apply to all descriptors.•Henceno circle of equilibrium contribution can be drawn in PCA scaling 2 biplots.
Equilibriumcontributionofvariables
In a scaling 2 biplot, the angles between descriptors reflect their correlations. Species with long arrows separated by small angles may indicate species associations.
PCA biplot, scaling 2, of the Hellinger-transformed spider data.
Drawn with function cleanplot.pca() of the Numerical ecology with R book.
-2 -1 0 1 2
-2-1
01
2
PCA biplot - Scaling 2
PCA 1
PC
A 2
Site1
Site2
Site3 Site4Site5Site6
Site7
Site8
Site9
Site10
Site11
Site12
Site13
Site14
Site15
Site16
Site17Site18Site19Site20
Site21Site22Site23Site24
Site25
Site26
Site27
Site28
Alop.acce
Alop.cune
Alop.fabr
Arct.lute
Arct.peri
Aulo.albi
Pard.lugu
Pard.mont
Pard.nigr
Pard.pull
Troc.terr
Zora.spin
Environmental variables must be standardized at the beginning of a principal component analysis. A circle of equilibrium contribution can be drawn in scaling 1. The circle helps determine the variables that contribute the most to the formation of the reduced space plane.
Environmental variables
# PCA, spider environmental data using vegan’s rda()spiders.env <- read.table(file.choose())rda.spiders.env <- rda(spiders.env, scale=TRUE) par(mfrow=c(1,2))cleanplot.pca(rda.spiders.env, scaling=1, opt=TRUE)cleanplot.pca(rda.spiders.env, scaling=2, opt=TRUE)
The data file used here is ‘Spiders_env_(28x15).txt’, containing 15 variables.cleanplot.pca(): R function of the book Numerical ecology with R.
•Distances among sites in scaling 1 biplot reflect the distances in multivariate space. A circle of equilibrium contribution can be drawn. •In a scaling 2 biplot, the angles between variables reflect their correlations.
-4 -2 0 2 4
-4-2
02
46
PCA biplot - Scaling 1
PCA 1
PC
A 2 Site1
Site2
Site3
Site4Site5
Site6
Site7
Site8
Site9
Site10
Site11Site12
Site13
Site14
Site15Site16
Site17
Site18Site19Site20Site21
Site22Site23Site24
Site25
Site26
Site27Site28
Water.contentHumus
Bare.sandLeaves.twigs
Herb.cover
Herb.height
Calamagrostis
Corynephorus Tree.coverTree.heightPopulus
Crataegus
Ill.grey.sky
Ill.clear.skySoil.reflection
-2 -1 0 1 2
-2-1
01
2
PCA biplot - Scaling 2
PCA 1P
CA
2
Site1
Site2
Site3
Site4Site5
Site6
Site7
Site8
Site9
Site10
Site11Site12
Site13
Site14
Site15Site16
Site17
Site18
Site19Site20Site21Site22Site23Site24
Site25
Site26
Site27Site28
Water.contentHumus
Bare.sandLeaves.twigs
Herb.cover
Herb.height
Calamagrostis
Corynephorus Tree.coverTree.heightPopulus
Crataegus
Ill.grey.sky
Ill.clear.skySoil.reflection
The meaningful components
A PCA produces an ordination in k dimensions, with
k ≤ min(p, n – 1)
Which ones of these dimensions should we look at and try to interpret?
PCA provides a description of multivariate data in a space of reduced dimensionality. It is not a test of statistical significance.
1 See also Legendre & Legendre (2012), Section 9.1.6, “The meaningful components”.
Yet, users of the method would like to know: how many axes should we look at? How many axes display more than random variation? Various criteria have been proposed 1.
1. Arbitrary decision — For example, display and examine the axes that represent, together, at least 75% of the total variation.2. The Kaiser-Guttman criterion — Interpret the axes whose eigenvalues are larger then the mean of the eigenvalues. (For standardized data, the sum of the eigenvalues is the number of variables p, so that the mean eigenvalue is 1.)
Criteria to select the meaningful components
3. Compare the eigenvalues to the broken-stick model, a null model for random distribution of the variance among the axes.
Null model: break a stick of unit length into p parts by placing (p–1) breakpoints at random along it. Measure the piece lengths and place them in decreasing order. Repeat a large number of times. Compute the mean of the longest parts, the second longest, etc.
For a unit stick broken at random into p = 2, 3, … pieces, the expected values (E) of the relative lengths of the successively smaller pieces (j) are given by the following model equation:
Several R packages provide functions to compute scree plots that compare eigenvalues to the broken stick model.
E( j) = 1p
1xx= j
p
∑
# Screeplot example: the spider dataspiders <- read.table(file.choose())spiders.hel <- decostand(spiders, "hellinger")
# PCA using function rda of {vegan}rda.spiders.hel <- rda(spiders.hel)screeplot(rda.spiders.hel, bstick=TRUE, npcs=12)
Screeplot of the eigenvalues. Grey rectangles: the eigenvalues. Red dots: values of the broken stick model rescaled to the sum of the eigenvalues. Vegan’s function screeplot.cca(). Screeplot: see http://psychologydictionary.org/scree-plot/ù
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12
rda.spiders.hel
Inertia
0.00
0.10
0.20
Broken Stick
The decision on how many eigenvalues should be interpreted can be based … • either on the comparison of individual eigenvalues with the corresponding broken stick values,• or the comparison of cumulative eigenvalues with cumulative broken stick values.
Grey rectangles: the eigenvalues. Red dots: values of the broken stick model rescaled to the sum of the eigenvalues.
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12
rda.spiders.hel
Inertia
0.00
0.10
0.20
Broken Stick
For this example, using the comparison of individual eigenvalues with the corresponding broken stick values, one could decide to interpret the first 2 or 3 eigenvalues because the first two eigenvalues are larger than the corresponding broken stick values and the third one is about equal.
Themeaningfulcomponents
Grey rectangles: the eigenvalues. Red dots: values of the broken stick model rescaled to the sum of the eigenvalues.
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12
rda.spiders.hel
Inertia
0.00
0.10
0.20
Broken Stick
PCAalgebraandcomputationsteps
In the analysis of the spider data, what proportion of the species variance is expressed on the first 2 PCA axes? On the first 3 axes?Compute pseudo-R2 statistics for m = 2 or 3 axes:
R.square = Sum of the first m eigenvalues / Sum of all eigenvalues
# PCA using function rda() of {vegan}res1 <- rda(spiders.hel)( sum(res1$CA$eig[1:2])/sum(res1$CA$eig) ) # 0.744( sum(res1$CA$eig[1:3])/sum(res1$CA$eig) ) # 0.874
# Same analysis using function prcomp() of {stats}res2 <- prcomp(spiders.hel)( sum(res2$sdev[1:2]^2)/sum(res2$sdev^2) ) # 0.744( sum(res2$sdev[1:3]^2)/sum(res2$sdev^2) ) # 0.874
Algorithms for PCA
Principal component analysis can be computed using different computer algorithms.
AlgorithmsforPCA
PCA is a statistical method of data analysis (not a statistical test).Three different algorithms (or methods of calculation) can be used to implement it:Ø Eigenvalue decomposition (EVD); eigen(cov(Y)) in R.
Ø Singular value decomposition (SVD); svd(Y.c) in R.These two algorithms are interchangeable, although statisticians often prefer svd(), which offers greater numerical accuracy.
Details are found in Legendre & Legendre (2012, Section 9.1.9).
Ø An iterative algorithm developed by Clint & Jennings (1970) was adapted to correspondence analysis by Hill (1973). It was then used by ter Braak in the Canoco ordination package.
AlgorithmsforPCA
It had to be U – The (wonderful) SVDThe SVD song
Calculation of PCA by Singular Value Decomposition (SVD) is demonstrated in a video written by Prof. Michael Greenacre, Barcelona Graduate School of Economics. The video is available at the following address:
https://www.youtube.com/watch?v=JEYLfIVvR9I
The video presents a song explaining the mathematics of singular value decomposition (SVD), a most useful results in matrix algebra, which has a vast range of practical applications, including PCA. It is sung by Gurdeep Stephens, accompanied on the piano by Lisa Olive. Concept, lyrics and animations by Michael Greenacre.
This video was first played at the 9th Tartu Conference of Multivariate Statistics in Tartu, Estonia, on 28 June 2011.
The video links to mathematical lectures on SVD.
Playthevideo
AlgorithmsforPCA
Some applications of PCA in ecology
Principal component analysis can help answer different questions in ecology.Here are some examples.
SomeapplicationsofPCA
Display two-dimensional ordinations of the objects with their variables. Objects are often sampling sites in ecology.
-0.3 -0.2 -0.1 0.0 0.1 0.2
-0.3
-0.2
-0.1
0.0
0.1
0.2
PC1
PC2
Site1
Site2
Site3Site4Site5Site6
Site7
Site8
Site9
Site10
Site11
Site12
Site13
Site14
Site15
Site16
Site17Site18Site19Site20
Site21Site22Site23Site24
Site25
Site26
Site27
Site28
-1.5 -1.0 -0.5 0.0 0.5 1.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
Alop.acce
Alop.cune
Alop.fabr
Arct.lute
Arct.peri
Aulo.albi
Pard.lugu
Pard.mont
Pard.nigr
Pard.pull
Troc.terr
Zora.spin
This is the most common application of PCA.
Example 1 –
Identify or display groups of variables that are intercorrelated, for example species associations.
Oribatid mite species associations (symbols: groups 1 and 2) plotted in a PCA scaling 2 ordination. From: Legendre (2005).
Example 2 –
Example 3 –Detect outliers or erroneous data in data files
-0.8 -0.6 -0.4 -0.2 0.0 0.2
-0.8
-0.6
-0.4
-0.2
0.0
0.2
PC1
PC2
Site1Site2
Site3Site4Site5
Site6Site7
Site8
Site9
Site10Site11Site12Site13Site14Site15Site16Site17Site18Site19Site20Site21
Site22Site23Site24 Site25Site26Site27Site28
-20 -15 -10 -5 0 5
-20
-15
-10
-50
5
Water_content
Calamagrostis
ReflectanceCorynephorus
-0.3 -0.1 0.1 0.2 0.3 0.4
-0.3
-0.1
0.10.20.30.4
PC1
PC2
Site1Site2
Site3
Site4
Site5
Site6
Site7
Site8Site9
Site10Site11Site12
Site13
Site14
Site15Site16Site17Site18Site19Site20Site21
Site22Site23Site24
Site25
Site26
Site27Site28
-4 -2 0 2 4 6
-4-2
02
46
Water_content
Calamagrostis
Reflectance
Corynephorus
• Left: erroneous value of 1000 for Calamagrostis at site 9; no other change. • Right: biplot with corrected value (Calamagrostis = 0) at site 9.
Example 4 –
Simplify collinear data that will be used as explanatory variables or covariables in canonical analysis (RDA or CCA).
# Example: the spider data, p=12spiders <- read.table(file.choose())spiders.hel <- decostand(spiders, "hellinger")
# PCA using function rda of {vegan}rda.spiders.hel <- rda(spiders.hel)eigenval = rda.spiders.hel$CA$eigformat(cumsum(eigenval)/sum(eigenval),digits=3)
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 0.502 0.744 0.874 0.906 0.935 0.953 0.969 0.979 0.987 0.994 0.998 1.000
Ø Noise in data can be removed by dropping the axes with small eigenvalues.Ø The first 4 components account for 90% of the variance of the data, and the
first 6 account for 95%. Use these components instead of the 12 original variables to represent the species as explanatory variables or covariables in RDA or CCA, in order to save degrees of freedom in tests of significance.
Remove a linear component of variation, e.g. the size factor in log-transformed morphological data.
Assumption: size is a multiplicative factor for all measured morphological traits.• After a log transformation of all morphological measures (lengths), size becomes an additive factor.• Compute PCA of the log-transformed measurements.• Size should be highly correlated to one of the first PCA axes.• From matrix F (see Computation steps), remove the PCA axis most related to size. Use the other columns of matrix F as size-detrended measurements.
Example 5 –
References
References
Borcard, D., F. Gillet & P. Legendre. 2018. Numerical ecology with R, 2nd edition. Use R! series, Springer International Publishing, New York. xv + 435 pp.
Clint, M. & A. Jennings. 1970. The evaluation of eigenvalues and eigenvectors of real symmetric matrices by simultaneous iteration. Computer Journal 13: 76–80.
Gabriel, K. R. 1971. The biplot graphical display of matrices with applications to principal component analysis. Biometrika 58: 453–467.
Goodall, D. W. 1954. Objective methods for the classification of vegetation. III. An essay in the use of factor analysis. Australian Journal of Botany 2: 304–324.
Hill, M. O. 1973. Reciprocal averaging: an eigenvector method of ordination. Journal of Ecology 61: 237–249.
Legendre, P. 2005. Species associations: the Kendall coefficient of concordance revisited. Journal of Agricultural, Biological, and Environmental Statistics 10: 226-245.
Legendre, P. & E. D. Gallagher. 2001. Ecologically meaningful transformations for ordination of species data. Oecologia 129: 271–280.
Legendre, P. & L. Legendre. 2012. Numerical ecology, 3rd English edition. Elsevier Science BV, Amsterdam. xvi + 990 pp. ISBN-13: 978-0444538680.
ter Braak, C. J. F. & P. Smilauer. 2002. CANOCO reference manual and CanoDraw for Windows user’s guide – Software for canonical community ordination (version 4.5). Microcomputer Power, Ithaca, New York. 500 pp.
End of section