Some slides from prof A. Alekseyenko, NYU; and prof S. Holmes, Stanford
Lecture 2: Diversity, Distances, adonis
1
Lecture 2: Diversity, Distances, adonis
• “Diversity” -‐ alpha, beta (, gamma) • Beta-‐Diversity in practice: Ecological Distances • Unsupervised Learning: Clustering, etc • Ordination: e.g. PCA, UniFrac/PCoA, DPCoA
• Testing: Permutational Multivariate ANOVA
2
Alpha-‐Diversity
3
Alpha diversity definition(s)
• Alpha diversity describes the diversity of a single community (specimen).
• In statistical terms, it is a scalar statistic computed for a single observation (column) that represents the diversity of that observation.
• There are many statistics that can describe diversity: e.g. taxonomical richness, evenness, dominance, etc.
4
Rank abundance plots
5
Species richness
• Suppose we observe a community that can contain up to k ‘species’.
• The relative proportions of the species P = {p1, …, pk} • Richness is computed as
R = 1(p1) + 1(p2) + … + 1(pk)
where 1(.) is an indicator function, i.e. 1(x) = 1 if pi≠0, and 0 otherwise.
• Higher R means greater diversity • Very dependent upon depth of sampling and sensitive to presence of rare species
6
• Sanders 1968 • non-parametric richness • estimate coverage
Sanders, H. L. (1968). Marine benthic diversity: a comparative study. American Naturalist
Rarefaction Curves
Number of species
# Observations / Library Size / # Reads / Sample Size
7
Shannon index• Suppose we observe a community that can contain up to k ‘species’. • The relative proportions of the species are P = {p1, …, pk}. • Shannon index is related to the notion of information content from
information theory. It roughly represents the amount of information that is available for the distribution of P.
• When pi = pj, for all i and j, then we have no information about which species a random draw will result in. As the inequality becomes more pronounced, we gain more information about the possible outcome of the draw. The Shannon index captures this property of the distribution.
• Shannon index is computed as Sk= – p1log2p1 – p2log2p2 – … – pklog2pk Note as pi ➔0, log2pi ➔ –∞, we therefore define pilog2pi = 0.
• Higher Sk means higher diversity
http://en.wikipedia.org/wiki/Entropy_(information_theory)“Shannon entropy”
8
From Shannon to Evenness
• Shannon index for a community of k species has a maximum at log2k
• We can make different communities more comparable if we normalize by the maximum
• Evenness index is computed as Ek=Sk/log2k
• Ek=1 means total evenness
9
Simpson index
• Suppose we observe a community that can contain up to k ‘species’. • The relative proportions of the species are P = {p1, …, pk}. • Simpson index is the probability of resampling the same species on
two consecutive draws with replacement. • Suppose on the first draw we picked species i, this event has
probability pi, hence the probability of drawing that species twice is pi*pi.
• Simpson index is usually computed as: D=1 – (p1
2 + p22 + … + pk
2) In this case, the index represents the probability that two individuals randomly selected from a sample will belong to different species.
• D = 0 means no diversity (1 species is completely dominant) • D = 1 means complete diversity
10
Numbers equivalent diversity• Often it is convenient to talk about alpha diversity in terms of equivalent units: – How many equally abundant taxa will it take to get the same diversity as we see in a given community?
• For richness there is no difference in statistic • For Shannon, remember that log2k is the maximum which is attained when all species equal abundance. Hence the diversity in equivalent units is 2Sk
• For Simpson the equivalent units measure of diversity is 1/(1-‐D) Sometimes called “Inverse Simpson Index”
11
Beta-‐Diversity
12
Beta-‐Diversity
http://en.wikipedia.org/wiki/Beta_diversity
• Microbial ecologists typically use beta diversity as a broad umbrella term that can refer to any of several indices related to compositional differences (Differences in species content between samples)
• For some reason this is contentious, and there appears to be ongoing (and pointless?) argument over the possible definitions
• For our purposes, and microbiome research, when you hear “beta-‐diversity”, you can probably think:
“Diversity of species composition”
13
Summary of diversity “types”
• α – diversity within a community, # of species only • β – diversity between communities (differentiation), species identity is taken into account
• γ – (global) diversity of the site • Theoretically, one would wishes to use such measures that result in γ = α × β
• This is only possible if α and β are independent of each other.
14
Beta-‐Diversity “in practice”1.UniFrac or Bray-‐Curtis distance between samples 2.MDS (“PCoA”) 3.Plot first two axes 4.Admire clusters 5.Write Paper 6.Choose new microbiomes 7.Return to Step 1, Repeat
Why? Let’s back up. This is one option in an arsenal of dimensional reduction methods, that come from “unsupervised learning” in “exploratory data analysis”
15
Dimensional Reduction
Regress disc on weight Regress weight on disc
16
Dimensional ReductionMinimize the distance to the line in both directions the purple line is the principal component line
17
Dimensional ReductionPrincipal Components are Linear Combinations of the ‘old’ variables The projection that maximizes the area of the shadow and an equivalent measurement is the sums of squares of the distances between points in the projection, we want to see as much of the variation as possible, that’s what PCA does.
18
The PCA workflow
19
Ordination Using the Tree1. UniFrac-‐PCoA 2. Double Principal Coordinates
20
(Un)supervised LearningOrdination Best Practice
1. Always look at scree plot 2. Variables, Samples 3. Biplot 4. Altogether (if readable)
21
(Un)supervised LearningOrdination Best Practice
pca.turtles=dudi.pca(Turtles[,-1],scannf=F,nf=2)!scatter(pca.turtles)
22
(Un)supervised LearningWhat did we “learn”? Depends on the data.
• How many axes are probably useful? • Are their clusters? How many? • Are their gradients? • Are the patterns consistent with covariates • (e.g. sample observations) • How might we test this?
23
(Un)supervised LearningWhat did we “learn”? Depends on the data.
• Are their clusters? How many? !Gap Statistic
24
(Un)supervised LearningWhat did we “learn”? Depends on the data.
• Are their gradients? !PCA regression
25
(Un)supervised LearningWhat did we “learn”? Depends on the data.
• Are the patterns consistent with covariates • How might we test this?
(Permutational) Multivariate ANOVA vegan::adonis( )
26