Date post: | 15-Apr-2017 |
Category: |
Science |
Upload: | david-morina-soler |
View: | 476 times |
Download: | 0 times |
Statistical methods in metabolomics
David Moriñaa,b
aCentre for Research in Environmental Epidemiology (CREAL)bUnitat de Bioestadística, Facultat de Medicina, Universitat Autònoma de Barcelona
May 08 2014, Reus
Statistical methods in metabolomics
Contents
1 Introduction
2 Basic statistics
3 Available tools
4 R basics
5 LC/MS example
6 Further reading
2 / 66
Statistical methods in metabolomics
Introduction
Where does data come from?
Metabolomics
• Metabolomics is the analysis and study of the set of metabolites in a cell,organ, or tissue.
• To detect and quantify metabolites, separation techniques like gas or li-quid chromatography, followed by quantification by mass spectrometry(GC/MS, or LC/MS) are often used.
• Nuclear magnetic resonance spectroscopy (NMR) is also frequently em-ployed and has some appealing properties:
• Is non-destructive, in the sense that it does not “destroy” the samples duringthe analysis process.
• Is useful when analyzing tissues or when sequential analysis of samples isrequired.
3 / 66
Statistical methods in metabolomics
Introduction
Where does data come from?
Metabolomics
• Metabolomics is the analysis and study of the set of metabolites in a cell,organ, or tissue.
• To detect and quantify metabolites, separation techniques like gas or li-quid chromatography, followed by quantification by mass spectrometry(GC/MS, or LC/MS) are often used.
• Nuclear magnetic resonance spectroscopy (NMR) is also frequently em-ployed and has some appealing properties:
• Is non-destructive, in the sense that it does not “destroy” the samples duringthe analysis process.
• Is useful when analyzing tissues or when sequential analysis of samples isrequired.
3 / 66
Statistical methods in metabolomics
Introduction
Where does data come from?
Metabolomics
• Metabolomics is the analysis and study of the set of metabolites in a cell,organ, or tissue.
• To detect and quantify metabolites, separation techniques like gas or li-quid chromatography, followed by quantification by mass spectrometry(GC/MS, or LC/MS) are often used.
• Nuclear magnetic resonance spectroscopy (NMR) is also frequently em-ployed and has some appealing properties:
• Is non-destructive, in the sense that it does not “destroy” the samples duringthe analysis process.
• Is useful when analyzing tissues or when sequential analysis of samples isrequired.
3 / 66
Statistical methods in metabolomics
Basic statistics
What does people say about statistics?
• There are three kinds of lies: lies, damned lies, and statistics. (B. Disraeli)
• Statistics are like bikinis: What they reveal is suggestive, but what theyhide is vital. (A. Levenstein)
• About 93% of all statistics are made up. (Any newspaper)
4 / 66
Statistical methods in metabolomics
Basic statistics
What does people say about statistics?
• There are three kinds of lies: lies, damned lies, and statistics. (B. Disraeli)• Statistics are like bikinis: What they reveal is suggestive, but what they
hide is vital. (A. Levenstein)
• About 93% of all statistics are made up. (Any newspaper)
4 / 66
Statistical methods in metabolomics
Basic statistics
What does people say about statistics?
• There are three kinds of lies: lies, damned lies, and statistics. (B. Disraeli)• Statistics are like bikinis: What they reveal is suggestive, but what they
hide is vital. (A. Levenstein)• About 93% of all statistics are made up. (Any newspaper)
4 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Normal distribution
5 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Normal distribution
6 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Central Limit Theorem
Under some conditions (not much demanding), the distribution of the sumof independent and identically distributed random variables tends to normaldistribution if the number of observations is not too small.
7 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Central Limit Theorem
The following example shows the distribution of the sum of the scores obtainedwhen rolling 1, 2, 3, 5 and 10 dices:
8 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Central Limit Theorem
The following example shows the distribution of the sum of the scores obtainedwhen rolling 1, 2, 3, 5 and 10 dices:
8 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Central Limit Theorem
9 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Central Limit Theorem
10 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Testing normality
There are some analytical methods to test if a random variable follow a normaldistribution or not. Some of them are
• Kolmogorov-Smirnov test• Shapiro-Wilk test• Graphical methods (QQ-plot, . . . )
11 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Testing normality
There are some mathematical functions that can be applied in order to stabili-ze the variance of a random variable
• log transformation• logit transformation• Square root transformation
12 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Testing normality
There are some mathematical functions that can be applied in order to stabili-ze the variance of a random variable
• log transformation• logit transformation• Square root transformation
12 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Testing normality
Original variable
X scale
Den
sity
0 1 2 3 4 5 6
0.0
0.2
0.4
0.6
log−transformed variable
log(X) scale
Den
sity
−8 −6 −4 −2 0 2
0.0
0.2
0.4
0.6
13 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Significance
Interpretation of p-value is well-known. . . Sure?
• Probability of a difference at least as the observed if H0 is true (bychance)
• Probability of mistake when rejecting H0
• Evidence against H0 provided by the sample. If the p-value is small, it’snot likely to observe the sample differences by chance
• Probability that the observed differences are false• 1 - p-value = Probability that the observed differences are real
14 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Significance
Interpretation of p-value is well-known. . . Sure?• Probability of a difference at least as the observed if H0 is true (by
chance)• Probability of mistake when rejecting H0
• Evidence against H0 provided by the sample. If the p-value is small, it’snot likely to observe the sample differences by chance
• Probability that the observed differences are false• 1 - p-value = Probability that the observed differences are real
14 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Significance
Interpretation of p-value is well-known. . . Sure?• Probability of a difference at least as the observed if H0 is true (by
chance)• Probability of mistake when rejecting H0
• Evidence against H0 provided by the sample. If the p-value is small, it’snot likely to observe the sample differences by chance
• Probability that the observed differences are false• 1 - p-value = Probability that the observed differences are real
15 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Significance
• The p-value is computed under the assumption that H0 is true andtherefore it cannot provide direct data about its certainty
• Scientist should decide on H0 based on the evidence against it thatsample provides, without reality knowledge
16 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Significance
• If a confidence level 1− α is fixed:
p < α −→ Statistically significant differencesp ≥ α −→ No statistically significant differences
17 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Significance
18 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Significance
19 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Comparing two populations
• Student’s t test was designed to compare two means.• A t-test can also be used to determine whether 2 clusters are different.
●
●
●●●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●●
●
●●
●●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●
●
●●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●●
●
●
●
●
●●
●●●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●●
●●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●
0 100 200 300 400
−10
010
2030
40
Time
Val
ue
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
20 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Comparing two populations (non-gaussian)
• If the distribution of the variable of interest is not gaussian, we can stillcompare two populations, by means of Mann-Whitney’s U test (forindependent samples) or Wilcoxon test (for paired samples).
• Formally, these non-parametric tests are comparing two medians.
21 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Comparing three (or more) populations
• If we want to compare more than two groups we can use ANOVAtechnique.
• Essentially, it is a genearlization of Student’s t test.• Intra-group variance should be similar.• Normality is not crucial.• Just tells if some of the compared groups is different from the others.
22 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Comparing three (or more) populations
• ANOVA can also be used to determine whether 3 or more clusters aredifferent.
●
●●●●
●●
●
●●
●
●
●
●●
●●●
●
●
●
●
●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●●
●●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●●●●●
●●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●●●
●●●
●
●
●
●
●
●
●
●
●●●
●
●●●
●●
●●
●●●●
●●●●
●
●●
0 100 200 300 400 500 600
−10
010
2030
4050
60
Time
Val
ue ●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●●
●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
23 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Comparing three (or more) populations
If H0 can be rejected, which is the different group?• We need to perform a posteriori mean tests.• They compare each pair of means.• More conservative to control αT .
24 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Multiple comparisons
There are several methods to control type I error:• Bonferroni• Holm• Tukey• Scheffé• Dunnett (control)
25 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
False Discovery Rate
Suppose you performed 100 different t-tests, and found 20 results with a p-value < 0.05.
• How many of these 20 tests are likely false positives?
• 20 · 0.05 = 1• To correct for this we can consider as significant the results with a
p-value < 0.0520 or p < 0.0025
26 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
False Discovery Rate
Suppose you performed 100 different t-tests, and found 20 results with a p-value < 0.05.
• How many of these 20 tests are likely false positives?• 20 · 0.05 = 1• To correct for this we can consider as significant the results with a
p-value < 0.0520 or p < 0.0025
26 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Correlation
• If there is some dependency between the two variables or if there is arelationship between the predicted and observer variable or if the“before” and “after” treatments led to some effect, then it is possible tosee some clear patterns to the scatter plot
• This pattern or relationship is called correlation
27 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Correlation
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
−2 −1 0 1 2
34
56
7
Positive correlation
x
y
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
−2 −1 0 1 2
−2
−1
01
2
Negative correlation
x
z●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−2 −1 0 1 2
−3
−2
−1
01
2
No correlation
x
t
28 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Correlation
The correlation coefficient (Pearson coefficient) is computed by means of
r =
∑(xi − x)(yi − y)√∑(xi − x)2(yi − y)2
29 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Correlation and significance
●●
●
−3.0 −2.0 −1.0
−5
05
1015
r=0.98
●●
●
●
●
−3.0 −2.0 −1.0
−5
05
1015
r=0.22
30 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Clustering
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
0 100 200 300 400
−2
02
46
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
31 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Clustering
−200 −100 0 100 200
−6
−4
−2
02
4
CLUSPLOT( mydata )
Component 1
Com
pone
nt 2
These two components explain 100 % of the point variability.
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●
●
●●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
32 / 66
Statistical methods in metabolomics
Basic statistics
Clustering
Clustering
• Clustering is a process by which objects that are logically similar incharacteristics are grouped together
• It’s a previous step before classification.• It requires a method to measure similarity (a similarity matrix) or
dissimilarity (a dissimilarity coefficient) between objects• Uses a threshold value to decide whether an object belongs with a
cluster• There are several clustering methods, differing in how they start the
clustering process
33 / 66
Statistical methods in metabolomics
Basic statistics
Clustering
Clustering
• K-means algorithm: divides a set of N objects into M clusters – with orwithout overlap. M must be specified by the analist
• Hiearchical clustering: produces a set of nested clusters in which eachpair of objects is progressively nested into a larger cluster until only onecluster remains
34 / 66
Statistical methods in metabolomics
Basic statistics
Clustering
K-means algorithm
• Make the first object the centroid for the first cluster• For the next object calculate the similarity to each existing centroid• If the similarity is greater than a threshold add the object to the existing
cluster and redetermine the centroid, else use the object to start newcluster
• Return to step 2 and repeat until done
35 / 66
Statistical methods in metabolomics
Basic statistics
Clustering
K-means algorithm example
# Read data> st1 <- read.table("Data/global_afegits.csv", sep=";",
dec=",", header=T)# Determine number of clusters> n <- nrow(st2.ado)> wss <- rep(1:10)> wss[1] <- (n-1)*sum(apply(st2.ado[,2:8],2,var))> for (i in 2:10)
{wss[i] <- sum(kmeans(na.omit(st2.ado[,2:8]),
centers=i)$withinss)}
> plot(1:10,wss,type="b",xlab="Number of groups",ylab="Within groups sum of squares")
36 / 66
Statistical methods in metabolomics
Basic statistics
Clustering
K-means algorithm example
●
●
●
●
●
●
●●
●
●
2 4 6 8 10
1200
014
000
1600
018
000
2000
022
000
2400
0
Number of groups
With
in g
roup
s su
m o
f squ
ares
37 / 66
Statistical methods in metabolomics
Basic statistics
Clustering
K-means algorithm example
If we choose 5 clusters, we then
> fit <- kmeans(st2.ado, 5)
will classify the observations in the 5 groups
38 / 66
Statistical methods in metabolomics
Basic statistics
Clustering
Hierarchical clustering
• Find the two closest objects and merge them into a cluster• Find and merge the next two closest objects (or an object and a cluster,
or two clusters) using some similarity measure and a predefinedthreshold
• If more than one cluster remains return to step 2 until finished
39 / 66
Statistical methods in metabolomics
Basic statistics
Clustering
Hierarchical clustering example
# Ward Hierarchical Clustering> d <- dist(st2.ado, method = "euclidean")> fit <- hclust(d, method="ward.D")> plot(fit) # display dendogram> groups <- cutree(fit, k=5) # cut tree into 5 clusters# draw dendogram with red borders around the 5 clusters> rect.hclust(fit, k=5, border="red")
40 / 66
Statistical methods in metabolomics
Basic statistics
Clustering
Hierarchical clustering example
281
302
240
309
234 17 10
223
204
221
563
305 25 148
292
137 9
283
556
152
155 67
258
300 33 173
270
170
262
251
158
286
142
282
162
315
186
567
294
241
535 1
319
4 4710
9 49 20 52 203
268
231
558 36 289
287
175 59 11 103
116
260
161
202
119
225
269
141
599
209
217
222
250
249
214
297
230 2
293
139
160
179
595
257
254
113
200
210
205
183
215
177
324
184
193
314
235
227
181
102
5526
5 32 117
601
291
185
149
154
146
169 94
312
608
569
528
546
326
131
187 46 606
243
267
605
248 66 207
327
233
247
157
164
290
295 97
140
275
172
122
568 3
127 48 24
311
129 32
215
198
206
188
238
192 18 253
256 45
288
307
259
199
57 60 229 80 31 37 124
301
126
313 51 596 5
135
548
226
201
176 26 151 14 82 242 57
234
212
559
024
475
197 68 266
277
144 92 54 591
284
136
246
213
237 88 555 6
252
338
165
337
105
212
239
328 1
285
321 65 34 35 21 99 63 39 58
218
110
334
553
585
224
167 61
057
812
831
0 12 318 8
120
171
339
611
191
272
196 50
607
156
539
271
296
341
278
303
220
263
554 4
211
195
163
216
261
53 28 134
306 22 273
132
111
274
566
114
168
180
121
174 93 106 78
115
190
298
166
255
133
118
138
145
7430
410
815
013
012
311
214
310
720
8 71 540
98 299
316
264
232 42 81
189
153
609
147 56
182
538
574
41 276
280 30 236
159
533
8524
527
957
557
6 62 19 27 586 16
600
178
228
319 43 542
219
561
050
100
150
200
250
300
Cluster Dendrogram
hclust (*, "ward.D")d
Hei
ght
41 / 66
Statistical methods in metabolomics
Basic statistics
Clustering
Validating cluster solutions
There are several methods to compare different clustering solutions to thesame problem.
• Hubert’s gamma coefficient• Dunn index• Corrected rand index
Some of them are implemented in R package fpc
42 / 66
Statistical methods in metabolomics
Basic statistics
Clustering
Validating cluster solutions
There are several methods to compare different clustering solutions to thesame problem.
• Hubert’s gamma coefficient• Dunn index• Corrected rand index
Some of them are implemented in R package fpc
42 / 66
Statistical methods in metabolomics
Basic statistics
Clustering
Validating cluster solutions
There are several methods to compare different clustering solutions to thesame problem.
• Hubert’s gamma coefficient• Dunn index• Corrected rand index
Some of them are implemented in R package fpc
42 / 66
Statistical methods in metabolomics
Basic statistics
Multivariate statistics
Multivariate statistics• Multivariate statistics means dealing with several variables at the same
time• Multivariate problems requires more complex, multidimensional analyses
or dimensional reduction methods• Metabolomics experiments typically measure many metabolites at once,
in other words the instruments are measuring multiple variables and sometabolomic data are inherently multivariate data
• The key trick in multivariate statistics is to find a way that effectivelyreduces the multivariate data into univariate data
• Then we can apply the same univariate concepts such as p-values,t-tests and ANOVA tests to the data
43 / 66
Statistical methods in metabolomics
Basic statistics
Multivariate statistics
Principal Component Analysis
• PCA is a process that transforms a number of possibly correlatedvariables into a smaller number of uncorrelated variables called principalcomponents
• PCA captures what should be visually detectable• If you can’t see it, PCA probably won’t help
44 / 66
Statistical methods in metabolomics
Basic statistics
Multivariate statistics
Principal Component Analysis
> data(USArrests)> pc.cr <- princomp(USArrests, cor = TRUE)> biplot(pc.cr)
45 / 66
Statistical methods in metabolomics
Basic statistics
Multivariate statistics
Principal Component Analysis
−0.2 −0.1 0.0 0.1 0.2 0.3
−0.
2−
0.1
0.0
0.1
0.2
0.3
Comp.1
Com
p.2
AlabamaAlaska
Arizona
Arkansas
California
ColoradoConnecticut
Delaware
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana IowaKansas
KentuckyLouisiana
MaineMaryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
Oklahoma
OregonPennsylvania
Rhode Island
South Carolina
South DakotaTennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
−5 0 5
−5
05
Murder
Assault
UrbanPop
Rape
46 / 66
Statistical methods in metabolomics
Basic statistics
Multivariate statistics
Other methods
There are several multivariate methods, with an increasing usage in metabo-lomics and related fields
• Discriminant Analysis (DA, PLS-DA, OPLS-DA)• Factor Analysis• Structural Equation Modeling
47 / 66
Statistical methods in metabolomics
Available tools
How to analyze data?
R
• R is a freely available language and environment for statistical computingand graphics.
• It provides a wide variety of statistical and graphical techniques.• It is constantly expanding thanks to user-contributed packages.• Can be downloaded from http://cran.r-project.org.
Bioconductor
• Bioconductor is a repository of user-contributed R packages.• It is accessible from http://www.bioconductor.org.• Provides tools for the analysis and comprehension of high-throughput ge-
nomic data.• It has mailing lists and a very active users/developers community.
48 / 66
Statistical methods in metabolomics
Available tools
How to analyze data?
R
• R is a freely available language and environment for statistical computingand graphics.
• It provides a wide variety of statistical and graphical techniques.• It is constantly expanding thanks to user-contributed packages.• Can be downloaded from http://cran.r-project.org.
Bioconductor
• Bioconductor is a repository of user-contributed R packages.• It is accessible from http://www.bioconductor.org.• Provides tools for the analysis and comprehension of high-throughput ge-
nomic data.• It has mailing lists and a very active users/developers community.
48 / 66
Statistical methods in metabolomics
Available tools
Bioconductor
Bioconductor submitted packages
49 / 66
Statistical methods in metabolomics
Available tools
Bioconductor
Installation of Bioconductor packages
The installation of Bioconductor can be done within the R session by
source("http://bioconductor.org/biocLite.R")biocLite()
50 / 66
Statistical methods in metabolomics
R basics
Getting help
Getting help• ?mean
• help(mean)
• help.search("mean")
• apropos("mean")
• example(mean)
51 / 66
Statistical methods in metabolomics
R basics
R packages for metabolomics
Useful packages
There are a number of useful packages in Bioconductor regarding metabolo-mics data analysis.
• flagme: Analysis of metabolomics GC/MS data
• xcms: Analysis of metabolomics XC/MS data
52 / 66
Statistical methods in metabolomics
R basics
R packages for metabolomics
Useful packages
There are a number of useful packages in Bioconductor regarding metabolo-mics data analysis.
• flagme: Analysis of metabolomics GC/MS data• xcms: Analysis of metabolomics XC/MS data
52 / 66
Statistical methods in metabolomics
LC/MS example
LC/MS example
xcms
• Can read data stored in several formats like netcdf, mzXML, mzData andmzML.
• Provides methods for feature detection, non-linear retention time align-ment, visualization, relative quantization and statistics.
• Is capable of simultaneously preprocessing, analyzing, and visualizingthe raw data from hundreds of samples.
• It’s available as an R package or as an online platform accessible throughhttps://xcmsonline.scripps.edu/.
53 / 66
Statistical methods in metabolomics
LC/MS example
LC/MS example
Typical xcms workflow
54 / 66
Statistical methods in metabolomics
LC/MS example
LC/MS example
Reading the data
# use biocLite to install a Bioconductor package> source("http://bioconductor.org/biocLite.R")# Install the xcms package> biocLite("xcms")# Install dataset package used in this session> biocLite("faahKO")
55 / 66
Statistical methods in metabolomics
LC/MS example
LC/MS example
Reading the data
The data in faahKO consists of LC/MS peaks from the spinal cords of 6 wild-type and 6 FAAH knockout mice. The data is a subset of the original data from200-600 m/z and 2500-4500 seconds. It was collected in positive ionizationmode.
# Load libraries> library("xcms")> library("faahKO")
56 / 66
Statistical methods in metabolomics
LC/MS example
LC/MS example
Reading the data
> cdfpath <- system.file("cdf",package="faahKO")> files <- list.files(cdfpath, recursive=T, full=T)> data <- xcmsSet(files)
Some important parameters• scanrange=c(lower, upper): to scan part of the spectra• fwhm = seconds: specify full width at half maximum (default 30s)
based on the type of chromatography• method = “centWave”): use wavelet algorithm for peak detection,
suitable for high resolution spectra
57 / 66
Statistical methods in metabolomics
LC/MS example
LC/MS example
Reading the data
> cdfpath <- system.file("cdf",package="faahKO")> files <- list.files(cdfpath, recursive=T, full=T)> data <- xcmsSet(files)
Some important parameters• scanrange=c(lower, upper): to scan part of the spectra• fwhm = seconds: specify full width at half maximum (default 30s)
based on the type of chromatography• method = “centWave”): use wavelet algorithm for peak detection,
suitable for high resolution spectra
57 / 66
Statistical methods in metabolomics
LC/MS example
LC/MS example
Peak alignment and retention time correction
> xsg <- group(data) # peak alignment> xsg <- retcor(xsg) # retention time correction> xsg <- group(xsg) # re-align
• Matching peaks across samples• Using the peak groups to correct drift• Re-do the alignment• Can be performed iteratively until no further change
58 / 66
Statistical methods in metabolomics
LC/MS example
LC/MS example
Peak alignment and retention time correction
> xsg <- group(data) # peak alignment> xsg <- retcor(xsg) # retention time correction> xsg <- group(xsg) # re-align
• Matching peaks across samples• Using the peak groups to correct drift• Re-do the alignment• Can be performed iteratively until no further change
58 / 66
Statistical methods in metabolomics
LC/MS example
LC/MS example
Peak alignment and retention time correction
−2
−1
01
23
Retention Time Deviation vs. Retention Time
Retention Time
Ret
entio
n T
ime
Dev
iatio
n●
●
●
●
●
●
ko15ko16ko18ko19ko21ko22wt15wt16wt18wt19wt21wt22●
●
●●
●
●●
●
●
●●●
●
●
●
●●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●●
●
●●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●●
●
●
●
●
●
●
●●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●●
●
●●
●
●
●●
●
●
●
●●●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●●●
●
●●
●
●
●●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●
●●●
●
●
●
●
●
●
●●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●
●●●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●●●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
2500 3000 3500 4000 4500
Retention Time
Pea
k D
ensi
ty
AllCorrection
59 / 66
Statistical methods in metabolomics
LC/MS example
LC/MS example
Filling in missing peaks
> xsg <- fillPeaks(xsg)
• A significant number of potential peaks can be missed during peakdetection
• Missing values are problematic for robust statistical analysis• We now have a better idea about where to expect real peaks and their
boundaries• Re-scan the raw spectra and integrate peaks in the regions of the
missing peaks
60 / 66
Statistical methods in metabolomics
LC/MS example
LC/MS example
Filling in missing peaks
> xsg <- fillPeaks(xsg)
• A significant number of potential peaks can be missed during peakdetection
• Missing values are problematic for robust statistical analysis• We now have a better idea about where to expect real peaks and their
boundaries• Re-scan the raw spectra and integrate peaks in the regions of the
missing peaks
60 / 66
Statistical methods in metabolomics
LC/MS example
LC/MS example
Results of peak detection
> peaks(xsg)
peaks() function gives a list of peaks with• mz• mzmin• mzmax• rt• rtmin• rtmax• peak intensities/areas (raw data)
61 / 66
Statistical methods in metabolomics
LC/MS example
LC/MS example
Results of peak detection
> peaks(xsg)
peaks() function gives a list of peaks with• mz• mzmin• mzmax• rt• rtmin• rtmax• peak intensities/areas (raw data)
61 / 66
Statistical methods in metabolomics
LC/MS example
LC/MS example
Statistical analysis
> report <- diffreport(xsg, "WT", "KO")
• diffreport() function computes Welch’s two-sample t-statistic foreach analyte and ranks them by p-value.
• It returns a summary report• Multivariate analysis and visualization can be performed using
MetaboAnalyst• The report generated by diffreport() can be directly uploaded to
MetaboAnalyst
62 / 66
Statistical methods in metabolomics
LC/MS example
LC/MS example
Statistical analysis
> report <- diffreport(xsg, "WT", "KO")
• diffreport() function computes Welch’s two-sample t-statistic foreach analyte and ranks them by p-value.
• It returns a summary report• Multivariate analysis and visualization can be performed using
MetaboAnalyst• The report generated by diffreport() can be directly uploaded to
MetaboAnalyst
62 / 66
Statistical methods in metabolomics
LC/MS example
LC/MS example
Visualizing important peaks
# Select peaks with median retention time# between 3300 and 3400 and detected in# at least 8 samples> gr <- groups(xsg)> groupidx <- which(gr[,"rtmed"]>3300 &
gr[,"rtmed"]<3400 &gr[,"npeaks"]>=8])[1]
> eiccor <- getEIC(xsg, groupidx=groupidx)> plot(eiccor, col=as.numeric(phenoData(xsg)$class))
• When significant peaks are identified, it is critical to visualize thesepeaks to assess quality
• This is done using the Extracted Ion Chromatogram (EIC)
63 / 66
Statistical methods in metabolomics
LC/MS example
LC/MS example
Visualizing important peaks
# Select peaks with median retention time# between 3300 and 3400 and detected in# at least 8 samples> gr <- groups(xsg)> groupidx <- which(gr[,"rtmed"]>3300 &
gr[,"rtmed"]<3400 &gr[,"npeaks"]>=8])[1]
> eiccor <- getEIC(xsg, groupidx=groupidx)> plot(eiccor, col=as.numeric(phenoData(xsg)$class))
• When significant peaks are identified, it is critical to visualize thesepeaks to assess quality
• This is done using the Extracted Ion Chromatogram (EIC)
63 / 66
Statistical methods in metabolomics
LC/MS example
LC/MS example
Visualizing important peaks
3300 3350 3400 3450
050
000
1000
0015
0000
2000
0025
0000
Extracted Ion Chromatogram: 300.1 − 300.2 m/z
Retention Time (seconds)
Inte
nsity
64 / 66
Statistical methods in metabolomics
Further reading
Some references
• Broadhurst, D. I., Kell, D. B. (2007). Statistical strategies for avoidingfalse discoveries in metabolomics and related experiments.Metabolomics, 2 (4), 171–196.
• Worley, B., Powers, R. (2013). Multivariate Analysis in Metabolomics.Current metabolomics, 1 (1), 92–107.
• Issaq, H. J., Van, Q. N., Waybright, T. J., Muschik, G. M., Veenstra, T. D.(2009). Analytical and statistical approaches to metabolomics research.Journal of separation science, 32, 2183–2199.
• Smith, C. A. (2014). LC/MS Preprocessing and Analysis with xcms. Rpackage documentation.
• Korman, A., Oh, A., Raskind, A., Banks, D. (2012). Statistical methods inmetabolomics. Methods in molecular biology, 856 (Evolutionarygenomics), 381–413. Springer.
65 / 66
Centre for Researchin EnvironmentalEpidemiology
Parc de Recerca Biomèdica de BarcelonaDoctor Aiguader, 8808003 Barcelona (Spain)Tel. (+34) 93 214 70 00Fax (+34) 93 214 73 02