+ All Categories
Home > Documents > 0 introduction

0 introduction

Date post: 10-May-2015
Category:
Upload: dmitry-grapov
View: 10,590 times
Download: 0 times
Share this document with a friend
Popular Tags:
36
Introduction to Metabolomic Data Analysis Dmitry Grapov, PhD Introduction
Transcript
Page 1: 0  introduction

Introduction to Metabolomic Data Analysis

Dmitry Grapov, PhD

Intr

oduc

tion

Page 2: 0  introduction

Important

•This is an introduction to a series of 8 tutorials for metabolomic data analysis

•Download all the required files and software here:

https://sourceforge.net/projects/teachingdemos/files/Winter%202014%20LC-MS%20and%20Statistics%20Course/

•Then follow the directions in the software/startup.R to launch all accompanying software

Intr

oduc

tion

Page 3: 0  introduction

Goals?

Page 4: 0  introduction

Analysis at the Metabolomic Scale

Page 5: 0  introduction

Cycle of Scientific DiscoveryData Acquisition

DataData AnalysisHypothesis Generation

Data ProcessingHypothesis

Page 6: 0  introduction

Univariate vs. MultivariateUnivariate

Gro

up 1

Gro

up 2

Multivariate Predictive Modeling

Hypothesis testing (t-Test, ANOVA, etc.) PCA O-/PLS/-DA

Page 7: 0  introduction

univariate/bivariate vs.

\ multivariate

mixed up samples?outliers?

Univariate vs. Multivariate

Page 8: 0  introduction

Data Analysis Goals

• Are there any trends in my data?– analytical sources – meta data/covariates

• Useful Methods– matrix decomposition (PCA, ICA, NMF)– cluster analysis

• Differences/similarities between groups?– discrimination, classification, significant changes

• Useful Methods– analysis of variance (ANOVA), mixed effects models– partial least squares discriminant analysis (O-/PLS-DA)– Others: random forest, CART, SVM, ANN

• What is related or predictive of my variable(s) of interest?– Regression, correlation

• Useful Methods– correlation– partial least squares (O-/PLS)

Exploration Classification Prediction

Page 9: 0  introduction

Data Complexity

nm

1-D 2-D m-D

Data

samples

variables

complexity

Meta Data

Experimental Design =

Variable # = dimensionality

Page 10: 0  introduction

Univariate Qualities• length (sample size)

• center (mean, median, geometric mean)

• dispersion (variance, standard deviation)

• range (min / max),

• quantiles

• shape (skewness, kurtosis, normality, etc.)

mean

standard deviation

Page 11: 0  introduction

Data QualityMetrics

• Precision

• Accuracy

Remedies

• normalization

• outliers detection

*Start lab 1-statistical analysis

Page 12: 0  introduction

Univariate Analyses• Identify differences in sample population

means• sensitive to distribution shape

• parametric = assumes normality

• error in Y, not in X (Y = mX + error)

• optimal for long data

• assumed independence

• false discovery rate (FDR) long

wide

n-of-one

Page 13: 0  introduction

Type I Error: False Positives

• Type II Error: False Negatives

• Type I risk =

• 1-(1-p.value)m

m = number of variables tested

FDR correction

• p-value adjustment or estimate of FDR (Fdr, q-value)

False Discovery Rate (FDR)

Bioinformatics (2008) 24 (12):1461-1462

Page 14: 0  introduction

Achieving “significance” is a function of:

significance level (α) and power (1-β )

effect size (standardized difference in means)

sample size (n)

*finish lab 1-statistical analysis

Page 15: 0  introduction

ClusteringIdentify

•patterns

•group structure

• relationships

•Evaluate/refine hypothesis

•Reduce complexity

Artist: Chuck Close

Page 16: 0  introduction

Cluster AnalysisUse the concept similarity/dissimilarity to group a collection of samples or variables

Approaches• hierarchical (HCA)• non-hierarchical (k-NN, k-means)• distribution (mixtures models)• density (DBSCAN)• self organizing maps (SOM)

Linkage k-means

Distribution Density

Page 17: 0  introduction

Hierarchical Cluster Analysis• similarity/dissimilarity

defines “nearness” or distance

X

Y

euclidean

X

Y

manhattan Mahalanobis

X

Y*

non-euclidean

Page 18: 0  introduction

Hierarchical Cluster Analysis

single complete centroid average

Agglomerative/linkage algorithm defines how points are grouped

Page 19: 0  introduction

Dendrograms

Sim

ilarit

y

x

xx

x

Page 20: 0  introduction

Exploration Confirmation

How does my metadata match my data structure?

Hierarchical Cluster Analysis

*finish lab 2-Cluster Analysis

Page 21: 0  introduction

Projection of Data

The algorithm defines the position of the light sourcePrincipal Components Analysis (PCA)

• unsupervised• maximize variance (X)

Partial Least Squares Projection to Latent Structures (PLS)

• supervised• maximize covariance (Y ~ X)

James X. Li, 2009, VisuMap Tech.

Page 22: 0  introduction

Interpreting PCA Results

Variance explained (eigenvalues)

Row (sample) scores and column (variable) loadings

Page 23: 0  introduction

How are scores and loadings related?

Page 24: 0  introduction

Centering and Scaling

PMID: 16762068

*finish lab 3-Principal Components Analysis

Page 25: 0  introduction

Use PLS to test a hypothesis

time = 0 120 min.

Partial Least Squares (PLS) is used to identify planes of maximum correlation between X measurements and Y (hypothesis)

PCA PLS

Page 26: 0  introduction

Modeling multifactorial relationships

dynamic changes among groups~two-way ANOVA

Page 27: 0  introduction

PLS Related ObjectsModel• dimensions, latent variables (LV)• performance metrics (Q2, RMSEP, etc)• validation (training/testing, permutation, cross-validation)• orthogonal correctionSamples• scores• predicted values• residualsVariables• Loadings• Coefficients, summary of loadings based on all LVs• VIP, variable importance in projection• Feature selection

Page 28: 0  introduction

“goodness” of the model is all about the perspective

Determine in-sample (Q2) and out-of-sample error (RMSEP) and compare to a random model

• permutation tests

• training/testing

*finish lab 4-Partial Least Squares and lab 5-Data Analysis Case Study

Page 29: 0  introduction

Biological Interpretation

• Visualization• Enrichment• Networks

– biochemical– structural– spectral– empirical

Projection or mapping of analysis results into a biological context.

Page 30: 0  introduction

Organism specific biochemical relationships and information

Multiple organism DBs

• KEGG

• BioCyc

• Reactome

• Human

• HMDB

• SMPDB

Identification of alterations in biochemical domains

*finish lab 6-Metabolite Enrichment Analysis

Page 31: 0  introduction

2. Calculate Mappings

1. Generate Connections

3. Create Network

Grapov D., Fiehn O., Multivariate and network tools for analysis and visualization of metabolomic data, ASMS, June 08, 2013, Minneapolis, MN

Network Mapping

Page 32: 0  introduction

Connections and Contexts

Biochemical (substrate/product)• Database lookup• Web query

Chemical (structural or spectral similarity )• fingerprint generation

Empirical (dependency)• correlation, partial-correlation

BMC Bioinformatics 2012, 13:99 doi:10.1186/1471-2105-13-99

Page 33: 0  introduction

Mapping Analysis Results

Analysis results Network Annotation Mapped Network

*finish lab 7-Network Mapping I

Page 34: 0  introduction

Biochemical Relationships

http://www.genome.jp/dbget-bin/www_bget?rn:R00975

Page 35: 0  introduction

Structural Similarity

http://pubchem.ncbi.nlm.nih.gov//score_matrix/score_matrix.cgi

Page 36: 0  introduction

Mass Spectral Connections

Watrous J et al. PNAS 2012;109:E1743-E1752 *finish lab 8-Network Mapping II


Recommended