SAMSI 2014-2015 Program Beyond Bioinformatics: Statistical and Mathematical Challenges Topic: Data...

SAMSI 2014-2015 Program Beyond Bioinformatics: Statistical and

Mathematical Challenges

Topic: Data Integration

Katerina Kechris, PhDAssociate Professor

Biostatistics and InformaticsColorado School of Public Health

University of Colorado Denver

Omics

• Large-scale analyses for studying a population of molecules or molecular mechanisms

• High-throughput data• Examples– Genomics (entire genome – DNA)– Proteomics (study of protein repertoire)– Epigenomics (study of DNA and histone modifications)

OmicsEpigenome

Phenome

Adapted from http://www.sciencebasedmedicine.org http://www.scientificpsychic.com/fitness/transcription.gifhttp://themedicalbiochemistrypage.org/images/hemoglobin.jpg http://upload.wikimedia.org/wikipedia/commons/c/c6/Clopidogrel_active_metabolite.png http://creatia2013.files.wordpress.com/2013/03/dna.gif

http://www.sciencebasedmedicine.org/

http://themedicalbiochemistrypage.org/images/hemoglobin.jpg

http://upload.wikimedia.org/wikipedia/commons/c/c6/Clopidogrel_active_metabolite.png

Large-scale Projects & Databases

NCI 60 Database

Integration of Omics Data

• Each type of data gives a different snapshot of the biological or disease system

• Why integrate data?• Reduce false positives/negatives• Identify interactions between different

molecules• Explore functional mechanisms

Challenges

1. When to integrate?2. Dimensionality 3. Resolution4. Heterogeneity5. Interactions and Pathways

Challenge 1: When to integrate?

• Early– Merging data to increase sample size

• Intermediate– Convert different data sources into common format

(e.g., ranks, correlation matrices), kernel-based analysis

• Late– Meta-analysis (combine effect size or p-value),

aggregate voting for classifiers, genomic enrichment and overlap of significant results

Genomic Meta-analysis:Combining Multiple Transcriptomic Studies

Tseng Lab, U. of Pitt.

Assessing Genomic Overlap:Permutation-based Strategies

Bickel Lab, Berkeley & ENCODEAnn. Appl. Stat. (2010) 4:4 1660-1697.

Challenge 2: Dimensionality

• Most technologies produce 10Ks to 100Ks measurements per sample– Exponential increase with 2+ data types

• Dimension reduction – Process data type separately (filtering)– Combine with model fitting– Multivariate analysis

Sparse Multivariate Methods

• Variable Selection, Discriminant Analysis, Visualization

• Penalties (or regularization) to reduce parameter space, only a few entries are non-zero (sparsity)

• Sparse Canonical Correlation Analysis (CCA) and Partial Least Squares Regression (PLS)

Le Cao, U. of Queensland; Besse, U. of Toulose; Witten, U. of Wash; Tibshirani, StanfordStat Appl Genet Mol Biol. 2009 January 1; 8(1): Article 28; Stat Appl Genet Mol Biol. 2008;7(1):Article 35

Challenge 3: Genomic Resolution• Base level (conservation, motif scores)

• Regular intervals (expression/binding from tiling arrays)

• Irregular intervals– Gene/ncRNA level data (expression)– Individual positions (SNP, methylation sites)

Challenge 4: Heterogeneity

• Technology-specific sources of error• Different pre-processing, normalization• Different amounts of missing values• Data matching– Different identifiers– Not always one-to-one (microarrays)– Imputation

Challenge 4: Heterogeneity

• Continuous – expression and binding data from microarrays,

motif scores, protein/metabolite abundance• Counts – expression data from sequencing

• 0-1 – conservation (UCSC), DNA methylation

• Binary/Categorical – Thresh-holding (e.g., motif scores), genotype

Case Study: DevelopmentCi

• important for differentiation of appendages during development• transcription factor – binds to DNA near target genes

http://www.biology.ualberta.ca/locke.hp/research.htmhttp://howardhughes.trinity.duke.edu

Kechris Lab, CU Denver

Hierarchical Mixture Model• Data- Transcriptome: Ci pathway mutants (expr) – irregular

interval

- Genome: DNA binding data of Ci (bind) – regular interval, DNA conservation across 14 insect species (cons)– base level

• Goal: Predict gene targets of Ci• Hidden variable is gene target – hierarchical

mixture model

Dvorkin et al., 2013 (under review)

Challenge 5: Interactions and Pathways

• Known Pathways– Incorporate information in databases (curated but

sparse)– e.g., KEGG pathways have metabolite – protein

interactions (directed graphs)

• De novo Pathways– Discover novel interactions

Known Pathways

Jornsten, Chalmers & Michailidis, U. MichiganBiostatistics (2012) 13:4 748-761

Joint modeling of metabolite and transcript data to identify active pathways

metabolite

gene

de novo Interactions

• Single data

INTEGRATION• Pair-wise

– Correlations (e.g., eQTL)– Bayesian networks

• Multiple– Kernel-based methods – Probabilistic graphical models – Network analysis

gene

SNP

protein

metabolitegene

methylation site

PHENOTYPE

de novo Interactions

Shojaie Lab U. WashingtonBiometrika (2010) 97 (3): 519-538.

Summary Methodology

1. Meta-analysis2. Permutation-based Methods3. Sparse Multivariate Methods4. Graphical Models5. Network Analysis

Date post:	14-Dec-2015
Category:	Documents
Upload:	briana-sherfield
View:	212 times
Download:	0 times

SAMSI 2014-2015 Program Beyond Bioinformatics: Statistical and Mathematical Challenges Topic: Data...

Documents