Date post: | 14-Dec-2015 |
Category: |
Documents |
Upload: | briana-sherfield |
View: | 212 times |
Download: | 0 times |
SAMSI 2014-2015 Program Beyond Bioinformatics: Statistical and
Mathematical Challenges
Topic: Data Integration
Katerina Kechris, PhDAssociate Professor
Biostatistics and InformaticsColorado School of Public Health
University of Colorado Denver
Omics
• Large-scale analyses for studying a population of molecules or molecular mechanisms
• High-throughput data• Examples– Genomics (entire genome – DNA)– Proteomics (study of protein repertoire)– Epigenomics (study of DNA and histone modifications)
OmicsEpigenome
Phenome
Adapted from http://www.sciencebasedmedicine.org http://www.scientificpsychic.com/fitness/transcription.gifhttp://themedicalbiochemistrypage.org/images/hemoglobin.jpg http://upload.wikimedia.org/wikipedia/commons/c/c6/Clopidogrel_active_metabolite.png http://creatia2013.files.wordpress.com/2013/03/dna.gif
Large-scale Projects & Databases
NCI 60 Database
Integration of Omics Data
• Each type of data gives a different snapshot of the biological or disease system
• Why integrate data?• Reduce false positives/negatives• Identify interactions between different
molecules• Explore functional mechanisms
Challenges
1. When to integrate?2. Dimensionality 3. Resolution4. Heterogeneity5. Interactions and Pathways
Challenge 1: When to integrate?
• Early– Merging data to increase sample size
• Intermediate– Convert different data sources into common format
(e.g., ranks, correlation matrices), kernel-based analysis
• Late– Meta-analysis (combine effect size or p-value),
aggregate voting for classifiers, genomic enrichment and overlap of significant results
Genomic Meta-analysis:Combining Multiple Transcriptomic Studies
Tseng Lab, U. of Pitt.
Assessing Genomic Overlap:Permutation-based Strategies
Bickel Lab, Berkeley & ENCODEAnn. Appl. Stat. (2010) 4:4 1660-1697.
Challenge 2: Dimensionality
• Most technologies produce 10Ks to 100Ks measurements per sample– Exponential increase with 2+ data types
• Dimension reduction – Process data type separately (filtering)– Combine with model fitting– Multivariate analysis
Sparse Multivariate Methods
• Variable Selection, Discriminant Analysis, Visualization
• Penalties (or regularization) to reduce parameter space, only a few entries are non-zero (sparsity)
• Sparse Canonical Correlation Analysis (CCA) and Partial Least Squares Regression (PLS)
Le Cao, U. of Queensland; Besse, U. of Toulose; Witten, U. of Wash; Tibshirani, StanfordStat Appl Genet Mol Biol. 2009 January 1; 8(1): Article 28; Stat Appl Genet Mol Biol. 2008;7(1):Article 35
Challenge 3: Genomic Resolution• Base level (conservation, motif scores)
• Regular intervals (expression/binding from tiling arrays)
• Irregular intervals– Gene/ncRNA level data (expression)– Individual positions (SNP, methylation sites)
Challenge 4: Heterogeneity
• Technology-specific sources of error• Different pre-processing, normalization• Different amounts of missing values• Data matching– Different identifiers– Not always one-to-one (microarrays)– Imputation
Challenge 4: Heterogeneity
• Continuous – expression and binding data from microarrays,
motif scores, protein/metabolite abundance• Counts – expression data from sequencing
• 0-1 – conservation (UCSC), DNA methylation
• Binary/Categorical – Thresh-holding (e.g., motif scores), genotype
Case Study: DevelopmentCi
• important for differentiation of appendages during development• transcription factor – binds to DNA near target genes
http://www.biology.ualberta.ca/locke.hp/research.htmhttp://howardhughes.trinity.duke.edu
Kechris Lab, CU Denver
Hierarchical Mixture Model• Data- Transcriptome: Ci pathway mutants (expr) – irregular
interval
- Genome: DNA binding data of Ci (bind) – regular interval, DNA conservation across 14 insect species (cons)– base level
• Goal: Predict gene targets of Ci• Hidden variable is gene target – hierarchical
mixture model
Dvorkin et al., 2013 (under review)
Challenge 5: Interactions and Pathways
• Known Pathways– Incorporate information in databases (curated but
sparse)– e.g., KEGG pathways have metabolite – protein
interactions (directed graphs)
• De novo Pathways– Discover novel interactions
Known Pathways
Jornsten, Chalmers & Michailidis, U. MichiganBiostatistics (2012) 13:4 748-761
Joint modeling of metabolite and transcript data to identify active pathways
metabolite
gene
de novo Interactions
• Single data
INTEGRATION• Pair-wise
– Correlations (e.g., eQTL)– Bayesian networks
• Multiple– Kernel-based methods – Probabilistic graphical models – Network analysis
gene
SNP
protein
metabolitegene
methylation site
PHENOTYPE
de novo Interactions
Shojaie Lab U. WashingtonBiometrika (2010) 97 (3): 519-538.
Summary Methodology
1. Meta-analysis2. Permutation-based Methods3. Sparse Multivariate Methods4. Graphical Models5. Network Analysis