Date post: | 13-Jul-2015 |
Category: |
Education |
Upload: | stephen-turner |
View: | 356 times |
Download: | 0 times |
Pathway Analysis Adding Func2onal Context to High-‐Throughput Results
Stephen D. Turner, Ph.D. Bioinforma2cs Core Director [email protected]
bioinforma2cs.virginia.edu
Outline • Bioinforma2cs & the Bioinforma2cs Core • Service Highlight: Pathway Analysis • IPA demo
December 20, 2012 bioinforma2cs.virginia.edu
Bioinforma2cs Origins • Rooted in sequence analysis • Driven by need to: - Collect - Annotate - Analyze
What is bioinforma2cs?
(Diagram modified from @drewconway)
What is bioinforma2cs?
“There is a tremendous amount of informa4on regarding evolu&onary history and biochemical func&on implicit in each sequence and the number of known sequences is growing explosively. We feel it is important to collect this significant informa4on, correlate it into a unified whole and interpret it.”
M. Dayhoff, February 27, 1967
UVA Bioinforma2cs Core
• A centralized resource for providing expert and 2mely bioinforma2cs consul2ng and data analysis.
• Main goals: help you publish and get funding. – 1. Service – 2. Training
December 20, 2012 bioinforma2cs.virginia.edu
Sample prep
Sequencing
Raw data Differential expression Gene identification Novel Genes Discoveries …etc.
This is the “stuff” we do in the bioinforma2cs core!
Find out what this “stuff” is at bioinforma2cs.virginia.edu
Services • Gene expression: Microarray Analysis • Gene expression: RNA-‐seq Analysis • Pathway analysis • DNA Varia2on (GWAS, NGS) • DNA Binding / ChIP-‐Seq • DNA Methyla2on • Grant / Manuscript support • Custom development
December 20, 2012 bioinforma2cs.virginia.edu
Services Gene expression: Microarray Analysis • Accession and analysis of publicly available data (e.g. GEO, ArrayExpress). • Preprocessing: background subtrac2on, summariza2on, and quan2le normaliza2on using
RMA (Robust Mul2chip Average) expression measure described in Irizarry et al. Biosta2s2cs 4:249-‐264.
• Quality assessment: – Visualiza2on of signal intensity distribu2ons of each array using boxplots and density plots. – MA plots to visualize signal intensity over average intensity. – Principal components analysis to visualize the overall data (dis)similarity between arrays.
• Analysis: – Es2ma2on of fold changes and standard errors using a linear model. – Empirical Bayes smoothing to standard errors. – Lists of top differen2ally expressed genes, fold changes, sta2s2cal significance, mul2ple tes2ng correc2on.
• Visualiza2on: – Heatmaps and dendrograms. – Volcano plots to visualize sta2s2cal significance by fold change.
• Biological context – Pathway/Func2onal Analysis.
December 20, 2012 bioinforma2cs.virginia.edu
Services Gene expression: RNA-‐seq • Pre-‐alignment quality assessment:
– Per-‐base sequence quality – Per-‐base sequence content – Per-‐base GC content – Search for overrepresented sequences (adapters, primers, etc)
• Alignment to a reference genome: – Homo sapiens – Mus musculus – Rahus norvegicus – Bos taurus – Canis familiaris – Gallus gallus – Drosophila melanogaster – Arabidopsis thaliana – Caenorhabdi2s elegans – Saccharomyces cerevisiae
• Post-‐alignment quality assessment: – Flagging duplicate reads – Es2ma2on of library complexity – Insert size distribu2on (for paired-‐end sequencing) – Analysis of coverage over transcript posi2on
• Transcript assembly • Differen2al expression tes2ng
– Isoforms – Genes – Primary transcripts – Coding sequence
• Differen2al splicing analysis • Differen2al coding output • Differen2al promoter use • Visualiza2on: assistance with visualiza2on using IGV.
December 20, 2012 bioinforma2cs.virginia.edu
Services DNA Varia2on: Genotyping
• Study design & power calcula2ons for SNP genotype-‐phenotype associa2on studies • Data management and quality control • PCA for popula2on stra2fica2on control • Imputa2on to a reference popula2on (e.g. HapMap, 1000 Genomes) • Analysis, interpreta2on, visualiza2on • Manuscript prepara2on • Grant support (compliance with NIH data sharing policies, methodology for data management,
design, analysis, and interpreta2on) • Acquisi2on of publicly available data (dbGaP)
DNA Varia2on: Next-‐Gen Sequencing
• Alignment to a reference genome • Calibra2on of quality scores and duplicate read removal • Variant calling • Variant annota2on • SNP effect predic2on • De novo assembly • Any of the applicable analysis, interpreta2on, and visualiza2on services described above for
genotyping data.
December 20, 2012 bioinforma2cs.virginia.edu
Service Highlight: “Pathway Analysis” • You’ve done your microarray/RNA-‐Seq experiment
– You have a list of genes – Want to put these into func2onal context – What biological processes are perturbed? – What pathways are being dysregulated? – Data reduc2on: hundreds or thousands of genes can be reduced to 10s of pathways
– Iden2fying ac2ve pathways = more explanatory power • “Pathway analysis” encompasses many, many techniques.
1. 1st Genera2on: Overrepresenta2on Analysis (E.g. GO ORA) 2. 2nd Genera2on: Func2onal Class Scoring (e.g. GSEA) 3. 3rd Genera2on (in development): Pathway Topology (E.g. SPIA)
• bit.ly/pathway-‐analysis
December 20, 2012 bioinforma2cs.virginia.edu
Over-‐representa2on analysis (ORA) • Many varia2ons on the same theme: sta2s2cally evaluates the frac2on of genes in par2cular pathway that show changes in expression.
• Algorithm: 1. Create input list (e.g. “significant at p<0.05”) 2. For each gene set:
a. Count number of input genes b. Count number of “background” genes (e.g. all genes on plaoorm).
3. Test each pathway for over-‐representa2on of input genes • Gene Set: typically gene ontology (GO) term.
December 20, 2012 bioinforma2cs.virginia.edu
Gene Ontology • Ontology = formal representa2on of a knowledge domain. • Gene ontology = cell biology. • GO represented by directed acyclic graph (DAG).
– Terms are nodes, rela2onships are edges. – Parent terms are more general than their child terms. – Unlike a simple tree, terms can have mul2ple parents.
December 20, 2012 bioinforma2cs.virginia.edu
Rhee, S. Y., Wood, V., Dolinski, K., & Draghici, S. (2008). Use and misuse of the gene ontology annota2ons. Nature reviews. Gene2cs, 9(7), 509-‐15. doi:10.1038/nrg2363
GO ORA: Example • Algorithm:
1. Create input list (e.g. “significant at p<0.05”) 2. For each gene set:
a. Count number of input genes b. Count number of “background” genes (e.g. all genes on plaoorm).
3. Test each pathway for over-‐representa2on of input genes • Ex: GO “Purine Ribonucleo2de Biosynthe2c Process”
– 1% of input (significant) genes are annotated with this term. – 1% of genes on the chip are annotated with this term. – Not significantly overrepresented.
• Ex: GO “V(D)J Recombina2on” – 20% of input (significant) genes are annotated with this term. – 1% of genes on the chip are annotated with this term. – Highly significantly over-‐represented!.
December 20, 2012 bioinforma2cs.virginia.edu
GO ORA: Example
December 20, 2012 bioinforma2cs.virginia.edu
GO ORA: Limita2ons • Some categories are so general they’re meaningless (e.g. “cellular process”).
• ORA uses genes above a cutoff and discards everything else.
• ORA only uses the number genes, and ignores their measured changes.
• Two assump2ons violated – Genes are independent (NOT! Coexpression, interac2on, etc). – Pathways are independent (by defini2on violated by DAG).
December 20, 2012 bioinforma2cs.virginia.edu
Func2onal Class Scoring • Theory: while large changes in individual genes can have significant effects on pathways, weaker but coordinated changes in sets of func2onally related genes can also have significant effects.
• General Algorithm: 1. Compute gene-‐level sta2s2c (e.g. Fold Change, student’s t). 2. Aggregate gene level sta2s2cs for all genes in pathway into
single pathway-‐level sta2s2c. 3. Assess significance with permuta2on.
December 20, 2012 bioinforma2cs.virginia.edu
Gene Set Enrichment Analysis 1. Calculate an Enrichment Score
a) Rank genes by their expression difference b) For each Gene Set*:
i. Compute cumula2ve sum over ranked genes 1. Increase sum when gene is in set, decrease otherwise 2. Magnitude of increment depends on gene-‐phenotype correla2on
ii. Record the maximum devia2on from zero as Enrichment Score (ES) 2. Assess significance
a) Permute phenotype (or gene labels) 1000 2mes b) Compute ES score for each permuta2on (empiric null). c) Compare ES score for actual data to distribu2on of ES scores from permuted
data. d) Normalize ES by accoun2ng for gene set size e) Control mul2ple tes2ng by calcula2ng FDR for each NES
• * Gene sets: Come from MSigDB – hhp://www.broadins2tute.org/gsea/msigdb/index.jsp – MSigDB is collec2on of annotated gene sets for use with GSEA sovware. – Posi2onal, curated, computa2onally predicted, GO. – Curated: KEGG, Reactome, STKE, etc.
December 20, 2012 bioinforma2cs.virginia.edu
GSEA: Example
December 20, 2012 bioinforma2cs.virginia.edu
FCS/GSEA: Limita2ons • Violate same assump2ons as GO-‐ORA: – Genes are independent – Pathways are independent
• Only consider number/magnitude of genes, and ignore other informa2on in databases: – Direc4onality of the interac2on – Nature of the interac2on (ac2va2ng, inhibi2on, etc). – Where the interac2on occurs (nucleus, cytoplasm, etc).
December 20, 2012 bioinforma2cs.virginia.edu
Pathway Topology: SPIA • U2lizes direc2onality,
func2on, and topology. • Computes two orthogonal
p-‐values: – pNDE: Number of Differen2ally Expressed genes (E.g. like ORA).
– pPERT: degree of perturba2on • pG is overall p-‐value (pNDE
and pPERT combined) • pGFDR is overall FDR-‐
corrected p-‐value
December 20, 2012 bioinforma2cs.virginia.edu
Pathway Topology: SPIA • TCR Signaling
Pathway Results – pNDE: 6.5e-‐9 – pPERT: .29 – pGFDR: 1.2e-‐6 – Conclusion: many
differen2ally expressed genes, but pathway may not be badly perturbed.
December 20, 2012 bioinforma2cs.virginia.edu
Pathway Topology / SPIA: Limita2ons • With SPIA, s2ll need arbitrary “cutoff” e.g. top 500, or p<0.05, etc.
• True topology is dependent on type of cell due to cell-‐specific gene expression profiles.
• Tissue-‐specific topology is rarely available and fragmented in databases, even if it’s fully understood.
• Other general limita2ons of pathway analysis -‐-‐-‐
December 20, 2012 bioinforma2cs.virginia.edu
Pathway Analysis: General Limita2ons • Low resolu2on knowledge bases – E.g. RNA-‐seq studies have found >90% of transcriptome is alterna2vely spliced.
– Different transcripts can have different or opposing func2ons. • Incomplete/inaccurate annota2ons. • Oct 2007: 95% GO annota2ons inferred electronically (i.e. not manually curated).
• Missing condi2on-‐ and cell-‐specific informa2on. • Methodological challenge: lack of benchmarks.
December 20, 2012 bioinforma2cs.virginia.edu
Pathway Analysis: Conclusions
December 20, 2012 bioinforma2cs.virginia.edu
Pathway analysis gives you more biological insight than staring at lists of genes.
Pathway analysis is complex, and has many limita2ons.
Pathway analysis is s2ll more of an exploratory procedure rather than a pure sta2s2cal endpoint.
The best conclusions are made by viewing enrichment analysis results through the lens of the inves4gator’s expert biological knowledge.
IPA Demo • Background: Microarray data from Childhood Exacerbated
Asthma compared to normal state. • Ques2ons: Do data supported involvement of immune/
inflammatory responses and viral infec2on in the acute asthma ahack?
• Tasks: – View Canonical pathways that contain significant numbers of genes from
this dataset. – Overlay a Func2on/Disease state that shows how key signaling pathways
for figh2ng off respiratory infec2ons overlapped with asthma2c inflamma2on.
– Overlay Biomarkers that iden2fy genes in the infec2on signaling pathway that are also used for diagnosis and efficacy indicators for asthma treatments.
– Search the Ingenuity Knowledge Base for literature references that support your findings.
– Inves2gate a “weird” finding…
December 20, 2012 bioinforma2cs.virginia.edu
Thank you
Web: bioinforma2cs.virginia.edu
E-‐mail: [email protected]
Blog: www.Ge{ngGene2csDone.com
Twiher: twiher.com/gene2cs_blog
December 20, 2012 bioinforma2cs.virginia.edu