Post on 10-May-2015
description
transcript
Automation of Biological Data Analysis and Report Generation
Dmitry Grapov, PhD
Bots write the darndest things
http://www.latimes.com/local/lanow/earthquake-27-quake-strikes-near-westwood-california-rdivor,0,3229825.story#axzz2wQwc82EK
• fill in the template (easy)
• human-guided automation (e.g. Metaboanalyst, intermediate)
• intelligent/reactive writing (e.g. ~AI, advanced)
http://narrativescience.com/
Humans + Bots
Interaction:
• Bots and humans combine in guided analyses
• Humans: make choices (based on bot guides)
• Bots: automate!
Facilitate:
• workflow logging and template creation
• reproducible results
Bot: Initial data and meta data parsing and quality validation
(need: template input)
Human: data cleaning and experimental design identification
(use: multiple choice, dynamic GUI)
Bot: instantiation of complex workflows
Human: overview of bot assumptions and results
Bot: Numerical and text output generation
Humans + Bots write darndender things?
Choose Your Own Life Adventure!
?
https://github.com/
dgrapov/AdventureR
Data Analysis Tasks
Visualization (how does it look?)
• histograms, density plots, box plots, line plots, scatter plots, networks, etc.
Statistical Analysis (what is statistically significant?)
• summary tables, ANOVA, FDR adjustment, power analysis, etc.
Exploration (what are the major patterns/trends?)
• clustering, PCA, ICA, etc.
Predictive Modeling (what explains my hypothesis?)
• mixed effects, partial least squares (O-/PLS/-DA), etc.
Network Analysis and Mapping (how are things related?)
• Functional analysis: pathway enrichment or overrepresentation
• Networks: biochemical, structural, mass spectral and empirical networks
• Mapping: projection of analysis results onto network
WCMC Data Analysis Reports ™
Statistical analysisClusteringPCAO-PLS-DABiochemical enrichmentNetwork mapping
Input template: BinBase
• inference of experimental goals from sample meta data
• mapping variables to external databases
Tasks:
Report:
Tools:
Automation Challenges
Data cleaning and quality validation
• use: quality control samples; identify: precision/accuracy, normalization, batch corrections; mitigate: outliers, missing values, batch effects, etc.
Identification of experimental goals
• use: meta data, identify: main and accessory effects; choose: statistics, multivariate tests and visualizations
Integration of multiple tasks to evolve robust analyses • tasks: statistics, multivariate, functional, networks,
database mapping, etc
Data analysis report generation
• use: R, Latex, markdown
?
Challenges to automated metabolite ID mapping
Stereochemistry?
Search: catechin
Best Match: Catechin
Biologically relevant:
D-catechin
Synonyms?
Search: UDP GlcNAc
FAIL: UDP GlcNac
PASS: UDP-GlcNac
Strategies for automated metabolite ID mapping (from synonym)
#1: CTS+ #2: Web query #3: Curated DB
• Use CTS to translate from synonyms to KEGG (KID) and PubChem (CID)
• Use KEGGREST and PUG to filter and choose most appropriate IDs
• Use fuzzy matching and word similarity metrics (e.g. Damerau–Levenshtein distance)
• Use KEGGREST + PubChem PUG to translate synonyms to IDs
• For KEGG ID:
synonym SID KID
• Generate a curated DB for KEGG and CID translations +
• Include InChI Keys
• Map to other DBs
• Allow fuzzy matching on synonyms
• e.g. IDEOM http://bioinformatics.oxfordjournals.org/content/early/2012/02/04/bioinformatics.bts069
Interactive Analysis and Report Generation
knitr (http://yihui.name/knitr/)
Analysis Report Generation
• Analysis on rails or open sandbox
• Humans facilitate robust results generation + Bots ensure reproduction
• Generation of Methods and Results should be automateable
Devium 2.0Human-guided automated data analysis and report generator
Human-guided automation could help ensure robust results by making choices which are otherwise difficult to automate.
https://github.com/dgrapov/DeviumWeb
MetaMapRLinking data analysis and
biologyhttps://github.com/dgrapov/MetaMapR
Integration of complex work flows is key to automation.
+ Workflows for complex experiments (e.g. time-course)
+ Biochemical functional analysis (pathway enrichment)
+ GUI for report generation (Devium 2.0)
+ Integrate multi-’Omic’ data sets (MetaMapR 2.0)
+ Scientific literature mining (RapportR)
+ Interactive plots and networks (JavaScript)
Future Goals
dgrapov@ucdavis.edu metabolomics.ucdavis.edu
This research was supported in part by NIH 1 U24 DK097154