+ All Categories
Home > Education > Automated Hypothesis Testing with Large Scale Scientific Workflows

Automated Hypothesis Testing with Large Scale Scientific Workflows

Date post: 22-Jan-2018
Category:
Upload: dgarijo
View: 113 times
Download: 0 times
Share this document with a friend
44
Automated Hypothesis Testing with Large Scale Scientific Workflows Yolanda Gil Daniel Garijo Rajiv Mayani Varun Ratnakar Information Sciences Institute & Department of Computer Science University of Southern California http://www.isi.edu Parag Mallick Ravali Adusumilli Hunter Boyce Stanford School of Medicine Canary Center for Early Cancer Detection Stanford University http://mallicklab.stanford.edu http://www.disk-project.org
Transcript
Page 1: Automated Hypothesis Testing with Large Scale Scientific Workflows

Automated Hypothesis Testing with Large Scale Scientific Workflows

Yolanda GilDaniel GarijoRajiv Mayani

Varun Ratnakar

Information Sciences Institute & Department of Computer Science

University of Southern Californiahttp://www.isi.edu

Parag MallickRavali AdusumilliHunter Boyce

Stanford School of MedicineCanary Center for Early Cancer Detection

Stanford Universityhttp://mallicklab.stanford.edu

http://www.disk-project.org

Page 2: Automated Hypothesis Testing with Large Scale Scientific Workflows

Talk Outline๏ Motivation

๏ Research Challenges1. Representing Hypotheses

2. Representing Lines of Inquiry

3. Meta-analysis to review workflow results

๏ DISK Scenario walkthrough

๏ Results in cancer multi-omics

๏ Related work

๏ Contributions and Future Work

Page 3: Automated Hypothesis Testing with Large Scale Scientific Workflows

Scientific Data Analysis Today: Inefficient, Incomplete, Irreproducible

๏ Data analysis is time consuming

๏ Not systematic

๏ Not updated when new data/methods become available

๏ Hard/impractical to reproduce prior work

๏ Overall process is manually done: inefficient and error-prone

๏ Analytic knowledge is compartmentalised

New hypothesis

Formulate line of inquiry

(data + method)

Retrieve data

Run

workflows (methods)

Meta-analysis of results

Page 4: Automated Hypothesis Testing with Large Scale Scientific Workflows

Our Focus: Cancer Multi-Omics๏ Data Availability and Complexity:

• The multi-omic domain is filled with multiple levels of heterogeneous data that is regularly expanding in volume and complexity through projects like The Cancer Genome Atlas TCGA and and the associated Clinical Proteomic Tumor Analysis Consortium (CPTAC)

Page 5: Automated Hypothesis Testing with Large Scale Scientific Workflows

Our Focus: Cancer Multi-Omics๏ Analytic Complexity:

• Multi-omic analysis requires the use of dozens of interconnected tools each of which may require substantial domain knowledge. MAQ

BWABWA-SW(SEonly)PERMSOAPv2MOSAIKNOVOALIGN

SAMTOOLSPICARDGATKPICARDSAMTOOLSIGVtools

Domain Knowledge is isolated

Page 6: Automated Hypothesis Testing with Large Scale Scientific Workflows

Our Focus: Cancer Multi-Omics๏ Multiple types and complexities

of hypotheses:

• Hypotheses span the range from single-gene/single dataset to multi-gene/multi-ome/multi-dataset

• Is this protein is found in this sample ?• Is this gene is found in this sample ?• Is this protein is associated with a

certain cancer ?• Which proteins are associated with a

certain cancer ?• ..• ..

Page 7: Automated Hypothesis Testing with Large Scale Scientific Workflows

Talk Outline๏ Motivation

๏ Our Approach & Research Challenges1. Representing Hypotheses

2. Representing Lines of Inquiry

3. Meta-analysis to review workflow results

๏ DISK Scenario walkthrough

๏ Results in cancer multi-omics

๏ Related work

๏ Contributions and Future Work

Page 8: Automated Hypothesis Testing with Large Scale Scientific Workflows

Our Approach: Hypotheses-Driven Discovery๏ Represent scientist

hypotheses

๏ Formulate lines of inquiry that express how a type of hypothesis can be pursued by data analysis workflows

๏ Design a meta-analysis that examines the results of lines of inquiry and either validates or revises the original hypotheses

๏ Develop an intelligent agent that can report and explain new findings to the scientist

Hypothesis

Lines of Inquiry Specify relevant analytic methods (workflows),

type of data needed, and how to combine results

Query to retrieve Data

Data Analysis Workflows

Workflow Bindings

Meta-Workflows

Confidence Estimation

Benchmarking

Revised hypothesis & interesting findings

Page 9: Automated Hypothesis Testing with Large Scale Scientific Workflows

Representing Hypotheses

Hypothesis

Lines of Inquiry Specify relevant analytic methods (workflows),

type of data needed, and how to combine results

Query to retrieve Data

Data Analysis Workflows

Workflow Bindings

Meta-Workflows

Confidence Estimation

Benchmarking

Revised hypothesis & interesting findings

Representing Hypotheses

Page 10: Automated Hypothesis Testing with Large Scale Scientific Workflows

Requirements from Omics

๏ Graph-based hypothesis representation

• Entities are nodes

• Relationships are links

๏ Annotations on graphs

• Represent qualifications of hypotheses: confidence and evidence

๏ Representing hypothesis evolution

• Graph versioning

Graph representation in RDF

๏ Standard semantic web language

๏ Scalable reasoners available

๏ Qualifications and provenance through triple reification

๏ Versioning through multiple named graphs

Representing Hypotheses

Page 11: Automated Hypothesis Testing with Large Scale Scientific Workflows

Representing Hypotheses

Biology ontology

Hypothesis ontology

hyp:expressedInuser:TCGA-AA-3561-01A-22

User data definitions

hyp:associatedWithbio:ColonCancer

Graph Hy1

Graph Hy2

bio:PRKCDBP

bio:PRKCDBP

Page 12: Automated Hypothesis Testing with Large Scale Scientific Workflows

Lifecycle of a hypothesis

Biology ontology

Hypothesis ontology

hyp:expressedInuser:TCGA-AA-3561-01A-22

User data definitions

hyp:associatedWithbio:ColonCancer

Graph Hy1

Graph Hy2

bio:PRKCDBP

bio:PRKCDBP

Page 13: Automated Hypothesis Testing with Large Scale Scientific Workflows

1. Initial Hypothesis, Data & Workflows

Data Available

Workflows Available

Proteomics

Proteogenomics

XX_3561Proteome_VU.zip (MassSpecData)

producedData TCGA-AA-3561(Patient)

collectedFromTCGA-AA-3561-01A-22(Sample)

AA_3561_EX2(Experiment)

experimentedOn

Hypothesis Statement Hy1

PRKCDBPexpressedIn

TCGA-AA-3561-01A-22

Page 14: Automated Hypothesis Testing with Large Scale Scientific Workflows

2. Running workflows on Data

Data Available

Workflows Available

Proteomics

Proteogenomics

XX_3561Proteome_VU.zip (MassSpecData)

producedData TCGA-AA-3561(Patient)

collectedFromTCGA-AA-3561-01A-22(Sample)

AA_3561_EX2(Experiment)

experimentedOn

Workflow ExecutionW1

hasWorkflowTemplate

used

Hypothesis Statement Hy1

PRKCDBPexpressedIn

TCGA-AA-3561-01A-22

Page 15: Automated Hypothesis Testing with Large Scale Scientific Workflows

Qualifications of Hy1'Provenance of Hy1'

Hypothesis Statement Hy1

3. Meta reasoning about workflow results

PRKCDBPexpressedIn

TCGA-AA-3561-01A-22

Data Available

Workflows Available

Proteomics

Proteogenomics

XX_3561Proteome_VU.zip (MassSpecData)

producedData TCGA-AA-3561(Patient)

collectedFromTCGA-AA-3561-01A-22(Sample)

AA_3561_EX2(Experiment)

experimentedOn

Workflow ExecutionW1

hasWorkflowTemplate

used

Meta-Workflow ExecutionMW1

used

Revised Hypothesis Statement Hy1'

PRKCDBPexpressedIn

TCGA-AA-3561-01A-22

hasConfidenceValue

0

Statement Hy1'-S1hasProvenanceproducedused

produced

revisionOf

Page 16: Automated Hypothesis Testing with Large Scale Scientific Workflows

4. New Data becomes available

Workflows Available

Proteomics

Proteogenomics

Hypothesis Statement Ha1

PRKCDBPexpressedIn

TCGA-AA-3561-01A-22

Data Available

XX_3561Proteome_VU.zip (MassSpecData)

producedData

producedData

experimentedOn

experimentedO

nTCGA-AA-3561

(Patient)collectedFromTCGA-AA-3561-01A-22

(Sample)

AA_3561_EX1(Experiment)

AA_3561_EX2(Experiment)

XX_3561_DD.zip (RNASeqData)

Page 17: Automated Hypothesis Testing with Large Scale Scientific Workflows

5. New Multi-Workflows are also run

Workflows Available

Proteomics

Proteogenomics

used

Data Available

XX_3561Proteome_VU.zip (MassSpecData)

producedData

producedData

experimentedOn

experimentedO

nTCGA-AA-3561

(Patient)collectedFromTCGA-AA-3561-01A-22

(Sample)

AA_3561_EX1(Experiment)

AA_3561_EX2(Experiment)

Workflow ExecutionW2

XX_3561_DD.zip (RNASeqData)

Workflow ExecutionW1

used

Hypothesis Statement Ha1

PRKCDBPexpressedIn

TCGA-AA-3561-01A-22

Page 18: Automated Hypothesis Testing with Large Scale Scientific Workflows

Qualifications of Ha1'

hasProvenance

Provenance of Ha1'

6. Hypothesis Revision

Workflows Available

Proteomics

Proteogenomics

used

used

Revised Hypothesis Statement Ha1'

PRKCDBPMutated

expressedInTCGA-AA-3561-01A-22

hasConfidenceValue

0.98

Statement Ha1'-S1producedused

Data Available

XX_3561Proteome_VU.zip (MassSpecData)

producedData

producedData

experimentedOn

experimentedO

nTCGA-AA-3561

(Patient)collectedFromTCGA-AA-3561-01A-22

(Sample)

AA_3561_EX1(Experiment)

AA_3561_EX2(Experiment)

Workflow ExecutionW2

XX_3561_DD.zip (RNASeqData)

Workflow ExecutionW1

used usedproduced

Meta-Workflow ExecutionMW2

Hypothesis Statement Ha1

PRKCDBPexpressedIn

TCGA-AA-3561-01A-22

revisionOf

Page 19: Automated Hypothesis Testing with Large Scale Scientific Workflows

Representing Lines of Inquiry & Data analysis workflows

Hypothesis

Lines of Inquiry Specify relevant analytic methods (workflows),

type of data needed, and how to combine results

Query to retrieve Data

Data Analysis Workflows

Workflow Bindings

Meta-Workflows

Confidence Estimation

Benchmarking

Revised hypothesis & interesting findings

Page 20: Automated Hypothesis Testing with Large Scale Scientific Workflows

Data Query Pattern

DataFile ?d

Hypothesis Pattern

Lines of Inquiry๏ Capture how to setup potential analyses that can be pursued to test a certain type of

hypothesis

bio:Protein ?phyp:expressedIn

bio:Sample ?s

producedDataPatient ?pcollectedFromSample ?sExperiment ?e

experimentedOn

Data Analytic WorkflowsProteomicsProteogenomics

DataFile ?d

Meta-workflowsComparisonConfidence estimation Benchmarking

Page 21: Automated Hypothesis Testing with Large Scale Scientific Workflows

Example Multi-omics Workflow (Zhang et. al replication)

Page 22: Automated Hypothesis Testing with Large Scale Scientific Workflows

Automated Workflow Generation in WINGS by Reasoning about Semantic Constraints

Example: all input data must be from human species, i.e. must have HS in metadata

Workflow system uses this constraint to select datasets that have HS in their metadata so they are valid

Page 23: Automated Hypothesis Testing with Large Scale Scientific Workflows

Representing Hypotheses

Hypothesis

Lines of Inquiry Specify relevant analytic methods (workflows),

type of data needed, and how to combine results

Query to retrieve Data

Data Analysis Workflows

Workflow Bindings

Meta-Workflows

Confidence Estimation

Benchmarking

Revised hypothesis & interesting findings

Page 24: Automated Hypothesis Testing with Large Scale Scientific Workflows

Meta-workflows:1) Comparison Meta-Workflows

Variant Detection

Custom Protein DB

Protein Identification

Protein Identification

Custom DB Reference DB

Protein IDs Protein IDs

Similarity Score Data Dependent:

•  Peptide Level •  Protein Level •  Scan Level

Comparison Meta-Workflow

๏ Goals:

• Compare results amongst multiple workflows

• Measure the global similarity amongst multiple workflows

• Provide users with explanation of workflow-dependent differences in results

Page 25: Automated Hypothesis Testing with Large Scale Scientific Workflows

Meta-workflows:2) Benchmark Meta-Workflows

๏ Goals:

• Evaluation of workflow performance

• Training of confidence estimation models (probabilistic)

Probabilistic Models

Benchmark Meta-Workflow

ROC, True/False Positive Rate

Page 26: Automated Hypothesis Testing with Large Scale Scientific Workflows

Meta-workflows:3) Confidence estimation Meta-Workflows

๏ Goals:

• Composite results from multiple workflows

• Estimate confidence of the workflow result

• Use estimated confidence to update hypothesis

Protein Identification

Protein Identification

Custom DB Reference DB

Protein IDs Protein IDs Probabilistic Model

Estimate Confidence

Update Hypothesis

Benchmark Meta-Workflow

Page 27: Automated Hypothesis Testing with Large Scale Scientific Workflows

Talk Outline๏ Motivation

๏ Our Approach & Research Challenges1. Representing Hypotheses

2. Representing Lines of Inquiry

3. Meta-analysis to review workflow results

๏ DISK Scenario walkthrough

๏ Results in cancer multi-omics

๏ Related work

๏ Contributions and Future Work

Page 28: Automated Hypothesis Testing with Large Scale Scientific Workflows

DISK Walkthrough: Initial Hypothesis๏ Initial hypothesis is provided by the user

• PRKCDBP protein is expressed in a patient sample

Page 29: Automated Hypothesis Testing with Large Scale Scientific Workflows

DISK Walkthrough: Lines of Inquiry๏ Line of inquiry suggests to find data from different experiments done with the

patient’s sample, then run multi-omic workflows, and then combine evidence into confidence score

General hypothesis pattern

Data query pattern: search for different experiments that produced omics data (eg type RNASeq and MassSpecData)

Data analysis workflows to run on genomics and proteomics data (more omics in the future)

Meta-workflows to assess confidence on the hypothesis based on workflow results

Page 30: Automated Hypothesis Testing with Large Scale Scientific Workflows

DISK Walkthrough: Data & Workflows

To test a hypothesis that a protein is present in a patient’s sample:

๏ Retrieve mass spec and RNASeq data

๏ Use workflows

• Wf1: Proteome only

• Wf2: ProteoGenomic

Page 31: Automated Hypothesis Testing with Large Scale Scientific Workflows

DISK Walkthrough: Meta-Workflows๏ After running the workflows, meta-

workflow analyse the results and generate a confidence value

Page 32: Automated Hypothesis Testing with Large Scale Scientific Workflows

DISK Walkthrough: Revised Hypothesis๏ The hypothesis is revised and given a confidence value:

• A mutation of the protein PRKCDBP has been expressed in the patient’s sample TCGA-AA-3561-01A-22 with a confidence 0.9887

Page 33: Automated Hypothesis Testing with Large Scale Scientific Workflows

DISK Walkthrough: Provenance Details๏ Hypothesis provenance stores information about workflows run and the data used

• Workflow execution provenance is published by WINGS in the prov standard.

Page 34: Automated Hypothesis Testing with Large Scale Scientific Workflows

Talk Outline๏ Motivation

๏ Our Approach & Research Challenges1. Representing Hypotheses

2. Representing Lines of Inquiry

3. Meta-analysis to review workflow results

๏ DISK Scenario walkthrough

๏ Results in cancer multi-omics

๏ Related work

๏ Contributions and Future Work

Page 35: Automated Hypothesis Testing with Large Scale Scientific Workflows

DISK: Automated DIscovery of Scientific Knowledge

Workflow Constraints

Workflow Reasoning

Open Publication of

Results as Linked Data

Workflow Provenance

WINGS Intelligent Workflow System

Lines of Inquiry

Interactive Discovery

Agent

Hypothesis Evaluation Hypotheses

Revised hypotheses

& interesting findings

Analytic Workflows

Data Retrieval

Workflow Binding

Meta-Workflows

Confidence Estimation

Benchmarking

Formulate Lines of Inquiry

Meta-Analysis of Results

Data Repository

Page 36: Automated Hypothesis Testing with Large Scale Scientific Workflows

Our Initial Focus: Reproduce Seminal Omics Analysis [Zhang et al 2014]

Page 37: Automated Hypothesis Testing with Large Scale Scientific Workflows

๏ Replicated [Zhang et al 2014] Proteogenomic analysis of Colo-rectal cancer

๏ Successfully reproduced paper findings comparing results at multiple levels (final figure, supplementary tables, etc.)

๏ Took months and direct conversations with authors to replicate paper figures and supplemental figures

๏ Application of analysis approach to new cancer type now takes minutes

• Useful when TCGA is integrated

๏ Expanded analysis to

• compare how sensitive findings were to workflow details0

2

4

6

−1.0 −0.5 0.0 0.5 1.0spearman correlation

den

sity

Correlation between mRNA−protein abundance(within samples)

0

1

2

−4 −3 −2 −1 0spearman correlation

den

sity

Correlation between mRNA−protein variation(across samples)

Impact on Cancer Multi-Omics

Page 38: Automated Hypothesis Testing with Large Scale Scientific Workflows

Talk Outline๏ Motivation

๏ Our Approach & Research Challenges1. Representing Hypotheses

2. Representing Lines of Inquiry

3. Meta-analysis to review workflow results

๏ DISK Scenario walkthrough

๏ Results in cancer multi-omics

๏ Related work

๏ Contributions and Future Work

Page 39: Automated Hypothesis Testing with Large Scale Scientific Workflows

Related Work1) Discovery Systems๏ [Lenat 1976]

๏ [Lindsay et al 1980]

๏ [Langley 1981]

๏ [Falkenhainer 1985]

๏ [Kulkarni and Simon 1988]

๏ [Cheeseman et al 1989]

๏ [Zytkow et al 1990]

๏ [Simon 1996]

๏ [Valdes-Perez 1997]

๏ [Todorovski et al 2000]

๏ [Schmidt and Lipson 2009]

Page 40: Automated Hypothesis Testing with Large Scale Scientific Workflows

Related Work:2) Hypothesis Representation as Graphs๏ Existing vocabularies are related but need to be extended to represent hypotheses in

DISK

• SWAN [Gao et al 2006]

• EXPO [Soldatova and King 2006]

• Nanopublications [Groth et al 2010]

• Ovopublications [Callahan and Dumontier 2013]

• Micropublications [Clark et al 2014]

• LSC

• BEL

Page 41: Automated Hypothesis Testing with Large Scale Scientific Workflows

Talk Outline๏ Motivation

๏ Our Approach & Research Challenges1. Representing Hypotheses

2. Representing Lines of Inquiry

3. Meta-analysis to review workflow results

๏ DISK Scenario walkthrough

๏ Results in cancer multi-omics

๏ Related work

๏ Contributions and Future Work

Page 42: Automated Hypothesis Testing with Large Scale Scientific Workflows

Contributions

๏ Represent scientist hypotheses

• Hypothesis ontology includes revisions & provenance

๏ Formulate lines of inquiry that express how a type of hypothesis can be pursued with a data analysis workflow

• Lines of inquiry outline what type of data and workflows to use, and customize them to the hypotheses at hand

๏ Design a meta-analysis to assess the results of lines of inquiry and revise the original hypotheses

• Meta-analysis workflows assess diverse evidence

Page 43: Automated Hypothesis Testing with Large Scale Scientific Workflows

Ongoing & Future Work๏ Ongoing work:

• Interactive Discovery Agent that explains interesting findings

• Continuous analysis of data (TCGA/CPTAC) as it grows

• Extending and generalizing meta-workflows

• Using DISK in geosciences: Subsurface water resource modeling

๏ Future challenges:

• More complex hypotheses about several entities

• Incorporate evidence over time

• Designing domain-independent meta-workflows

• Resource-bound hypothesis exploration

Page 44: Automated Hypothesis Testing with Large Scale Scientific Workflows

Thank you


Recommended