The Scientific Method on the Semantic Web

Post on 16-Dec-2014

655 views 5 download

Tags:

description

Presentation to the iCAPTURE Center, Heart + Lung Institute at St. Paul's Hospital

transcript

SADI, SHARE and the Scientific Method

The Quest for the Holy Grail

The Problem

The Problem

The Holy Grail:(this slide created circa 2002)

Align the promoters of all serine threonine kinases involved exclusively in the regulation of cell sorting during wound healing in blood vessels.

Retrieve and align 2000nt 5' from every serine/threonine kinase in Mus musculus expressed exclusively in the tunica [I | M |A] whose expression increases 5X or more within 5 hours of wounding but is not activated during the normal development of blood vessels, and is <40% homologous in the active site to kinases known to be involved in cell-cycle regulation in any other species.

Two novel technologies

developed in our lab

are getting us very close to the Holy Grail!

Holy Grail Demo #1

Imagine there is a “virtual database” containing all of the data from all of the databases,together with the output of

every conceivable analysis

How do we query that database?

A Brief Digression…

“Database”

?

Boxes became ovals…

Straight lines became curvy lines…

Boxes became ovals…

Straight lines became curvy lines…

…and you want us to give you a grant for THAT??

Relational Database

“Graph”

Protein Table-----------------------

Protein IndexProtein NameRegulates ID

Gene Table-----------------------

Gene IDTissue IDType ID

http://pdb.org/114487 http://ncbi.nlm/NR/NR_14487

isRepressor

Of

Protein Table-----------------------

Protein IndexProtein NameRegulates ID

Gene Table-----------------------

Gene IDTissue IDType ID

“Foreign keys” are used to link tables in a database

http://pdb.org/114487 http://ncbi.nlm/NR/NR_14487

isRepressor

Of

Links in Graphs consist of statements called

“TRIPLES”

Protein Table-----------------------

Protein IndexProtein NameRegulates ID

Gene Table-----------------------

Gene IDTissue IDType ID

http://pdb.org/114487 http://ncbi.nlm/NR/NR_14487

isRepressor

Of

Protein Table-----------------------

Protein IndexProtein NameRegulates ID

Gene Table-----------------------

Gene IDTissue IDType ID

Both Data Sources are on the Same Machine

http://pdb.org/114487 http://ncbi.nlm/NR/NR_14487

isRepressor

Of

Graph Data Sources (may be) on Independent Machines on the Web

Protein Table-----------------------

Protein IndexProtein NameRegulates ID

Gene Table-----------------------

Gene IDTissue IDType ID

http://pdb.org/114487 http://ncbi.nlm/NR/NR_14487

isRepressor

Of

Protein Table-----------------------

Protein IndexProtein NameRegulates ID

Gene Table-----------------------

Gene IDTissue IDType ID

“Meaning” of the connection between data-points is understood

only by the database administrator

http://pdb.org/114487 http://ncbi.nlm/NR/NR_14487

isRepressor

Of

Protein regulates

Gene

“Meaning” of the connection in a Graph is explicitly labeled(and machine-readable!)

Protein Table-----------------------

Protein IndexProtein NameRegulates ID

Gene Table-----------------------

Gene IDTissue IDType ID

http://pdb.org/114487 http://ncbi.nlm/NR/NR_14487

isRepressor

Of

Connect all of the graphs in the world to one another

And what do you get?

Mark Butler (2003) Is the semantic web hype? Hewlett Packard laboratories presentation at MMU, 2003-03-12

The lavender portion represents biology – currently ~40,000,000,000 Triples(we and our collaborators will be doubling that number in the next 12 months)

How do you find information on this

“Semantic Web”

??

SPARQL

The query language used to discover and extract information represented in Graphs

SPARQL

Unfortunately, YOU have to know which Web resources contain which Triples

(HARD!)

Even if you do know this, SPARQL has significant limitations when attempting to

query over disparate Graphs(SLOW AND CUMBERSOME)

SPARQL

If the data doesn’t existin any Graph at all…

Basically…

A novel way of making Triples available on the Semantic Web, using a technology called Web Services

“Services” for short

Basically…

We invented SADI to overcome some/all of these problems

…but I wont bore you with the technical details…

Detour EndsPlease resume speed

Imagine there is a “virtual database” containing all of the data from all of the databases,together with the output of

every conceivable analysis

Holy Grail Demo #1

How do we query that database?

SHARESemantic Health And Research Environment

SPARQL enhanced by SADI

A Novel SPARQL Query Engine

Overcomes some of the limitations of traditional SPARQL query-handlers

A Novel SPARQL Query Engine

Overcomes some of the limitations of traditional SPARQL query-handlers

…and more…

A Novel SPARQL Query Engine

Overcomes some of the limitations of traditional SPARQL query-handlers

…and more…

MUCH more!!

What pathways does UniProt protein P47989 belong to?

PREFIX pred: <http://sadiframework.org/ontologies/predicates.owl#>PREFIX ont: <http://ontology.dumontierlab.com/>PREFIX uniprot: <http://lsrn.org/UniProt:>SELECT ?gene ?pathway WHERE {

uniprot:P47989 pred:isEncodedBy ?gene . ?gene ont:isParticipantIn ?pathway .

}

What pathways does UniProt protein P47989 belong to?

PREFIX pred: <http://sadiframework.org/ontologies/predicates.owl#>PREFIX ont: <http://ontology.dumontierlab.com/>PREFIX uniprot: <http://lsrn.org/UniProt:>SELECT ?gene ?pathway WHERE {

uniprot:P47989 pred:isEncodedBy ?gene . ?gene ont:isParticipantIn ?pathway .

}

What pathways does UniProt protein P47989 belong to?

PREFIX pred: <http://sadiframework.org/ontologies/predicates.owl#>PREFIX ont: <http://ontology.dumontierlab.com/>PREFIX uniprot: <http://lsrn.org/UniProt:>SELECT ?gene ?pathway WHERE {

uniprot:P47989 pred:isEncodedBy ?gene . ?gene ont:isParticipantIn ?pathway .

}

Note that there is no “From” clause… I have neglected to tell the system where to look for the answer, I am simply asking my question

Now stick that query into SHARE

Recapwhat we just saw

A standard SPARQL query was entered into SHARE, a SADI-aware query engine

Recapwhat we just saw

The query was interpreted to extract the individual data/relationships being

requested

(and any component/sub-properties, as we shall see later!)

Recapwhat we just saw

The “triple-patterns” required to answer the query are passed to SADI for

Web Service discovery

Recapwhat we just saw

Services capable of generating those triple-patterns are automatically executed,

the triples are stored, and the query is resolved.

Recapwhat we just saw

We posed, and answered a ~complex database query

WITHOUT A DATABASE

(in fact, the data didn’t even have to exist...)

Holy Grail Demo #1

Align the promoters of all serine threonine kinases involved exclusively in the regulation of cell sorting during wound healing in blood vessels.

Retrieve and align 2000nt 5' from every serine/threonine kinase in Mus musculus expressed exclusively in the tunica [I | M |A] whose expression increases 5X or more within 5 hours of wounding but is not activated during the normal development of blood vessels, and is <40% homologous in the active site to kinases known to be involved in cell-cycle regulation in any other species.

Holy Grail Demo #2

Show me the latest Blood Urea Nitrogen and Creatinine levelsof patients who appear to be rejecting their transplants

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX patient: <http://sadiframework.org/ontologies/patients.owl#> PREFIX l: <http://sadiframework.org/ontologies/predicates.owl#> SELECT ?patient ?bun ?creatFROM <http://sadiframework.org/ontologies/patients.rdf>WHERE {

?patient rdf:type patient:LikelyRejecter .?patient l:latestBUN ?bun . ?patient l:latestCreatinine ?creat .

}

Likely Rejecter:

A patient who has creatinine levelsthat are increasing over time

- - Wilkinson MD

Likely Rejecter:

…but there is no “likely rejecter” column or table in our database…

only blood chemistry measurementsat various time-points

?

The definition of a LikelyRejecter is encoded in a machine-readable document written in the OWL language (“Ontology”)

“the regression line over creatinine measurements should have an increasing slope”

The machine continues to burrow down through the definition and discovers that regression lines have things like slopes and intercepts, etc…

Then…

Two magical events occur…

The machine figures out

by itself

the need to do a Linear Regression analysis

in order to answer your question

The machine figures out

by itself

how and where that analysiscan be done

and does it automatically!

http://www.impactlab.net/2009/03/22/improve-your-brain-power/

The SHARE system utilizes SADI to discover analytical services on the Web that do linear regression analysis

VOILA!

How do we do that?!?

We let the data describe itself!

This is a different frommost of the bioinformatics world,

where the person giving you the data also tells you how to interpret it

Data exhibits “late binding”

Late binding:

“purpose and meaning”of the data is

not determined untilthe moment it is required

Benefitof late binding

Data is amenable toconstant re-interpretation

Example?

Blood Creatinine measurements

were not dictated to be (only)

Blood Creatinine measurements!

Example?

The data had the ‘qualities/properties’ that

allowed the machine to infer

that they were Blood Creatinine measurements

Example?

But the data also had the ‘qualities/properties’ that

allowed them to be interpreted as

X/Y coordinate data by another Service

http://www.flickr.com/people/faernworks/

Holy Grail Demo #2

Align the promoters of all serine threonine kinases involved exclusively in the regulation of cell sorting during wound healing in blood vessels.

Retrieve and align 2000nt 5' from every serine/threonine kinase in Mus musculus expressed exclusively in the tunica [I | M |A] whose expression increases 5X or more within 5 hours of wounding but is not activated during the normal development of blood vessels, and is <40% homologous in the active site to kinases known to be involved in cell-cycle regulation in any other species.

The Holy Grail may not yet be in-handbut we can at least see it from here!

So… now what?

Mark’s Manifesto

What is my next “Holy Grail”?

Science

Support for the in silico Scientific Method

Reproducibility

Clarity (hypothesis)

Discourse

Disagreement

Clarity (experiment)

The Scientific Method

Discourse: What do you believe? What do I believe?

Disagreement: You’re wrong! And I’m gonna prove it!

Clarity: This is the experiment I am going to do

Reproducibility: This is how I did it (“provenance”)

Clarity: This is my new hypothesis

The Scientific Method

Discourse: What do you believe? What do I believe?

Disagreement: You’re wrong! And I’m gonna prove it!

Clarity: This is the experiment I am going to do

Reproducibility: This is how I did it (“provenance”)

Clarity: This is my new hypothesis

Workflows (e.g. myExperiment)

Another Brief Digression…

“Facebook” for Scientists

http://myexperiment.org

An exciting evolution in the way Researchers express and share

their in silico “Materials and Methods”

Through things called ‘Workflows’

Workflows are explicit representationsof the method by which an analysis was done

and which resources are used to do it

Workflows can be very simple…

“Blast this sequence”

Or not...

This workflow takes in a CEL file and a normalisation method then returns a series of images/graphs which represent the same output obtained using the MADAT software package (MicroArray Data Analysis Tool)

Also returned by this workflow are a list of the top differentially expressed genes (size dependant on the number specified as input - geneNumber), which are then used to find the candidate pathways which may be influencing the observed changes in the microarray data.

Why bother?

A workbench for designing and executingScientific Workflows

Taverna

Load-up your data and press “play”!

…Then go home for the weekend! You are just one click away from your M.Sc.!!

By the by…

The SHARE application automatically creates a Workflow and then automatically runs it.

This is where the data comes from to answer the queries…

Workflows are a Good Thing™

Detour EndsPlease resume speed

WORKFLOWSReproducibility

Clarity (hypothesis)

Discourse

Disagreement

Clarity (experiment)

Reproducibility

Clarity (hypothesis)

Discourse

Disagreement

Clarity (experiment)

At the moment the Semantic Web in Healthcare

and Life Sciencesaddresses these issues by attempting to create

“consensus”

Large, centralized ontologies (e.g. the Gene Ontology)that claim to represent community agreement about “biological reality”

…is that Science?

Reproducibility

Clarity (hypothesis)

Discourse

Disagreement

Clarity (experiment)

Reproducibility

Clarity (hypothesis)

Ontology Consortia

Disagreement

Clarity (experiment)

Reproducibility

Clarity (hypothesis)

Ontology Consortia

Consensus

Clarity (experiment)

Reproducibility

????

Ontology Consortia

Consensus

Clarity (experiment)

To restore the “traditions of Science”

to in silico science

The Semantic Web needs to encourage/facilitate

personal opinion and debate

What has this got to do with SADI and SHARE?

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX patient: <http://sadiframework.org/ontologies/patients.owl#> PREFIX l: <http://sadiframework.org/ontologies/predicates.owl#> SELECT ?patient ?bun ?creatFROM <http://sadiframework.org/ontologies/patients.rdf>WHERE {

?patient rdf:type patient:LikelyRejecter .?patient l:latestBUN ?bun . ?patient l:latestCreatinine ?creat .

}

Likely Rejecter

I created a small ontology describing my definition of

a Likely Rejecter

… it was MY ontology!

I can re-use it

I can modify it as I change my world-view

Reproducibility

Clarity (hypothesis)

Discourse

Disagreement

Clarity (experiment)

I can publish it for others to use

Reproducibility

Clarity (hypothesis)

Discourse

Disagreement

Clarity (experiment)Others can modify it and/or

compare it to THEIR world-view

Reproducibility

Clarity (hypothesis)

Discourse

Disagreement

Clarity (experiment)

Sharing my ontology gives opportunities for “micro-attribution”

“Credit” to me is automatic when someone uses my ontology in their ontology/query

Using SADI and SHAREmy personal world-view is

explicitly expressed and can bedynamically evaluated against

global data and knowledge

http://www.dailymail.co.uk/femail/article-488234/Friends-dignity-self-respect---weight-wasnt-I-lost-slimming-club.html

…but there’s more…

“Likely Rejecter”

I made that up! It came out of my head!

What’s another word for a world-view that you make-up?

Hypothesis

Reproducibility

Hypotheses

Discourse

Disagreement

Clarity (experiment)The “Likely Rejecter” OWL Classis an explicitly-expressed hypothesis;

Members of that class may or may not exist!

Reproducibility

Hypotheses

Discourse

Disagreement

Experiment

Ontologically-expressed Hypotheses drive the discovery, assembly, and analysis of data capable of evaluating their validity

Blood Pressure

Hypertension

Ischemia

Hypothesis

Database 1 Database 2

SADI+

SHARE

Analytical Algorithm

Join us!

SADI and CardioSHARE are Open-Source projects

Come join us – we’re having a lot of fun!!

http://sadiframework.org

C r e d i t s

B e n j a m i n V a n d e r V a l k ( S H A R E & S A D I )

L u k e M c C a r t h y ( S A D I , S H A R E , T a v e r n a , C a r d i o S H A R E )

S o r o u s h S a m a d i a n ( C a r d i o S H A R E )

D a v i d W i t h e r s( T a v e r n a )

E d w a r d K a w a s ( S A D I S e r v i c e a u t o - g e n e r a t o r )

U o f N e w B r u n s w i c k

D r. C h r i s B a k e rA l e x a n d r e R i a z a n o v

C a r l e t o n U n i v e r s i t yD r. M i c h e l D u m o n t i e rM a r c - A l e x a n d r e N o l i nL e o n i d C h e p e l e vS t e v e E t l i n g e rN i c h a e l l a K i e t hJ o s e C r u z

Microsoft Research

Fin

This presentation available on SlideShare: keywords ‘wilkinson’ ‘iCAPTURE’ ‘HLI’