SADI CSHALS 2013

Semantic Automated Discovery and IntegrationSADI Services Tutorial

Mark WilkinsonIsaac Peral Senior Researcher in Biological InformaticsCentro de Biotecnología y Genómica de Plantas, UPM, Madrid, Spain

Adjunct Professor of Medical Genetics, University of British ColumbiaVancouver, BC, Canada.

MOTIVATIONPart I

A lot of important information cannot be represented

on the Semantic Web

For example, all of the data that results from

analytical algorithms and statistical analyses

(I’m purposely excluding databases from the list of examplesfor reasons I will discuss in a moment)

Varying estimatesput the size of the Deep Web between500 and 800 timeslarger than the surface Web

On the WWW “automation” of access to Deep Web data happens through

“Web Services”

Traditional definitions of The Deep Webinclude databases that have Web FORM interfaces.

HOWEVER

The Life Science Semantic Web communityis encouraging the establishment of SPARQL endpoints

as the way to serve that same data to the world(i.e. NOT through Web Services)

I am quite puzzled by this...

Historically, most* bio/informaticsdatabases do not allowdirect public SQL access

*yes, I know there are some exceptions!

“We need to commit specific hardware for that [mySQL] service. We don’t use the same servers for mySQL as for the Website...”

“...we resolve the situation by asking the user to stop hammering the server. This might involve temporary ban on the IP...”

- ENSEMBL Helpdesk

So... There appears to be good reasons why most data providers do not expose

their databases for public query!

Are SPARQL endpoints somehow “safer” or “better”?

One of the early-adopters of RDF/SPARQL in the bioinformatics domain was UniProt

How are things going for them?

Content-Type: text/plain; charset=utf-8; format=flowed; delsp=yesTo: Mark <[email protected]>Date: Tue, 19 Feb 2013 13:11:22 +0100Subject: SPARQL or not?MIME-Version: 1.0Content-Transfer-Encoding: 7bitFrom: "Mark Wilkinson" <[email protected]>Message-ID: <op.wsq5g8jenbznux@bioinformatica-mark>User-Agent: Opera Mail/12.14 (Linux)X-Antivirus: AVG for E-mail 2012.0.2238 [2639/5614]X-AVG-ID: ID798D8A94-2992BC71

Hi Bio2RDF maintainers, I keep on noticing this rather expensive query. CONSTRUCT { <http://bio2rdf.org/search/Paget> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://bio2rdf.org/bio2rdf_resource:SearchResults> . <http://bio2rdf.org/search/Paget> <http://bio2rdf.org/bio2rdf_resource:hasSearchResult> ?s . <http://bio2rdf.org/search/Paget> <http://www.w3.org/2000/01/rdf-schema#seeAlso> ?s . ?s <http://www.w3.org/2000/01/rdf-schema#label> ?label . ?s <http://purl.org/dc/elements/1.1/title> ?title . ?s <http://purl.org/dc/terms/title> ?dctermstitle . ?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?type . ?s <http://www.w3.org/2004/02/skos/core#prefLabel> ?skoslabel . ?s ?p ?o .}WHERE { ?s ?p ?o FILTER contains(str(?o), "\"Paget\"") OPTIONAL { ?s <http://www.w3.org/2000/01/rdf-schema#label> ?label } OPTIONAL { ?s <http://purl.org/dc/elements/1.1/title> ?title } OPTIONAL { ?s <http://purl.org/dc/terms/title> ?dctermstitle } OPTIONAL { ?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?type } OPTIONAL { ?s <http://www.w3.org/2004/02/skos/core#prefLabel> ?skoslabel } }OFFSET 0LIMIT 500 It comes from the example queries on the bio2rdf landing page.Its extremely resource consuming and totally useless as it will never ever run in time. Can you please change this query to something useful and workable. And at least cache the results if you ever get them. Regards,Jerven

A message posted to the Bio2RDF mailing list last week from Jerven Bolleman, one of the team-members behind UniProt’s push for RDF...



I keep noticing this rather expensive query



It comes from THE EXAMPLE QUERIES on the Bio2RDF landing page

(my emphasis added)



It’s extremely resource-consuming and totally useless as

it will never run in time

So even people who are world-leaders in RDF and SPARQLwrite “expensive” and “useless” queries

that (already!) are making life difficult for SPARQL endpoint providers

I believe that situation will only get worse as more people begin to use the Semantic Web

and as SPARQL itself becomes richer and more SQL-like

In My Opinion

History tells us, and this story IMO supports,that SPARQL endpoints might not be widely adopted

by source bioinformatics data providers

Historically, the majority of bioinformatics data hosts have opted for API/Service-based

access to their resources

In My Opinion

Moreover, I am still obsessed with interoperability!

Having a unified way to discover, and access, bioinformatics resources

whether they be databases or algorithms

just seems like a Good Thing™

In My Opinion

So we need to find a way to make Web Servicesplay nicely with the Semantic Web

Design Pattern forWeb Services on the Semantic Web

SADI “PHILOSOPHY” AND DESIGNPart II

The Semantic Web

causally related to

The important bit

causally related to

The link is explicitly labeled

???

http://semanticscience.org/resource/SIO_000243

SIO_000243:

<owl:ObjectProperty rdf:about="&resource;SIO_000243"> <rdfs:label xml: lang="en"> is causally related with</rdfs:label> <rdf:type rdf:resource="&owl;SymmetricProperty"/> <rdf:type rdf:resource="&owl;TransitiveProperty"/> <dc:description xml:lang="en"> A transitive, symmetric, temporal relation in which one entity is causally related with another non-identical entity. </dc:description> <rdfs:subPropertyOf rdf:resource="&resource;SIO_000322"/> </owl:ObjectProperty>

causally related with


SIO_000243:

<owl:ObjectProperty rdf:about="&resource;SIO_000243"> <rdfs:label xml: lang="en"> is causally related with</rdfs:label> <rdf:type rdf:resource="&owl;SymmetricProperty"/> <rdf:type rdf:resource="&owl;TransitiveProperty"/> <dc:description xml:lang="en"> A transitive, symmetric, temporal relation in which one entity is causally related with another non-identical entity. </dc:description> <rdfs:subPropertyOf rdf:resource="&resource;SIO_000322"/> </owl:ObjectProperty>


There are many suggestions for how to bring the Deep Web

into the Semantic Web using Semantic Web Services (SWS)

OWL-S

SAWSDL

WSDL-S

Others...



Describe input data

Describe output data

Describe how the system manipulates the data

Describe how the world changes as a result



Describe input data




Usually through “semantic annotation”

of XML Schema



Describe input data




In the least-semantic case, the input and output data

is “vanilla” XML



Describe input data




In the “most semantic” case (WSDL) RDF is

converted into XML, then back to RDF again



Describe input data




The rigidity of XML Schema is the

antithesis of the Semantic Web!



Describe input data




So... Perhaps we shouldn’t be using XML

Schema at all...??



Describe input data



Describe how the world changes as a resultHARD!



Describe input data



Describe how the world changes as a resultUn-necessary?

Lord, Phillip, et al. The Semantic Web–ISWC 2004 (2004): 350-364.


Scientific Web Services are DIFFERENT!


“The service interfaces within bioinformatics are relatively simple. An extensible or constrained interoperability framework

is likely to suffice for current demands: a fully generic framework is currently not necessary.”

Scientific Web Services are DIFFERENT

They’re simpler!

Rather than waiting for a solution to the more general problem

(which may be years away... or more!)

can we solve the Semantic Web Service problemwithin the scientific domain

while still being fully standards-compliant?

Other “philosophical” considerations

v.v. being Semantic Webby,what is missing from this list?

Describe input data







The Semantic Web works because of relationships!



The Semantic Web works because of relationships!

In 2008 I proposed that, in the Semantic Web world, algorithms should be viewed as “exposing” relationships

between the input and output data


AACTCTTCGTAGTG...

BLAST

Web Service

AACTCTTCGTAGTG...

BLAST

SADI

hashomologyto

Terminal Flower

type

gene

species

A. thal.SADI requires you to explicitly declareas part of your analytical output, the biological relationship that your algorithm “exposed”.

sequence

has_seq_string

AACTCTTCGTAGTG...

sequence

has_seq_string

Another “philosophical” decision was to abandon XML Schema

In a world that is moving towards RDF representations of all data

it makes no sense to convert semantically rich RDF into semantic-free Schema-based XML

then back into RDF again

The final philosophical decision wasto abandon SOAP

The bioinformatics community seems to bevery receptive to pure-HTTP interfaces

(e.g. the popularity of REST-like APIs)

So SADI uses simple HTTP POST of just the RDF input data

(no message scaffold whatsoever)

SADI SERVICE DISCOVERYAND INVOCATION

Part III

In slightly more detail...

ID Name Height Weight Age

24601 Jean Valjean 1.8m 84kg 45

7474505B Jake Blues 1.73m 101kg 31

6 — 1.88m 75kg 39

... ... ... ... ...



7474505B Jake Blues 1.73m 101kg 31

6 — 1.88m 75kg 39

... ... ... ... ...



7474505B Jake Blues 1.73m 101kg 31

6 — 1.88m 75kg 39

... ... ... ... ...

OWL-DL Classes



7474505B Jake Blues 1.73m 101kg 31

6 — 1.88m 75kg 39

... ... ... ... ...

Property restrictionsin OWL Class definition



7474505B Jake Blues 1.73m 101kg 31

6 — 1.88m 75kg 39

... ... ... ... ...



7474505B Jake Blues 1.73m 101kg 31

6 — 1.88m 75kg 39

... ... ... ... ...

A reasoner determines that Patient #24601is an OWL Individual of the Input service Class



7474505B Jake Blues 1.73m 101kg 31

6 — 1.88m 75kg 39

... ... ... ... ...

NOTE THE URI OF THE INPUT INDIVIDUALPatient:24601



7474505B Jake Blues 1.73m 101kg 31

6 — 1.88m 75kg 39

... ... ... ... ...

BMI

25.9



7474505B Jake Blues 1.73m 101kg 31

6 — 1.88m 75kg 39

... ... ... ... ...

BMI

25.9

NOTE THE URI OF THE OUTPUT INDIVIDUALPatient:24601

The URI of the input is linked by a meaningful predicate to the output

(either literal output or another URI)

Therefore, by connecting SADI services together in a workflow you end-up with an

unbroken chain of Linked Data

SADI TO THE EXTREME:“WEB SCIENCE 2.0”

Part IV

A proof-of-concept query engine & registry

Objective: answer biologists’ questions

The SHARE registry

indexes all of the input/output/relationship

triples that can be generated by all known services

This is how SHARE discovers services

We wanted to duplicatea real, peer-reviewed, bioinformatics analysis

simply by building a model in the Webdescribing what the answer

(if one existed)

would look like

...the machine had to make every other decision

on it’s own

This is the study we chose:

Gordon, P.M.K., Soliman, M.A., Bose, P., Trinh, Q., Sensen, C.W., Riabowol, K.: Interspecies data mining to predict novel ING-protein interactions in human. BMC genomics. 9, 426 (2008).

Original Study Simplified

Using what is known about interactions in fly & yeast

predict new interactions with your protein of interest

Given a protein P in Species X

Find proteins similar to P in Species Y

Retrieve interactors in Species Y

Sequence-compare Y-interactors with Species X genome

(1) Keep only those with homologue in X

Find proteins similar to P in Species Z

Retrieve interactors in Species Z

Sequence-compare Z-interactors with (1)

Putative interactors in Species X

“Pseudo-code” Abstracted Workflow

Modeling the science...

OWL

ProbableInteractor is homologous to ( Potential Interactor from ModelOrganism1…)

and

Potential Interactor from ModelOrganism2…)

Probable Interactor is defined in OWL as a subClass - something that appears as a potential interactor in both comparator model organisms.

Modeling the science...

In a local data-file

provide the protein we are interested in

and the two species we wish to use in our comparison

taxon:9606 a i:OrganismOfInterest . # humanuniprot:Q9UK53 a i:ProteinOfInterest . # ING1taxon:4932 a i:ModelOrganism1 . # yeasttaxon:7227 a i:ModelOrganism2 . # fly

Running the Web Science Experiment

The tricky bit is...

In the abstract, the search for homology is “generic” – ANY Protein, ANY model

system

But when the machine does the experiment, it will need

to use (at least) two organism-specific resources because the answer requires

information from two declared species taxon:4932 a i:ModelOrganism1 . # yeast

taxon:7227 a i:ModelOrganism2 . # fly

PREFIX i: <http://sadiframework.org/ontologies/InteractingProteins.owl#>

SELECT ?proteinFROM <file:/local/workflow.input.n3>WHERE {

?protein a i:ProbableInteractor .}

This is the question we ask:(the query language here is SPARQL)

The URL of our OWL model (ontology) defining Probable Interactors

Each relationship (property-restriction)in the OWL Class is then matched with a SADI Service

The matched SADI Service can generate data that fulfils that property restriction(i.e. produces triples with that S/P/O pattern)

SHARE chains these SADI services into an analytical workflow...

...the outputs from that workflow areInstances (OWL Individuals) ofProbable Interactors

SHARE derived (and executed) the following workflow automatically

These are differentSADI Web Services...

...selected at run-time based on the same model

Keys to Success:

1: Use standards

2: Focus on predicates, not classes

3: Use these predicates to define, rather than assert, classes

4: Make sure all URIs resolve, and resolve to something useful

5: Never leave the RDF world... (abandon vanilla XML, even for Web Services!)

6: Use reasoners... Everywhere... Always!

THE TOOLS AVAILABLEPart V

SERVICE PROVISIONPart V - A

Libraries• Perl• Java• Python

Plug-in to Protege• Perl service scaffolds• Java service scaffolds

CLIENTSPart V - B

SHARE

• you’ve already seen how SHARE works...

Taverna

• Contextual service discovery

• Automatic RDF serialization and deserialization beetween SADI and non-SADI services

• Note that Taverna is not as rich a client as SHARE. The reason is that SHARE will aggregate and re-reason after every service invocation. There is no (automatic) data aggregation in Taverna.

Using SADI services – building a workflowThe next step in the workflow is to find a SADI service that takes the genes from getKEGGGenesByPathway and returns the proteins that those genes code for.

Using SADI services – building a workflowRight-click on the service output port and click Find services that consume KEGG_Record…

Using SADI services – building a workflowSelect getUniprotByKeggGene from the list of SADI services and click Connect.

Using SADI services – building a workflowThe getUniprotByKeggGene service is added to the workflow and automatically connected to the output from getKEGGGenesByPathway.

Using SADI services – building a workflowAdd a new workflow output called protein and connect the output from the getUniprotByKeggGene service to it.

Using SADI services – building a workflowThe next step in the workflow is to find a SADI service that takes the proteins and returns sequences of those proteins. Right-click on the encodes output port and click Find services that consume UniProt_Record…

Using SADI services – building a workflowThe UniProt info service attaches the property hasSequence so select this service and click Connect.

Using SADI services – building a workflowThe UniProt info service is added to the workflow and automatically connected to the output from getUniprotByKeggGene .

Using SADI services – building a workflowAdd a new workflow output called sequence and connect the output from the hasSequence output from the UniProt info service to it.

Using SADI services – building a workflowThe KEGG pathway were interested in is "hsa00232”, so we’ll add it as a constant value. Right-click on the KEGG_PATHWAY_Record input port and click Constant value.

Using SADI services – building a workflowEnter the value hsa00232 and click OK.

Using SADI services – building a workflowThe workflow is now complete and ready to run.

IO Informatics Knowledge Explorer plug-in

• “Bootstrapping” of semantics using known URI schema (identifiers.org, LSRN, Bio2RDF, etc.)

• Contextual service discovery

• Automatic packaging of appropriate data from your data-store and automated service invocation using that data.

• This uses some not-widely-known services and metadata that is in the SHARE registry!!

The SADI plug-in to the IO Informatics’

Knowledge Explorer

...a quick explanation of how we “boot-strap” semantics...

The Knowledge ExplorerPersonal Edition,

and the SADI plug-in, arefreely available.

Sentient Knowledge Explorer is a retrieval, integration, visualization, query, and exploration

environment for semantically rich data

Most imported data-sets will already have properties (e.g. “encodes”)

…and the data will already be typed (e.g. “Gene” or “Protein”)

…so finding SADI Services to consume that data is ~trivial

Now what...??

No properties...

No rdf:type...

How do I find a service using that node?

What *is* that node anyway??

In the case of LSRN URIs, they resolve to:

<lsrn:DragonDB_Locus_Record rdf:about="http://lsrn.org/DragonDB_Locus:CHO"> <dc:identifier>CHO</dc:identifier> <sio:SIO_000671>  <lsrn:DragonDB_Locus_Identifier> <sio:SIO_000300>CHO</sio:SIO_000300>  </lsrn:DragonDB_Locus_Identifier> </sio:SIO_000671> </lsrn:DragonDB_Locus_Record></rdf:RDF>

In the case of LSRN URIs, they resolve to:

<lsrn:DragonDB_Locus_Record rdf:about="http://lsrn.org/DragonDB_Locus:CHO"> <dc:identifier>CHO</dc:identifier> <sio:SIO_000671>  <lsrn:DragonDB_Locus_Identifier> <sio:SIO_000300>CHO</sio:SIO_000300>  </lsrn:DragonDB_Locus_Identifier> </sio:SIO_000671> </lsrn:DragonDB_Locus_Record></rdf:RDF> The Semantic Science Integrated

Ontology (Dumontier) has a model for how to describe database records, including explicitly making the record identifier an attribute of that record; in our LSRN metadata, we also explicitly rdf:type both records and identifiers.

Now we have enough information to start exploring global data...

Menu option provided by the plugin

Discovered the (only)service that consumesthese kinds of records

Output is added to the graph (with some extra logic to make visualization of complex data structures a bit easier)

Lather, rinse, repeat...

...and of course, these links are “live”

What about URIs other than LSRN?

HTTP POST the URI to the SHARE Resolver Service

It will (try to) return you SIO-compliant RDF metadata about

that URI(this is a typical SADI service)

The resolver currently recognizes a few different sharted-URI

schemes(e.g. Bio2RDF, Identifiers.org)and can be updated with new

patterns

Next problem:

Knowledge Explorer and therefore the

plug-in are written in C#

All of our interfaces are described in

OWL

C# reasoners are extremely limited at

this time

This problem manifests itself in two ways:

1.An individual on the KE canvas has all the properties required by a Service in the registry, but is not rdf:typed as that Service’s input type how do you discover that Service so that you can add it to the menu?

2.For a selected Service from the menu, how does the plug-in know which data-elements it needs to extract from KE to send to that service in order to fulfil it’s input property-restrictions?

If I select a canvas node, and ask SADI to find services, it will...

The get_sequence_for_region service required ALL of this (hidden) information

Nevertheless:

(a) The service can be discovered based on JUST this node selection

(b) The service can be invoked based on JUST this node selection

Voila!

How did the plug-in discover the service, and determine which data was required to access that service based on an OWL Class definition,

without a reasoner?

Service Description

INPUT OWL ClassNamedIndividual: things with a “name” property from “foaf” ontology

OUTPUT OWL ClassGreetedIndividual: things with a “greeting” property from “hello” ontology

INDEX

The service provides a “greeting”

property based on a “name” property

Registry

SELECT ?x, ?yFROM knowledge_explorer_databaseWHERE { ?x foaf:name ?y}

Convert Input OWL Class def’ninto an ~equivalent SPARQL query

Store togetherwith index

Just to ensure that I don’t over-trivialize this point,

the REAL SPARQL query that extracts the input for this service is...

CONSTRUCT {?input a <http://sadiframework.org/ontologies/GMOD/BiopolymerRegion.owl#BiopolymerRegion> .?input <http://sadiframework.org/ontologies/GMOD/BiopolymerRegion.owl#position> ?position .?position a

<http://sadiframework.org/ontologies/GMOD/RangedSequencePosition.owl#RangedSequencePosition> .?position <http://sadiframework.org/ontologies/GMOD/RangedSequencePosition.owl#coordinate> ?start .?start a <http://sadiframework.org/ontologies/GMOD/RangedSequencePosition.owl#StartPosition> .?start <http://semanticscience.org/resource/SIO_000300> ?startValue .?position <http://sadiframework.org/ontologies/GMOD/RangedSequencePosition.owl#coordinate> ?end .?end a <http://sadiframework.org/ontologies/GMOD/RangedSequencePosition.owl#EndPosition> .?end <http://semanticscience.org/resource/SIO_000300> ?endValue .?position <http://sadiframework.org/ontologies/GMOD/RangedSequencePosition.owl#in_relation_to> ?sequence .?sequence <http://semanticscience.org/resource/SIO_000210> ?feature .?feature <http://semanticscience.org/resource/SIO_000008> ?identifier .?identifier <http://semanticscience.org/resource/SIO_000300> ?featureID .

?sequence <http://semanticscience.org/resource/SIO_000210> ?strand .?strand <http://semanticscience.org/resource/SIO_000093> ?strandFeature . ?strandFeature a ?strandFeatureType .?strandFeature <http://semanticscience.org/resource/SIO_000008> ?strandFeatureIdentifier .?strandFeatureIdentifier <http://semanticscience.org/resource/SIO_000300> ?strandFeatureID .?strand a ?strandType .

} WHERE {?input <http://sadiframework.org/ontologies/GMOD/BiopolymerRegion.owl#position> ?position .?position <http://sadiframework.org/ontologies/GMOD/RangedSequencePosition.owl#coordinate> ?start .?start a <http://sadiframework.org/ontologies/GMOD/RangedSequencePosition.owl#StartPosition> .?start <http://semanticscience.org/resource/SIO_000300> ?startValue .?position <http://sadiframework.org/ontologies/GMOD/RangedSequencePosition.owl#coordinate> ?end .?end a <http://sadiframework.org/ontologies/GMOD/RangedSequencePosition.owl#EndPosition> .?end <http://semanticscience.org/resource/SIO_000300> ?endValue .?position <http://sadiframework.org/ontologies/GMOD/RangedSequencePosition.owl#in_relation_to> ?sequence .{

?sequence <http://semanticscience.org/resource/SIO_000210> ?feature .?feature <http://semanticscience.org/resource/SIO_000008> ?identifier .

?identifier <http://semanticscience.org/resource/SIO_000300> ?featureID .

} UNION {?sequence <http://semanticscience.org/resource/SIO_000210> ?strand .?strand <http://semanticscience.org/resource/SIO_000093> ?strandFeature .{

?strandFeature a <http://sadiframework.org/ontologies/GMOD/Feature.owl#Feature> .

} UNION {?strandFeature <http://semanticscience.org/resource/SIO_000008> ?

strandFeatureIdentifier .?strandFeatureIdentifier <http://semanticscience.org/resource/SIO_000300> ?

strandFeatureID .} .{

?strand a <http://sadiframework.org/ontologies/GMOD/Strand.owl#PlusStrand> .

?strand a ?strandType .} UNION {

?strand a <http://sadiframework.org/ontologies/GMOD/Strand.owl#MinusStrand> .

?strand a ?strandType .} .

} .}

Summary

While the Knowledge Explorer plug-in has similar functionality to other tools we have built for SADI, it takes advantage of some features of

the SADI Registry, and SADI in general, that are not widely-known.

We hope that the availability of these features encourages development of SADI tooling in other languages that have limited access to

reasoning.

Luke McCarthy Lead Developer, SADI project

Benjamin VanderValk Developer, SADI project

Date post:	16-Dec-2014
Category:	Technology
Upload:	mark-wilkinson
View:	594 times
Download:	3 times

SADI CSHALS 2013

Technology