SADI CSHALS 2013

transcript

Semantic Automated Discovery and IntegrationSADI Services Tutorial

Mark WilkinsonIsaac Peral Senior Researcher in Biological InformaticsCentro de Biotecnología y Genómica de Plantas, UPM, Madrid, Spain

Adjunct Professor of Medical Genetics, University of British ColumbiaVancouver, BC, Canada.

MOTIVATIONPart I

A lot of important information cannot be represented

on the Semantic Web

For example, all of the data that results from

analytical algorithms and statistical analyses

(I’m purposely excluding databases from the list of examplesfor reasons I will discuss in a moment)

Varying estimatesput the size of the Deep Web between500 and 800 timeslarger than the surface Web

On the WWW “automation” of access to Deep Web data happens through

“Web Services”

Traditional definitions of The Deep Webinclude databases that have Web FORM interfaces.

HOWEVER

The Life Science Semantic Web communityis encouraging the establishment of SPARQL endpoints

as the way to serve that same data to the world(i.e. NOT through Web Services)

I am quite puzzled by this...

Historically, most* bio/informaticsdatabases do not allowdirect public SQL access

*yes, I know there are some exceptions!

“We need to commit specific hardware for that [mySQL] service. We don’t use the same servers for mySQL as for the Website...”

“...we resolve the situation by asking the user to stop hammering the server. This might involve temporary ban on the IP...”

- ENSEMBL Helpdesk

So... There appears to be good reasons why most data providers do not expose

their databases for public query!

Are SPARQL endpoints somehow “safer” or “better”?

One of the early-adopters of RDF/SPARQL in the bioinformatics domain was UniProt

How are things going for them?

Content-Type: text/plain; charset=utf-8; format=flowed; delsp=yesTo: Mark <markw@illuminae.com>Date: Tue, 19 Feb 2013 13:11:22 +0100Subject: SPARQL or not?MIME-Version: 1.0Content-Transfer-Encoding: 7bitFrom: "Mark Wilkinson" <markw@illuminae.com>Message-ID: <op.wsq5g8jenbznux@bioinformatica-mark>User-Agent: Opera Mail/12.14 (Linux)X-Antivirus: AVG for E-mail 2012.0.2238 [2639/5614]X-AVG-ID: ID798D8A94-2992BC71

Hi Bio2RDF maintainers, I keep on noticing this rather expensive query. CONSTRUCT { <http://bio2rdf.org/search/Paget> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://bio2rdf.org/bio2rdf_resource:SearchResults> . <http://bio2rdf.org/search/Paget> <http://bio2rdf.org/bio2rdf_resource:hasSearchResult> ?s . <http://bio2rdf.org/search/Paget> <http://www.w3.org/2000/01/rdf-schema#seeAlso> ?s . ?s <http://www.w3.org/2000/01/rdf-schema#label> ?label . ?s <http://purl.org/dc/elements/1.1/title> ?title . ?s <http://purl.org/dc/terms/title> ?dctermstitle . ?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?type . ?s <http://www.w3.org/2004/02/skos/core#prefLabel> ?skoslabel . ?s ?p ?o .}WHERE { ?s ?p ?o FILTER contains(str(?o), "\"Paget\"") OPTIONAL { ?s <http://www.w3.org/2000/01/rdf-schema#label> ?label } OPTIONAL { ?s <http://purl.org/dc/elements/1.1/title> ?title } OPTIONAL { ?s <http://purl.org/dc/terms/title> ?dctermstitle } OPTIONAL { ?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?type } OPTIONAL { ?s <http://www.w3.org/2004/02/skos/core#prefLabel> ?skoslabel } }OFFSET 0LIMIT 500 It comes from the example queries on the bio2rdf landing page.Its extremely resource consuming and totally useless as it will never ever run in time. Can you please change this query to something useful and workable. And at least cache the results if you ever get them. Regards,Jerven

A message posted to the Bio2RDF mailing list last week from Jerven Bolleman, one of the team-members behind UniProt’s push for RDF...

I keep noticing this rather expensive query

It comes from THE EXAMPLE QUERIES on the Bio2RDF landing page

(my emphasis added)

It’s extremely resource-consuming and totally useless as

it will never run in time

So even people who are world-leaders in RDF and SPARQLwrite “expensive” and “useless” queries

that (already!) are making life difficult for SPARQL endpoint providers

I believe that situation will only get worse as more people begin to use the Semantic Web

and as SPARQL itself becomes richer and more SQL-like

In My Opinion

History tells us, and this story IMO supports,that SPARQL endpoints might not be widely adopted

by source bioinformatics data providers

Historically, the majority of bioinformatics data hosts have opted for API/Service-based

access to their resources

In My Opinion

Moreover, I am still obsessed with interoperability!

Having a unified way to discover, and access, bioinformatics resources

whether they be databases or algorithms

just seems like a Good Thing™

In My Opinion

So we need to find a way to make Web Servicesplay nicely with the Semantic Web

Design Pattern forWeb Services on the Semantic Web

SADI “PHILOSOPHY” AND DESIGNPart II

The Semantic Web

causally related to

The important bit

causally related to

The link is explicitly labeled

http://semanticscience.org/resource/SIO_000243

SIO_000243:

<owl:ObjectProperty rdf:about="&resource;SIO_000243"> <rdfs:label xml: lang="en"> is causally related with</rdfs:label> <rdf:type rdf:resource="&owl;SymmetricProperty"/> <rdf:type rdf:resource="&owl;TransitiveProperty"/> <dc:description xml:lang="en"> A transitive, symmetric, temporal relation in which one entity is causally related with another non-identical entity. </dc:description> <rdfs:subPropertyOf rdf:resource="&resource;SIO_000322"/> </owl:ObjectProperty>

causally related with

SIO_000243:

<owl:ObjectProperty rdf:about="&resource;SIO_000243"> <rdfs:label xml: lang="en"> is causally related with</rdfs:label> <rdf:type rdf:resource="&owl;SymmetricProperty"/> <rdf:type rdf:resource="&owl;TransitiveProperty"/> <dc:description xml:lang="en"> A transitive, symmetric, temporal relation in which one entity is causally related with another non-identical entity. </dc:description> <rdfs:subPropertyOf rdf:resource="&resource;SIO_000322"/> </owl:ObjectProperty>

There are many suggestions for how to bring the Deep Web

into the Semantic Web using Semantic Web Services (SWS)

SAWSDL

WSDL-S

Others...

Describe input data

Describe output data

Describe how the system manipulates the data

Describe how the world changes as a result

Describe input data

Usually through “semantic annotation”

of XML Schema

Describe input data

In the least-semantic case, the input and output data

is “vanilla” XML

Describe input data

In the “most semantic” case (WSDL) RDF is

converted into XML, then back to RDF again

Describe input data

The rigidity of XML Schema is the

antithesis of the Semantic Web!

Describe input data

So... Perhaps we shouldn’t be using XML

Schema at all...??

Describe input data

Describe how the world changes as a resultHARD!

Describe input data

Describe how the world changes as a resultUn-necessary?

Lord, Phillip, et al. The Semantic Web–ISWC 2004 (2004): 350-364.

Scientific Web Services are DIFFERENT!

Lord, Phillip, et al. The Semantic Web–ISWC 2004 (2004): 350-364.

“The service interfaces within bioinformatics are relatively simple. An extensible or constrained interoperability framework

is likely to suffice for current demands: a fully generic framework is currently not necessary.”

Scientific Web Services are DIFFERENT

They’re simpler!

Rather than waiting for a solution to the more general problem

(which may be years away... or more!)

can we solve the Semantic Web Service problemwithin the scientific domain

while still being fully standards-compliant?

Other “philosophical” considerations

v.v. being Semantic Webby,what is missing from this list?

Describe input data

The Semantic Web works because of relationships!

In 2008 I proposed that, in the Semantic Web world, algorithms should be viewed as “exposing” relationships

between the input and output data

AACTCTTCGTAGTG...

Web Service

AACTCTTCGTAGTG...

hashomologyto

Terminal Flower

species

A. thal.SADI requires you to explicitly declareas part of your analytical output, the biological relationship that your algorithm “exposed”.

sequence

has_seq_string

AACTCTTCGTAGTG...

sequence

has_seq_string

Another “philosophical” decision was to abandon XML Schema

In a world that is moving towards RDF representations of all data

it makes no sense to convert semantically rich RDF into semantic-free Schema-based XML

then back into RDF again

The final philosophical decision wasto abandon SOAP

The bioinformatics community seems to bevery receptive to pure-HTTP interfaces

(e.g. the popularity of REST-like APIs)

So SADI uses simple HTTP POST of just the RDF input data

(no message scaffold whatsoever)

SADI SERVICE DISCOVERYAND INVOCATION

Part III

In slightly more detail...

ID Name Height Weight Age

24601 Jean Valjean 1.8m 84kg 45