Querying Linked Data with SPARQL (2010)

transcript

WWW 2010 Tutorial "How to Consume Linked Data on the Web"

QueryingLinked Data

withSPARQL

Brief Introduction to SPARQL

● SPARQL: Query Language for RDF data*● Main idea: pattern matching

● Describe subgraphs of the queried RDF graph● Subgraphs that match your description yield a result● Mean: graph patterns (i.e. RDF graphs /w variables)

?vhttp://.../Volcano

rdf:type

* http://www.w3.org/TR/rdf-sparql-query/

Brief Introduction to SPARQLQueriedgraph:

?vhttp://.../Volcano

rdf:type

http://.../Mount_Baker http://.../Volcanordf:type

"1880"

p:lastEruption

http://.../Mount_Etna

rdf:type

http://.../Mount_Bakerhttp://.../Mount_Etna

Results:

SPARQL Endpoints

● Linked Data sources usually provide aSPARQL endpoint for their dataset(s)

● SPARQL endpoint: SPARQL query processing service that supports the SPARQL protocol*

● Send your SPARQL query, receive the result

* http://www.w3.org/TR/rdf-sparql-protocol/

SPARQL Endpoints

More complete list: http://esw.w3.org/topic/SparqlEndpoints

Data Source Endpoint Address

DBpedia http://dbpedia.org/sparql

Musicbrainz http://dbtune.org/musicbrainz/sparql

U.S. Census http://www.rdfabout.com/sparql

Semantic Crunchbase http://cb.semsol.org/sparql

Accessing a SPARQL Endpoint

● SPARQL endpoints: RESTful Web services● Issuing SPARQL queries to a remote SPARQL

endpoint is basically an HTTP GET request to the SPARQL endpoint with parameter query

GET /sparql?query=PREFIX+rd... HTTP/1.1Host: dbpedia.orgUser-agent: my-sparql-client/0.1

URL-encoded stringwith the SPARQL query

Query Results Formats

● SPARQL endpoints usually support different result formats:● XML, JSON, plain text

(for ASK and SELECT queries)● RDF/XML, NTriples, Turtle, N3

(for DESCRIBE and CONSTRUCT queries)

PREFIX dbp: <http://dbpedia.org/ontology/>PREFIX dbpprop: <http://dbpedia.org/property/>

SELECT ?name ?bday WHERE { ?p dbp:birthplace <http://dbpedia.org/resource/Berlin> . ?p dbpprop:dateOfBirth ?bday . ?p dbpprop:name ?name .} name | bday ------------------------+------------ Alexander von Humboldt | 1769-09-14 Ernst Lubitsch | 1892-01-28 ...

Query Results Formats

<?xml version="1.0"?><sparql xmlns="http://www.w3.org/2005/sparql-results#"> <head> <variable name="name"/> <variable name="bday"/> </head> <results distinct="false" ordered="true"> <result> <binding name="name"> <literal xml:lang="en">Alexander von Humboldt</literal> </binding> <binding name="bday"> <literal datatype="http://www.w3.org/2001/XMLSchema#date">1769-09-14</literal> </binding> </result> <result> <binding name="name"> <literal xml:lang="en">Ernst Lubitsch</literal> </binding> <binding name="bday"> <literal datatype="http://www.w3.org/2001/XMLSchema#date">1892-01-28</literal> </binding> </result>  </results></sparql>

http://www.w3.org/TR/rdf-sparql-XMLres/

"head": { "link": [], "vars": ["name", "bday"] }, "results": { "distinct": false, "ordered": true, "bindings": [

{ "name": { "type": "literal", "xml:lang": "en",

"value": "Alexander von Humboldt" } , "bday": { "type": "typed-literal",

"datatype": "http://www.w3.org/2001/XMLSchema#date",

"value": "1769-09-14" } },

{ "name": { "type": "literal", "xml:lang": "en",

"value": "Ernst Lubitsch" } , "bday": { "type": "typed-literal",

"datatype": "http://www.w3.org/2001/XMLSchema#date", "value": "1892-01-28" }

// ... ] }

http://www.w3.org/TR/rdf-sparql-json-res/

Query Result Formats

● Use the ACCEPT header to request the preferred result format:

GET /sparql?query=PREFIX+rd... HTTP/1.1Host: dbpedia.orgUser-agent: my-sparql-client/0.1Accept: application/sparql-results+json

Query Result Formats

● As an alternative some SPARQL endpoint implementations (e.g. Joseki) provide an additional parameter out

GET /sparql?out=json&query=... HTTP/1.1Host: dbpedia.orgUser-agent: my-sparql-client/0.1

● More convenient: use a library● Libraries:

● SPARQL JavaScript Library http://www.thefigtrees.net/lee/blog/2006/04/sparql_calendar_demo_a_sparql.html

● ARC for PHPhttp://arc.semsol.org/

● RAP – RDF API for PHPhttp://www4.wiwiss.fu-berlin.de/bizer/rdfapi/index.html

● Libraries (cont.):● Jena / ARQ (Java) http://jena.sourceforge.net/● Sesame (Java) http://www.openrdf.org/● SPARQL Wrapper (Python)

http://sparql-wrapper.sourceforge.net/● PySPARQL (Python)

http://code.google.com/p/pysparql/

● Example with Jena / ARQ:

import com.hp.hpl.jena.query.*;

String service = "..."; // address of the SPARQL endpointString query = "SELECT ..."; // your SPARQL queryQueryExecution e = QueryExecutionFactory.sparqlService( service, query );ResultSet results = e.execSelect();while ( results.hasNext() ) {

QuerySolution s = results.nextSolution();// …

}e.close();

● Querying a single dataset is quite boring

compared to:● Issuing SPARQL queries over multiple datasets

● How can you do this?

1. Issue follow-up queries to different endpoints

2. Querying a central collection of datasets

3. Build store with copies of relevant datasets

4. Use query federation system

Follow-up Queries

● Idea: issue follow-up queries over other datasets based on results from previous queries

● Substituting placeholders in query templates

String s1 = "http://cb.semsol.org/sparql";String s2 = "http://dbpedia.org/sparql";

String qTmpl = "SELECT ?c WHERE{ <%s> rdfs:comment ?c }";

String q1 = "SELECT ?s WHERE { ...";QueryExecution e1 = QueryExecutionFactory.sparqlService(s1,q1);ResultSet results1 = e1.execSelect();while ( results1.hasNext() ) { QuerySolution s1 = results.nextSolution(); String q2 = String.format( qTmpl, s1.getResource("s"),getURI() ); QueryExecution e2= QueryExecutionFactory.sparqlService(s2,q2); ResultSet results2 = e2.execSelect(); while ( results2.hasNext() ) { // ... } e2.close();}e1.close();

Find a list of companiesfiltered by some criteria and

return DBpedia URIs of them

Follow-up Queries

● Advantage:● Queried data is up-to-date

● Drawbacks:● Requires the existence of a SPARQL endpoint for

each dataset● Requires program logic● Very inefficient

Querying a Collection of Datasets

● Idea: Use an existing SPARQL endpoint that provides access to a set of copies of relevant datasets

● Example:● SPARQL endpoint by OpenLink SW over a majority

of datasets from the LOD cloud at: http://lod.openlinksw.com/sparql

Querying a Collection of Datasets

● Advantage:● No need for specific program logic

● Drawbacks:● Queried data might be out of date● Not all relevant datasets in the collection

Own Store of Dataset Copies

● Idea: Build your own store with copies of relevant datasets and query it

● Possible stores:● Jena TDB http://jena.hpl.hp.com/wiki/TDB● Sesame http://www.openrdf.org/● OpenLink Virtuoso http://virtuoso.openlinksw.com/● 4store http://4store.org/● AllegroGraph http://www.franz.com/agraph/● etc.

Populating Your Store

● Get RDF dumps provided for the datasets● (Focussed) Crawling

● ldspider http://code.google.com/p/ldspider/● Multithreaded API for focused crawling● Crawling strategies (breath-first, load-balancing)● Flexible configuration with callbacks and hooks

Own Store of Dataset Copies

● Advantages:● No need for specific program logic● Can include all datasets● Independent of the existence, availability, and

efficiency of SPARQL endpoints

● Drawbacks:● Requires effort to set up and to operate the store● Ideally, data sources provide RDF dumps; if not?● How to keep the copies in sync with the originals?● Queried data might be out of date

Federated Query Processing

● Idea: Querying a mediator whichdistributes subqueries torelevant sources andintegrates the results

● Instance-based federation● Each thing described by only one data source● Untypical for the Web of Data

● Triple-based federation● No restrictions● Requires more distributed joins

● Statistics about datasets required (both cases)

● DARQ (Distributed ARQ) http://darq.sourceforge.net/● Query engine for federated SPARQL queries● Extension of ARQ (query engine for Jena)● Last update: June 28, 2006

● Semantic Web Integrator and Query Engine(SemWIQ) http://semwiq.sourceforge.net/● Actively maintained by Andreas Langegger

● Advantages:● No need for specific program logic● Queried data is up to date

● Drawbacks:● Requires the existence of a SPARQL endpoint for

each dataset● Requires effort to set up and configure the mediator

In any case:

● You have to know the relevant data sources● When developing the app using follow-up queries● When selecting an existing SPARQL endpoint over

a collection of dataset copies● When setting up your own store with a collection of

dataset copies● When configuring your query federation system

● You restrict yourself to the selected sources

In any case:

● You have to know the relevant data sources● When developing the app using follow-up queries● When selecting an existing SPARQL endpoint over

a collection of dataset copies● When setting up your own store with a collection of

dataset copies● When configuring your query federation system

● You restrict yourself to the selected sourcesThere is an alternative:

Remember, URIs link to data

AutomatedLink Traversal

Automated Link Traversal

● Idea: Discover further data by looking up relevant URIs in your application

● Can be combined with the previous approaches

Link Traversal BasedQuery Execution

● Applies the idea of automated link traversal to the execution of SPARQL queries

● Idea:● Intertwine query evaluation with traversal of RDF links● Discover data that might contribute to query results

during query execution

● Alternately:● Evaluate parts of the query● Look up URIs in intermediate solutions

Queried data

SELECT ?c ?u WHERE {

<http://mymovie.db/movie2449> mov:filming_location ?c .

?c geo:statistics ?cStats .

?cStats stat:unempRate ?u . }

Queried data

● Example:Return unemployment rate of the countries in which the movie http://mymovie.db/movie2449 was filmed.

Queried data

http://mymovie.db/movie2449

Queried data

<http://mymovie.db/movie2449> mov:filming_location <http://geo.../Italy> .

Queried data

<http://mymovie.db/movie2449> mov:filming_location <http://geo.../Italy> .

http://geo.../Italy

Queried data

http://geo.../Italy

Queried data

http://geo.../Italy

?cStats stat:unempRate ?u . } http://geo.../Italy

Queried data

<http://geo.../Italy> geo:statistics <http://example.db/stat/IT> .

http://geo.../Italy

Queried data

http://geo.../Italy http://stats.db/../it

?stat?loc

<http://geo.../Italy> geo:statistics <http://example.db/stat/IT> .

... Queried data

http://geo.../Italy http://stats.db/../it

?stat?loc

● Proceed with this strategy(traverse RDF links during query execution)

Queried data

● Advantages:● No need to know all data sources in advance● No need for specific programming logic● Queried data is up to date● Does not depend on the existence of SPARQL

endpoints provided by the data sources

● Drawbacks:● Not as fast as a centralized collection of copies● Unsuitable for some queries● Results might be incomplete

Implementations

● Semantic Web Client library (SWClLib) for Javahttp://www4.wiwiss.fu-berlin.de/bizer/ng4j/semwebclient/

● SWIC for Prolog http://moustaki.org/swic/

Implementations

● SQUIN http://squin.org● Provides SWClLib functionality as a Web service● Accessible like a SPARQL endpoint● Public SQUIN service at:

http://squin.informatik.hu-berlin.de/SQUIN/● Install package: unzip and start● Convenient access with SQUIN PHP tools:

$s = 'http:// …'; // address of the SQUIN service$q = new SparqlQuerySock( $s, '… SELECT ...' );$res = $q->getJsonResult(); // or getXmlResult()

Real-World Examples

SELECT DISTINCT ?author ?phone WHERE {

?pub swc:isPartOf <http://data.semanticweb.org/conference/eswc/2009/proceedings> .

?pub swc:hasTopic ?topic . ?topic rdfs:label ?topicLabel .

FILTER regex( str(?topicLabel), "ontology engineering", "i" ) .

?pub swrc:author ?author .

{ ?author owl:sameAs ?authorAlt }

{ ?authorAlt owl:sameAs ?author }

?authorAlt foaf:phone ?phone .

1min 30sec

# of query results

# of retrieved graphs

# of accessed servers

avg. execution time

Returnphone numbers of authors

of ontology engineering papersat ESWC'09.

Querying Linked Data with SPARQL (2010)

Technology