+ All Categories
Home > Documents > Seminario Cristian Lai, 06-09-2012

Seminario Cristian Lai, 06-09-2012

Date post: 08-May-2015
Category:
Upload: crs4-research-center-in-sardinia
View: 384 times
Download: 5 times
Share this document with a friend
Description:
Il seminario presenta il tema emergente del Web of Data, nell'ambito del Semantic Web. Vengono esaminate le criticità incontrate nell'accedere all'enorme quantità di informazione presente attualmente nel Web e i vantaggi di un approccio basato sulla creazione interattiva di interrogazioni.
37
Query modeling and information retrieval within the Web of Data Cristian LAI [email protected] CRS4 september 6, 2012 1 / 37
Transcript
Page 1: Seminario Cristian Lai, 06-09-2012

Query modeling and information retrieval withinthe Web of Data

Cristian [email protected]

CRS4

september 6, 2012 1 / 37

Page 2: Seminario Cristian Lai, 06-09-2012

Outline

l Motivation

l UnStructured Data

l Structured Data

l Query building

l Applications

l Conclusion

september 6, 2012 2 / 37

Page 3: Seminario Cristian Lai, 06-09-2012

ContextSemantic Web

http://www.w3.org/2006/Talks/1023-sb-W3CTechSemWeb/

september 6, 2012 3 / 37

Page 4: Seminario Cristian Lai, 06-09-2012

MotivationSearch on the Web

http://www.slideshare.net/novaspivack/web-evolution-nova-spivack-twine

september 6, 2012 4 / 37

Page 5: Seminario Cristian Lai, 06-09-2012

Outline

l Motivation

l UnStructured Data

l Structured Data

l Query building

l Applications

l Conclusion

september 6, 2012 5 / 37

Page 6: Seminario Cristian Lai, 06-09-2012

Wikipedia

l Started in 2001.

l Is a multilingual, web-based, free-content encyclopedia project based onan openly editable model.

l Is the 5th site on the web and serves 454 million unique visitors monthly asof March 2011.

l Has fewer than 100 employees.

l Wikipedia holds an annual fundraiser instead of accepting advertising. Youmay have seen "A personal appeal from Wikipedia founder Jimmy Wales" ifyou’ve used the online encyclopedia during the last weeks of 2011. Googleco-founder Sergey Brin and his wife, Anne Wojcicki, has given a 500,000dollars grant to help Wikipedia fund its 28.3 million dollars annual budget.

september 6, 2012 6 / 37

Page 7: Seminario Cristian Lai, 06-09-2012

Wikipedia

l Pros:m Is a highly-efficient not-for-profit organization.m Is the finest example of truly collaborative created content: >19M articles;

>270 languages, >82k active contributors.m Covers many topics and domains, articles are a result of a community

consensus.

l Cons:m Contains many inconsistencies.

l Disclaimer: Wikipedia cannot guarantee the validity of the information found here.

m Is not very well integrated with other data sources.m Queries and search are not facilitated due to the lacks of structured

representation.

september 6, 2012 7 / 37

Page 8: Seminario Cristian Lai, 06-09-2012

Issues

l UnStructured data, keywords based search.l Simple questions are hard to answer.

m People who were born in Rome before 1900.m Italian musicians with English and French descriptions.m The official websites of companies with more than 500 employees.

l The information required to answer these is contained in Wikipedia.l Transforming Wikipedia into a knowledge base.

m To reveal the structure and semantics of Wikipedia contentm The DBpedia project.

september 6, 2012 8 / 37

Page 9: Seminario Cristian Lai, 06-09-2012

Structure in Wikipedia

l Wikipedia articles consist mostly of free text, but also contain differenttypes of structured information, such as infobox templates,categorisationinformation, images, geo-coordinates, and links to external Web pages.

l Title

l Abstract

l Infobox Template

l Geo-coordinates

l Caegories

l Imagesl Links

m other language versionm other Wikipedia pagesm redirectsm disambiguation

september 6, 2012 9 / 37

Page 10: Seminario Cristian Lai, 06-09-2012

Structured Information in Wikipedia

september 6, 2012 10 / 37

Page 11: Seminario Cristian Lai, 06-09-2012

Structured Information in Wikipedia

september 6, 2012 11 / 37

Page 12: Seminario Cristian Lai, 06-09-2012

Structured Information in Wikipedia

september 6, 2012 12 / 37

Page 13: Seminario Cristian Lai, 06-09-2012

Outline

l Motivation

l UnStructured Data

l Structured Data

l Query building

l Applications

l Conclusion

september 6, 2012 13 / 37

Page 14: Seminario Cristian Lai, 06-09-2012

RDF representationKnowledge Base

dbp:Cagliari rdf:type dbp:Citydbp:Cagliari dbp:Title "Cagliari"dbp:Cagliari dbp:Country dbp:Italydbp:Cagliari dbp:postalCode 09100dbp:Cagliari geo:lat "39.246387"xsd:floatdbp:Cagliari geo:long "9.057500"xsd:floatdbp:Cagliari rdf:type yago:MediterraneanPortCitiesAndTownsInItaly. . .

l An environment for collecting and structuring data.

l Well defined structure of classification.

september 6, 2012 14 / 37

Page 15: Seminario Cristian Lai, 06-09-2012

RDF

l Triples: (subject, predicate, object)l Subject and object

m are both URIs that each identify a resource, or a URI and a string literalrespectively.

m

l Predicatem specifies how the subject and object are related, and is also represented by a

URI.

l For example:m A knows Bm C isAuthorOf Dm Two resources linked in this fashion can be drawn from different data sets on

the Web, allowing data in one data source to be linked to that in another,thereby creating a Web of Data.

september 6, 2012 15 / 37

Page 16: Seminario Cristian Lai, 06-09-2012

DBpedia

l Started in 2007.

l Is the result of a community effort to extract structured information fromWikipedia.

l Makes Wikipedia data available as RDF.l Results: The DBpedia Data Set

m describes 3.64 million "things" with over half a billion "facts" (July 2011), 364kpersons, 462k places, 99k music albums, 54k films, 148k organisations;

m extraction in 97 different languages;m 672M RDF triples

l It is maintained by: Universität Leipzig, Freie Universität Berlin, OpenLinkSoftware, Inc.

l See http://wiki.dbpedia.org/Team

september 6, 2012 16 / 37

Page 17: Seminario Cristian Lai, 06-09-2012

Nucleus of the Web of Data

l Within the W3C Linking Open Data (LOD) community effort.l Tim Berners-Lee’s Linked Data principles.

m URIm HTTPm RDF, SPARQLm Interlinking among data providers

l An increasing number of data providers have started to publish andinterlink data on the Web.

l Several billion RDF triples and covers domains such as geographicinformation, people, companies, online communities, films, music, booksand scientific publications.

september 6, 2012 17 / 37

Page 18: Seminario Cristian Lai, 06-09-2012

LOD Datasets

september 6, 2012 18 / 37

Page 19: Seminario Cristian Lai, 06-09-2012

LOD Datasets

september 6, 2012 19 / 37

Page 20: Seminario Cristian Lai, 06-09-2012

Outline

l Motivation

l UnStructured Data

l Structured Data

l Query building

l Applications

l Conclusion

september 6, 2012 20 / 37

Page 21: Seminario Cristian Lai, 06-09-2012

SPARQL Query Language

l RDF is a directed, labeled graph data format for representing information(also in the Web).

l SPARQL is a language for querying RDF graphs by specifying templatesagainst which to compare graph components. Data which matches orsatisfies a template is returned from the query.

l A triple template contains variables that represent triplet components (e.g.,?s, ?p, or ?o within a triplet).

l Example:m ?person ex:age "20"xsd:integer .m Identifies a list of triplet subjects that have an ex:age property of "20".

Analogous to asking "Who has age 20?".m The SPARQL query engine will return a list of the subject component of triples

that satisfy each query through value substitution.

september 6, 2012 21 / 37

Page 22: Seminario Cristian Lai, 06-09-2012

SPARQL Queries

SELECT variables_list

FROM < RDF_source_URL >

WHERE {

{ triple_pattern_1 .. . .

triple_pattern_n . }.

}

SELECT ?person

FROM < http://ex.com >

WHERE {

?person ex:age "20"xsd:integer .

}

?person

------------------

_p1

_p2. . .

september 6, 2012 22 / 37

Page 23: Seminario Cristian Lai, 06-09-2012

The DBpedia SPARQL endpoint

l All data sets are available for queries via the DBpedia SPARQL endpoint(http://dbpedia.org/sparql).

l Querying the data set:m . . .m Abstracts of movies starring Tom Cruise, released before 1999.m The official websites of companies with more than 50000 employees.m Cities with more than 2 million habitants.m . . .

september 6, 2012 23 / 37

Page 24: Seminario Cristian Lai, 06-09-2012

Abstracts of movies starring Tom Cruise, released before1999

SELECT ?subject ?label ?released ?abstract WHERE {

?subject rdf:type <http://dbpedia.org/ontology/Film>.

?subject dbpedia2:starring <http://dbpedia.org/resource/Tom_Cruise>.

?subject rdfs:comment ?abstract.

?subject rdfs:label ?label.

FILTER(lang(?abstract) = "en" && lang(?label) = "en").

?subject <http://dbpedia.org/ontology/releaseDate> ?released.

FILTER(xsd:date(?released) < "2000-01-01"^^xsd:date).

} ORDER BY ?released

SPARQL

september 6, 2012 24 / 37

Page 25: Seminario Cristian Lai, 06-09-2012

Outline

l Motivation

l UnStructured Data

l Structured Data

l Query building

l Applications

l Conclusion

september 6, 2012 25 / 37

Page 26: Seminario Cristian Lai, 06-09-2012

Linked Data Search Engines and Indexes

l A number of search engines have been developed that crawl Linked Datafrom the Web by following RDF links, and provide query capabilities overaggregated data.Tom Heath and Christian Bizer (2011) Linked Data: Evolving the Web into a Global Data Space (1st edition). Synthesis Lectures on the Semantic Web:

Theory and Technology, 1:1, 1-136. Morgan & Claypool.

l Google, Bing and Yahoo! agree to create and support a commonvocabulary for structured data markup on web pages.

l Facebook has started to support RDF and Linked Data URIs and nowprovides access to parts of its user data via a Linked Data API.

september 6, 2012 26 / 37

Page 27: Seminario Cristian Lai, 06-09-2012

Google rich snippets

september 6, 2012 27 / 37

Page 28: Seminario Cristian Lai, 06-09-2012

Twitter, #annotationsTwitter API based client

september 6, 2012 28 / 37

Page 29: Seminario Cristian Lai, 06-09-2012

Twitter, #annotationsLookup annotations

september 6, 2012 29 / 37

Page 30: Seminario Cristian Lai, 06-09-2012

Twitter, #annotationsResource #dbpedia:Cagliari

september 6, 2012 30 / 37

Page 31: Seminario Cristian Lai, 06-09-2012

Twitter, #annotationsResource #dbpedia:Cagliari

september 6, 2012 31 / 37

Page 32: Seminario Cristian Lai, 06-09-2012

Question answeringRisorsa Cagliari

september 6, 2012 32 / 37

Page 33: Seminario Cristian Lai, 06-09-2012

Question answeringTemplate

september 6, 2012 33 / 37

Page 34: Seminario Cristian Lai, 06-09-2012

Question answeringRDF/XML

september 6, 2012 34 / 37

Page 35: Seminario Cristian Lai, 06-09-2012

Outline

l Motivation

l UnStructured Data

l Structured Data

l Query building

l Applications

l Conclusion

september 6, 2012 35 / 37

Page 36: Seminario Cristian Lai, 06-09-2012

Conclusion

l Data on the Web is a major challenge; technologies are needed to usethem, to interact with them, to integrate them.

l Semantic Web technologies (RDF, SPARQL, etc.) can play a major role inpublishing and using Data on the Web.

l Users can largely benefit from the wide world of structured content.

l Content providers joining the Linking Open Data project are contributingto create more meaningful navigation paths not only within websites butacross the whole web.

september 6, 2012 36 / 37

Page 37: Seminario Cristian Lai, 06-09-2012

Q & A

september 6, 2012 37 / 37


Recommended