Date post: | 15-Jan-2017 |
Category: |
Data & Analytics |
Upload: | dimitris-kontokostas |
View: | 754 times |
Download: | 0 times |
Graph databases & data integration
The case of RDF
By Dimitris KontokostasAKSW/KILT - LeipzigDBpedia Association
Thessaloniki Java Meetup / 09.05.2016
Thessaloniki Java meetup - 09.05.2016
About me
● I live in Veria● I am an ex-ICT teacher● Since 2003 I was working on mainly on R&D projects
○ + some web development
● Since 2012 doing a PhD & working in AKSW group in Leipzig○ Focusing on semantic web technologies (RDF, SPARQL, and many other scary terms)○ aka Knowledge Engineer
● I am on open source enthusiast (DBpedia, RDFUnit)● Recently became a W3c specification editor for SHACL● Walked across many langs but ended up in Scala, Java, & Bash
○ With bash / CLI as a first choice;)
Thessaloniki Java meetup - 09.05.2016
Before we start… who knows?
LOD Cloud
Linked Data
Thessaloniki Java meetup - 09.05.2016
Agenda*
● Graphs● RDF Graphs● Data integration● Who uses RDF● Quick overview of:
○ DBpedia○ SPARQL○ RelFinder○ Schema.org & actions○ JSON-LD○ Entity disambiguation○ Data Quality
(*) focusing mostly on getting familiar to basic terms and concepts(**) Apologies in advance for mixing greek with English
Thessaloniki Java meetup - 09.05.2016
Thessaloniki Java meetup - 09.05.2016
The four V’s heatmap for Graph Databases
Study in 2013 found:● many organizations
find the “variety” dimension a greater challenge than volume or velocity.
Graph DBs to the rescue:● Combine multiple
sources with different structures
● while retaining the flexibility to add new ones without adapting schematas
● query combined data, or multiple sources at once
● detecting patterns in the data
(*) See also this
Thessaloniki Java meetup - 09.05.2016© Image by Max De Margi
Thessaloniki Java meetup - 09.05.2016
● A graph is a way of specifying relationships among a collection of items● Items
○ Nodes - Alice, Bob, …○ Edges
■ undirected - knows, …■ directed - follows, …
○ Values -- weights, distances, scores, 0-5 scale, …○ Attributes - name, time, ...
Graphs
Thessaloniki Java meetup - 09.05.2016
Graph Data Models
Property graphs
● Industry standards○ Neo4j, Titan, Apache TinkerPop, ...○ App specific way for querying, exporting, importing, etc○ Optimized for specific operation and in many cases faster
RDF Graphs
● W3c standards○ Like XML / HTML, define once run everywhere TM
○ Standardised way for querying, exporting, importing
Thessaloniki Java meetup - 09.05.2016
Property Graphs
● Each node has a○ unique identifier.○ set of outgoing edges.○ set of incoming edges.○ collection of key-value properties.
● Each edge○ Is directed○ has a unique identifier.
○ has a label that denotes
the type of relationship between its source and
○ target nodes.○ has a collection of key-value
Thessaloniki Java meetup - 09.05.2016
RDF - Resource Description Framework
● An RDF Graph is a set of RDF Triples● An RDF triple consists of (only) three components:
○ the subject (is an IRI)○ the predicate (is an IRI)○ the object (can be an IRI or Literal)○ (subjects and objects can also be blank nodes but let’s leave it for now)
http://dbpedia.org/resource/Java
dbo:latestReleaseVersion“1.8.0_60”
http://dbpedia.org/resource/C++
dbo:influencedBy
http://dbpedia.org/resource/C#
dbo:influencedBy
Subject Predicate Object
Thessaloniki Java meetup - 09.05.2016
RDF is an abstract data model
Turtle@prefix dbo: <http://dbpedia.org/ontology/> .@prefix ex: <http://example.com/> .ex:Dimitris a dbo:Person .
NTriples<http://example.com/Dimitris> a <http://dbpedia.org/ontology/Person> .
JSON-LD{ "@id": "http://example.com/Dimitris", "@type": "http://dbpedia.org/ontology/Person" }
XML <rdf:Description rdf:about="http://example.com/Dimitris"> <rdf:type rdf:resource="http://dbpedia.org/ontology/Person"/> </rdf:Description>
RDFa (embedded in html)<div xmlns="http://www.w3.org/1999/xhtml" prefix=" rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# dbo: http://dbpedia.org/ontology/ rdfs: http://www.w3.org/2000/01/rdf-schema#"> <div typeof="dbo:Person" about="http://example.com/Dimitris"> </div></div>
Thessaloniki Java meetup - 09.05.2016
RDF & Graphs (Separate)
File1.ttl@prefix foaf: <http://xmlns.com/foaf/0.1/> .@prefix ex: <http://example.com/> .ex:Dimitris foaf:knows ex:Petros .
File2.ttl@prefix foaf: <http://xmlns.com/foaf/0.1/> .@prefix ex: <http://example.com/> .ex:Dimitris a foaf:Person .ex:Petros a foaf:Person .
File3.ttl@prefix foaf: <http://xmlns.com/foaf/0.1/> .@prefix dbpedia: <http://dbpedia.org/resource/> .@prefix ex: <http://example.com/> .ex:Dimitris foaf:interest dbpedia:RDF .ex:Petros foaf:interest dbpedia:Cassandra .
Thessaloniki Java meetup - 09.05.2016
RDF & Graphs (merge)
File_all.ttl@prefix foaf: <http://xmlns.com/foaf/0.1/> .@prefix ex: <http://example.com/> .ex:Dimitris foaf:knows ex:Petros .
ex:Dimitris a foaf:Person .ex:Petros a foaf:Person .
@prefix dbpedia: <http://dbpedia.org/resource/> .
ex:Dimitris foaf:interest dbpedia:RDF .ex:Petros foaf:interest dbpedia:Apache_Cassandra .
Thessaloniki Java meetup - 09.05.2016
RDF & Graphs (dataset / multi-graph) .n3 files
<http://example.com/relations-graph> {@prefix foaf: <http://xmlns.com/foaf/0.1/> .@prefix ex: <http://example.com/> .ex:Dimitris foaf:knows ex:Petros .
}
<http://example.com/types-graph> {@prefix foaf: <http://xmlns.com/foaf/0.1/> .@prefix ex: <http://example.com/> .ex:Dimitris a foaf:Person .ex:Petros a foaf:Person .
}
<http://example.com/interests-graph> {@prefix foaf: <http://xmlns.com/foaf/0.1/> .@prefix dbpedia: <http://dbpedia.org/resource/> .@prefix ex: <http://example.com/> .ex:Dimitris foaf:interest dbpedia:RDF .ex:Petros foaf:interest dbpedia:Cassandra .
}
Thessaloniki Java meetup - 09.05.2016
RDF & Linked Data
● Using HTTP(s) based IRIs we get the Web of Data○ See TED talk from Tim Berners Lee (Creator of WWW)
● Every RDF Resource becomes like a REST GET API that returns all the RDF triples it is associated with
○ content negotiation for RDF (machine) or HTML (human)○ Follow-your-nose pattern
http://dbpedia.org/resource/Java
dbo:latestReleaseVersion “1.8.0_60”
http://dbpedia.org/resource/C++
dbo:influencedBy
http://dbpedia.org/resource/C#
dbo:influencedBy
http://aksw.org/DimitrisKontok
ostas ex:learns
http://www.geonames.org/733905/
dbo:birthPlace 40.52437
22.20242
geo:lat
geo:long
Thessaloniki Java meetup - 09.05.2016
LOD CLOUD
>1K Datasets>50B Triples>100M links
Thessaloniki Java meetup - 09.05.2016
Vocabularies & Semantics
● Vocabularies/Ontologies define classes and predicates (properties) in RDF
○ ex:Dimitris a dbo:Person○ ex:Dimitris dbo:birthDate “1981-06-06”^^xsd:date
● Existing Vocabularies capture many use case○ DBpedia ontology (general purpose)○ Schema.org (general purpose / new backed by Google, Yahoo, Bing & Yandex)○ Foaf (Friend of a friend)○ Geo (geographical)○ Prov-o (data provenance)○ SKOS (classifications)○ Org (organization structure)○ … http://lov.okfn.org has more than 400
Thessaloniki Java meetup - 09.05.2016
Vocabularies & Semantics
● classes and predicates (properties) have definitions (semantics)● ex:Dimitris a dbo:Person
○ dbo:Person Belongs in a class hierarchy● ex:Dimitris dbo:birthDate “1981-06-06”^^xsd:date
○ dbo:birthDate expects a dbo:Person as subject○ dbo:birthDate expects an xsd:date as object
● Reusing existing vocabularies (classes & properties) with defined semantics is a good practice
○ Get part of the data modeling for free○ Using common terms can help integrate data easier○ Validation (or inference) for free
■ ex:Thessaloniki dbo:birthDate “1981-06-06”^^xsd:date (is Thessaloniki a Person?)■ ex:Dimitris dbo:birthDate ex:Thessaloniki (ex:Thessaloniki is not an xsd:date)
Thessaloniki Java meetup - 09.05.2016
Data integration with RDF
● Very simple graph data model● Convert your data to RDF and model against common vocabularies
○ Design applications against vocabularies○ Integrate multiple different sources
● Local identifiers are a common integration problem● Link to data authorities
○ ex:Dimitris dbo:birthPlace ex:Veria geonames:733905○ (or) ex:Veria owl:sameAs geonames:733905
Thessaloniki Java meetup - 09.05.2016
Pay as you go Data Integration
● RDF views on top of RDBMS (e.g. MySQL) R2RML (W3c spec)○ Mapping files defines how SQL queries / tables translate to RDF○ Queryable through a virtual SPARQL endpoint translating SPARQL to SQL
● Convert XML/JSON/CSV/… to RDF with RML.io using mapping files● Find links to external databases with Limes & Silk
○ e.g.: ex:Veria owl:sameAs geonames:733905
● You can get some benefit with low effort● The more time you invest the better the results● (Common practice) work on secondary RDF views of your data
Thessaloniki Java meetup - 09.05.2016
Who uses RDF (in public)
https://github.com/json-ld/json-ld.org/wiki/Users-of-JSON-LD
Thessaloniki Java meetup - 09.05.2016
Some More Statistics
● Based on the common crawl of Nov 2015● 30% of HTML pages (541M / 1.77B pages) contained structured data.● This 30% originates from 2.72M different pay-level-domains out of the
14.41 million pay-level-domains covered by the crawl (19%). ○ 521K websites use RDFa○ 1.1 million Microdata○ 586K have embedded json-ld (mostly for search actions)
● Altogether, the extracted data sets consist of 24.38 billion RDF quads.
http://webdatacommons.org/structureddata/2015-11/stats/stats.html#results-2015-1
Thessaloniki Java meetup - 09.05.2016
DBpedia Let’s look at John Cleese (Monty Pythons)
Thessaloniki Java meetup - 09.05.2016
SPARQL
„Which films starred John Cleese without any other members of Monty Python?“
SPARQL Examples by Markus Ackermann &Markus Freudenberg
Thessaloniki Java meetup - 09.05.2016
Thessaloniki Java meetup - 09.05.2016
Basic Graph Pattern
Thessaloniki Java meetup - 09.05.2016
Thessaloniki Java meetup - 09.05.2016
Graph Group Pattern
Thessaloniki Java meetup - 09.05.2016
Thessaloniki Java meetup - 09.05.2016
Filtering Unwanted Results
Thessaloniki Java meetup - 09.05.2016
Thessaloniki Java meetup - 09.05.2016
RelFinder demo (flash)
Schema.org
● Vocabulary backed by all Search engines
● RDF data model○ Normative format is JSON-LD○ RDF in not actively mentioned (to
not scare people away)○ Allows use as general structured
data (e.g. microdata)● Enriches a lot of (at least) Google’s
application○ Search (try e.g. recipes)○ Gmail (travel, events, actions,...)○ Google Now○ Google Knowledge Graph○ ...
Thessaloniki Java meetup - 09.05.2016
Schema.org actions
Thessaloniki Java meetup - 09.05.2016
JSON-LD
● Like normal JSON but better ;)
Thessaloniki Java meetup - 09.05.2016
JSON-LD
● Like normal JSON but better ;)● @context makes the difference● Append your own context
Thessaloniki Java meetup - 09.05.2016
JSON-LD
Thessaloniki Java meetup - 09.05.2016
JSON-LD
Thessaloniki Java meetup - 09.05.2016
JSON-LD
Thessaloniki Java meetup - 09.05.2016
JSON-LD links
● Previous examples
● JSON-LD specification & playground
● Hypermedia self-described APIs with Hydra
Thessaloniki Java meetup - 09.05.2016
Entity disambiguation
aka NERD (Named Entity Resolution & Disambiguation)
● George Bush is sitting in front of the White House ○ George: some George?○ Bush: a small plant○ George Bush: former president of USA○ White: Colour○ House: a house○ White House:
● http://dbpedia-spotlight.github.io/demo/
Thessaloniki Java meetup - 09.05.2016
Data Quality
● As mentioned earlier, we can (re) use the vocabulary semantics for automatic data validation
● RDFUnit - https://github.com/AKSW/RDFUnit ○ Automatically generates data unit tests based on the vocabularies your data uses○ Custom JUnit runner
● SHACL - http://w3c.github.io/data-shapes/shacl/ ○ Language to define advanced data constraints on RDF Graphs○ (In progress) W3c recommendation
Thessaloniki Java meetup - 09.05.2016
ALIGNED project
● Aligning software & data engineering● Tools & techniques for agility in changes in code / data● http://aligned-project.eu ● Options a free consultancy in aligned tools
○ See website for more info
Thessaloniki Java meetup - 09.05.2016
Wrapping up / Key points
● Data variety is a common problem● Integrating Data can be a pain :)● Graph Databases can help, RDF can sometimes be more appropriate● Pay as you go data integration
○ Map your data to RDF○ Keep RDF as a copy of your source data
● RDF helps you develop reusable applications against schemas● Schema.org
○ For website markups○ For defining actions
● JSON-LD (embedded mappings)● RDF for text annotations
● There is very good tool support for RDF in Java
Thessaloniki Java meetup - 09.05.2016
Links
● http://json-ld.org/● http://wiki.dbpedia.org ● http://dbpedia-spotlight.github.io/demo/● http://schema.org ● http://aksw.org - Many interesting tools● http://wikidata.org● Apache Jena - RDF Java library ● Virtuoso - Open Source RDF & RDBMS DB
Thessaloniki Java meetup - 09.05.2016
Thank you!
Questions?
Slides available at slideshare.net/jimkont