Sieve Linked Data
Quality Assessment
and Fusion
Pablo N. Mendes
Hannes Mühleisen
Christian Bizer
With contributions from:
Andreas Schultz, Andrea Matteini, Christian Becker, Robert Isele
“A sieve, or sifter, separates wanted elements
from unwanted material using a woven screen
such as a mesh or net.” Source: http://en.wikipedia.org/wiki/Sieve
“sieve”
• Raw data (RDF)
• Accessible on the Web
• Data can link to other data sources
• Benefits: Ease of access and re-use; enables discovery
What is Linked Data?
Thing
Thing
Thing
Thing
Thing
Thing
A B C
Thing
Thing
Thing
Thing
D E
data link data link data link data link
Linking Open Data Cloud
http://lod-cloud.net
Linked Data Challenges
• Data providers have different intentions, experience/knowledge
• data may be inaccurate, outdated, spam etc.
• Data sources that overlap in content may use…
• ... different RDF schemata
• ... different identifiers for the same real-world entity
• …conflicting values for properties
• Integrating public datasets with internal databases poses the
same problems
An Architecture for Linked Data Applications
LDIF – Linked Data Integration Framework
• Open source (Apache License, Version 2.0)
• Collaboration between Freie Universität Berlin and mes|semantics
Collect data: Managed download and update
Translate data into a single target vocabulary
Resolve identifier aliases into local target URIs
Output
1
2
3
5
Assess quality, filter bad results, resolve conflicts 4
Supported data sources:
• RDF dumps (various formats)
• SPARQL Endpoints
• Crawling Linked Data
LDIF Pipeline
Collect data
Translate data
Resolve identities
Filter and fuse
1
2
3
4
Output 5
dbpedia-owl: City
LDIF Pipeline
Collect data
Translate data
Resolve identities
1
2
3 R2R
• Mappings expressed in RDF (Turtle)
• Simple mappings using OWL / RDFs statements (x rdfs:subClassOf y)
• Complex mappings with SPARQL expressivity
• Transformation functions
Data sources use a wide range of different RDF
vocabularies
schema:Place
fb:location.citytown
local:City
Filter and fuse 4
Output 5
LDIF Pipeline
Collect data
Translate data
Resolve identities
1
2
3
Silk
Berlin, Germany
Berlin, CT
Berlin, MD
Berlin, NJ
Berlin, MA
Berlin
• Profiles expressed in XML
• Supports various comparators and transformations
Data sources use different identifiers for the same entity
Berlin
=
Berlin,
Germany
Filter and fuse 4
Output 5
LDIF Pipeline
Collect data
Translate data
Resolve identities
1
2
Sieve
891.85 km2
891.82 km2
891.82 km2
891.85 km2
Quality
• Profiles expressed in XML
• Supports various scoring and fusion functions
Sources provide different values for the same property
Filter and fuse
Output 5
4
3
Total Area
Total Area
891.85 km2
• Output options:N-Quads
• N-Triples
• SPARQL Update Stream
• Provenance tracking using Named
Graphs
LDIF Pipeline
Collect data
Translate data
Resolve identities
1
2
3
Filter and fuse 4
Output 5
An Architecture for Linked Data Applications
Data Quality and
Fusion Module
Data Fusion
“fusing multiple records representing the same
real-world object into a single, consistent, and
clean representation”
(Bleiholder & Naumann, 2008)
Conflict resolution strategies
• Independent of quality assessment metrics
• Pick most frequent (democratic voting)
• Average, max, min, concatenation
• Within interval
• Based on task-specific quality assessment
• Keep highest scored
• Keep all that pass a threshold
• Trust some sources over others
• Weighted voting
Data Fusion
• Input:
• (Potentially) conflicting data
• Quality metadata describing input
• Execution:
• Use existing or custom FusionFunctions
• Output:
• Clean data, according to user’s definition of clean
Configuration: Data Fusion
Sieve: Quality Assessment
• Quality as “fitness for use”:
• Subjective:
• good for me might not be enough for you
• Task dependent:
• temperature: planning a weekend vs biology experiment
• Multidimensional:
• even correct data may be outdated or not available
• Requires task-specific quality assessment.
Data Quality - Conceptual Framework Dimension
Accuracy
Consistency
Objectivity
Timeliness
Validity
Believability
Completeness
Understandability
Relevancy
Reputation
Verifiability
Amount of Data
Interpretability
Rep. Conciseness
Rep. Consistency
Availability
Response Time
Security
Configuration: Quality Assessment
• Quality Assessment Metrics composed by:
• ScoringFunction (generically applicable to given data types)
• Quality Indicator as input (adaptable to use case)
• Output: [0;1]
Describes input within a quality dimension,
according to a user’s definition of quality
Configuration: Quality Assessment
More about Sieve
• Software: Open Source, Apache V2
• Scoring Functions and Fusion Functions can be extended
• Scala/Java interface, methods score/fuse and fromXML
• Quality scores can be stored and shared with other
applications
• Website: http://sieve.wbsg.de
• Documentation, examples, downloads, support
Use Case
Conflicting values
Quality indicators
User config Voilá!
(Multidimensional)
(Task-dependent)
Multiple data sources
(Complementary)
(Conflict
Resolution
Strategies)
(Heterogeneous)
Evaluating Quality of Data Integration
• Completeness
• How many cities did we find?
• How many of the properties did we fill with values?
• Conciseness
• How much redundancy is there in the object identifiers?
• How much redundancy is there in the property values?
• Consistency
• How many conflicting values are there?
Results
Generated data that is more complete, concise
and consistent than in the original sources
Linked Data application Architecture
My view on this data space can also be
shared, and reused.
We can “pay as we go”
• Twitter: @pablomendes
• E-mail: [email protected]
• Website: http://sieve.wbsg.de
• Google Group: http://bit.ly/ldifgroup
THANK YOU!
Supported in part by: Vulcan Inc. as part of its Project Halo
EU FP7 projects:
-LOD2 - Creating Knowledge out of Interlinked Data
-PlanetData - A European Network of Excellence on Large-Scale Data Management