ABSTAT: Ontology-driven Linked Data Summaries with Pattern Minimalization

ABSTAT: Ontology-driven Linked Data Summaries with Pattern Minimalization

Blerina Spahiu, Riccardo Porrini, Matteo Palmonari, Anisa Rula, Andrea Maurino

University of Milano-Bicocca ([email protected])

[email protected]

mailto:[email protected]

mailto:[email protected]

Outline

Motivation Dataset Understanding State of the Art

Summarization Framework Abstract Knowledge Patterns (AKPs) Pattern Minimalization Summary extraction, storage and presentation

Evaluation Compactness Informativeness User Study

Conclusion and Future Work

2University of Milan - Bicocca

Introduction

What types of resources are there in a data set? How are they described? What types of resources are linked by a certain property and how frequently?

Motivation

Understanding the content of data sets is challenging Looking at the ontology is not enough:

Ontologies may be large and underspecified

• DBpedia 2015-04: 2795 properties, domain not specified for 259 properties, range not specified for 187 properties

• No information about the usage Explorative queries are too expensive

Significant server overload High response time/timeout

State of the Art

University of Milan - Bicocca 5

Relevance Based Summarization Pattern Based Approaches

Troullinoy et al. 2015Zhang et al. 2007

Identifying subsets of data sets or ontologies that are considered to be more relevant

Aim at extracting knowledge patterns for a complete representation of the data set

Mihindukulasooriya et al. 2015Persutti et al. 2011M. Jarrar and M. Dikaiakos, 2012

Schema Induction

Induces a schema from the data and aim at extracting stronger assertions

Völker and Niepert, 2011

Statistics about the dataset

Konrath et. al 2012Langegger and W. Wöb, 2009Auer et al. 2012Linked Open Vocabularies (http://lov.okfn.org/)

Aim at reporting statistics about the usage of different vocabularies, properties and types in the data

State of the Art

University of Milan - Bicocca 6

Relevance Based Summarization Pattern Based Approaches

Troullinoy et al. 2015Zhang et al. 2007

Identifying subsets of data sets or ontologies that are considered to be more relevant.

Aim at extracting knowledge patterns for a complete rapresentation of the dataset.

Mihindukulasooriya et al. 2015Persutti et al. 2011M. Jarrar and M. Dikaiakos, 2012

Schema Induction

Induces a schema from the data and aim at extracting stronger assertions.

Völker and Niepert, 2011

Statistics about the dataset

Konrath et. al 2012Langegger and W. Wöb, 2009Auer et al. 2012Linked Open Vocabularies (http://lov.okfn.org/)

Aim at reporting statistics about the usage of different vocabularies, properties and types in the data.

ABSTAT

ABSTAT

ABSTAT (http://abstat.disco.unimib.it) is an ontology-driven linked data summarization framework

A summary provides a complete but compact schema-level representation of a data set A set of Abstract Knowledge Patterns (AKPs) Statistics

An AKP represents the fact that there are instance of type Person linked with instances of type Settlement by the property birthplace

How many times does this pattern occur in the data set

How many times does a certain type occur as minimal type and how many time does the property occur in the dataset

http://abstat.disco.unimib.it/


Abstract Knowledge Patterns (AKPs)

ABSTAT adopts a minimalization mechanism based on minimal type patterns

Minimalization is based on a subtype graph which represents the data ontology

Abstract Knowledge Patterns (AKPs) are abstract representations of Knowledge Patterns

An AKP is a triple (C; P; D ) such that C and D are types and P is a property

In ABSTAT we represent only a set of AKP occurring in the data set, those that are minimal types

Person

Sportist

FootballPlayer

Lawyer

Jim Brown

AmalClooney

“1936-02-17”

XMLSchema#Date

hasWife

Artist

George Clooney

birthDate

= types= instances= literals

.subclassOf

subclassOf

subclassOf

subclassOf

type

type

type

The (minimal-type) patterns extracted by ABSTAT are:<Artist, hasWife, Lawyer><FootballPlayer, birthDate, XMLSchema#Date>

(type)

An example how AKPs are extracted

typetype

type

Person

Sportist

FootballPlayer

Lawyer

Jim Brown

AmalClooney

“1936-02-17”

XMLSchema#Date

hasWife

Artist

George Clooney

birthDate


.subclassOf

subclassOf

subclassOf

subclassOf

type

type

type

The (minimal-type) patterns extracted by ABSTAT are:<Artist, hasWife, Lawyer><FootballPlayer, birthDate, XMLSchema#Date>

(type)


typetype

typeRedundant patterns excluded by the summary:<Person, hasWife, Person><Sportist, birthDate, XMLSchema#Date><Person, birthDate, XMLSchema#Date>

Person

Sportist

FootballPlayer

Lawyer

Jim Brown

AmalClooney

“1936-02-17”

XMLSchema#Date

hasWife

Artist

George Clooney

birthDate


.subclassOf

subclassOf

subclassOf

subclassOf

type

type

type

The (minimal-type) patterns extracted by ABSTAT are:<Artist, hasWife, Lawyer><FootballPlayer, birthDate, XMLSchema#Date><Artist, birthDate, XMLSchema#Date>

(type)


typetype

typetype

Summary Extraction Workflow

13

ABSTAT User Interfaces

ABSTAT homepage

(http://abstat.disco.unimib.it)

ABSTATBrowse

(http://abstat.disco.unimib.it/browse)

ABSTATSearch

(http://abstat.disco.unimib.it/search)

SPARQL Endpoint

(http://abstat.disco.unimib.it/sparql)

University of Milan - Bicocca


http://abstat.disco.unimib.it/browse

http://abstat.disco.unimib.it/search

http://abstat.disco.unimib.it/search

http://abstat.disco.unimib.it/sparql

Experimental Evaluation

Summary compactness Number of patterns in the summary vs. number of triples in the

data set Comparison with a similar approach without minimalization

Summary informativeness Insights about the semantics of the properties Small-scale user study

Compactness

Dataset Relational Typing Assertions Types (Ext.) Properties (Ext.) Patterns

DBpedia Core 2014 40.5M 29.7M 70.1M 869 (85) 1439 (15) 171340

DBpedia 3.9 Infobox 96.3M 19.7M 116.4M 821 (58) 62572 (14) 732418

Linked Brainz 180.1M 39.6M 221.7M 21 (9) 33 (0) 161

Reduction Rate =

Dataset ABSTAT LOUPE

DBpedia Core 2014 0.002 0.01

Linked Brainz 6.72 10-7 7.1 10-7

Minimalization produces more compact summaries Advantage of minimalization is more observable for datasets with

richer subtype graphs and typing assertions

Data sets and summaries statistics

Reduction rate

Number of patterns

Number of assertions in the data set

Similar to ABSTAT without minimalization

Informativeness

ABSTAT summaries provide useful insights about the semantics of properties, based on their usage within a data set

Dataset Missing Domain (%)

Missing Range (%)

Missing Domain & Range (%)

DBpedia Core 2014 259 (18%) 187 (13%) 48 (3.3%)

DBpedia 3.9 Infobox 61368 (98%)

61309 (98%)

61161 (97%)

Linked Brainz 13 (39%) 15 (45%) 13 (39%)

Inferred domain and range for DBpedia Core 2014

dbo:t

ype

dbo:s

ucce

ssor

dbo:d

ivisio

n

dbo:i

sPartO

f

dbo:s

eries

dbo:g

ender

dbo:s

ource

dbo:l

ocalA

utho..

.

dbo:r

oyalA

nthem

dbo:m

ainIntere

st

dbo:c

hairL

abel

dbo:f

ormat

dbo:m

anag

e...

dbo:r

elated

dbo:h

asVaria

nt

dbo:v

ariantO

f

dbo:n

amedAfte

r0

20

40

60

80

100

120

140

160

Extracted minimal types (domain)

Num

ber o

f min

imal

type

s

User Study: Setup

Can ABSTAT be useful to support query formulation? Queries to DBpedia 3.9 Infobox from the Questions and

Answering in Linked Open Data benchmark 5 queries of increasing length (1 of length 1, 2 of length 2

and 2 of length 3) 20 participants, 2 groups:

abstat group uses ABSTAT (after 20 min of training)control group does not use ABSTAT

Measures:Time needed to formulate the queryAccuracy of the answer

19

User Study: Questionnaire


User Study: Results

Group Avg. Completion Time (s) AccuracyQuery 1- length 1 How many employees does Google

have?

abstat 358.9 0.9control 380.6 0.8

Query 2- length 2 Give me all people that were born in Vienna and died in Berlin.

abstat 356.3 1control 346.9 0.8

Query 3- length 2 Which professional surfers were born in Australia?

abstat 476.6 0.6

control 234.24 0.7Query 4- length 3 In which films directed by Gary Marshall was Julia Roberts

starring?

abstat 333.4 0.9

control 445.6 0.9

Query 5- length 3 Give me all books by William Goldman with more than 300 pages.

abstat 233.4 1control 569.8 0.7The independent t-test showed that there was a significant effect between two groups for answering correctly Q5: t(16) = 10.32, p < .005

User Study: Results Analysis

abstat group users benefit from ABSTAT summary in terms of average completion time, accuracy, or both Increasing accuracy over increasing difficulty, performing the tasks faster Exception is query 3, because the individual Surfing is classified with no

type other than owl:Thing

Two used strategies to answer the queries by participants from the control group were: To directly access the public web page describing the DBpedia named

individuals mentioned in the query Very few submitted explorative SPARQL queries to the endpoint

Conclusion and Future Work

ABSTAT: ontology-driven summarization with minimalization Sensible reduction rate and promising results about the

informativeness of the summary Currently extending the user study

Apply relevance-oriented summarization methods based on connectivity analysis

ABSTAT summary should consider the inheritance of properties to produce even more compact summaries

We envision a complete analysis of the most important data set available in the LOD cloud (20+ data sets available)

APIs available soon

Thank you for your attention!

23University of Milan - Bicocca

24

www.abstat.unimib.it


Feedback is WELCOMED!

http://www.abstat.unimib.it/

Date post:	15-Jan-2017
Category:	Internet
Upload:	blerina-spahiu
View:	227 times
Download:	0 times

ABSTAT: Ontology-driven Linked Data Summaries with Pattern Minimalization

Internet