Home > Internet > ABSTAT: Ontology-driven Linked Data Summaries with Pattern Minimalization

ABSTAT: Ontology-driven Linked Data Summaries with Pattern Minimalization

Date post: 15-Jan-2017
Category:
Author: blerina-spahiu
View: 226 times
Download: 0 times
Share this document with a friend
Embed Size (px)
of 24 /24
ABSTAT: Ontology-driven Linked Data Summaries with Pattern Minimalization Blerina Spahiu , Riccardo Porrini, Matteo Palmonari, Anisa Rula, Andrea Maurino University of Milano-Bicocca ([email protected] ) [email protected] imib.it
Transcript

Slide 1

ABSTAT: Ontology-driven Linked Data Summaries with Pattern MinimalizationBlerina Spahiu, Riccardo Porrini, Matteo Palmonari, Anisa Rula, Andrea Maurino

University of Milano-Bicocca ([email protected])

[email protected]

1

Outline MotivationDataset UnderstandingState of the Art Summarization FrameworkAbstract Knowledge Patterns (AKPs)Pattern MinimalizationSummary extraction, storage and presentation Evaluation CompactnessInformativenessUser Study Conclusion and Future Work

2University of Milan - Bicocca

This is the outilne for todays talk. To begin i will introduce you to the motivation. I will present the summarization framework. After I will talk about the experiment we run to evaluate ABSTAT and at the end some conclusion and future work.2

Introduction

What types of resources are there in a data set? How are they described? What types of resources are linked by a certain property and how frequently?

Many of us have looked at the LOD cloud and thought: wow so many datasets so good!!! If a user wants to evaluate if a data set is useful for her or to formulate some queries, she needs first to understand the content of the data set and its organization, by finding answers to questions such as: what types of resources are described in the data set? What properties are used to describe the resources?What types of resources are linked by a certain property and how frequently? How many resources have a certain type and how frequent is the use of a given property?

3

Motivation Understanding the content of data sets is challenging Looking at the ontology is not enough:Ontologies may be large and underspecifiedDBpedia 2015-04: 2795 properties, domain not specified for 259 properties, range not specified for 187 properties No information about the usage Explorative queries are too expensiveSignificant server overload High response time/timeout

Linked data sets make use of ontologies to describe the semantics of their data. Ontologies may be large and underspecfied. For Dbpedia release of April 2015 there are abot 2795 properties while the domain and range is not specified for 259 and 187 properties respectively.Finally, the ontology does not tell how frequently a certain modelling pattern occurs in a data set. The questions we made before can be answered with explorative queries, but at the price of a significant server overload for data publishers and high response time or even timeout for data consumers.

4

State of the ArtUniversity of Milan - Bicocca5

Relevance Based Summarization

Pattern Based ApproachesTroullinoy et al. 2015Zhang et al. 2007Identifying subsets of data sets or ontologies that are considered to be more relevantAim at extracting knowledge patterns for a complete representation of the data setMihindukulasooriya et al. 2015Persutti et al. 2011M. Jarrar and M. Dikaiakos, 2012

Schema InductionInduces a schema from the data and aim at extracting stronger assertionsVlker and Niepert, 2011

Statistics about the datasetKonrath et. al 2012Langegger and W. Wb, 2009Auer et al. 2012Linked Open Vocabularies (http://lov.okfn.org/)Aim at reporting statistics about the usage of different vocabularies, properties and types in the data

There exist 4 bodies of work which try to summarize Linked data set and RDF data. A first body of work has focused on identifying subsets of data sets or ontologies that are considered to be more relevant. Differently from these approaches, ours aims at providing a complete summary with respect to the data set.A second body of work aims at extracting knowledge patterns for a complete representation of the data setA third body of work aims at inducing a schema from the data and their axioms represent stronger patterns compared to the patterns extracted by our approach.And a fourth body of work aims at reporting statistics about the usage of different features in the data set like the vocabulary usage, properties and types used in the dataset etc.

5

State of the ArtUniversity of Milan - Bicocca6

Relevance Based Summarization

Pattern Based ApproachesTroullinoy et al. 2015Zhang et al. 2007Identifying subsets of data sets or ontologies that are considered to be more relevant.Aim at extracting knowledge patterns for a complete rapresentation of the dataset.Mihindukulasooriya et al. 2015Persutti et al. 2011M. Jarrar and M. Dikaiakos, 2012

Schema InductionInduces a schema from the data and aim at extracting stronger assertions.Vlker and Niepert, 2011

Statistics about the datasetKonrath et. al 2012Langegger and W. Wb, 2009Auer et al. 2012Linked Open Vocabularies (http://lov.okfn.org/)Aim at reporting statistics about the usage of different vocabularies, properties and types in the data.

ABSTAT

ABSTAT is complementary with other approaches and aims at providing knowledge patterns that rapresent the complete dataset. Along with the patterns ABSTAT also produces statistics about the occurrence of these patterns, types and properties. The approach more similar to ours is Loupe which also extracts patterns and report statistics, but differently ABSTAT represents only a set of the extracted patterns.

6

ABSTAT ABSTAT (http://abstat.disco.unimib.it) is an ontology-driven linked data summarization framework A summary provides a complete but compact schema-level representation of a data set A set of Abstract Knowledge Patterns (AKPs) Statistics

An AKP represents the fact that there are instance of type Person linked with instances of type Settlement by the property birthplace

How many times does this pattern occur in the data set

How many times does a certain type occur as minimal type and how many time does the property occur in the dataset

ABSTAT accessible at abstat.disco.unimib.it is an ontology driven linked data summarization framework proposed to help users understand the dataset. A summary is aimed at providing a compact but complete representation of a data set. With complete representation we refer to the fact that every relation between concepts that is not in the summary can be inferred. The summary is composed of a set of Abstract Knowledge patterns, a subtype graph and statistics.

This is how a summary looks like: An Abstract knowledge pattern tells that there are instances of the type blabla connected with instances of type blibli by a property X.

7

Abstract Knowledge Patterns (AKPs) ABSTAT adopts a minimalization mechanism based on minimal type patterns Minimalization is based on a subtype graph which represents the data ontology Abstract Knowledge Patterns (AKPs) are abstract representations of Knowledge Patterns An AKP is a triple (C; P; D ) such that C and D are types and P is a property In ABSTAT we represent only a set of AKP occurring in the data set, those that are minimal types

What kind of patterns do we represent in the summary?One distinguishing feature of ABSTAT is that it adopts a minimalization mechanism based on minimal type patterns. Minimalization is based on a subtype graph introduced to represent the data ontology. The informative unit of ABSTAT is an AKP, Abstract Knowledge Patterns (AKPs) are abstract representations of Knowledge Patterns.A minimal type pattern is a triple (C; P;D ) that represents the occurrences of assertions in RDF data, such that C is a minimal type of the subject a and D is a minimal type of the object b. So an AKP states that there are instances of type C that arelinked to instances of a type D by a property P . In ABSTAT only a set of AKPs are represented in the summary. We represent only those patterns which are minimal type. Instead of representing every AKP occurring in the data set, ABSTATsummaries include only a base of minimal type patterns, i.e., a subset of the patterns such that every other pattern can be derived using a subtype graph

8

PersonSportistFootballPlayerLawyerJim BrownAmalClooney1936-02-17XMLSchema#Date

hasWifeArtistGeorge ClooneybirthDate= types= instances= literals

.subclassOfsubclassOfsubclassOfsubclassOftypetypetypeThe (minimal-type) patterns extracted by ABSTAT are:

(type)An example how AKPs are extractedtypetypetype

In the case when the instance of Jim Brown has two types: one type FootballPlayer and one is type Person. Then from the subtype graph we know that FootballPlayer is a subtype of Person, and between these two FootballPlayer is the minimal type then in the summary of ABSTAT we only include the AKP FootballPlayer, birthdate, XMLSchema#Date.

PersonSportistFootballPlayerLawyerJim BrownAmalClooney1936-02-17XMLSchema#Date

hasWifeArtistGeorge ClooneybirthDate= types= instances= literals

.subclassOfsubclassOfsubclassOfsubclassOftypetypetypeThe (minimal-type) patterns extracted by ABSTAT are:

(type)An example how AKPs are extractedtypetypetypeRedundant patterns excluded by the summary:

In the case when the instance of Jim Brown has two types: one type FootballPlayer and one is type Person. Then from the subtype graph we know that FootballPlayer is a subtype of Person, and between these two FootballPlayer is the minimal type then in the summary of ABSTAT we only include the AKP FootballPlayer, birthdate, XMLSchema#Date.

PersonSportistFootballPlayerLawyerJim BrownAmalClooney1936-02-17XMLSchema#Date

hasWifeArtistGeorge ClooneybirthDate= types= instances= literals

.subclassOfsubclassOfsubclassOfsubclassOftypetypetypeThe (minimal-type) patterns extracted by ABSTAT are:

(type)An example how AKPs are extractedtypetypetypetype

In the case when the instance of Jim Brown has two types: one type FootballPlayer and one is type Person. Then from the subtype graph we know that FootballPlayer is a subtype of Person, and between these two FootballPlayer is the minimal type then in the summary of ABSTAT we only include the AKP FootballPlayer, birthdate, XMLSchema#Date.

Summary Extraction Workflow

In the picture is shown the summarization workflow. As an input our framework takes a dataset in RDF and an ontology in OWL or RDFS format. From the dataset we extract the typing and relation assertions while the subtype graph is extracted from the ontology. Then the typing assertion is processed and the set of minimal types for each named individual is computed. Finally, the relation assertion is processed in order to compute the minimal type patterns that will form the minimal pattern base. During each phase we keep track of the occurrence of types, properties and patterns, which are included as statistics in the summary.

12

ABSTAT User InterfacesABSTAT homepage(http://abstat.disco.unimib.it)

ABSTATBrowse(http://abstat.disco.unimib.it/browse)

ABSTATSearch(http://abstat.disco.unimib.it/search)

SPARQL Endpoint(http://abstat.disco.unimib.it/sparql)University of Milan - Bicocca13

At the moment of speaking ABSTAT has four user interfaces, the homepage accessible at www.abstat.disco.it. The browsing interface help users browse for a given dataset, its types, properties and its AKPs in the summary, and it also returns the statistics for every element in the summary. ABSTATSearch implements a full-text search functionality over a set of summaries. Types, properties and patterns are represented by means of their local names (e.g., Person , birthPlace or Person birthPlace Country) . While the SPARQL endpoint allows users to execute SPARQL queries to the summary.

13

Experimental Evaluation Summary compactness Number of patterns in the summary vs. number of triples in the data set Comparison with a similar approach without minimalization

Summary informativeness Insights about the semantics of the propertiesSmall-scale user study

We evaluate our summaries from different perspectives. We measure the compactness of ABSTAT summaries We compare the number patterns in the summary versus the number of tiples in the data set. And we also compared our approach with another similar approach without minimalization which is Loupe.While the informativeness of our summaries are evaluated with two experiments. In the first one we showed that our summaries provide useful insights about the semantics of properties, based on their usage within a data set. In the second experiment, we conduct a preliminary user study to evaluate if the exploration of the summaries can help users in query formulation tasks.

14

Compactness DatasetRelational TypingAssertionsTypes (Ext.)Properties (Ext.)PatternsDBpedia Core 201440.5M29.7M70.1M869 (85)1439 (15)171340DBpedia 3.9 Infobox96.3M19.7M116.4M821 (58)62572 (14)732418Linked Brainz180.1M39.6M221.7M21 (9)33 (0)161

Reduction Rate = DatasetABSTATLOUPEDBpedia Core 2014 0.002 0.01Linked Brainz 6.72 10-7 7.1 10-7

Minimalization produces more compact summariesAdvantage of minimalization is more observable for datasets with richer subtype graphs and typing assertionsData sets and summaries statisticsReduction rateNumber of patternsNumber of assertions in the data setSimilar to ABSTAT without minimalization

To evaluate compactness of a summary we measure the reduction rate , defined as the ratio between the number of patterns in a summary and the number of assertions from which the summary has been extracted. In our evaluation we use the summaries extracted from three linked data sets: DBpedia Core 2014, DBpedia 3.9 with Infoboxes and Linked Brainz. In the first table we show some of the statistics about these three datasets, while in the second we compare the reduction rate achieved by ABSTAT and by Loupe which does not use minimalization. Thus for DBpedia, the reduction rate is of order of thousandths while for Loupe the reduction rate is of order of hundredths. Comparing the reduction rate obtained by our model with the one obtained by Loupe we observe that the summaries computed by ABSTAT are more compact, as we only include minimal type patterns. The effect of minimalization is more observable on DBpedia data sets, since DBpedia has a richer subtype graph and has more typing assertions. Also all the external types are added to the subtype graph during the minimal types computation phase as they were not part of the original terminology, and thus are considered by default as minimal types.

15

InformativenessABSTAT summaries provide useful insights about the semantics of properties, based on their usage within a data set

DatasetMissing Domain (%)Missing Range (%)Missing Domain & Range (%)DBpedia Core 2014259 (18%)187 (13%)48 (3.3%)DBpedia 3.9 Infobox61368 (98%)61309 (98%)61161 (97%)Linked Brainz13 (39%)15 (45%)13 (39%)

In the second experiment we evaluated if ABSTAT summaries could provide usefull insights about the semantics of the properties based on their usage within a dataset. As we can see from the table, around 18% of the properties of Dbpedia core do not have the information about the domain, 13% of the properties are missing the information of range, while both the domain and range is missing for 3.3% of the properties. Dbpedia core is the most curated subset of DBpedia as it includes only triples generated by user validatedmappings to Wikipedia templates. In contrast for db3.9-infobox data set which includes also triples generated by information extraction algorithms, most of the properties (i.e., the ones from the dbpepdia.org/property namespace) are not specified within the terminology.

16

Inferred domain and range for DBpedia Core 2014

Here we provide an overview of the number of different minimal types that constitute the domain and range of unspecied properties extracted from the summary of the db2014-core data set. The left part of the plot shows those properties whose semantics is less clear", in the sense that their domain and range cover a higher number of different minimal types e.g., the dbo:type property. Surprisingly, the dbo:religion property is among them: its semantics is not as clear as one might think, as its range covers 54 disparate minimal types, such as dbo:Organization , dbo:Sport or dbo:EthnicGroup. Conversely, the property dbo:variantOf, whose semantics is intuitively harder to guess, is used within the data set with a very specic meaning, as its domain and range covers only 2 minimal types: dbo:Automobile and dbo:Colour .

17

User Study: SetupCan ABSTAT be useful to support query formulation? Queries to DBpedia 3.9 Infobox from the Questions and Answering in Linked Open Data benchmark 5 queries of increasing length (1 of length 1, 2 of length 2 and 2 of length 3) 20 participants, 2 groups: abstat group uses ABSTAT (after 20 min of training)control group does not use ABSTAT Measures:Time needed to formulate the queryAccuracy of the answer

In the last experiment we evaluated if ABSTAT could help users in formulating queries. We selected a set of queries from the Questions and Answering in Linked Open Data benchmark to the db3.9-infobox data set. We selected five queries of increasing length, defined in terms of the number of triple patterns within the WHERE clause; one query of length one, two of length two and two of length three. Overall 20 participants with no prior knowledge about the ABSTAT framework were selected and split into 2 groups: abstat and control. We trained for about 20 minutes on how to use ABSTAT only the participants from the first group. Both groups execute SPARQL queries against the db3.9-infobox data set through the same interface and were asked to submit the results they considered correct for each query. We measured the time spent to complete each query and the correcteness of the answers.

18

User Study: QuestionnaireUniversity of Milan - Bicocca19

This is a screenshot from the questionnaire. We provided the participants the query in natural language and a \template" of thecorresponding SPARQL query, with spaces intentionally left blank for properties and/or concepts. Except of the properties they also had to fill in the box for the results they get executing that sparql query.

19

User Study: ResultsGroupAvg. Completion Time (s)AccuracyQuery 1- length 1 How many employees does Google have?abstat358.90.9control380.60.8Query 2- length 2 Give me all people that were born in Vienna and died in Berlin.abstat356.31control346.90.8Query 3- length 2 Which professional surfers were born in Australia?abstat476.60.6control234.240.7Query 4- length 3 In which films directed by Gary Marshall was Julia Roberts starring?abstat333.40.9control445.60.9Query 5- length 3 Give me all books by William Goldman with more than 300 pages.abstat233.41control569.80.7

The independent t-test showed that there was a significant effect between two groups for answering correctly Q5: t(16) = 10.32, p < .005

In this table we show the results of the performance of the user study. As a general observstion we can say that participants of the abstat group took advantage of the summary, obtaining huge benefits in terms of average completion time, accuracy, or both. Moreover, they achieved increasing accuracy over queries at increasing difficulty, still performing the tasks faster. We interpret the latter trend as a classical cognitive pattern, as the participants became more familiar with ABSTATBrowse and ABSTATSearch. The independent t-test , showed that the time needed to correctly answer Q5, the most diffcult query, was statisticallysignificant for two groups. There was a significant effect between two groups, t(16) = 10.32, p < .005, with mean time for answering correctly to Q5 being signicantly higher (+336s) for the control group than for abstat group.

20

User Study: Results Analysis abstat group users benefit from ABSTAT summary in terms of average completion time, accuracy, or bothIncreasing accuracy over increasing difficulty, performing the tasks fasterException is query 3, because the individual Surfing is classified with no type other than owl:Thing Two used strategies to answer the queries by participants from the control group were:To directly access the public web page describing the DBpedia named individuals mentioned in the queryVery few submitted explorative SPARQL queries to the endpoint

As we noticed in the previous table abstat users achieve increasing accuracy over increasing difficulty, performing the taks faster. Exception is query 3, because the individua Surfing is classified with no other type than owl:Thing. As a consequence, participants from the abstat group went trough a more time consuming trial and error process in order to guess the right type and property. two used strategies to answer the queries by participants from the control group were: to directly access the public web page describing the DBpedia named individuals mentioned in the query and very few of them submitted explorative SPARQL queries to the endpoint. Most of the users searched on Google for some entity in the query, then consulted DBpedia web pages to find the correct answer. DBpedia is arguably the best searchable dataset, which is why this explorative approach was successful for relatively simple queries. However, this explorative approach does not work with other non-indexed datasets (e.g., LinkedBrainz) and for complex queries. Very few of them submitted explorative SPARQL quries to the endpoint.

21

Conclusion and Future Work ABSTAT: ontology-driven summarization with minimalization Sensible reduction rate and promising results about the informativeness of the summary Currently extending the user study Apply relevance-oriented summarization methods based on connectivity analysis ABSTAT summary should consider the inheritance of properties to produce even more compact summaries We envision a complete analysis of the most important data set available in the LOD cloud (20+ data sets available) APIs available soon

So to conclude we presented ABSTAT wich is an ontology driven summarization model with minimalization. Using abstat we obtained a sensible reduction rate and promissing result about the informativeness of the summary. We are currently extended the user study and we are planning to apply some relevance oriented summarization methods based on connecivity analysis. We also plan to consider the inheritance of the properties to produce even more compact summaries. More that 20 datasets are already summarized and we plan to summarize the most important datasets in the LOD cloud. APIs will be available soon.

22

Thank you for your attention!23University of Milan - Bicocca

I will be around the whole week, so please try abstat and if you have any question or feedback you are more than welcomed! THANK YOU!!!

23

www.abstat.unimib.it

University of Milan - Bicocca24Feedback is WELCOMED!


Recommended