+ All Categories
Home > Documents > Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … ·...

Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … ·...

Date post: 12-Nov-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
52
Semantic Web 0 (2017) 1–0 1 IOS Press Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO Editor(s): Amrapali Zaveri, University of Leipzig Solicited review(s): Zhigang Wang, Beijing Normal University, China; Anonymous; Sebastian Mellor, Newcastle University, U.K. Michael Färber *,** , Frederic Bartscherer, Carsten Menne, and Achim Rettinger *** Karlsruhe Institute of Technology (KIT), Institute AIFB, 76131 Karlsruhe, Germany Abstract. In recent years, several noteworthy large, cross-domain, and openly available knowledge graphs (KGs) have been created. These include DBpedia, Freebase, OpenCyc, Wikidata, and YAGO. Although extensively in use, these KGs have not been subject to an in-depth comparison so far. In this survey, we provide data quality criteria according to which KGs can be analyzed and analyze and compare the above mentioned KGs. Furthermore, we propose a framework for finding the most suitable KG for a given setting. Keywords: Knowledge Graph, Linked Data Quality, Data Quality Metrics, Comparison, DBpedia, Freebase, OpenCyc, Wikidata, YAGO 1. Introduction The vision of the Semantic Web is to publish and query knowledge on the Web in a semantically struc- tured way. According to Guns [23], the term “Seman- tic Web” had already been used in fields such as Ed- ucational Psychology, before it became prominent in Computer Science. Freedman and Reynolds [21], for instance, describe “semantic webbing” as organizing in- formation and relationships in a visual display. Berners- Lee has mentioned his idea of using typed links as ve- hicle of semantics already since 1989 and proposed it under the term Semantic Web for the first time at the INET conference in 1995 [23]. * Corresponding author. E-mail: [email protected]. ** This work was carried out with the support of the German Fed- eral Ministry of Education and Research (BMBF) within the Software Campus project SUITE (Grant 01IS12051). *** The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007- 2013) under grant agreement no. 611346. The idea of a Semantic Web was introduced to a wider audience by Berners-Lee in 2001 [10]. Accord- ing to his vision, the traditional Web as a Web of Docu- ments should be extended to a Web of Data where not only documents and links between documents, but any entity (e.g., a person or organization) and any relation between entities (e.g., isSpouseOf ) can be represented on the Web. When it comes to realizing the idea of the Semantic Web, knowledge graphs (KGs) are currently seen as one of the most essential components. The term "knowledge graph" was reintroduced by Google in 2012 [42] and is intended for any graph-based knowledge repository. Since in the Semantic Web RDF graphs are used we use the term knowledge graph for any RDF graph. An RDF graph consists of a finite set of RDF triples where each RDF triple (s, p, o) is an ordered set of the following RDF terms: a subject s U B, a predicate p U , and an object o U B L. An RDF term is either a URI u U , a blank node b B, or a literal l L. U , B, and L are infinite sets and pairwise 1570-0844/17/$27.50 c 2017 – IOS Press and the authors. All rights reserved
Transcript
Page 1: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with

Semantic Web 0 (2017) 1ndash0 1IOS Press

Linked Data Quality of DBpedia FreebaseOpenCyc Wikidata and YAGOEditor(s) Amrapali Zaveri University of LeipzigSolicited review(s) Zhigang Wang Beijing Normal University China Anonymous Sebastian Mellor Newcastle University UK

Michael Faumlrber lowastlowastlowast Frederic Bartscherer Carsten Menne and Achim Rettinger lowastlowastlowast

Karlsruhe Institute of Technology (KIT) Institute AIFB76131 Karlsruhe Germany

Abstract In recent years several noteworthy large cross-domain and openly available knowledge graphs (KGs) have beencreated These include DBpedia Freebase OpenCyc Wikidata and YAGO Although extensively in use these KGs have not beensubject to an in-depth comparison so far In this survey we provide data quality criteria according to which KGs can be analyzedand analyze and compare the above mentioned KGs Furthermore we propose a framework for finding the most suitable KG for agiven setting

Keywords Knowledge Graph Linked Data Quality Data Quality Metrics Comparison DBpedia Freebase OpenCyc WikidataYAGO

1 Introduction

The vision of the Semantic Web is to publish andquery knowledge on the Web in a semantically struc-tured way According to Guns [23] the term ldquoSeman-tic Webrdquo had already been used in fields such as Ed-ucational Psychology before it became prominent inComputer Science Freedman and Reynolds [21] forinstance describe ldquosemantic webbingrdquo as organizing in-formation and relationships in a visual display Berners-Lee has mentioned his idea of using typed links as ve-hicle of semantics already since 1989 and proposed itunder the term Semantic Web for the first time at theINET conference in 1995 [23]

Corresponding author E-mail michaelfaerberkiteduThis work was carried out with the support of the German Fed-

eral Ministry of Education and Research (BMBF) within the SoftwareCampus project SUITE (Grant 01IS12051)

The research leading to these results has received funding fromthe European Union Seventh Framework Programme (FP72007-2013) under grant agreement no 611346

The idea of a Semantic Web was introduced to awider audience by Berners-Lee in 2001 [10] Accord-ing to his vision the traditional Web as a Web of Docu-ments should be extended to a Web of Data where notonly documents and links between documents but anyentity (eg a person or organization) and any relationbetween entities (eg isSpouseOf ) can be representedon the Web

When it comes to realizing the idea of the SemanticWeb knowledge graphs (KGs) are currently seen as oneof the most essential components The term knowledgegraph was reintroduced by Google in 2012 [42] andis intended for any graph-based knowledge repositorySince in the Semantic Web RDF graphs are used weuse the term knowledge graph for any RDF graphAn RDF graph consists of a finite set of RDF tripleswhere each RDF triple (s p o) is an ordered set of thefollowing RDF terms a subject s isin U cupB a predicatep isin U and an object o isin U cup B cup L An RDF termis either a URI u isin U a blank node b isin B or aliteral l isin L U B and L are infinite sets and pairwise

1570-084417$2750 ccopy 2017 ndash IOS Press and the authors All rights reserved

2 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

disjoint We denote the system that hosts a KG g withhg

In this survey we focus on those KGs having thefollowing aspects

1 The KGs are freely accessible and freely usablewithin the Linked Open Data (LOD) cloudLinked Data refers to a set of best practices1 forpublishing and interlinking structured data on theWeb defined by Berners-Lee [8] in 2006 LinkedOpen Data refers to the Linked Data which canbe freely used modified and shared by anyone forany purpose2 The aim of the Linking Open Datacommunity project3 is to publish RDF datasets onthe Web and to interlink these datasets

2 The KGs should cover general knowledge (oftenalso called cross-domain or encyclopedic knowl-edge) instead of knowledge about special domainssuch as biomedicine

Thus out of scope are KGs which are not openlyavailable such as the Google Knowledge Graph4 andthe Google Knowledge Vault [13] Excluded are alsoKGs which are only accessible via an API but whichare not provided as dump files (see WolframAlpha5

and the Facebook Graph6) as well as KGs which arenot based on Semantic Web standards at all or whichare only unstructured or weakly structured knowledgecollections (eg The World Factbook of the CIA7)

For selecting the KGs for analysis we regardedall datasets which had been registered at the onlinedataset catalog httpdatahubio8 and whichwere tagged as ldquocrossdomainrdquo Besides that we tookWikidata into consideration since it also fulfilled theabove mentioned requirements Based on that we se-

1See httpwwww3orgTRld-bp requested on April5 2016

2See httpopendefinitionorg requested on Apr 52016

3See httpwwww3orgwikiSweoIGTaskForcesCommunityProjectsLinkingOpenDatarequested on Apr 5 2016

4See httpwwwgooglecominsidesearchfeaturessearchknowledgehtml requested on Apr 32016

5See httpproductswolframalphacomapi re-quested on Aug 30 2016

6See httpsdevelopersfacebookcomdocsgraph-api requested on Aug 30 2016

7See httpswwwciagovlibrarypublicationsthe-world-factbook requested on Aug30 2016

8This catalog is also used for registering Linked Open Datadatasets

lected DBpedia Freebase OpenCyc Wikidata andYAGO as KGs for our comparison

In this paper we give a systematic overview of theseKGs in their current versions (as of April 2016) anddiscuss how the knowledge in these KGs is modeledstored and queried To the best of our knowledge sucha comparison between these widely used KGs has notbeen presented before Note that the focus of this surveyis not the life cycle of KGs on the Web or in enterprisesWe can refer in this respect to [5] Instead the focus ofour KG comparison is on data quality as this is one ofthe most crucial aspects when it comes to consideringwhich KG to use in a specific setting

Furthermore we provide a KG recommendationframework for users who are interested in using one ofthe mentioned KGs in a research or industrial settingbut who are inexperienced in which KG to choose fortheir concrete settings

The main contributions of this survey are

1 Based on existing literature on data quality weprovide 34 data quality criteria according to whichKGs can be analyzed

2 We calculate key statistics for the KGs DBpediaFreebase OpenCyc Wikidata and YAGO

3 We analyze DBpedia Freebase OpenCyc Wiki-data and YAGO along the mentioned data qualitycriteria9

4 We propose a framework which enables users tofind the most suitable KG for their needs

The survey is organized as follows

ndash In Section 2 we introduce formal definitions usedthroughout the article

ndash In Section 3 we describe the data quality dimen-sions which we later use for the KG comparisonincluding their subordinated data quality criteriaand corresponding data quality metrics

ndash In Section 4 we describe the selected KGsndash In Section 5 we analyze the KGs using several

key statistics and using the data quality metricsintroduced in Section 3

ndash In Section 6 we present our framework for assess-ing and rating KGs according to the userrsquos setting

ndash In Section 7 we present related work on (linked)data quality criteria and on key statistics for KGs

ndash In Section 8 we conclude the survey

9The data and detailed evaluation results for both thekey statistics and the metric evaluations are online avail-able at httpkmaifbkitedusitesknowledge-graph-comparison (requested on Jan 31 2017)

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 3

2 Important Definitions

We define the following sets that are used in formal-izations throughout the article If not otherwise statedwe use the prefixes listed in Listing 1 for indicatingnamespaces throughout the article

ndash Cg denotes the set of classes in gCg = x | (xrdfssubClassOf o) ising or (srdfssubClassOf x) isin g or (xwdtP279 o) isin g or (swdtP279 x) isin g or(xrdftyperdfsClass) isin g

ndash An instance of a class is a resource which is mem-ber of that class This membership is given by acorresponding instantiation assignment10 Ig de-notes the set of instances in gIg = s | (srdftype o) isin g or (swdtP31o) isin g

ndash Entities are defined as instances which representreal world objects Eg denotes the set of entitiesin gEg = s | (srdftypeowlThing) ising or (srdftypewdoItem) isin g or(srdftypefreebasecommontopic) ising or (srdftypecychIndividual) ising

ndash Relations (interchangeably used with proper-ties) are links between RDF terms11 defined onthe schema level (ie T-Box) To emphasize thischaracterization we also call them explicitly de-fined relations Pg denotes the set of all thoserelations in gPg = s | (srdftyperdfProperty) ising or (srdftyperdfsProperty)isin g or (srdftypewdoProperty) ising or (srdftypeowlFunctionalProperty) isin g or (srdftypeowlInverseFunctionalProperty) isin g or(srdftypeowlDatatypeProperty) ising or (srdftypeowlObjectProperty) isin g or (srdftypeowlSymmetricProperty) isin g or(srdftypeowlTransitiveProperty)isin g

ndash Implicitly defined relations embrace all linksused in the KG ie on instance and schema level

10See httpswwww3orgTRrdf-schema re-quested on Aug 29 2016

11RDF terms comprise URIs blank nodes and literals

We also call them predicates P impg denotes the

set of all implicitly defined relations in gP impg = p | (s p o) isin g

ndash Ug denotes the set of all URIs used in gUg = x | ((x p o) isin g or (s x o) isin g or(s p x) isin g) and x isin U

ndash U localg denotes the set of all URIs in g with local

namespace ie those URIs start with the KG gdedicated prefix (cf Listing 1)

ndash Complementary Uextg consists of all URIs in Ug

which are external to the KG g which means thathg is not responsible for resolving those URIs

Note that knowledge about the KGs which were ana-lyzed for this survey was taken into account when defin-ing these sets These definitions may not be appropriatefor other KGs

Furthermore the setsrsquo extensions would be differentwhen assuming a certain semantic (eg RDF RDFS orOWL-LD) Under the assumption that all entailmentsunder one of these semantics were added to a KG thedefinition of each set could be simplified and the exten-sions would be of larger cardinality However for thisarticle we did not derive entailments

3 Data Quality Assessment wrt KGs

Everybody on the Web can publish informationTherefore a data consumer does not only face the chal-lenge to find a suitable data source but is also con-fronted with the issue that data on the Web can dif-fer very much regarding its quality Data quality canthereby be viewed not only in terms of accuracy but inmultiple other dimensions In the following we intro-duce concepts regarding the data quality of KGs in theLinked Data context which are used in the followingsections The data quality dimensions are then exposedin Sections 32 ndash 35

Data quality (DQ) ndash in the following interchange-ably used with information quality12 ndash is defined byJuran et al [32] as fitness for use This means that dataquality is dependent on the actual use case

One of the most important and foundational works ondata quality is that of Wang et al [47] They developeda framework for assessing the data quality of datasetsin the database context In this framework Wang et al

12As soon as data is considered wrt usefulness the data is seenin a specific context It can thus already be regarded as informationleading to the term ldquoinformation qualityrdquo instead of ldquodata qualityrdquo

4 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Listing 1 Default prefixes for namespaces used throughout this article

prefix cc lthttpcreativecommonsorgnsgt prefix cyc lthttpswopencycorgconceptgt prefix cych lthttpswopencycorg20120510conceptengt prefix dbo lthttpdbpediaorgontologygt prefix dbp lthttpdbpediaorgpropertygt prefix dbr lthttpdbpediaorgresourcegt prefix dby lthttpdbpediaorgclassyagogt prefix dcterms lthttppurlorgdctermsgt prefix foaf lthttpxmlnscomfoaf01gt prefix freebase lthttprdffreebasecomnsgt prefix owl lthttpwwww3org200207owlgt prefix prov lthttpwwww3orgnsprovgt prefix rdf lthttpwwww3org19990222-rdf-syntax-nsgt prefix rdfs lthttpwwww3org200001rdf-schemagt prefix schema lthttpschemaorggt prefix umbel lthttpumbelorgumbelscgt prefix void lthttpwwww3orgTRvoidgt prefix wdo lthttpwwwwikidataorgontologygt prefix wdt lthttpwwwwikidataorgentitygt prefix xsd lthttpwwww3org2001XMLSchemagt prefix yago lthttpyago-knowledgeorgresourcegt

distinguish between data quality criteria data qualitydimensions and data quality categories13 In the follow-ing we reuse these concepts for our own frameworkwhich has the particular focus on the data quality ofKGs in the context of Linked Open Data

A data quality criterion (Wang et al also call itldquodata quality attributerdquo) is a particular characteristic ofdata wrt its quality and can be either subjective orobjective An example of a subjectively measurabledata quality criterion is Trustworthiness on KG levelAn example of an objective data quality criterion is theSyntactic validity of RDF documents (see Section 32and [46])

In order to measure the degree to which a certaindata quality criterion is fulfilled for a given KG eachcriterion is formalized and expressed in terms of a func-tion with the value range of [0 1] We call this functionthe data quality metric of the respective data qualitycriterion

A data quality dimension ndash in the following justcalled dimension ndash is a main aspect how data qualitycan be viewed A data quality dimension comprises oneor several data quality criteria [47] For instance the

13The quality dimensions are defined in [47] the sub-classificationinto parametersindicators in [46 p 354]

criteria Syntactic validity of RDF documents Syntacticvalidity of literals and Semantic validity of triples formthe Accuracy dimension

Data quality dimensions and their respective dataquality criteria are further grouped into data qualitycategories Based on empirical studies Wang et alspecified four categories

ndash Criteria of the category of the intrinsic data qualityfocus on the fact that data has quality in its ownright

ndash Criteria of the category of the contextual data qual-ity cannot be considered in general but must beassessed depending on the application context ofthe data consumer

ndash Criteria of the category of the representationaldata quality reveal in which form the data is avail-able

ndash Criteria of the category of the accessibility dataquality determine how the data can be accessed

Since its publication the presented framework ofWang et al has been extensively used either in itsoriginal version or in an adapted or extended versionBizer [11] and Zaveri [49] worked on data quality in theLinked Data context They make the following adapta-tions on Wang et alrsquos framework

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 5

ndash Bizer [11] compared the work of Wang et al [47]with other works in the area of data quality Hethereby complements the framework with the di-mensions consistency verifiability and offensive-ness

ndash Zaveri et al [49] follow Wang et al [47] but intro-duce licensing and interlinking as new dimensionsin the linked data context

In this article we use the DQ dimensions as definedby Wang et al [47] and as extended by Bizer [11] andZaveri [49] More precisely we make the followingadaptations on Wang et alrsquos framework

1 Consistency is treated by us as separate DQ dimen-sion

2 Verifiability is incorporated within the DQ dimen-sion Trustworthiness as criterion Trustworthinesson statement level

3 The Offensiveness of KG facts is not consideredby us as it is hard to make an objective evaluationin this regard

4 We extend the category of the accessibility dataquality by the dimension License and Interlinkingas those data quality dimensions get in additionrelevant in the Linked Data context

31 Criteria Weighting

When applying our framework to compare KGs thesingle DQ metrics can be weighted differently so thatthe needs and requirements of the users can be takeninto account In the following we first formalize theidea of weighting the different metrics We then presentthe criteria and the corresponding metrics of our frame-work

Given are a KG g a set of criteria C = c1 cn aset of metrics M = m1 mn and a set of weightsW = w1 wn Each metric mi corresponds to thecriterion ci and mi(g) isin [0 1] where a value of 0 de-fines the minimum fulfillment degree of a KG regardinga quality criterion and a value of 1 the maximum fulfill-ment degree Furthermore each criterion ci is weightedby wi

The fulfillment degree h(g) isin [0 1] of a KG g isthen the weighted normalized sum of the fulfillmentdegrees wrt the criteria c1 cn

h(g) =

sumni=1 wi mi(g)sumn

j=1 wj

Based on the quality dimensions introduced by Wanget al [47] we now present the DQ criteria and met-rics as used in our KG comparison Note that some ofthe criteria have already been introduced by others asoutlined in Section 7

Note also that our metrics are to be understood aspossible ways of how to evaluate the DQ dimensionsOther definitions of the DQ metrics might be possibleand reasonable We defined the metrics along the char-acteristics of the KGs DBpedia Freebase OpenCycWikidata and YAGO but kept the definitions as genericas possible In the evaluations we then used those met-ric definitions and applied them eg on the basis ofown-created gold standards

32 Intrinsic Category

ldquoIntrinsic data quality denotes that data have qualityin their own rightrdquo [47] This kind of data quality cantherefore be assessed independently from the contextThe intrinsic category embraces the three dimensionsAccuracy Trustworthiness and Consistency which aredefined in the following subsections The dimensionsBelievability Objectivity and Reputation which areseparate dimensions in Wang et alrsquos classification sys-tem [47] are subsumed by us under the dimensionTrustworthiness

321 AccuracyDefinition of dimension Accuracy is ldquothe extent to

which data are correct reliable and certified free oferrorrdquo [47]

Discussion Accuracy is intuitively an important di-mension of data quality Previous work on data qualityhas mainly analyzed only this aspect [47] Hence accu-racy has often been used as synonym for data quality[39] Bizer [11] highlights in this context that Accuracyis an objective dimension and can only be applied onverifiable statements

Batini et al [6] distinguish between syntactic andsemantic accuracy Syntactic accuracy describes theformal compliance to syntactic rules without review-ing whether the value reflects the reality The semanticaccuracy determines whether the value is semanticallyvalid ie whether the value is true Based on the clas-sification of Batini et al we can define the metric forAccuracy as follows

Definition of metric The dimension Accuracy isdetermined by the criteria

ndash Syntactic validity of RDF documentsndash Syntactic validity of literals and

6 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

ndash Semantic validity of triples

The fulfillment degree of a KG g wrt the dimen-sion Accuracy is measured by the metrics msynRDF msynLit and msemTriple which are defined as fol-lows

Syntactic validity of RDF documents The syntacticvalidity of RDF documents is an important require-ment for machines to interpret an RDF document com-pletely and correctly Hogan et al [29] suggest usingstandardized tools for creating RDF data The authorsstate that in this way normally only little syntax errorsoccur despite the complex syntactic representation ofRDFXML

RDF data can be validated by an RDF validator suchas the W3C RDF validator14

msynRDF (g) =

1 if all RDF documents are valid0 otherwise

Syntactic validity of literals Assessing the syntacticvalidity of literals means to determine to which degreeliteral values stored in the KG are syntactically validThe syntactic validity of literal values depends on thedata types of the literals and can be automatically as-sessed via rules [2234] Syntactic rules can be writ-ten in the form of regular expressions For instanceit can be verified whether a literal representing a datefollows the ISO 8601 specification Assuming that L isthe infinite set of literals we can state

msynLit(g) =|(s p o) isin g | o isin L and synV alid(o)|

|(s p o) isin g | o isin L|

In case of an empty set in the denominator of thefraction the metric should evaluate to 1

Semantic validity of triples The criterion Semanticvalidity of triples is introduced to evaluate whether thestatements expressed by the triples (with or withoutliterals) hold true Determining whether a statementis true or false is strictly speaking impossible (see thefield of epistemology in philosophy) For evaluating theSemantic validity of statements Bizer et al [11] notethat a triple is semantically correct if it is also availablefrom a trusted source (eg Name Authority File) if it

14See httpwwww3orgRDFValidator requestedon Feb 29 2016

is common sense or if the statement can be measuredor perceived by the user directly Wikidata has similarguidelines implemented to determine whether a factneeds to be sourced15

We measure the Semantic validity of triples based onempirical evidence ie based on a reference data setserving as gold standard We determine the fulfillmentdegree as the precision that the triples which are in theKG g and in the gold standard GS have the same valuesNote that this measurement is heavily depending on thetruthfulness of the reference data set

Formally let nogGS = |(s p o) | (s p o) isin g and(x y z) isin GSandequi(s x)andequi(p y)andequi(o z))|be the number of triples in g to which semanticallycorresponding triples in the gold standard GS exist Letnog = |(s p o) | (s p o) isin g and (x y z) isin GS andequi(s x) and equi(p y)| be the number of triples in gwhere the subject-relation-pairs (s p) are semanticallyequivalent to subject-relation-pairs (x y) in the goldstandard Then we can state

msemTriple(g) =nogGS

nog

In case of an empty set in the denominator of thefraction the metric should evaluate to 1

322 TrustworthinessDefinition of dimension Trustworthiness is defined

as the degree to which the information is accepted to becorrect true real and credible [49] We define it as acollective term for believability reputation objectivityand verifiability These aspects were defined by Wanget al [47] and Naumann [39] as follows

ndash Believability Believability is ldquothe extent to whichdata are accepted or regarded as true real andcrediblerdquo [47]

ndash Reputation Reputation is ldquothe extent to whichdata are trusted or highly regarded in terms of theirsource or contentrdquo [47]

ndash Objectivity Objectivity is ldquothe extent to whichdata are unbiased (unprejudiced) and impartialrdquo[47]

ndash Verifiability Verifiability is ldquothe degree and easewith which the data can be checked for correctnessrdquo[39]

15See httpswwwwikidataorgwikiHelpSources requested on Sep 8 2016

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 7

Discussion In summary believability considers thesubject (data consumer) side reputation takes the gen-eral social view on trustworthiness objectivity consid-ers the object (data provider) side while verifiabilityfocuses on the possibility of verification

Trustworthiness has been discussed as follows

ndash Believability According to Naumann [39] believ-ability is the ldquoexpected accuracyrdquo of a data source

ndash Reputation The essential difference of believ-ability to accuracy is that for believability data istrusted without verification [11] Thus believabil-ity is closely related to the reputation of a dataset

ndash Objectivity According to Naumann [39] the ob-jectivity of a data source is strongly related to theverifiability The more verifiable a data source orstatement is the more objective it is The authorsof this article would not go so far since also biasedstatements could be verifiable

ndash Verifiability Heath et al [26] emphasize that it isessential for trustworthy applications to be able toverify the origin of data

Definition of metric We define the metric for thedata quality dimension Trustworthiness as a combina-tion of trustworthiness metrics on both KG and state-ment level Believability and reputation are thereby cov-ered by the DQ criterion Trustworthiness on KG level(metric mgraph(hg)) while objectivity and verifiabilityare covered by the DQ criteria Trustworthiness on state-ment level (metric mfact(g)) and Indicating unknownand empty values (metric mNoV al(g)) Hence the ful-fillment degree of a KG g wrt the dimension Trust-worthiness is measured by the metrics mgraph mfactand mNoV al which are defined as follows

Trustworthiness on KG level The measure of Trust-worthiness on KG level exposes a basic indication aboutthe trustworthiness of the KG In this assessment themethod of data curation as well as the method of datainsertion is taken into account Regarding the methodof data curation we distinguish between manual andautomated methods Regarding the data insertion wecan differentiate between 1 whether the data is enteredby experts (of a specific domain) 2 whether the knowl-edge comes from volunteers contributing in a commu-nity and 3 whether the knowledge is extracted automat-ically from a data source This data source can itself beeither structured semi-structured or un-structured Weassume that a closed system where experts or other reg-istered users feed knowledge into a system is less vul-nerable to harmful behavior of users than an open sys-

tem where data is curated by a community Thereforewe assign the values of the metric for Trustworthinesson KG level as follows

mgraph(hg) =

1 manual data curation man-ual data insertion in aclosed system

075 manual data curation and in-sertion both by a commu-nity

05 manual data curation datainsertion by community ordata insertion by automatedknowledge extraction

025 automated data curationdata insertion by automatedknowledge extraction fromstructured data sources

0 automated data curationdata insertion by automatedknowledge extraction fromunstructured data sources

Note that all proposed DQ metrics should be seen assuggestions of how to formulate DQ metrics Henceother numerical values and other classification schemes(eg for mgraph(hg)) might be taken for defining theDQ metrics

Trustworthiness on statement level The fulfillment ofTrustworthiness on statement level is determined by anassessment whether a provenance vocabulary is usedBy means of a provenance vocabulary the source ofstatements can be stored Storing source information isan important precondition to assess statements easilywrt semantic validity We distinguish between prove-nance information provided for triples and provenanceinformation provided for resources

The most widely used ontologies for storing prove-nance information are the Dublin Core Metadataterms16 with properties such as dctermsprovenance and dctermssource and the W3C PROVontology17 with properties such as provwasDerivedFrom

16See httppurlorgdcterms requested on Feb 42017

17See httpswwww3orgTRprov-o requested onDec 27 2016

8 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

mfact(g) =

1 provenance on statement

level is used05 provenance on resource

level is used0 otherwise

Indicating unknown and empty values If the datamodel of the considered KG supports the representa-tion of unknown and empty values more complex state-ments can be represented For instance empty valuesallow to represent that a person has no children andunknown values allow to represent that the birth date ofa person in not known This kind of higher explanatorypower of a KG increases the trustworthiness of the KG

mNoV al(g) =

1 unknown and empty values

are used05 either unknown or empty

values are used0 otherwise

323 ConsistencyDefinition of dimension Consistency implies that

ldquotwo or more values [in a dataset] do not conflict eachotherrdquo [37]

Discussion Due to the high variety of data providersin the Web of Data a user must expect data inconsisten-cies Data inconsistencies may be caused by (i) differ-ent information providers (ii) different levels of knowl-edge and (iii) different views of the world [11]

In OWL restrictions can be introduced to ensureconsistent modeling of knowledge to some degree TheOWL schema restrictions can be divided into class re-strictions and relation restrictions [7]

Class restrictions refer to classes For instanceone can specify via owldisjointWith that twoclasses have no common instance

Relation restrictions refer to the usage of relationsThey can be classified into value constraints and cardi-nality constraints

Value constraints determine the range of relationsowlsomeValuesFrom for instance specifies thatat least one value of a relation belongs to a certainclass If the expected data type of a relation is specifiedvia rdfsrange we also consider this as relationrestriction

Cardinality constraints limit the number of times a re-lation may exist per resource Via owlFunctionalproperty and owlInverseFunctionalProp

erty global cardinality constraints can be specifiedFunctional relations permit at most one value per re-source (eg the birth date of a person) Inverse func-tional relations specify that a value should only occuronce per resource This means that the subject is theonly resource linked to the given object via the givenrelation

Definition of metric We can measure the data qual-ity dimension Consistency by means of (i) whetherschema constraints are checked during the insertion ofnew statements into the KG and (ii) whether alreadyexisting statements in the KG are consistent to specifiedclass and relation constraints The fulfillment degree ofa KG g wrt the dimension consistency is measuredby the metrics mcheckRestr mconClass and mconRelatwhich are defined as follows

Check of schema restrictions during insertion of newstatements Checking the schema restrictions duringthe insertion of new statements can help to reject factsthat would render the KG inconsistent Such simplechecks are often done on the client side in the user inter-face For instance the application checks whether datawith the right data type is inserted Due to the depen-dency to the actual inserted data the check needs to becustom-designed Simple rules are applicable howeverinconsistencies can still appear if no suitable rules areavailable Examples of consistency checks are check-ing the expected data types of literals checking whetherthe entity to be inserted has a valid entity type (iechecking the rdftype relation) checking whetherthe assigned classes of the entity are disjoint ie con-tradicting each other (utilizing owldisjointWithrelations)

mcheckRestr(hg) =

1 schema restrictions arechecked

0 otherwise

Consistency of statements wrt class constraints Thismetric is intended to measure the degree to which theinstance data is consistent with the class restrictions(eg owldisjointWith) specified on the schemalevel

In the following we limit ourselves to the classconstraints given by all owldisjointWith state-ments defined on the schema level of the consid-ered KG Ie let CC be the set of all class con-straints defined as CC = (c1 c2) | (c1owldis-

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 9

jointWith c2) isin g18 Furthermore let cg(e) bethe set of all classes of instance e in g defined ascg(e) = c | (erdftype c) isin g Then we definemconClass(g) as follows

mconClass(g) =

|(c1 c2) isin CC | notexiste (c1 isin cg(e) and c2 isin cg(e))||(c1 c2) isin CC|

In case of an empty set of class constraints CC themetric should evaluate to 1

Consistency of statements wrt relation constraintsThe metric for this criterion is intended for measur-ing the degree to which the instance data is consis-tent with the relation restrictions (eg indicated viardfsrange and owlFunctionalProperty)specified on the schema level We evaluate this crite-rion by averaging over the scores obtained from sin-gle metrics mconRelati indicating the consistency ofstatements wrt different relation constraints

mconRelat(g) =1

n

nsumi=1

mconRelati(g)

In case of evaluating the consistency of instance dataconcretely wrt given rdfsrange and owlFunctionalProperty statements19 we can state

mconRelat(g) =mconRelatRg(g) +mconRelatFct(g)

2

Let Rr be the set of all rdfsrange constraints

Rr = (p d) | (prdfsrange d) isin g

and isDatatype(d)

18Implicit restrictions which can be deducted from the class hi-erarchy eg that a restriction for dboAnimal counts also fordboMammal a subclass of dboAnimal are not considered byus here

19We chose those relations (and for instance notowlInverseFunctionalProperty) as only those relationsare used by more than half of the considered KGs

and Rf be the set of all owlFunctionalPro-perty constraints

Rf = (p d) | (prdftypeowlFunc

tionalProperty) isin g and

(prdfsrange d) isin g and isDatatype(d)

Then we can define the metrics mconRelatRg(g) andmconRelatFct(g) as follows

mconRelatRg(g) =

|(s p o) isin g | exist(p d) isin Rr datatype(o) 6= d||(s p o) isin g | exist(p d) isin Rr|

mconRelatFct(g) =

|(s p o) isin g|exist(p d) isin Rf notexist(s p o2) isin g o 6= o2||(s p o) isin g | exist(p d) isin Rf|

In case of an empty set of relation constraints (Rr orRf ) the respective metric should evaluate to 1

33 Contextual Category

Contextual data quality ldquohighlights the requirementthat data quality must be considered within the contextof the task at handrdquo [47] This category contains thethree dimensions (i) Relevancy (ii) Completeness and(iii) Timeliness Wang et alrsquos further dimensions in thiscategory appropriate amount of data and value-addedare considered by us as being part of the dimensionCompleteness

331 RelevancyDefinition of dimension Relevancy is ldquothe extent

to which data are applicable and helpful for the task athandrdquo [47]

Discussion According to Bizer [11] Relevancy isan important quality dimension since the user is con-fronted with a variety of potentially relevant informa-tion on the Web

Definition of metric The dimension Relevancy isdetermined by the criterion Creating a ranking ofstatements20 The fulfillment degree of a KG g wrtthe dimension Relevancy is measured by the metricmRanking which is defined as follows

20We do not consider the relevancy of literals as there is no rankingof literals provided for the considered KGs

10 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Creating a ranking of statements By means of thiscriterion one can determine whether the KG supportsa ranking of statements by which the relative rele-vance of statements among other statements can beexpressed For instance given the Wikidata entityBarack Obama (wdtQ76) and the relation posi-tion held (wdtP39) President of the United Statesof America (wdtQ11696) has a preferred rank(wdoPreferredRank) (until 2017) while olderpositions which he holds no more are ranked as normalrank (wdoNormalRank)

mRanking(g) =

1 ranking of statements supported0 otherwise

Note that this criterion refers to a characteristic ofthe KG and not to a characteristic of the system thathosts the KG

332 CompletenessDefinition of dimension Completeness is ldquothe ex-

tent to which data are of sufficient breadth depth andscope for the task at handrdquo [47]

We include the following two aspects in this dimen-sion which are separate dimensions in Wang et alrsquosframework

ndash Appropriate amount of data Appropriate amountof data is ldquothe extent to which the quantity orvolume of available data is appropriaterdquo [47]

ndash Value-added Value-added is ldquothe extent to whichdata are beneficial and provide advantages fromtheir userdquo [47]

Discussion Pipino et al [40] divide Completenessinto

1 Schema completeness ie the extent to whichclasses and relations are not missing

2 Column completeness ie the extent to whichvalues of relations on instance level ndash ie facts ndashare not missing and

3 Population completeness ie the extent to whichentities are not missing

The Completeness dimension is context-dependent andtherefore belongs to the contextual category becausethe fact that a KG is seen as complete depends on theuse case scenario ie on the given KG and on the infor-mation need of the user As exemplified by Bizer [11]a list of German stocks is complete for an investor whois interested in German stocks but it is not complete for

an investor who is looking for an overview of Europeanstocks The completeness is hence only assessable bymeans of a concrete use case at hand or with the helpof a defined gold standard

Definition of metric We follow the above-mentioneddistinction of Pipino et al [40] and determine Com-pleteness by means of the criteria Schema completenessColumn completeness and Population completeness

The fulfillment degree of a KG g wrt the dimensionCompleteness is measured by the metrics mcSchemamcCol and mcPop which are defined as follows

Schema completeness By means of the criterionSchema completeness one can determine the complete-ness of the schema wrt classes and relations [40] Theschema is assessed by means of a gold standard Thisgold standard consists of classes and relations which arerelevant for the use case For evaluating cross-domainKGs we use as gold standard a typical set of cross-domain classes and relations It comprises (i) basicclasses such as people and locations in different gran-ularities and (ii) basic relations such as birth date andnumber of inhabitants We define the schema complete-ness mcSchema as the ratio of the number of classesand relations of the gold standard existing in g noclatg and the number of classes and relations in the goldstandard noclat

mcSchema(g) =noclatgnoclat

Column completeness In the traditional database area(with fixed schema) by means of the Column complete-ness criterion one can determine the degree by whichthe relations of a class which are defined on the schemalevel (each relation has one column) exist on the in-stance level [40] In the Semantic Web and Linked Datacontext however we cannot presume any fixed rela-tional schema on the schema level The set of possiblerelations for the instances of a class is given at run-time by the set of used relations for the instances ofthis class Therefore we need to modify this criterionas already proposed by Pipino et al [40] In the updatedversion by means of the criterion Column completenessone can determine the degree by which the instances ofa class use the same relations averaged over all classes

Formally we define the Column completeness met-ric mcCol(g) as the ratio of the number of instanceshaving class k and a value for the relation r nokp tothe number of all instances having class k nok Byaveraging over all class-relation-pairs which occur on

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 11

instance level we obtain a fulfillment degree regardingthe whole KG

mcCol(g) =1

|H|sum

(kp)isinH

nokpnok

We thereby let H = (k p) isin (K times P ) | existk isinCg and exist(x p o) | p isin P imp

g and (xrdftype k) bethe set of all combinations of the considered classesK = k1 kn and considered relations P =p1 pm

Note that there are also relations which are dedicatedto the instances of a specific class but which do notneed to exist for all instances of that class For instancenot all people need to have a relation hasChild ordeathDate21 For measuring the Column complete-ness we selected only those relations for an assessmentwhere a value of the relation typically exists for allgiven instances

Population completeness The Population complete-ness metric determines the extent to which the consid-ered KG covers a basic population [40] The assess-ment of the KG completeness wrt a basic populationis performed by means of a gold standard which coversboth well-known entities (called ldquoshort headrdquo eg then largest cities in the world according to the number ofinhabitants) and little-known entities (called ldquolong tailrdquoeg municipalities in Germany) We take all entitiescontained in our gold standard equally into account

Let GS be the set of entities in the gold standardThen we can define

mcPop(g) =|e|e isin GS and e isin Eg|

|e|e isin GS|

333 TimelinessDefinition of dimension Timeliness is ldquothe extent

to which the age of the data is appropriate for the taskat handrdquo [47]

Discussion Timeliness does not describe the creationdate of a statement but instead the time range since thelast update or the last verification of the statement [39]Due to the easy way of publishing data on the Webdata sources can be kept easier up-to-date than tradi-tional isolated data sources This results in advantagesto the consumer of Web data [39] How Timeliness is

21For an evaluation about the prediction which relations are of thisnature see [1]

measured depends on the application context For somesituations years are sufficient while in other situationsone may need days [39]

Definition of metric The dimension timeliness isdetermined by the criteria Timeliness frequency of theKG Specification of the validity period and Specifica-tion of the modification date of statements

The fulfillment degree of a KG g wrt the dimen-sion Timeliness is measured by the metrics mFreqmV alidity and mChange which are defined as follows

Timeliness frequency of the KG The criterion Time-liness frequency of the KG indicates how fast the KGis updated We consider the KG RDF export here anddifferentiate between continuous updates where the up-dates are always performed immediately and discreteKG updates where the updates take place in discretetime intervals In case the KG edits are available onlineimmediately but the RDF export files are available indiscrete varying updating intervals we consider theonline version of the KG since in the context of LinkedData it is sufficient that URIs are dereferenceable

mFreq(g) =

1 continuous updates05 discrete periodic updates025 discrete non-periodic updates0 otherwise

Specification of the validity period of statements Spec-ifying the validity period of statements enables to tem-porally limit the validity of statements By using this cri-terion we measure whether the KG supports the speci-fication of starting and maybe end dates of statementsby means of providing suitable forms of representation

mV alidity(g) =

1 specification of validity pe-riod supported

0 otherwise

Specification of the modification date of statementsThe modification date discloses the point in timeof the last verification of a statement The modifi-cation date is typically represented via the relationsschemadateModified and dctermsmodified

mChange(g) =

1 specification of modifica-

tion dates for statementssupported

0 otherwise

12 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

34 Representational Data Quality

Representational data quality ldquocontains aspects re-lated to the format of the data [] and meaning ofdatardquo [47] This category contains the two dimensions(i) Ease of understanding (ie regarding the human-readability) and (ii) Interoperability (ie regarding themachine-readability) The dimensions InterpretabilityRepresentational consistency and Concise representa-tion as in addition proposed by Wang et al [47] areconsidered by us as being a part of the dimension Inter-operability

341 Ease of UnderstandingDefinition of dimension The ease of understanding

is ldquothe extent to which data are clear without ambiguityand easily comprehendedrdquo [47]

Discussion This dimension focuses on the under-standability of a data source by a human data con-sumer In contrast the dimension Interoperability fo-cuses on technical aspects The understandability of adata source (here KG) can be improved by things suchas descriptive labels and literals in multiple languages

Definition of metric The dimension understand-ability is determined by the criteria Description of re-sources Labels in multiple languages UnderstandableRDF serialization and Self-describing URIs The ful-fillment degree of a KG g wrt the dimension Con-sistency is measured by the metrics mDescr mLangmuSer and muURI which are defined as follows

Description of resources Heath et al [2630] suggestto describe resources in a human-understandable wayeg via rdfslabel or rdfscomment Withinour framework the criterion is measured as followsGiven a sample of resources we divide the numberof resources in the KG for which at least one label orone description is provided (eg via rdfslabelrdfscomment or schemadescription) bythe number of all considered resources in the localnamespace

mDescr(g) = |u|u isin U localg and exist(u p o) isin g

p isin PlDesc||u|u isin U localg |

PlDesc is the set of implicitly used relations in g in-dicating that the value is a label or description (egPlDesc = rdfslabelrdfscomment)

Beschreibung) Daruumlber hinaus ist das Ergebnisder Evaluation auf Basis der Entitaumlten interessant -gt DBpedia weicht deutlich ab da manche Entitaumlten

(Intermediate-Node-Mapping) keine rdfslabel habenFolglich wuumlrde ich die Definition der Metrik allgemeinhalten (beschraumlnkt auf proprietaumlre Ressourcen dh imselben Namespace) die Evaluation jedoch nur anhandder Entitaumlten machen

Labels in multiple languages Resources in the KG aredescribed in a human-readable way via labels eg viardfslabel or skosprefLabel22 The charac-teristic feature of skosprefLabel is that this kindof label should be used per resource at most once incontrast rdfslabel has no cardinality restrictionsie it can be used several times for a given resourceLabels are usually provided in English as the ldquobasiclanguagerdquo The now introduced metric for the criterionLabels in multiple languages determines whether labelsin other languages than English are provided in the KG

mLang(g) =

1 Labels provided in English

and at least one other lan-guage

0 otherwise

Understandable RDF serialization RDFXML is therecommended RDF serialization format of the W3CHowever due to its syntax RDFXML documents arehard to read for humans The understandability of RDFdata by humans can be increased by providing RDFin other more human-understandable serialization for-mats such as N3 N-Triple and Turtle We measurethis criterion by measuring the supported serializationformats during the dereferencing of resources

muSer(hg) =

1 Other RDF serializationsthan RDFXML available

0 otherwise

Note that conversions from one RDF serializationformat into another are easy to perform

Self-describing URIs Descriptive URIs contribute toa better human-readability of KG data Sauermann etal23 recommend to use short memorable URIs in theSemantic Web context which are easier understandableand memorable by humans compared to opaque URIs24

22Using the namespace httpwwww3org200402skoscore

23See httpswwww3orgTRcooluris requested onMar 1 2016

24For an overview of URI patterns see httpswwww3orgcommunitybpmlodwikiBest_practises_-_previous_notes requested on Dec 27 2016

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 13

such as wdtQ1040 The criterion Self-describingURIs is dedicated to evaluate whether self-describingURIs or generic IDs are used for the identification ofresources

muURI(g) =

1 self-describing URIs always used05 self-describing URIs partly used0 otherwise

342 InteroperabilityInteroperability is another dimension of the repre-

sentational data quality category and subsumes Wanget alrsquos aspects interpretability representational consis-tency and concise representation

Definition of dimension We define Interoperabilityalong the subsumed dimensions of Wang et al

ndash Interpretability Interpretability is ldquothe extent towhich data are in appropriate language and unitsand the data definitions are clearrdquo [47]

ndash Representational consistency Representationalconsistency is ldquothe extent to which data are alwayspresented in the same format and are compatiblewith previous datardquo [47]

ndash Concise representation Concise representationis ldquothe extent to which data are compactly repre-sented without being overwhelmingrdquo [47]

Discussion regarding interpretability In contrastto the dimension understandability which focuses onthe understandability of RDF KG data towards the useras data consumer interpretability focuses on the rep-resentation forms of information in the KG from atechnical perspective An example is the considerationwhether blank nodes are used According to Heath etal [26] blank nodes should be avoided in the LinkedData context since they complicate the integration ofmultiple data sources and since they cannot be linkedby resources of other data sources

Discussion regarding representational consistencyIn the context of Linked Data it is best practice to reuseexisting vocabulary for the creation of own RDF dataIn this way less data needs to be prepared for beingpublished as Linked Data [26]

Discussion regarding concise representation Heathet al [26] made the observation that the RDF features(i) RDF reification25 (ii) RDF collections and RDF

25In the literature it is often not differentiated between reificationin the general sense and reification in the sense of the specific

container and (iii) blank nodes are not very widelyused in the Linked Open Data context Those featuresshould be avoided according to Heath et al in orderto simplify the processing of data on the client sideEven the querying of the data via SPARQL may getcomplicated if RDF reification RDF collections andRDF container are used We agree on that but alsopoint out that reification (implemented via RDF stan-dard reification n-ary relations singleton propertiesor named graphs) is inevitably necessary for makingstatements about statements

Definition of metric The dimension Interoperabil-ity is determined via the following criteria

ndash Avoiding blank nodes and RDF reificationndash Provisioning of several serialization formatsndash Using external vocabularyndash Interoperability of proprietary vocabulary

The fulfillment degree of a KG g wrt the dimen-sion Interoperability is measured by the metrics mReif miSerial mexV oc and mpropV oc which are defined asfollows

Avoiding blank nodes and RDF reification Using RDFblank nodes RDF reification RDF container and RDFlists is often considered as ambivalent On the one handthese RDF features are not very common and theycomplicate the processing and querying of RDF data[3026] On the other hand they are necessary in cer-tain situations eg when statements about statementsshould be made We measure the criterion by evaluatingwhether blank nodes and RDF reification are used

mReif (g) =

1 no blank nodes and no RDF

reification05 either blank nodes or RDF

reification0 otherwise

Provisioning of several serialization formats The in-terpretability of RDF data of a KG is increased if be-

proposal described in the RDF standard (Brickley D Guha R (eds)RDF Vocabulary Description Language 10 RDF Schema W3CRecommendation online available at httpwwww3orgTRrdf-schema requested on Sep 2 2016) For more informationabout reification and its implementation possibilities we can refer thereader to [27] In this article we use the term reification by defaultfor the general sense and standard reification or RDF reificationfor referring to the modeling of reification according to the RDFstandard

14 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

sides the serialization standard RDFXML further seri-alization formats are supported for URI dereferencing

miSerial(hg) =

1 RDFXML and further for-

mats are supported05 only RDFXML is supported0 otherwise

Using external vocabulary Using a common vocabu-lary for representing and describing the KG data allowsto represent resources and relations between resourcesin the Web of Data in a unified way This increases theinteroperability of data [3026] and allows a comfort-able data integration We measure the criterion of usingan external vocabulary by setting the number of tripleswith external vocabulary in predicate position to thenumber of all triples in the KG

mextV oc(g) =|(s p o)|(s p o) isin g and p isin P external

g ||(s p o) isin g|

Interoperability of proprietary vocabulary Linkingon schema level means to link the proprietary vo-cabulary to external vocabulary Proprietary vocab-ulary are classes and relations which were definedin the KG itself The interlinking to external vo-cabulary guarantees a high degree of interoperabil-ity [26] We measure the interlinking on schemalevel by calculating the ratio to which classes andrelations have at least one equivalency link (egowlsameAs owlequivalentProperty orowlequivalentClass) to classes and relationsrespectively of other data sources

mpropV oc(g) = |x isin Pg cup Cg|exist(x p o) isin g

(p isin Peq and (o isin U and o isin Uextg ))||Pg cup Cg|

where Peq = owlsameAsowlequivalent-PropertyowlequivalenClass and Uext

g con-sists of all URIs in Ug which are external to the KG gwhich means that hg is not responsible for resolvingthese URIs

35 Accessibility Category

Accessibility data quality refers to aspects on howdata can be accessed This category contains the threedimensions

ndash Accessibilityndash Licensing andndash Interlinking

Wangrsquos dimension access security is considered by usas being not relevant in the Linked Open Data contextas we only take open data sources into account

In the following we go into details of the mentioneddata quality dimensions

351 AccessibilityDefinition of dimension Accessibility is ldquothe ex-

tent to which data are available or easily and quicklyretrievablerdquo [47]

Discussion Wang et alrsquos definition of Accessibilitycontains the aspects availability response time anddata request They are defined as follows

1 Availability ldquoof a data source is the probability thata feasible query is correctly answered in a giventime rangerdquo [39]According to Naumann [39] the availability is animportant quality aspect for data sources on theWeb since in case of integrated systems (with fed-erated queries) usually all data sources need tobe available in order to execute the query Therecan be different influencing factors regarding theavailability of data sources such as the day timethe worldwide distribution of servers the planedmaintenance work and the caching of data LinkedData sources can be available as SPARQL end-points (for performing complex queries on thedata) and via HTTP URI dereferencing We needto consider both possibilities for this DQ dimen-sion

2 Response time characterizes the delay betweenthe point in time when the query was submittedand the point in time when the query response isreceived [11]Note that the response time is dependent on em-pirical factors such as the query the size of the in-dexed data the data structure the used triple storethe hardware and so on We do not consider theresponse time in our evaluations since obtaininga comprehensive result here is hard

3 In the context of Linked Data data requests canbe made (i) on SPARQL endpoints (ii) on RDFdumps (export files) and (iii) on Linked DataAPIs

Definition of metric We define the metric for thedimension Accessibility by means of metrics for thefollowing criteria

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 15

ndash Dereferencing possibility of resourcesndash Availability of the KGndash Provisioning of public SPARQL endpointndash Provisioning of an RDF exportndash Support of content negotiationndash Linking HTML sites to RDF serializationsndash Provisioning of KG metadata

The fulfillment degree of a KG g wrt the dimen-sion Accessibility is measured by the metrics mDeref mAvai mSPARQL mExport mNegot mHTMLRDF and mMeta which are defined as follows

Dereferencing possibility of resources One of theLinked Data principles [9] is the dereferencing possi-bility of resources URIs must be resolvable via HTTPrequests and useful information should be returnedthereby We assess the dereferencing possibility of re-sources in the KG by analyzing for each URI in the sam-ple set (here all URIs Ug) the HTTP response statuscode and by evaluating whether RDF data is returned Asuccessful dereferencing of resources is given if HTTPstatus code 200 and an RDF document is returned

mDeref (hg) =|dereferencable(Ug)|

|Ug|

Availability of the KG The Availability of the KG cri-terion indicates the uptime of the KG It is an essentialcriterion in the context of Linked Data since in case ofan integrated or federated query mostly all data sourcesneed to be available [39] We measure the availabil-ity of a KG by monitoring the ability of dereferencingURIs over a period of time This monitoring processcan be done with the help of a monitoring tool such asPingdom26

mAvai(hg) =Number of successful requests

Number of all requests

Provisioning of public SPARQL endpoint SPARQLendpoints allow the user to perform complex queries(including potentially many instances classes and rela-tions) on the KG This criterion here indicates whetheran official SPARQL endpoint is publicly availableThere might be additional restrictions of this SPARQLendpoint such as a maximum number of requests pertime slice or a maximum runtime of a query However

26See httppingdomcom requested on Mar 1 2016

we do not measure these restrictions here

mSPARQL(hg) =

1 SPARQL endpoint publiclyavailable

0 otherwise

Provisioning of an RDF export If there is no pub-lic SPARQL endpoint available or the restrictions ofthis endpoint are so strict that the user does not useit an RDF export dataset (RDF dump) can often beused This dataset can be used to set up a local pri-vate SPARQL endpoint The criterion here indicateswhether an RDF export dataset is officially available

mExport(hg) =

1 RDF export available0 otherwise

Support of content negotiation Content negotiation(CN) allows that the server returns RDF documentsduring the dereferencing of resources in the desiredRDF serialization format The HTTP protocol allowsthe client to specify the desired content type (eg RDFXML) in the HTTP request and the server to specifythe returned content type in the HTTP response header(eg applicationrdf+xml) In this way the de-sired and the provided content type are matched as faras possible It can happen that the server does not pro-vide the desired content type Moreover it may hap-pen that the server returns an incorrect content typeThis may lead to the fact that serialized RDF data isnot processed further An example is RDF data whichis declared as textplain [26] Hogan et al [29]therefore propose to let KGs return the most specificcontent type as possible We measure the Support ofcontent negotiation by dereferencing resources withdifferent RDF serialization formats as desired contenttype and by comparing the accept header of the HTTPrequest with the content type of the HTTP response

mNegot(hg) =

1 CN supported and correct

content types returned05 CN supported but wrong

content types returned0 otherwise

Linking HTML sites to RDF serializations Heath etal [26] suggest linking any HTML description of aresource to RDF serializations of this resource in or-der to make the discovery of corresponding RDF dataeasier (for Linked Data aware applications) For thatreason in the HTML header the so-called Autodiscov-

16 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

ery pattern can be included This pattern consists ofthe phrase link rel=alternate the indicationabout the provided RDF content type and a link to theRDF document27 We measure the linking of HTMLpages to RDF documents (ie resource representations)by evaluating whether the HTML representations of theresources contain links as described

mHTMLRDF (hg) =

1 Autodiscovery pattern usedat least once

0 otherwise

Provisioning of KG metadata In the light of the Se-mantic Web vision where agents select and make useof appropriate data sources on the Web also the meta-information about KGs needs to be available in amachine-readable format The two important mech-anisms to specify metadata about KGs are (i) usingsemantic sitemaps and (ii) using the VoID vocabu-lary28 [26] For instance the URI of the SPARQL end-point can be assigned via voidsparqlEndpointand the RDF export URL can be specified withvoiddataDump Such metadata can be added as ad-ditional facts to the KG or it can be provided as separateVoID file We measure the Provisioning of KG meta-data by evaluating whether machine-readable metadataabout the KG is available Note that the provisioningof licensing information in a machine-readable format(which is also a meta-information about the KG) isconsidered in the data quality dimension License lateron

mMeta(g) =

1 Machine-readable metadataabout g available

0 otherwise

352 LicenseDefinition of dimension Licensing is defined as

ldquothe granting of permission for a consumer to re-use adataset under defined conditionsrdquo [49]

Discussion The publication of licensing informationabout KGs is important for using KGs without legalconcerns especially in commercial settings CreativeCommons (CC)29 publishes several standard licensing

27An example is ltlinkrel=alternate type =applicationrdf+xml href=companyrdfgt

28See namespace httpwwww3orgTRvoid29See httpcreativecommonsorg requested on Mar

1 2016

contracts which define rights and obligations Thesecontracts are also in the Linked Data context popularThe most frequent licenses for Linked Data are CC-BYCC-BY-SA and CC0 [31] CC-BY30 requires specify-ing the source of the data CC-BY-SA31 requires in ad-dition that if the data is published it is published underthe same legal conditions CC032 defines the respectivedata as public domain and without any restrictions

Noteworthy is that most data sources in the LinkedOpen Data cloud do not provide any licensing infor-mation [31] which makes it difficult to use the datain commercial settings Even if data is published un-der CC-BY or CC-BY-SA the data is often not usedsince companies refer to uncertainties regarding thesecontracts

Definition of metric The dimension License isdetermined by the criterion Provisioning machine-readable licensing information

The fulfillment degree of a KG g wrt the dimensionLicense is measured by the metric mmacLicense whichis defined as follows

Provisioning machine-readable licensing informationLicenses define the legal frameworks under which theKG data may be used Providing machine-readable li-censing information allows users and applications to beaware of the license and to use the data of the KG inaccordance with the legal possibilities [3026]

Licenses can be specified in RDF via relationssuch as cclicence33 dctermslicence ordctermsrights The licensing information can bespecified either in the KG as additional facts or sepa-rately in a VoID file We measure the criterion by eval-uating whether licensing information is available in amachine-readable format

mmacLicense(g) =

1 machine-readable

licensing informationavailable

0 otherwise

353 InterlinkingDefinition of dimension Interlinking is the extent

ldquoto which entities that represent the same concept are

30See httpscreativecommonsorglicensesby40 requestedon Mar 1 2016

31See httpscreativecommonsorglicensesby-sa40 requested on Mar 1 2016

32See httpcreativecommonsorgpublicdomainzero10 requested on Mar 3 2016

33Using the namespace httpcreativecommonsorgns

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 17

linked to each other be it within or between two ormore data sourcesrdquo [49]

Discussion According to Bizer et al [12] DBpediaestablished itself as a hub in the Linked Data clouddue to its intensive interlinking with other KGs Theseinterlinking is on the instance level usually establishedvia owlsameAs links However according to Halpinet al [24] those owlsameAs links do not alwaysinterlink identical entities in reality According to theauthors one reason might be that the KGs provideentries in different granularity For instance the DB-pedia resource for Berlin (dboBerlin) links viaowlsameAs relations to three different resources inthe KG GeoNames34 namely (i) Berlin the capital35

(ii) Berlin the state36 and (iii) Berlin the city37 More-over owlsameAs relations are often created auto-matically by some mapping function Due to mappingerrors the precision is often below 100 [18]

Definition of metric The dimension Interlinking isdetermined by the criteria

ndash Interlinking via owlsameAsndash Validity of external URIs

The fulfillment degree of a KG g wrt the dimen-sion Interlinking is measured by the metrics mInst andmURIs which are defined as follows

Interlinking via owlsameAs The forth LinkedData principle according to Berners-Lee [8] is the inter-linking of data resources so that the user can explorefurther information According to Hogan et al [30] theinterlinking has a side effect It does not only result inotherwise isolated KGs but the number of incominglinks of a KG indicates the importance of the KG in theLinked Open Data cloud We measure the interlinkingon instance level38 by calculating the extent to which in-stances have at least one owlsameAs link to externalKGs

34See httpwwwgeonamesorg requested on Dec 312016

35See httpwwwgeonamesorg2950159berlinhtml requested on Feb 4 2017

36See httpwwwgeonamesorg2950157land-berlinhtml requested on Feb 4 2017

37See httpwwwgeonamesorg6547383berlin-stadthtml requested on Feb 4 2017

38The interlinking on schema level is already measured via thecriterion Interoperability of proprietary vocabulary

mInst(g) = |x isin Ig (Pg cup Cg) |

exist(xowlsameAs y) isin g and y isin Uextg |

|Ig (Pg cup Cg)|

Validity of external URIs The considered KG maycontain outgoing links referring to RDF resourcesor Web documents (non-RDF data) The linking toRDF resources is usually enabled by owlsameAsowlequivalentProperty and owlequivalentClass relations Web documents are linked viarelations such as foafhomepage and foafdepiction Linking to external resources always entailsthe problem that those links might get invalid over timeThis can have different causes For instance the URIsare not available anymore We measure the Validity ofexternal URIs by evaluating the URIs from an URI sam-ple set wrt whether there is a timeout a client error(HTTP response 4xx) or a server error (HTTP response5xx)

mURIs(g) =|x isin A | resolvable(x)|

|A|

where A = y | exist(x p y) isin g (p isin Peqandx isin Ug(CgcupPg)andx isin U local

g andy isin Uextg ) and resolvable(x)

returns true if HTTP status code 200 is returned Peq isthe set of relations used for linking to external sourcesExamples for such relations are owlsameAs andfoafhomepage

In case of an empty set A the metric should evaluateto 1

36 Conclusion

In this section we provided 34 DQ criteria which canbe applied in the form of DQ metrics to KGs in order toassess those KGs wrt data quality The DQ criteria areclassified into 11 DQ dimensions These dimensionsare themselves grouped into 4 DQ categories In totalwe have the following picture

ndash Intrinsic category

lowast Accuracylowast Syntactic validity of RDF documentslowast Syntactic validity of literalslowast Semantic validity of triples

18 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

lowast Trustworthinesslowast Trustworthiness on KG levellowast Trustworthiness on statement levellowast Using unknown and empty values

lowast Consistencylowast Check of schema restrictions during inser-

tion of new statementslowast Consistency of statements wrt class con-

straintslowast Consistency of statements wrt relation con-

straints

ndash Contextual category

lowast Relevancylowast Creating a ranking of statements

lowast Completenesslowast Schema completenesslowast Column completenesslowast Population completeness

lowast Timelinesslowast Timeliness frequency of the KGlowast Specification of the validity period of state-

mentslowast Specification of the modification date of

statements

ndash Representational data quality

lowast Ease of understandinglowast Description of resourceslowast Labels in multiple languageslowast Understandable RDF serializationlowast Self-describing URIs

lowast Interoperabilitylowast Avoiding blank nodes and RDF reificationlowast Provisioning of several serialization formatslowast Using external vocabularylowast Interoperability of proprietary vocabulary

ndash Accessibility category

lowast Accessibilitylowast Dereferencing possibility of resourceslowast Availability of the KGlowast Provisioning of public SPARQL endpointlowast Provisioning of an RDF exportlowast Support of content negotiationlowast Linking HTML sites to RDF serializationslowast Provisioning of KG metadata

lowast Licenselowast Provisioning machine-readable licensing in-

formationlowast Interlinking

lowast Interlinking via owlsameAslowast Validity of external URIs

4 Selection of KGs

We consider the following KGs for our comparativeevaluation

ndash DBpedia DBpedia39 is the most prominent KGin the LOD cloud [4] The project was initiatedby researchers from the Free University of Berlinand the University of Leipzig in collaborationwith OpenLink Software Since the first public re-lease in 2007 DBpedia is updated roughly once ayear40 By means of a dedicated open source ex-traction framework DBpedia is created from infor-mation contained in Wikipedia such as infobox ta-bles categorization information geo-coordinatesand external links Due to its role as the hub ofthe LOD cloud DBpedia contains many links toother datasets in the LOD cloud such as FreebaseOpenCyc UMBEL41 GeoNames Musicbrainz42

CIA World Factbook43 DBLP44 Project Guten-berg45 DBtune Jamendo46 Eurostat47 Uniprot48

and Bio2RDF4950 DBpedia has been used exten-sively in the Semantic Web research communitybut has become also relevant in commercial set-tings for instance companies such as the BBC[33] and the New York Times [41] use DBpediato organize their content The version of DBpediawe analyzed is 2015-04

39See httpdbpediaorg requested on Nov 1 201640There is also DBpedia live which started in 2009 and which

gets updated when Wikipedia is updated See httplivedbpediaorg requested on Nov 1 2016 Note however thatDBpedia live only provides a restricted set of relations compared toDBpedia Also the provisioning of data varies a lot While for sometime ranges DBpedia live provides data for each hour for other timeranges DBpedia live data is only available once a month

41See httpumbelorg requested on Dec 31 201642See httpmusicbrainzorg requested on Dec 31

201643See httpswwwciagovlibrary

publicationsthe-world-factbook requested on Dec31 2016

44See httpwwwdblporg requested on Dec 31 201645See httpswwwgutenbergorg requested on Dec

31 201646See httpdbtuneorgjamendo requested on Dec

31 201647See httpeurostatlinked-statisticsorg

requested on Dec 31 201648See httpwwwuniprotorg requested on Dec 31

201649See httpbio2rdforg requested on Dec 31 201650See a complete list of the links on the websites describing the sin-

gle DBpedia versions such as httpdownloadsdbpediaorg2016-04links (requested on Nov 1 2016)

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 19

ndash Freebase Freebase51 is a KG announced byMetaweb Technologies Inc in 2007 and was ac-quired by Google Inc on July 16 2010 In con-trast to DBpedia Freebase had provided an in-terface that allowed end-users to contribute tothe KG by editing structured data Besides user-contributed data Freebase integrated data fromWikipedia NNDB52 FMD53 and MusicBrainz54

Freebase uses a proprietary graph model for stor-ing also complex statements Freebase shut downits services completely on August 31 2016 Onlythe latest data dump is still available WikimediaDeutschland and Google integrate Freebase datainto Wikidata via the Primary Sources Tool55 Fur-ther information about the migration from Free-base to Wikidata is provided in [44] We analyzedthe latest Freebase version as of March 2015

ndash OpenCyc The Cyc56 project started in 1984 bythe industry research and development consortiumMicroelectronics and Computer Technology Cor-poration The aim of Cyc is to store ndash in a machine-processable way ndash millions of common sense factssuch as ldquoEvery tree is a plantrdquo The main focus ofCyc has been on inferencing and reasoning SinceCyc is proprietary a smaller version of the KGcalled OpenCyc57 was released under the opensource Apache license Version 2 In July 2006 Re-searchCyc58 was published for the research com-munity containing more facts than OpenCyc Wedid not consider Cyc and ResearchCyc since thoseKGs do not meet the chosen requirements namelythat the KGs are freely available and freely us-able in any context The version of OpenCyc weanalyzed is 2012-05-10

ndash Wikidata Wikidata59 is a project of WikimediaDeutschland which started on October 30 2012The aim of the project is to provide data whichcan be used by any Wikimedia project including

51See httpfreebasecom requested on Nov 1 201652See httpwwwnndbcom requested on Dec 31 201653See httpwwwfashionmodeldirectorycom re-

quested on Dec 31 201654See httpmusicbrainzorg requested on Dec 31

201655See httpswwwwikidataorgwikiWikidata

Primary_sources_tool requested on Apr 8 201656See httpwwwcyccom requested on Dec 31 201657See httpwwwopencycorg accessed on Nov 1

201658See httpresearchcyccom requested on Dec 31

201659See httpwikidataorg accessed on Nov 1 2016

Wikipedia Wikidata does not only store facts butalso the corresponding sources so that the valid-ity of facts can be checked Labels aliases anddescriptions of entities in Wikidata are providedin almost 400 languages Wikidata is a commu-nity effort ie users collaboratively add and editinformation Also the schema is maintained andextended based on community agreements Wiki-data is currently growing considerably due to theintegration of Freebase data [44] The version ofWikidata we analyzed is 2015-10

ndash YAGO YAGO60 ndash Yet Another Great Ontol-ogy ndash has been developed at the Max PlanckInstitute for Computer Science in Saarbruumlckensince 2007 YAGO comprises information ex-tracted from Wikipedia (such as information fromthe categories redirects and infoboxes) Word-Net [19] (such as information about synsets andhyponomies) and GeoNames61 The version ofYAGO we analyzed is YAGO3 which was pub-lished in March 2015

5 Comparison of KGs

51 Key Statistics

In the following we present statistical commonal-ities and differences of the KGs DBpedia FreebaseOpenCyc Wikidata and YAGO We thereby use thefollowing key statistics

ndash Number of triplesndash Number of classesndash Number of relationsndash Distribution of classes wrt the number of their

corresponding instancesndash Coverage of classes with at least one instance per

classndash Covered domains wrt entitiesndash Number of entitiesndash Number of instancesndash Number of entities per classndash Number of unique subjectsndash Number of unique predicatesndash Number of unique objects

In Section 72 we provide an overview of relatedwork wrt those key statistics

60See httpwwwmpi-infmpgdedepartmentsdatabases-and-information-systemsresearchyago-nagayagodownloads accessed on Nov 1 2016

61See httpwwwgeonamesorg requested on Dec 312016

20 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

511 TriplesRanking of KGs wrt number of triples The num-

ber of triples (see Table 2) differs considerably betweenthe KGs Freebase is the largest KG with over 31Btriples while OpenCyc resides the smallest KG withonly 24M triples The large size of Freebase can betraced back to the fact that large data sets such as Mu-sicBrainz have been integrated into this KG OpenCycin contrast has been built purely manually by expertsIn general this indicates a correlation between the wayof building up a KG and its size

Size differences between DBpedia and YAGO Asboth DBpedia and YAGO were created automaticallyby extracting semantically-structured information fromWikipedia the significant difference between their sizesndash in terms of triples ndash is in particular noteworthy Wecan mention here the following reasons YAGO inte-grates the statements from different language versionsof Wikipedia in one single KG while for the canon-ical DBpedia dataset (which is used in our evalua-tions) solely the English Wikipedia was used as in-formation source Besides that YAGO contains con-textual information and detailed provenance informa-tion Contextual information is for instance the an-chor texts of all links within Wikipedia For repre-senting the anchor texts the relation yagohasWikipediaAnchorText (330M triples in total) is usedThe provenance information of single statements isstored in a reified form In particular the relationsyagoextractionSource (1612M triples) andyagoextractionTechnique (1762M triples)are applied therefore3nInfluence of reification on the number of triples

DBpedia Freebase Wikidata and YAGO use someform of reification Reification in general describesthe possibility of making statements about statementsWhile reification has an influence on the number oftriples for DBpedia Freebase and Wikidata the num-ber of triples in YAGO is not influenced by reificationsince data is here provided in N-Quads62 This style ofreification is called Named Graph [27] The additionalcolumn (in comparison to triples) contains a unique IDof the statement by which the triple becomes identifiedFor backward compatibility the ID is commented andtherefore not imported into the triple store Note how-ever that transforming N-Quads to N-Triples leads to a

62The idea of N-Quads is based on the assignment of triples todifferent graphs YAGO uses N-Quads to identify statements per ID

high number of unique subjects concerning the set ofall triples

In case of DBpedia Freebase and Wikidata reifica-tion is implemented by means of n-ary relations Ann-ary relation denotes the relation between more thantwo resources and is implemented via additional inter-mediate nodes since in RDF only binary statementscan be modeled [1627] In Freebase and DBpedia datais mostly provided in the form of plain N-Triples andn-ary relations are only used for data from higher ar-ity63 Wikidata in contrast has the peculiarity that notonly every statement is expressed with the help of ann-ary relation but that in addition each statement is in-stantiated with wdoStatement This leads to about74M additional instances which is about one tenth ofall triples in Wikidata

512 ClassesMethods for counting classes The number of

classes can be calculated in different ways Classes canbe identified via rdfsClass and owlClass re-lations or via rdfssubClassOf relations64 SinceFreebase does not provide any class hierarchy withrdfssubClassOf relations and since Wikidatadoes not instantiate classes explicitly as classes butuses instead only ldquosubclass ofrdquo (wdtP279) relationsthe method of calculating the number of classes de-pends on the considered KG

Ranking of KG wrt number of classes Our eval-uations revealed that YAGO contains the highest num-ber of classes of all considered KGs DBpedia in con-trast has the fewest (see Table 2)

Number of classes in YAGO and DBpedia Howdoes it come to this gap between DBpedia and YAGOwith respect to the number of classes although bothKGs were created automatically based on WikipediaFor YAGO the classes are extracted from the categoriesin Wikipedia while the hierarchy of the classes is de-ployed with the help of WordNet synset relations TheDBpedia ontology in contrast is very small since itis created manually based on the mostly used infobox

63In Freebase Compound Value Types are used for reifi-cation [44] In DBpedia it is named Intermedia Node Map-ping see httpmappingsdbpediaorgindexphpTemplateIntermediateNodeMapping (requested on Dec31 2016)

64The number of classes in a KG may also be calculated by takingall entity type relations (rdftype and ldquoinstance ofrdquo (wdtP31)in case of Wikidata) on the instance level into account However thiswould result only in a lower bound estimation as here those classesare not considered which have no instances

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 21

DBpe

dia

Freeb

ase

Ope

nCyc

Wikidat

a

YAGO

20

40

60

80

100

Co

ve

rag

e in

Fig 1 Coverage of classes having at least one instance

templates in Wikipedia Besides those 736 classes theDBpedia KG contains further 444895 classes whichoriginate from the imported YAGO classes and whichare published in the namespace yago Those YAGOclasses are ndash like the DBpedia ontology classes ndash inter-connected via rdfssubClassOf to form a taxon-omy In the evaluation of DBpedia the YAGO classesare ignored as they do not belong to the DBpedia on-tology given as OWL file

Coverage of classes with at least one instanceFig 1 shows for each KG the extent to which classes areinstantiated that is for how many classes at least oneinstance exists YAGO exhibits the highest coveragerate (826) although it contains the highest numberof classes among the KGs This can be traced back tothe fact that YAGO classes are chosen by a heuristicthat considers Wikipedia leaf categories which tend tohave instances [43] OpenCyc (with 65) and Wiki-data (54) come last in the ranking Wikidata has thesecond highest number of classes in total (see Table 2)out of which relatively little are used on instance levelNote however that in some scenarios solely the schemalevel information (including classes) of KGs is neces-sary so that the low coverage of instances by classes isnot necessarily an issue

Correlation between number of classes and num-ber of instances In Fig 2 we can see a histogramof the classes with respect to the number of instancesper class That is for each KG we can spot how manyclasses have a high number of instances and how manyclasses have a low number of instances Note the log-arithmic scale on both axes The curves seem to fol-low power law distributions For DBpedia the line de-

Table 1Percentage of considered entities per KG for covered domains

DB FB OC WD YA

Reach of method 88 92 81 41 82

creases consistently for the first 250 classes before itdecreases more than exponentially beyond class 250

513 DomainsAll considered KGs are cross-domain meaning that a

variety of domains are covered in those KGs Howeverthe KGs often cover the single domains to a differentdegree Tartir [45] proposed to measure the covered do-mains of ontologies by determining the usage degree ofcorresponding classes the number of instances belong-ing to one or more subclasses of the respective domainis compared to the number of all instances In our workhowever we decided to evaluate the coverage of do-mains concerning the classes per KG via manual assign-ments of the mostly used classes to the domains peoplemedia organizations geography and biology65 Thislist of domains was created by aggregating the mostfrequent domains in Freebase

The manual assignment of classes to domainsis necessary in order to obtain a consistent assign-ment of the classes to the domains across all con-sidered KGs Otherwise the same classes in differ-ent KGs may be assigned to different domains More-over in some KGs classes may otherwise appear invarious domains simultaneously For instance theFreebase classes freebasemusicartist andfreebasepeopleperson overlap in terms oftheir instances and multiple domains (such as musicand people) might be assigned to them

As the reader can see in Table 1 our method to de-termine the coverage of domains and hence the reachof our evaluation includes about 80 of all entities ofeach KG except Wikidata It is calculated as the ratio ofthe number of unique entities of all considered domainsof a given KG divided by the number of all entities ofthis KG66 If the ratio was at 100 we were able toassign all entities of a KG to the chosen domains

Fig 3 shows the number of entities per domain in thedifferent KGs with a logarithmic scale Fig 4 presents

65See our website for examples of classes per domain andper KG httpkmaifbkitedusitesknowledge-graph-comparison (requested on Dec 31 2016)

66We used the number of unique entities of all domains and notthe sum of the entities measured per domain since entities may be inseveral domains at the same time

22 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

100 101 102 103

Classes

100

102

104

106

108

Nu

mb

er

of

insta

nce

s

DBpedia

Freebase

OpenCyc

Wikidata

YAGO

Fig 2 Distribution of classes wrt the number of instances per KG

persons media organizations geography biology100

102

104

106

108

1010

Num

ber

of entities

DBpedia

Freebase

OpenCyc

Wikidata

YAGO

Fig 3 Number of entities per domain

the relative coverage of each domain in each KG It iscalculated as the ratio of the number of entities in eachdomain to the total number of entities of the KG Avalue of 100 means that all instances reside in onesingle domain

The case of Freebase is especially outstanding here77 of all entities here are located in the media

domain This fact can be traced back to large-scaledata imports such as from MusicBrainz The classfreebasemusicrelease_track is account-able for 42 of the media entities As shown in Fig 3Freebase provides the most entities in four out of thefive domains when considering all KGs

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 23

persons media organizations geography biology

10

20

30

40

50

60

70

80

Rela

tive n

um

ber

of entities in p

erc

ent

DBpedia

Freebase

OpenCyc

Wikidata

YAGO

Fig 4 Relative number of entities per domain

In DBpedia and YAGO the domain of people is thelargest domain (50 and 34 respectively) Peculiar isthe higher coverage of YAGO regarding the geographydomain compared to DBpedia As one reason for thatwe can point out the data import of GeoNames intoYAGO

Wikidata contains around 150K entities in the do-main organization This is relativly few consideringthe total amount of entities being around 187M andconsidering the number of organizations in other KGsNote that even DBpedia provides more organizationentities than Wikidata The reason why Wikidata hasnot so many organization entities is not fully compre-hensible to us However we can point out that for ouranalysis we only considered Wikidata classes whichappeared more than 6000 times67 and that about 16Kclasses were therefore not considered It is possible thatentities of the domain organization are belonging tothose rather rarely occurring classes

514 Relations and PredicatesEvaluation method In this article we differentiate

between relations and predicates (see also Section 2)

ndash Relations ndash as short term for explicitly defined re-lations ndash refers to (proprietary) vocabulary definedon the schema level of a KG We identify the setof relations of a KG as the set of those links which

67This number is based on heuristics We focused on the 150 mostinstantiated classes and cut the long tail of classes having only fewinstances

are explicitly defined as such via assignments (forinstance with rdfsProperty) to classes InSection 2 we used Pg to denote this set

ndash In contrast we use predicates to denote links usedin the KG independently of their introduction onthe schema level The set of unique predicates perKG denoted as P imp

g is nothing else than the setof unique RDF terms on the predicate position ofall triples in the KG

It is important to distinguish the key statistics for rela-tions from the key statistics for predicates since theycan differ considerably depending on to which degreerelations are only defined on schema level but not usedon instance level

Evaluation resultsRelationsRanking regarding relations As presented in Ta-

ble 2 Freebase exhibits by far the highest number ofunique relations (around 785K) among the KGs YAGOshows only 106 relations which is the lowest value inthis comparison In the following we point out furtherfindings regarding the relations of the single KGs

DBpedia Regarding DBpedia relations we need todistinguish between so-called mapping-based prop-erties and non-mapping-based properties Mapping-based properties are created by extracing the informa-tion from infoboxes in Wikipedia using manually cre-ated mappings These mappings are specified in the DB-

24 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

pedia Mappings Wiki68 Mapping-based properties arecontained in the DBpedia ontology and located in thenamespace httpdbpediaorgontologyWe count 2819 such relations for the considered DB-pedia version 2015-04 Non-mapping-based properties(also called ldquoraw infobox propertiesrdquo) are extractedfrom Wikipedia without the help of manually createdmappings and hence without any manual adjustmentsTherefore they are generally of lower quality We count58776 such unique relations They reside in the names-pace httpdbpediaorgproperty Bothmapping-based and non-mapping-based properties areinstantiated in DBpedia with rdfProperty We ig-nore the non-mapping based properties for the calcu-lation of the number of relations |Pg| (see Table 2)since in contrast to DBpedia in YAGO non-mappingbased properties are not instantiated Note that themapping-based properties and the non-mapping basedproperties in DBpedia are not aligned69 and may over-lap until DBpedia version 2016-0470

Freebase The high number or Freebase relations canbe explained by two facts 1 About a third of all rela-tions in Freebase are duplicates in the sense that they aredeclared by means of the owlinverseOf relationas being inverse of other relations An example is the re-lation freebasemusicartistalbum and itsinverse relation freebasemusicalbumartist2 Freebase allowed users to introduce their own rela-tions without any limits These relations were originallyin each userrsquos namespace So-called commons adminswere able to approve those relations so that they gotincluded into the Freebase commons schema

OpenCyc For OpenCyc we measure 18028 uniquerelations We can assume that most of them are dedi-cated to statements on the schema level

Wikidata In Wikidata a relatively small set of rela-tions is provided Note in this context that despite thefact that Wikidata is curated by a community (just likeFreebase) Wikidata community members cannot insertarbitrarily new relations as it was possible in Freebaseinstead relations first need to be proposed and thenget accepted by the community if and only if certain

68See httpmappingsdbpediaorgindexphpMain_Page accessed on Nov 4 2016

69For instance The DBpedia ontology containsdbobirthName for the name of a person while the non-mappingbased property set contains dbpname dbpfirstname anddbpalternativeNames

70For instance dbpalias and dboalias

criteria are met71 One of those criteria is that each newrelation is presumably used at least 100 times Thisrelation proposal process can be mentioned as likelyreason why in Wikidata in relative terms more relationsare actually used than in Freebase

YAGO For YAGO we measure the small set of 106unique relations Although relations are curated man-ually for YAGO and DBpedia the size of the relationset differs significantly between those KGs Hoffart etal [28] mention the following reasons for that

1 Peculiarity of relations The DBpedia ontologyprovides quite many special relations For in-stance there exists the relation dboaircraftFighter between dboMilitaryUnit anddboMeanOfTransportation

2 Granularity of relations Relations in the DB-pedia ontology are more fine-grained than rela-tions in YAGO For instance DBpedia contains therelations dboauthor and dbodirectorwhereas in YAGO there is only the generic relationyagocreated

3 Date specification The DBpedia ontology intro-duces several relations for dates For instance DB-pedia contains the relations dbobirthDateand dbobirthYear for birth dates while inYAGO only the relation yagobirthOnDateis used Incomplete date specifications ndash for in-stance if only the year is known ndash are specifiedin YAGO by wildcards (ldquordquo) so that no multiplerelations are needed

4 Inverse relations YAGO has no relations ex-plicitly specified as being inverse In DBpediawe can find relations specified as inverse such asdboparent and dbochild

5 Reification YAGO introduces the SPOTL(X) for-mat This format extends the triple format ldquoSPOldquowith a specification of Time Location and conteXtIn this way no contextual relations are necessary(such as dbodistanceToLondon or dbopopulationAsOf) which occur if the relationsare closely aligned to Wikipedia template attributenames

Frequency of the usage of relations Fig 5 showsthe relative proportions of how often relations are usedper KG grouped into three classes Surprisingly DB-pedia and Freebase exhibit a high number of relationswhich are not used at all on the instance level In case of

71See httpswwwwikidataorgwikiWikidataProperty_proposal requested on Dec 31 2016

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 25

0 1-500 gt500

Number of relations

20

40

60

80

100

Re

lative

occu

ren

cie

s in

pe

rce

nt

DBpedia

Freebase

OpenCyc

Wikidata

YAGO

Fig 5 Frequency of the usage of the relations per KG grouped by(i) zero occurrences (ii) 1ndash500 occurrences and (iii) more than 500occurrences in the respective KG

OpenCyc 992 of the defined relations are never usedWe assume that those relations are used only withinCyc the commercial version of OpenCyc In case ofFreebase only 5 of the relations are used more than500 times and about 70 are not used at all Analo-gously to the discussion regarding the number of Free-base relations we can mention again the high numberof defined owlinverseOf relations and the highnumber of usersrsquo relation proposals as reasons for that

PredicatesRanking regarding predicates Freebase is here ndash

like in case of the ranking regarding relations ndash rankedfirst The lowest number of unique predictes is providedby OpenCyc which exhibits only 165 predicates AllKGs except OpenCyc provide more predicates then re-lations Our single observations regarding the predicatesets are as follows

DBpedia DBpedia is ranked third in terms of the ab-solute numbers of predicates about 60K predicates areused in DBpedia The set of relations and the set of pred-icates varies considerably here since also facts are ex-tracted from Wikipedia info-boxes whose predicates areconsidered by us as being only implicitly defined andwhich hence occur only as predicates These are the so-called non-mapping-based properties Note that in the

studied DBpedia version 2015-04 the set of explicitlydefined relations (mapping-based properties) and theset of implicitly defined relations (non-mapping-basedproperties) overlaps An example is dbpalias withdboalias

Freebase We can observe here a similar picture asfor the set of Freebase relations With about 785Kunique predicates Freebase exceeds the other KGs byfar Note however that 95 of the predicates (around743K) are used only once This relativizes the highnumber Most of the predicates are keys in the senseof ids and are used for internal modeling (for instancefreebasekeyuseradrianb)

OpenCyc In contrast to the 18028 unique relationswe measure only 164 unique predicates for OpenCycMore predicates are presumably used in Cyc

Wikidata We measure more Wikidata predicates thanWikidata relations since Wikidata predicates are cre-ated by modifying Wikidata relations An exampleare the following triples which express the statementBarack Obama (wdtQ76) is a human (wdtQ5) byan intermediate node (wdtQ76S123 abbreviated)

wdtQ76 wdtP31s wdtQ76S123wdtQ76S123 wdtP31v wdtQ5

The relation extension ldquosrdquo indicates that the RDF termin the object position is a statement The ldquovrdquo extensionallows to refer to a value (in Wikidata terminology)Besides those extensions there is ldquorrdquo to refer to a ref-erence and the ldquoqrdquo extension to refer to a qualifier Ingeneral these relation extensions are used for realizingreification via n-ary relations For that intermediatenodes are used which represent statements [16]

YAGO YAGO contains more predicates than DBpe-dia since infobox attributes from different languageversions of Wikipedia are aggregated into one KG72

while for DBpedia separate localized KG versions areoffered for non-English languages

515 Instances and EntitiesEvaluation method We distinguish between in-

stances Ig and entities Eg of a KG (cf Section 2)

1 Instances are belonging to classes They are iden-tified by retrieving the subjects of all triples wherethe predicates indicate class affiliations

72The language of each attribute is encoded in theURI for instance yagoinfoboxdeflaumlche andyagoinfoboxenareakm

26 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGODBpe

diaFre

ebas

eOpe

nCyc

Wiki

data

YAGO

10 010 110 210 310 410 510 610 710 810 9

Num

ber

of In

stan

ces

Fig 6 Number of instances per KG

2 Entities are real-world objects This excludesfor instance instantiated statements for beingentities Determining the set of entities is par-tially tricky In DBpedia and YAGO entitiesare determined as being an instance of theclass owlThing In Freebase entities are in-stances of freebasecommontopic and inWikidata instance of wdoItem In OpenCyccychIndividual corresponds to owlThingbut not all entities are classified in this way There-fore we approximately determine the set of en-tities in OpenCyc by manually classifying allclasses having more than 300 instances includingat least one entity73 In this way abstract classessuch as cychExistingObjectType are ne-glected

Ranking wrt the number of instances Table 2and Fig 6 show the number of instances per KG Wecan see that Wikidata comprises the highest numberof instances (142M) in total and OpenCyc the fewest(242K)

Ranking wrt the number of entities Table 2shows the ranking of KGs regarding the number of en-tities Freebase contains by far the highest number ofentities (about 499M) OpenCyc is at the bottom withonly about 41K entities

Differences in number of entities The reason whythe KGs show quite varying numbers of entities are theinformation sources of the KGs We illustrate this withthe music domain as example

1 Freebase had been created mainly from data im-ports such as from MusicBrainz Therefore enti-

73For instance cychIndividual cychMovie_CW andcychCity

ties in the domain of media and especially songrelease tracks are covered very well in Freebase77 of all entities are in the media domain (seeSection 513) out of which 42 are releasetracks74

Due to the large size and the world-wide coverageof entities in MusicBrainz Freebase contains al-bums and release tracks of both English and non-English languages For instance regarding the En-glish language the album ldquoThrillerrdquo from MichaelJackson and its single ldquoBillie Jeanrdquo are there aswell as rather unknown songs from the ldquoThrillerrdquoalbum such as ldquoThe Lady in My Liferdquo Regard-ing non-English languages Freebase contains forinstance songs and albums from Helene Fischersuch as ldquoLassrsquo mich in dein Lebenrdquo and ldquoZauber-mondrdquo also rather unknown songs such as ldquoHabrsquoden Himmel beruumlhrtrdquo can be found

2 In case of DBpedia the English Wikipedia is thesource of information In the English Wikipediamany albums and singles of English artists are cov-ered ndash such as the album ldquoThrillerrdquo and the singleldquoBillie Jeanrdquo Rather unknown songs such as ldquoTheLady in My Liferdquo are not covered in WikipediaFor many non-English artists such as the Germansinger Helene Fischer no music albums and nosingles are contained in the English Wikipedia Inthe corresponding language version of Wikipedia(and localized DBpedia version) this informationis often available (for instance the album ldquoZauber-mondrdquo and the song ldquoLassrsquo mich in dein Lebenrdquo)but not the rather unknown songs such as ldquoHabrsquoden Himmel beruumlhrtrdquo

3 For YAGO the same situation as for DBpediaholds with the difference that YAGO in additionimports entities also from the different languageversions of Wikipedia and imports also data fromsources such as GeoNames However the abovementioned works (ldquoLassrsquo mich in dein LebenrdquoldquoZaubermondrdquo and ldquoHabrsquo den Himmel beruumlhrtrdquo)of Helene Fischer are not in the YAGO althoughthe song ldquoLassrsquo mich in dein Lebenrdquo exists inthe German Wikipedia since May 2014 and al-though the used YAGO version 3 is based on theWikipedia dump of June 201475 Presumably theYAGO extraction system was unable to extract any

74Those release tracks are expressed via freebasemusicrelease_track

75See httpwwwmpi-infmpgdededepartmentsdatabases-and-information-

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 27DBpe

diaFre

ebas

eOpe

nCyc

Wiki

data

YAGO

10 0

10 1

10 2

10 3

10 4

Ave

rage

num

ber

of e

ntiti

es

Fig 7 Average number of entities per class per KG

types for those entities so that those entities werediscarded

4 Wikidata is supported by the community and con-tains music albums of English and non-Englishartists even if they do not exist in Wikipedia Anexample is the song ldquoThe Lady in My Liferdquo Notehowever that Wikidata does not provide all artistrsquosworks such as from Helene Fischer

5 OpenCyc contains only very few entities in themusic domain The reason is that OpenCyc has itsfocus mainly on common-sense knowledge andnot so much on facts about entities

Average number of entities per class Fig 7 showsthe average number of entities per class which can bewritten as |Eg||Cg| Obvious is the difference betweenDBpedia and YAGO (despite the similar number of en-tities) The reason for that is that the number of classesin the DBpedia ontology is small (as created manually)and in YAGO large (as created automatically)

Comparing number of instances with number ofentities Comparing the ratio of the number of instancesto the number of entities for each KG Wikidata ex-poses the highest difference As reason for that we canstate that each statement in Wikidata is modeled as aninstance of wdoStatement leading to 74M addi-tional instances In other KGs such as DBpedia state-ments are modeled without any dedicated statementassignment OpenCyc exposes also a high ratio sinceit contains mainly common sense knowledge and notas many entities as the other KGs Furthermore for ouranalysis we do not regard 100 of the entities but onlya large fraction of it (more precisely the classes with

systemsresearchyago-nagayagoarchive re-quested on Dec 31 2016

DBpedia

Freeb

ase

OpenC

ycW

ikida

ta

YAGO

0

1

2

3

4

5

6

7

8

Rat

io o

f num

ber

of in

stan

ces

to n

umbe

r of

ent

ities

Fig 8 Ratio of the number of instances to the number of entities foreach KG

the most frequently occurring instantiations) since en-tities are not consistently instantiated in OpenCyc (seebeginning of Section 515)

516 Subjects and ObjectsEvaluation method The number of unique subjects

and unique objects can be a meaningful KG charac-teristic regarding the link structure within the KG andin comparison to other KGs Especially interesting aredifferences between the number of unique subjects andthe number of unique objects

We measure the number of unique subjects by count-ing the unique resources (ie URIs and blank nodes) onthe subject position of N-Triples Sg = s | (s p o) ising Furthermore we measure the number of uniqueobjects by counting the unique resources on the ob-ject position of N-Triples excluding literals Og =o | (s p o) isin g and o isin U cup B Complementary thenumber of literals is given as Olit

g = o | (s p o) ising and o isin L

Ranking of KGs regarding number of uniquesubjects The number of unique subjects per KG is pre-sented in Fig 9 YAGO contains the highest number ofdifferent subjects while OpenCyc contains the fewest

Ranking of KGs regarding number of unique ob-jects The number of unique objects is also presented inFig 9 Freebase shows the highest score in this regardOpenCyc again the lowest

Ranking of KGs regarding the ratio of numberof unique subjects to number of unique objects Theratios of the number of unique subjects to the number ofunique objects vary considerably between the KGs (seeFig 8) We can observe that DBpedia has 265 timesmore objects than subjects while YAGO on the otherside has 19 times more unique subjects than objects

28 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Table 2Summary of key statistics

DBpedia Freebase OpenCyc Wikidata YAGO

Number of triples |(s p o) isin g| 411 885 960 3 124 791 156 2 412 520 748 530 833 1 001 461 792

Number of classes |Cg| 736 53 092 116 822 302 280 569 751

Number of relations |Pg| 2819 70 902 18 028 1874 106

No of unique predicates |P impg | 60 231 784 977 165 4839 88 736

Number of entities |Eg| 4 298 433 49 947 799 41 029 18 697 897 5 130 031

Number of instances |Ig| 20 764 283 115 880 761 242 383 142 213 806 12 291 250

Avg number of entities per class |Eg||Cg| 58403 9408 035 619 90

No of unique subjects |Sg| 31 391 413 125 144 313 261 097 142 278 154 331 806 927

No of unique non-literals in obj pos |Og| 83 284 634 189 466 866 423 432 101 745 685 17 438 196

No of unique literals in obj pos |Olitg | 161 398 382 1 782 723 759 1 081 818 308 144 682 682 313 508

DBpedia

Freeb

ase

OpenC

yc

Wiki

data

YAGO10 0

10 2

10 4

10 6

10 8

10 1 0

10 1 2

unique subjectsunique objects

Fig 9 Number of unique subjects and objects per KG Note thelogarithmic scale on the axis of ordinates

The high number of unique subjects in YAGO is sur-prising and can be explained by the reification styleused in YAGO Facts are stored as N-Quads in orderto allow for making statements about statements (forinstance storing the provenance information for state-ments) To that end IDs (instead of blank nodes) whichidentify the triples are used on the first position of N-Triples They lead to 308M unique subjects such asyagoid_6jg5ow_115_lm6jdp In the RDF ex-port of YAGO the IDs which identify the triples arecommented out in order to facilitate the N-Triple for-mat However the statements about statements are alsotransformed to triples In those cases the IDs identi-fying the reified statements are in the subject positionleading to such a high number of unique subjects

DBpedia contains considerably more owlsameAslinks to external resources than KGs like YAGO (290M

vs 38M links) leading to a bias of DBpedia towards ahigh number of unique objects

517 Summary of Key StatisticsBased on the evaluation results presented in the last

subsections we can highlight the following insights

1 Triples All KGs are very large Freebase is thelargest KG in terms of number of triples whileOpenCyc is the smallest KG We notice a corre-lation between the way of building up a KG andthe size of the KG automatically created KGs aretypically larger as the burdens of integrating newknowledge become lower Datasets which havebeen imported into the KGs such as MusicBrainzinto Freebase have a huge impact on the numberof triples and on the number of facts in the KGAlso the way of modeling data has a great impacton the number of triples For instance if n-aryrelations are expressed in N-Triples format (as incase of Wikidata) many intermediate nodes needto be modeled leading to many additional triplescompared to plain statements Last but not leastthe number of supported languages influences thenumber of triples

2 Classes The number of classes is highly varyingamong the KGs ranging from 736 (DBpedia) upto 300K (Wikidata) and 570K (YAGO) Despite itshigh number of classes YAGO contains in relativeterms the most classes which are actually used(ie classes with at least one instance) This canbe traced back to the fact that heuristics are usedfor selecting appropriate Wikipedia categories asclasses for YAGO Wikidata in contrast containsmany classes but out of them only a small fraction

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 29

is actually used on instance level Note howeverthat this is not necessarily a burden

3 Domains Although all considered KGs are speci-fied as crossdomain domains are not equally dis-tributed in the KGs Also the domain coverageamong the KGs differs considerably Which do-mains are well represented heavily depends onwhich datasets have been integrated into the KGsMusicBrainz facts had been imported into Free-base leading to a strong knowledge representation(77) in the domain of media in Freebase In DB-pedia and YAGO the domain people is the largestlikely due to Wikipedia as data source

4 Relations and Predicates Many relations arerarely used in the KGs Only 5 of the Freebaserelations are used more than 500 times and about70 are not used at all In DBpedia half of therelations of the DBpedia ontology are not usedat all and only a quarter of the relations is usedmore than 500 times For OpenCyc 992 of therelations are not used We assume that they areused only within Cyc the commercial version ofOpenCyc

5 Instances and Entities Freebase contains by farthe highest number of entities Wikidata exposesrelatively many instances in comparison to theentities as each statement is instantiated leadingto around 74M instances which are not entities

6 Subjects and Objects YAGO provides the high-est number of unique subjects among the KGsand also the highest ratio of the number of uniquesubjects to the number of unique objects This isdue to the fact that N-Quad representations needto be expressed via intermedium nodes and thatYAGO is concentrated on classes which are linkedby entities and other classes but which do not pro-vide outlinks DBpedia exhibits more unique ob-jects than unique subjects since it contains manyowlsameAs statements to external entities

52 Data Quality Analysis

We now present the results obtained by applyingthe DQ metrics introduced in the Sections 32 ndash 35 tothe KGs DBpedia Freebase OpenCyc Wikidata andYAGO

521 AccuracyThe fulfillment degrees of the KGs regarding the

Accuracy metrics are shown in Table 3

Table 3Evaluation results for the KGs regarding the dimension Accuracy

DB FB OC WD YA

msynRDF 1 1 1 1 1msynLit 099 1 1 1 062msemTriple 099 lt1 1 099 099

Syntactic validity of RDF documents msynRDF

Evaluation method For evaluating the Syntactic va-lidity of RDF documents we dereference the entityldquoHamburgrdquo as resource sample in each KG In caseof DBpedia YAGO Wikidata and OpenCyc thereare RDFXML serializations of the resource availablewhich can be validated by the official W3C RDF valida-tor76 Freebase only provides a Turtle serialization Weevaluate the syntactic validity of this Turtle documentby verifying if the document can be loaded into an RDFmodel of the Apache Jena Framework77

Evaluation result All considered KGs provide syn-tactically valid RDF documents In case of YAGO andWikidata the RDF validator declares the used languagecodes as invalid since the validator evaluates languagecodes in accordance with ISO-639 The criticized lan-guage codes are however contained in the newer stan-dard ISO 639-3 and actually valid

Syntactic validity of literals msynLit

Evaluation method We evaluate the Syntactic va-lidity of literals by means of the relations date ofbirth number of inhabitants and International Stan-dard Book Number (ISBN) as those relations cover dif-ferent domains ndash namely people cities and books ndashand as they can be found in all KGs In general do-main knowledge is needed for selecting representativerelations so that a meaningful coverage is guaranteed

Note that OpenCyc is not taken into account forthis criterion Although OpenCyc comprises around11M literals in total these literals are essentially la-bels and descriptions (given via rdfslabel andrdfscomment) ie not aligned to specific datatypes Hence OpenCyc has no syntactic invalid literalsand is assigned the metric value 1

As long as a literal with data type is given its syntaxis verified with the help of the function RDFDatatypeisValid(String) of the Apache Jena framework

76See httpsw3orgRDFValidator requested onMar 2 2016

77See httpsjenaapacheorg requested Mar 22016

30 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Thereby standard data types such as xsddate canbe validated easily especially if different data types areprovided78 If no data type is provided or if the literalvalue is of type xsdString the literal is evaluatedby a regular expression which is created manually (seebelow depending on the considered relation) For eachof the three relations we created a sample of 1M literalvalues per KG as long as the respective KG containsso many literals

Evaluation results All KGs except YAGO per-formed very well regarding the Syntactic validity ofliterals

Date of Birth For Wikidata DBpedia and Freebaseall verified literal values (1M per KG) were syntacti-cally correct79 For YAGO we detected around 519Ksyntactic errors (given 1M literal values) due to the us-age of wildcards in the date values For instance thebirth date of yagoSocrates is specified as ldquo470--rdquo which does not correspond to the syntax ofxsddate Obviously the syntactic invalidity of lit-erals is accepted by the YAGO publishers in order tokeep the number of relations low80

Number of inhabitants The data types of the literalvalues regarding the number of inhabitants were validin all KGs For DBpedia YAGO and Wikidata weevaluated the syntactic validity of the number of inhab-itants by checking if xsdnonNegativeIntegerxsddecimal and xsdinteger were used asdata types for the typed literals In Freebase no datatype is specified Therefore we evaluated the values bymeans of a regular expression which allows only thedecimals 0-9 periods and commas

ISBN The ISBN is an identifier for books and maga-zines The identifier can occur in various formats withor without preceding ldquoISBNrdquo with or without delim-iters and with 10 or 13 digits Gupta81 provided a regu-lar expression for validating ISBN in its different formswhich we used in our evaluation All in all most ofthe ISBN were assessed as syntactically correct The

78In DBpedia for instance data for the relationdbobirthDate is stored both as xsdgYear and xsddate

79Surprisingly the Jena Framework assessed data values with anegative year (ie BC eg ldquo-600rdquo for xsdgYear) as invaliddespite the correct syntax

80In order to model the dates to the extent they are known furtherrelations would be necessary such as using wasBornOnYearwith range xsdgYear wasBornOnYearMonth with rangexsdgYearMonth

81See httphowtodoinjavacomregexjava-regex-validate-international-standard-book-number-isbns requested on Mar 1 2016

lowest fulfillment degree was obtained for DBpediaWe found the following findings for the single KGs InFreebase around 699K ISBN numbers were availableOut of them 38 were assessed as syntactically incorrectTypical mistakes were too long numbers and wrongprefixes82 In case of Wikidata 18 of around 11K ISBNnumbers were syntactically invalid However some in-valid numbers have meanwhile been corrected This in-dicates that the Wikidata community does not only careabout inserting new data but also about curating givenKG data In case of YAGO we could only find 400triples with the relation yagohasISBN Seven of theliterals on the object position were syntactically incor-rect For DBpedia we evaluated around 24K literals7419 of them were assessed as syntactically incorrectIn many cases comments next to the ISBN numbers inthe info-boxes of Wikipedia led to an inaccurate extrac-tion of data so that the comments are either extractedas additional facts about ISBN numbers83 or togetherwith the actual ISBN numbers as coherent strings84

Semantic validity of triples msemTriple

Evaluation method The semantic validity can be re-liably measured by means of a reference data set which(i) contains at least to some degree the same facts asin the KG and (ii) which is regarded as some kind ofauthority We decided to use the Integrated AuthorityFile (Gemeinsame Normdatei GND)85 which is anauthority file especially concerning persons and corpo-rate bodies and which was created manually by Ger-man libraries Due to the focus on persons (especiallyauthors) we decided to evaluate a random sample ofperson entities wrt the following relations birth placedeath place birth date and death date For each ofthese relations the corresponding relations in the KGswere determined Then a random sample of 100 personentities per KG was chosen For each entity we retrievedthe facts with the mentioned relations and assessedmanually whether a GND entry exists and whether thevalues of the relations match with the values in the KG

Evaluation result We evaluated up to 400 facts perKG and observed only for a few facts some discrep-ancies For instance Wikidata states as death date of

82Eg we found the 16 digit ISBN 9789780307986931 (cffreebasem0pkny27) and the ISBN 2940045143431 with pre-fix 294 instead of 978 (cf freebasem0v3xf7b)

83See dbrPrince_Caspian84An example is ldquoISBN 0755111974 (hardcover edition)rdquo for

dbrMy_Family_and_Other_Animals85See httpwwwdnbdeENStandardisierung

GNDgndhtml requested on Sep 8 2016

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 31

ldquoAnton Erkelenzldquo (wdtQ589196) April 24 whereasGND states April 25 For DBpedia and YAGO we en-countered 3 and for Wikidata 4 errors Hence thoseKGs were evaluated with 099 Note that OpenCyc hasno values for the chosen relations and thus evaluates to1

During evaluation we identified the following issues

1 For finding the right entry in GND more informa-tion besides the name of the person is needed Thisinformation is sometimes not given so that entitydisambiguation is in those cases hard to perform

2 Contrary to assumptions often either no corre-sponding GND entry exists or not many facts ofthe GND entity are given In other words GND isincomplete wrt to entities (cf Population com-pleteness) and relations (cf Column complete-ness)

3 Values of different granularity need to be matchedsuch as an exact date of birth against the indicationof a year only

In conclusion the evaluation of semantic validity ishard even if a random sample set is evaluated manuallyMeaningful differences among the KGs might be re-vealed only when a very large sample is evaluated egby using crowd-sourcing [2348] Another approachfor assessing the semantic validity is presented by Kon-tokostas et al [34] who propose a test-driven evalu-ation where test cases are created to evaluate triplessemi-automatically For instance an interval specifiesthe valid height of a person and all triples which lieoutside of this interval are evaluated manually In thisway outliers can be easily found but possible wrongvalues within the interval are not detected

Our findings appear to be consistent with the evalua-tion results of the YAGO developer team for YAGO2where manually assessing 4412 statements resulted inan accuracy of 98186

522 TrustworthinessThe fulfillment degrees of the KGs regarding the

Trustworthiness criteria are shown in Table 4

Trustworthiness on KG level mgraph

Evaluation method Regarding the trustworthinessof a KG in general we differentiate between the method

86With a weighted averaging of 95 see httpwwwmpi-infmpgdededepartmentsdatabases-and-information-systemsresearchyago-nagayagostatistics requested on Mar 3 2016

Table 4Evaluation results for the KGs regarding the dimensionTrustworthiness

DB FB OC WD YA

mgraph 05 05 1 075 025mfact 05 1 0 1 1mNoV al 0 1 0 1 0

of how new data is inserted into the KG and the methodof how existing data is curated

Evaluation results The KGs differ considerablywrt this metric OpenCyc obtains the highest scorehere followed by Wikidata In the following we pro-vide findings for the single KGs which are listed bydecreasing fulfillment score

Cyc is edited (expanded and modified) exclusively bya dedicated expert group The free version OpenCycis derived from Cyc and only a locally hosted versioncan be modified by the data consumer

Wikidata is also curated and expanded manually butby volunteers of the Wikidata community Wikidataallows importing data from external sources such asFreebase87 However new data is not just inserted butis approved by the community

Freebase was also curated by a community of vol-unteers In contrast to Wikidata the proportion of dataimported automatically is considerably higher and newdata imports were not dependent on community ap-provals

DBpedia and YAGO The knowledge of both KGs isextracted from Wikipedia but DBpedia differs fromYAGO wrt the community involvement Any usercan engage (i) in mapping the Wikipedia infobox tem-plates to the DBpedia ontology in the DBpedia map-pings wiki88 and (ii) in the development of the DBpediaextraction framework

Trustworthiness on statement levelWe determine the Trustworthiness on statement level

by evaluating whether provenance information for state-ments is used in the KGs The picture is mixed

DBpedia uses the relation provwasDerivedFrom to store the sources of the entities and their state-

87Note that imports from Freebase require the approval ofthe community (see httpswwwwikidataorgwikiWikidataPrimary_sources_tool) Besides that there arebots which import automatically (see httpswwwwikidataorgwikiWikidataBotsde)

88See httpmappingsdbpediaorg requested onMar 3 2016

32 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

ments However as the source is always the correspond-ing Wikipedia article89 this provenance informationis trivial and the fulfillment degree is hence of ratherformal nature

YAGO uses its own vocabulary to indicate thesource of information Interestingly YAGO stores perstatement both the source (via yagoextractionSource eg the Wikipedia article) and the used ex-traction technique (via yagoextractionTech-nique eg ldquoInfobox Extractorrdquo or ldquoCategoryMap-perrdquo) The number of statements about sources is 161Mand hence many times over the number of instances inthe KG The reason for that is that in YAGO the sourceis stored for each fact

In Wikidata several relations can be used for refer-ring to sources such as ldquoimported fromrdquo (wdtP143)ldquostated inrdquo (wdtP248) and ldquoreference URLrdquo (wdtP854)90 Note that ldquoimported fromrdquo relations are usedfor automatic imports but that statements with such areference are not accepted (ldquodata is not sourcedrdquo)91 Tosource data the other relations ldquostated inrdquo and ldquoref-erence URLrdquo can be used The number of all storedreferences in Wikidata92 is around 971K Based on thenumber of all statements93 74M this corresponds to acoverage of around 13 Note however that not everystatement in Wikidata requires a reference according tothe Wikidata guidelines In order to be able to state howmany references are actually missing a manual evalua-tion would be necessary However such an evaluationwould be presumably highly subjective

Freebase uses proprietary vocabulary for represent-ing provenance via n-ary relations which are in Free-base called Compound Value Types (CVT) data fromhigher arity can be expressed [44]94

OpenCyc differs from the other KGs in that it usesneither an external vocabulary nor a proprietary vocab-ulary for storing provenance information

89Eg httpenwikipediaorgwikiHamburg fordbrHamburg

90All relations are instances of Wikidata property to indicate asource (wdtQ18608359)

91See httpswwwwikidataorgwikiPropertyP143 requested Mar 3 2016

92This is the number of instances of wdoReference93This is the number of instances of wdoStatement94Eg for a statement with the relation freebaselocation

statistical_regionpopulation the source can bestored via freebasemeasurement_unitdated_integersource

Table 5Evaluation results for the KGs regarding the dimension Consistency

DB FB OC WD YA

mcheckRestr 0 1 0 1 0mconClass 088 1 lt1 1 033mconRelat 099 045 1 050 099

Indicating unknown and empty values mNoV al

This criterion highlights the subtle data model ofWikidata and Freebase in comparison to the data mod-els of the other KGs Wikidata allows for storing un-known values and empty values (eg that ldquoElizabeth Iof Englandrdquo (wdtQ7207) had no children) Howeverin the Wikidata RDF export such statements are onlyindirectly available since they are represented via blanknodes and via the relation owlsomeValuesFrom

YAGO supports the representation of unknown val-ues and empty values by providing explicit relationsfor such cases95 Inexact dates are modeled by meansof wildcards (eg ldquo1940--rdquo if only the year isknown) Note however the invalidity of such stringsas date literals (see Section 521) Unknown dates arenot supported by YAGO

523 ConsistencyThe fulfillment degrees of the KGs regarding the

Consistency criteria are shown in Table 5

Check of schema restrictions during insertion of newstatements mcheckRestr

The values of the metric mcheckRestr indicating re-strictions during the insertion of new statements arevarying among the KGs The web interfaces of Free-base and Wikidata verify during the insertion of newstatements by the user whether the input is compatiblewith the respective data type For instance data of therelation ldquodate of birthrdquo (wdtP569) is expected to bein a syntactically valid form DBpedia OpenCyc andYAGO have no checks for schema restriction during theinsertion of new statements

Consistency of statements wrt class constraintsmconClass

Evaluation method For evaluating the consis-tency of class constraints we considered the relationowldisjointWith since this is the only rela-tion which is used by more than half of the consid-

95Eg freebasefreebasevaluenotationhas_no_value

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 33

ered KGs We only focused on direct instantiationshere if there is for instance the triple (dboPlantowldisjointWith dboAnimal) then theremust not be a resource which is instantiated both asdboPlant and dboAnimal

Evaluation results We obtained mixed results hereOnly Freebase OpenCyc and Wikidata perform verywell96

Freebase and Wikidata do not specify any constraintswith owldisjointWith Hence those two KGshave no inconsistencies wrt class restrictions and wecan assign the metric value 1 to them In case of Open-Cyc 5 out of the 27112 class restrictions are incon-sistent DBpedia contains 24 class constraints Threeout of them are inconsistent For instance over 1200instances exist which are both a dboAgent and adboPlace YAGO contains 42 constraints dedi-cated mainly for WordNet classes which are mostlyinconsistent

Consistency of statements wrt relation constraintsmconRelat

Evaluation method Here we considered the rela-tions rdfsrange and owlFunctionalProperty as those are used in more than every second con-sidered KG rdfsrange specifies the expected typeof an instance on the object position of a triple whileowlFunctionalProperty indicates that a rela-tion should only be used at most once per resource Weonly took datatype properties into account for this eval-uation since consistencies regarding object propertieswould require to distinguish Open World assumptionand Closed World assumption

Evaluation results In the following we considerthe fulfillment degree for the relation constraintsrdfsrange and owlFunctionalPropertyseparately In Table 5 we show the average of the fulfill-ment scores of each KG regarding rdfsrange andowlFunctionalProperty Note that the num-bers of evaluated relation constraints varied from KG toKG depending on how many relation constraints wereavailable per KG

Range Wikidata does not use any rdfsrangerestrictions Within the Wikidata data model there iswdopropertyType but this indicates not the ex-act allowed data type of a relation (eg wdoprop

96Note that the sample size varies among the KGs (depend-ing on how many owldisjointWith statements are availableper KG) Therefore inconsistencies measured on a small set ofowldisjointWith facts become more visible

Table 6Evaluation results for the KGs regarding the dimension Relevancy

DB FB OC WD YA

mRanking 0 1 0 1 0

ertyTypeTime can represent a year or an exact date)On the talk pages of Wikidata relations users can indi-cate the allowed values of relations via One of state-ments97 Since One of statements are only listed onthe property talk pages and since not only entity typesbut also concrete instances are used as One of valueswe do not consider those statements here

DBpedia obtains the highest measured fulfillmentscore wrt consistency of rdfsrange statementsAn example for a range inconsistency is that the relationdbobirthDate requires a data type xsddatein about 20 of those relations the data type xsdgYear is used though

YAGO Freebase and OpenCyc contain range incon-sistencies primarily since they specify designated datatypes via range relations which are not consistentlyused on the instance level For instance YAGO spec-ifies proprietary data types such as yagoyagoURLand yagoyagoISBN On the instance level how-ever either no data type is used or the unspecific datatype xsdstring

FunctionalProperty The restriction indicated byowlFunctionalProperty is used by all KGsexcept Wikidata On the talk pages about the rela-tions in Wikidata users can specify the cardinalityrestriction via setting the relation to single how-ever this is not part of the Wikidata data modelThe other KGs mostly comply with the usage re-strictions of owlFunctionalProperty Note-worthy is that in Freebase 999 of the inconsis-tencies obtained here are caused by the usages ofthe relations freebasetypeobjectname andfreebasecommonnotable_fordisplay_name

524 RelevancyThe fulfillment degrees of the KGs regarding the

Relevancy criteria are shown in Table 6

Creating a ranking of statements mRanking

Only Wikidata supports the modeling of a rankingof statements Each statement is ranked with ldquopre-

97See httpswwwwikidataorgwikiCategoryProperties_with_one-of_constraints for an overviewrequested on Jan 29 2017

34 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Table 7Evaluation results for the KGs regarding the dimensionCompleteness

DB FB OC WD YA

mcSchema 091 076 092 1 095mcColumn 040 043 0 029 033mcPop 093 094 048 099 089mcPop (short) 1 1 082 1 090mcPop (long) 086 088 014 098 088

ferred rankrdquo (wdoPreferredRank) ldquonormal rankrdquo(wdoNormalRank) or ldquodeprecated rankrdquo (wdoDeprecatedRank) The preferred rank corre-sponds to the up-to-date value or the consensus of theWikidata community wrt this relation Freebase doesnot provide any ranking of statements entities or re-lations However the meanwhile shutdown FreebaseSearch API provided a ranking for resources98

525 CompletenessThe fulfillment degrees of the KGs regarding the

Completeness criteria are shown in Table 7

Schema completeness mcSchema

Evaluation method Since a gold standard for eval-uating the Schema completeness of the considered KGshas not been published we built one on our own Thisgold standard is available online99 It is based on thedata set used in Section 513 where we needed as-signments of classes to domains and comprises of 41classes as well as 22 relations It is oriented towards thedomains people media organizations geography andbiology The classes in the gold standard were alignedto corresponding WordNet synsets (using WordNet ver-sion 31) and were grouped into main classes

Evaluation results Generally Wikidata performsoptimal also DBpedia OpenCyc and YAGO exhibitresults which can be judged as acceptable for most usecases Freebase shows considerable room for improve-ment concerning the coverage of typical cross-domainclasses and relations The results in more detail are asfollows

DBpedia DBpedia shows a good score regardingSchema completeness and its schema is mainly limited

98See httpsdevelopersgooglecomfreebasev1search-cookbookscoring-and-ranking re-quested on Mar 4 2016

99See httpkmaifbkitedusitesknowledge-graph-comparison requested on Jan 29 2017

due to the characteristics of how information is storedand extracted from Wikipedia

1 Classes The DBpedia ontology was created man-ually and covers all domains well However it is incom-plete in the details and therefore appears unbalancedFor instance within the domain of plants the DBpe-dia ontology does not use the class tree but the classginko which is a subclass of trees We can mentionas reason for such gaps in the modeling the fact thatthe ontology is created by means of the most frequentlyused infobox templates in Wikipedia

2 Relations Relations are considerably well cov-ered in the DBpedia ontology Some missing relationsor modeling failures are due to the Wikipedia infoboxcharacteristics For example to represent the gender ofa person the existing relation foafgender seemsto fit However it is only modeled in the ontology asbelonging to the class dbolanguage and not usedon instance level Note that the gender of a person is of-ten not explicitly mentioned in the Wikipedia infoboxesbut implicitly mentioned in the category names (forinstance American male singers) While DBpediadoes not exploit this knowledge YAGO does use it andprovides facts with the relation yagohasGender

Freebase Freebase shows a very ambivalent schemacompleteness On the one hand Freebase targets ratherthe representation of facts on instance level than therepresentation of classes and their hierarchy On theother hand Freebase provides a vast amount of rela-tions leading to a very good coverage of the requestedrelations

1 Classes Freebase lacks a class hierarchy and sub-classes of classes are often in different domains (for in-stance the classes freebasemusicartist andsportsmen freebasesportspro_athlete arelogically a subclass of the class people freebasepersonpeople but not explicitly stated as such)which makes it difficult to find suitable sub- and su-perclasses Noteworthy the biology domain containsno classes This is due to the fact that classes are rep-resented as entities such as tree100 and ginko101 Theginko tree is not classified as tree but by the genericclass freebasebiologyoganism_classification

2 Relations Freebase exhibits all relations requestedby our gold standard This is not surprising given thevast amount of available relations in Freebase (see Sec-tion 514 and Table 2)

100Freebase ID freebasem07j7r101Freebase ID freebasem0htd3

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 35

OpenCyc In total OpenCyc exposes a quite highSchema completeness scoring This is due to the factthat OpenCyc has been created manually and has itsfocus on generic and common-sense knowledge

1 Classes The ontology of OpenCyc covers bothgeneric and specific classes such as cychSocialGroup and cychLandTopographicalFeatureWe can state that OpenCyc is complete with respect tothe considered classes

2 Relations OpenCyc lacks some relations of thegold standard such as the number of pages or the ISBNof books

Wikidata According to our evaluation Wikidata iscomplete both with respect to classes and relations

1 Classes Besides frequently used generic classessuch as ldquohumanrdquo (wdtQ5) also very specific classesexist such as ldquolandformrdquo (wdtQ271669) in the senseof a geomorphologial unit with over 3K instances

2 Relations In particular remarkable is that Wiki-data covers all relations of the gold standard eventhough it has extremely less relations than FreebaseThus the Wikidata methodology to let users proposenew relations to discuss about their outreach and fi-nally to approve or disapprove the relations seems tobe appropriate

YAGO Due to its concentration on modeling classesYAGO shows the best overall Schema completenessfulfillment score among the KGs

1 Classes To create the set of classes in YAGOthe Wikipedia categories are extracted and connectedto WordNet synsets Since also our gold standard isalready aligned to WordNet synsets we can measure afull completeness score for YAGO classes

2 Relations The YAGO schema does not containmany unique but rather abstract relations which canbe understood in different senses The abstract rela-tion names make it often difficult to infer the mean-ing The relation yagowasCreatedOnDate forinstance can be used reasonably for both the founda-tion year of a company and for the publication dateof a movie DBpedia in contrast provides the rela-tion dbpfoundationYear Often the meaning ofYAGO relations is only fully understood after consider-ing the associated classes using domain and range ofthe relations Expanding the YAGO schema by furthermore fine-grained relations appears reasonable

Column completeness mcColumn

Evaluation method For evaluating KGs wrt Col-umn completeness for each KG 25 class-relation-

Table 8Metric values of mcCol for single class-relation-pairs

Relation DB FB OC ED YA

Personndashbirthdate 048 048 0 070 077

Personndashsex ndash 057 0 094 064

Bookndashauthor 091 093 0 082 028

BookndashISBN 073 063 ndash 018 001

combinations102 were created based on our gold stan-dard created for measuring the Schema completenessIt was ensured that only those relations were selectedfor a given class for which a value typically exists forthat class For instance we did not include the deathdate as potential relation for living people

Evaluation results In general no KG yields a met-ric score of over 043 As visible in Table 8 KGs oftenhave some specific class-relation-pairs which are wellrepresented on instance level while the rest of the pairsare poorly represented The well-represented pairs pre-sumably originate either from column-complete datasets which were imported (cf MusicBrainz in case ofFreebase) or from user edits focusing primarily on factsabout entities of popular classes such as people Wenotice the following observations with respect to thesingle KGs

DBpedia DBpedia fails regarding the relation sex forinstances of class Person since it does not containsuch a relation in its ontology If we considered the non-mapping-based property dbpgender instead (notdefined in the ontology) we would gain a coverage ofonly 025 (about 5K people) We can note hence thatthe extraction of data out of the Wikipedia categorieswould be a further fruitful data source for DBpedia

Freebase Freebase surprisingly shows a very highcoverage (927) of the authors of books given the ba-sic population of 17M books Note however that thereare not only books modeled under freebasebookbook but also entities of other types such as a descrip-tion of the Lord of Rings (see freebasem07bz5)Also the coverage of ISBN for books is quite high(634)

OpenCyc OpenCyc breaks ranks as mostly no val-ues for the considered relations are stored in this KG It

102The selection of class-relation-pairs was depending on the factwhich class-relation-pairs were available per KG Hence the choiceis varying from KG to KG Also note that less class-relation-pairswere used if no 25 pairs were available in the respective KG

36 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

contains mainly taxonomic knowledge and only thinlyspread instance facts

Wikidata Wikidata achieves a high coverage of birthdates (703) and of gender (941) despite the highnumber of 3M people103

YAGO YAGO obtains a coverage of 635 for gen-der relations as it in contrast to DBpedia extracts thisimplicit information from Wikipedia

Population completeness mcPop

Evaluation method In order to evaluate the Popu-lation completeness we need a gold standard consist-ing of a basic entity population for each consideredKG This gold standard which is available online104

was created on the basis of our gold standard usedfor evaluating the Schema completeness and the Col-umn completeness For its creation we selected fiveclasses from each of the five domains and determinedtwo well-known entities (called short head) and tworather unknown entities (called long tail) for each ofthose classes The exact entity selection criteria are asfollows

1 The well-known entities were chosen without tem-poral and location-based restrictions To take themost popular entities per domain we used quan-titative statements For instance to select well-known athletes we ranked athletes by the numberof won olympic medals to select the most popu-lar mountains we ranked the mountains by theirheights

2 To select the rather unknown entities we consid-ered entities associated to both Germany and aspecific year For instance regarding the athleteswe selected German athletes active in the year2010 such as Maria Houmlfl-Riesch The selectionof rather unknown entities in the domain of biol-ogy is based on the IUCN Red List of ThreatenedSpecies105106

Selecting four entities per class and five classes perdomain resulted in 100 entities to be used for evaluatingthe Population completeness

103These 3M instances form about 185 of all instances in Wiki-data See httpswwwwikidataorgwikiWikidataStatistics requested on Nov 7 2016

104See httpkmaifbkitedusitesknowledge-graph-comparison requested on Jan 29 2017

105See httpwwwiucnredlistorg requested on Apr2 2016

106Note that selecting entities by their importance or popularity ishard in general and that also other popularity measures such as thePageRank scores may be taken into account

Evaluation results All KGs except OpenCyc showgood evaluation results Since also Wikidata exhibitsgood evaluation results the population degree appar-ently does not depend on the age or the insertion methodof the KG Fig 10 additionally depicts the populationcompleteness for the single domains for each KG Inthe following we firstly present our findings for well-known entities before we secondly go into the detailsof rather unknown entities

Well-known entities Here all considered KGsachieve good results DBpedia Freebase and Wikidataare complete wrt the well-known entities in our goldstandard YAGO lacks some well-known entities al-though some of them are represented in Wikipedia Onereason for this fact is that those Wikipedia entities donot get imported into YAGO for which a WordNet classexists For instance there is no ldquoGreat White Sharkrdquoentity only the WordNet class yagowordnet_great_white_shark_101484850

Not-well-known entities First of all not very surpris-ing is the fact that all KGs show a higher degree of com-pleteness regarding well-known entities than regard-ing rather unknown entities as the KGs are orientedtowards general knowledge and not domain-specificknowledge Secondly two things are in particular pe-culiar concerning long-tail entities in the KGs Whilemost of the KGs obtain a score of about 088 Wiki-data deflects upwards and OpenCyc deflects stronglydownwards

Wikidata exhibits a very high Population complete-ness degree for long tail entities This is a result fromthe central storage of interwiki links between differentWikimedia projects (especially between the differentWikipedia language versions) in Wikidata A Wikidataentry is added to Wikidata as soon as a new entity isadded in one of the many Wikipedia language versionsNote however that in this way English-language labelsfor the entities are often missing We measure that onlyabout 546 (102M) of all Wikidata resources have anEnglish label

OpenCyc exhibits a poor population degree score of014 for long-tail entities OpenCycrsquos sister KGs Cycand ResearchCyc are apparently considerably bettercovered with entities [36] leading to higher Populationcompleteness scores

526 TimelinessThe evaluation results concerning the dimension

Timeliness are presented in Table 9

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 37

DBpedia Freebase OpenCyc Wikidata YAGO0

01

02

03

04

05

06

07

08

09

1

PeopleMediaOrganizationsGeographyBiology

Fig 10 Population completeness regarding the different domains per KG

Table 9Evaluation results for the KGs regarding the dimension Timeliness

DB FB OC WD YA

mFreq 05 0 025 1 025mV alidity 0 1 0 1 1mChange 0 1 0 0 0

Timeliness frequency of the KG mFreq

Evaluation results The KGs are very diverse re-garding the frequency in which the KGs are updatedranging from a score of 0 for Freebase (not updated anymore) to 1 for Wikidata (updates immediately visibleand retrievable) Note that the Timeliness frequency ofthe KG can be a crucial point and a criterion for exclu-sion in the process of choosing the right KG for a givensetting [17] In the following we outline some charac-teristics of the KGs with respect to their up-to-dateness

DBpedia is created about once to twice a year andis not modified in the meantime From September2013 until November 2016 six DBpedia versions havebeen published107 Besides the static DBpedia DBpe-dia live108 has been continuously updated by trackingchanges in Wikipedia in real-time However it does notprovide the full range of relations as DBpedia

107These versions are DBpedia 38 DBpedia 39 DBpedia 2014DBpedia 2015-04 DBpedia 2015-10 and DBpedia 2016-04 Alwaysthe latest DBpedia version is published online for dereferencing

108See httplivedbpediaorg requested on Mar 42016

Freebase had been updated continuously until itsclose-down and is not updated anymore

OpenCyc has been updated less than once per yearThe last OpenCyc version dates from May 2012109 Tothe best of our knowledge Cyc and OpenCyc respec-tively are developed further but no exact date of thenext version is known

Wikidata provides the highest fulfillment degree forthis criterion Modifications in Wikidata are via browserand via HTTP URI dereferencing immediately visibleHence Wikidata falls in the category of continuousupdates Besides that an RDF export is provided ona roughly monthly basis (either via the RDF exportwebpage110 or via own processing using the Wikidatatoolkit111)

YAGO has been updated less than once per yearYAGO3 was published in 2015 YAGO2 in 2011 andthe interim version YAGO2s in 2013 A date of the nextrelease has not been published

Specification of the validity period of statementsmV alidity

Evaluation results Although representing the va-lidity period of statements is obviously reasonable formany relations (for instance the presidentrsquos term of

109See httpswopencycorg requested on Nov 82016

110See httptoolswmflabsorgwikidata-exportsrdfexports requested on Nov 23 2016

111See httpsgithubcomWikidataWikidata-Toolkit requested on Nov 8 2016

38 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Table 10Evaluation results for the KGs regarding the dimension Ease ofunderstanding

DB FB OC WD YA

mDescr 070 097 1 lt1 1mLang 1 1 0 1 1muSer 1 1 0 1 1muURI 1 05 1 0 1

office) specifying the validity period of statements isin several KGs either not possible at all or only rudi-mentary performed

DBpedia and OpenCyc do not realize any specifi-cation possibility In YAGO Freebase and Wikidatathe temporal validity period of statements can be spec-ified In YAGO this modeling possibility is madeavailable via the relations yagooccursSinceyagooccursUntil and yagooccursOnDateWikidata provides the relations ldquostart timerdquo (wdtP580)and ldquoend timerdquo (wdtP582) In Freebase CompoundValue Types (CVTs) are used to represent relations withhigher arity [44] As part of this representation validityperiods of statements can be specified An example isldquoVancouverrsquos population in 1997rdquo

Specification of the modification date of statementsmChange

Evaluation results The modification date of state-ments can only be specified in Freebase but not in theother KGs Together with the criteria on Timelinessthis reflects that the considered KGs are mostly notsufficiently equipped with possibilities for modelingtemporal aspects within and about the KG

In Freebase the date of the last review of a fact can berepresented via the relation freebasefreebasevaluenotationis_reviewed In the DBpediaontology the relation dctermsmodified is usedto state the date of the last revision of the DBpediaontology When dereferencing a resource in Wikidatathe latest modification date of the resource is returnedvia schemadateModified This however doesnot hold for statements Thus Wikidata is evaluatedwith 0 too

527 Ease of UnderstandingDescription of resources mDescr

Evaluation method We measured the extent towhich entities are described Regarding the labelswe considered rdfslabel for all KGs Regard-ing the descriptions the corresponding relations dif-

fer from KG to KG DBpedia for instance usesrdfscomment and dcelementsdescriptionwhile Freebase provides freebasecommontopicdescription112

Evaluation result For all KGs the rule applies thatin case there is no label available usually there isalso no description available The current metric couldtherefore (without significant restrictions) be applied tordfslabel occurrences only

YAGO Wikidata and OpenCyc contain a label foralmost every entity In Wikidata the entities withoutany label are of experimental nature and are most likelynot used113

Surprisingly DBpedia shows a relatively low cov-erage wrt labels and descriptions (only 704) Ourmanual investigations suggest that relations with higherarity are modeled by means of intermediate nodeswhich have no labels114

Labels in multiple languages mLang

Evaluation method Here we measure whether theKGs contain labels (rdfslabel) in other languagesthan English This is done by means of the languageannotations of literals such as ldquoderdquo for literals inGerman

Evaluation results DBpedia provides labels in 13languages Further languages are provided in the lo-calized DBpedia versions YAGO integrates statementsof the different language versions of Wikipedia intoone KG Therefore it provides labels in 326 differentlanguages Freebase and Wikidata also provide a lot oflanguages (244 and 395 languages respectively) Con-trary to the other KGs OpenCyc only provides labelsin English

Coverage of languages We also measured the cov-erage of selected languages in the KGs ie the extentto which entities have an rdfslabel with a specificlanguage annotation115 Our evaluation shows that DB-pedia YAGO and Freebase achieve a high coveragewith more than 90 regarding the English language Incontrast to those KGs Wikidata shows a relative low

112Human-readable resource descriptions may also be representedby other relations [15] However we focused on those relations whichare commonly used in the considered KGs

113For instance wdtQ5127809 represents a game fo the Nin-tendo Entertainment System but there is no further information foran identification of the entity available

114Eg dbrNayim links via dboCareerStation to 10entities of his carrier stations

115Note that literals such as rdfslabel do not necessarily havelanguage annotations In those cases we assume that no languageinformation is available

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 39

coverage regarding the English language of only 546but a coverage of over 30 for further languages suchas German and French Wikidata is hence not only themost diverse KG in terms of languages but has also thehighest coverage regarding non-English languages

Understandable RDF serialization muSer

The provisioning of understandable RDF serializa-tions in the context of URI dereferencing leads to a bet-ter understandability for human data consumers DB-pedia YAGO and Wikidata provide N-Triples andN3Turtle serializations Freebase in contrast onlyprovides a Turtle serialization OpenCyc only usesRDFXML which is regarded as not easily understand-able by humans

Self-describing URIs muURI

We can observe two different paradigms of URI us-age On the one hand DBpedia OpenCyc and YAGOrely on descriptive URIs and therefore achieve the fullfulfillment degree In DBpedia and YAGO the URIsof the entities are determined by the corresponding En-glish Wikipedia article The mapping to the EnglishWikipedia is thus trivial In case of OpenCyc two RDFexports are provided one using opaque and one us-ing self-describing URIs The self-describing URIs arethereby derived from the rdfslabel values of theresources

On the other hand Wikidata and Freebase (the latterin part) rely on opaque URIs Wikidata uses Q-IDsfor resources (items in Wikidata terminology) andP-IDs for relations Freebase uses self-describing URIsonly partially namely opaque M-IDs for entities andself-describing URIs for classes and relations116

528 InteroperabilityThe evaluation results of the dimension Interoper-

ability are presented in Table 11

Avoiding blank nodes and RDF reification mReif

Reification allows to represent further informationabout single statements In conclusion we can state thatDBpedia Freebase OpenCyc and YAGO use someform of reification However none of the consideredKGs uses the RDF standard for reification Wikidatamakes extensive use of reification every relation isstored in the form of an n-ary relation In case of DB-pedia and Freebase in contrast facts are predominantlystored as N-Tripels and only relations of higher arity

116Eg freebasemusicalbum for the class music al-bums and freebasepeoplepersondate_of_birthfor the relation day of birth

Table 11Evaluation results for the KGs regarding the dimensionInteroperability

DB FB OC WD YA

mReif 05 05 05 0 05miSerial 1 0 05 1 1mextV oc 061 011 041 068 013mpropV oc 015 0 051 gt0 0

are stored via n-ary relations117 YAGO stores facts asN-Quads in order to be able to store meta informationof facts like provenance information When the quadsare loaded in a triple store the IDs referring to thesingle statements are ignored and quads are convertedinto triples In this way most of the statements are stillusable without the necessity to deal with reification

Blank nodes are non-dereferencable anonymous re-sources They are used by the Wikidata and OpenCycdata model

Provisioning of several serialization formats miSerial

DBpedia YAGO and Wikidata fulfill the criterion ofProvisioning several RDF serialization formats to thefull extent as they provide data in RDFXML and sev-eral other serialization formats during the URI derefer-encing In addition DBpedia and YAGO provide fur-ther RDF serialization formats (eg JSON-LD Micro-data and CSV) via their SPARQL endpoints Freebaseis the only KG providing RDF only in Turtle format

Using external vocabulary mextV oc

Evaluation method This criterion indicates the ex-tent to which external vocabulary is used For that foreach KG we divide the occurrence number of tripleswith external relations by the number of all relations inthis KG

Evaluation results DBpedia uses 37 unique exter-nal relations from 8 different vocabularies while theother KGs mainly restrict themselves to the externalvocabularies RDF RDFS and OWL

Wikidata reveals a high external vocabulary ratiotoo We can mention two obvious reasons for that fact1 Information in Wikidata is provided in a huge varietyof languages leading to 85M rdfslabel and 140Mschemadescription literals 2 Wikidata makesextensive use of reification Out of the 140M triplesused for instantiations via rdftype about 74M (ie

117See Section 511 for more details wrt the influence of reifica-tion on the number of triples

40 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

about the half) are taken for instantiations of statementsie for reification

Interoperability of proprietary vocabulary mpropV oc

Evaluation method This criterion determines the ex-tent to which URIs of proprietary vocabulary are linkedto external vocabulary via equivalence relations Foreach KG we measure which classes and relationsare linked via owlsameAs118 owlequivalentClass (in Wikidata wdtP1709) and owlequivalentProperty (in Wikidata wdtP1628) to ex-ternal vocabulary Note that other relations such asrdfsubPropertyOf could be taken into accounthowever in this work we only consider equivalencyrelations

Evaluation results In general we obtained low ful-fillment scores regarding this criterion OpenCyc showsthe highest value We achieved the following singlefindings

Regarding its classes DBpedia reaches a relativehigh interlinking degree of about 484 Classes arethereby linked to FOAF Wikidata schemaorg andDUL119 Regarding its relations DBpedia links to Wiki-data and schemaorg120 Only 63 of the DBpediarelations are linked to external vocabulary

Freebase only provides owlsameAs links in theform of a separate RDF file but these links are only oninstance level Thus the KG is evaluated with 0

In OpenCyc about half of all classes exhibit at leastone external linking via owlsameAs Internal linksto resources of swcyccom the commercial ver-sion of OpenCyc were ignored in our evaluation Theconsidered classes are mainly linked to FOAF UM-BEL DBpedia and linkedmdborg the relations mainlyto FOAF DBpedia Dublin Core Terms and linked-mdborg The relative high linking degree of OpenCyccan be attributed to dedicated approaches of linkingOpenCyc to other KGs (see eg Medelyan et al [38])

Regarding the classes Wikidata provides linksmainly to DBpedia Considering all Wikidata classesonly 01 of all Wikidata classes are linked to equiva-

118OpenCyc uses owlsameAs both on schema and instancelevel This is appropriate as the OWL primer states The built-inOWL property owlsameAs links an individual to an individualas well as The owlsameAs statements are often used in definingmappings between ontologies see httpswwww3orgTR2004REC-owl-ref-20040210sameAs-def (requestedon Feb 4 2017)

119See httpwwwontologydesignpatternsorgontdulDULowl requested on Jan 11 2017

120Eg dbobirthDate is linked to wdtP569 andschemabirthDate

Table 12Evaluation results for the KGs regarding the dimension Accessibility

DB FB OC WD YA

mDeref 1 1 044 041 1mAvai lt1 073 lt1 lt1 1mSPARQL 1 1 0 1 0mExport 1 1 1 1 1mNegot 05 1 0 1 0mHTMLRDF 1 1 1 1 0mMeta 1 0 0 0 1

lent external classes This may be due to the high num-ber of classes in Wikidata in general Regarding therelations Wikidata provides links in particular to FOAFand schemaorg and achieves here a linking coverageof 21 Although this is low frequently used relationsare linked121

YAGO contains around 553K owlequivalentClass links to classes within the DBpedia namespacedby However as YAGO classes (and their hierarchy)were imported also into DBpedia (using the namespacehttpdbpediaorgclassyago) we donot count those owlequivalentClass links inYAGO as external links for YAGO

529 AccessibilityThe evaluation results of the dimension Accessibility

are presented in Table 12

Dereferencing possibility of resources mDeref

Evaluation method We measured the dereferenc-ing possibilities of resources by trying to dereferenceURIs containing the fully-qualified domain name ofthe KG For that we randomly selected 15K URIs inthe subject predicate and object position of triples ineach KG We submitted HTTP requests with the HTTPaccept header field set to applicationrdf+xmlin order to perform content negotiation

Evaluation results In case of DBpedia OpenCycand YAGO all URIs were dereferenced successfullyand returned appropriate RDF data so that they fulfilledthis criterion completely For DBpedia 45K URIs wereanalyzed for OpenCyc only around 30K due to thesmall number of unique predicates We observed almost

121Frequently used relations with stated equivalence to externalrelations are eg wdtP31 linked to rdftype and wdtP279linked to rdfssubClassOf

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 41

the same picture for YAGO namely no notable errorsduring dereferencing

For Wikidata which contains also not that manyunique predicates we analyzed around 35K URIs Notethat predicates which are derived from relations using asuffix (eg the suffix s as in wdtP1024s is usedfor predicates referring to a statement) could not bedereferenced at all Furthermore the blank nodes usedfor reification cannot be dereferenced

Regarding Freebase mainly all URIs on subjectand object position of triples could be dereferencedSome resources were not resolvable even after multi-ple attempts (HTTP server error 503 eg freebasem0156q) Surprisingly server errors also appearedwhile browsing the website freebasecom so that datawas partially not available Regarding the predicate po-sition many URIs are not dereferencable due to servererrors (HTTP 503) or due to unknown URIs (HTTP404) Note that if a large number of Freebase requestsare performed an API key from Google is necessaryIn our experiments the access was blocked after a fewthousand requests Hence we can point out that withoutan API key the Freebase KG is only usable to a limitedextent

Availability of the KG mAvai

Evaluation method We measured the availabilityof the officially hosted KGs with the monitoring servicePingdom122 For each KG an uptime test was set upwhich checked the availability of the resource Ham-burg as representative resource for successful URI re-solving (ie returning the status code HTTP 200) ev-ery minute over the time range of 60 days (Dec 182015ndashFeb 15 2016)

Evaluation result While the other KGs showed al-most no outages and were again online after some min-utes on average YAGO outages took place frequentlyand lasted on average 35 hours123 In the given timerange four outages took longer than one day Based onthese insights we recommend to use a local version ofYAGO for time-critical queries

Availability of a public SPARQL endpoint mSPARQL

The SPARQL endpoints of DBpedia and YAGO are

122See httpswwwpingdomcom requested Mar 2 2016The HTTP requests of Pingdom are executed by various servers sothat caching is prevented

123See diagrams per KG on our website (httpkmaifbkitedusitesknowledge-graph-comparisonrequested on Jan 31 2017)

provided by a Virtuoso server124 the Wikidata SPARQLendpoint via Blazegraph125 Freebase and OpenCyc donot provide an official SPARQL endpoint However anendpoint for the MQL query language for the FreebaseKG was available

Especially regarding the Wikidata SPARQL endpointwe observed access restrictions The maximum execu-tion time per query is set to 30 seconds but there is nolimitation regarding the returning number of rows How-ever the front-end of the SPARQL endpoint crashed incase of large result sets with more than 15M rows Al-though public SPARQL endpoints need to be preparedfor inefficient queries the time limit of Wikidata mayimpede the execution of reasonable queries

Provisioning of an RDF export mExport

All considered KGs provide RDF exports as down-loadable files The format of the data differs from KGto KG Mostly data is provided in N-Triples and Turtleformat

Support of content negotiation mNegot

We measure the support of content negotiation re-garding the serialization formats RDFXML N3Turtleand N-Triples OpenCyc does not provide any contentnegotiation only RDFXML is supported as contenttype Therefore OpenCyc does not fulfill the criterionof supporting content negotiation

The endpoints for DBpedia Wikidata and YAGOcorrectly returned the appropriate RDF serializationformat and the corresponding HTML representationof the tested resources Freebase does currently notprovide any content negotiation and only the contenttype textplain is returned

Noteworthy is also that regarding the N-Triples seri-alization YAGO and DBpedia require the accept headertextplain and not applicationn-triplesThis is due to the usage of Virtuoso as endpoint For DB-pedia the forwarding to httpdbpediaorgdata[resource]ntriples does not work in-stead the HTML representation is returned Thereforethe KG is evaluated with 05

Linking HTML sites to RDF serializations mHTMLRDF

All KGs except OpenCyc interlink the HTML represen-tations of resources with the corresponding RDF repre-sentations by means of ltlink rel=alternate

124See httpsvirtuosoopenlinkswcom re-quested on Dec 28 2016

125See httpswwwblazegraphcom requested on Dec28 2016

42 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Table 13Evaluation results for the KGs regarding the dimension License

DB FB OC WD YA

mmacLicense 1 0 0 1 0

type=content type href=URLgtin the HTML header

Provisioning of metadata about the KG mmeta

For this criterion we analyzed if KG metadata isavailable such as in the form of a VoID file126 DBpediaintegrates the VoID vocabulary directly in its KG127 andprovides information such as the SPARQL endpointURL and the number of all triples OpenCyc revealsthe current KG version number via owlversionInfo For YAGO Freebase and Wikidata no metainformation could be found

5210 LicenseThe evaluation results of the dimension License are

shown in Table 13

Provisioning machine-readable licensing informationmmacLicense

DBpedia and Wikidata provide licensing informa-tion about their KG data in machine-readable form ForDBpedia this is done in the ontology via the predi-cate cclicense linking to CC-BY-SA128 and GNUFree Documentation License (GNU FDL)129 Wikidataembeds licensing information during the dereferenc-ing of resources in the RDF document by linking withcclicense to the license CC0130 YAGO and Free-base do not provide machine-readable licensing infor-mation However their data is published under the li-cense CC-BY131 OpenCyc embeds licensing informa-tion into the RDF document during dereferencing butnot in machine-readable form132

126See httpswwww3orgTRvoid requested on Apr7 2016

127See httpdbpediaorgvoidpageDataset re-quested on Mar 5 2016

128See httpcreativecomonsorglicensesby-sa30 requested on Feb 4 2017

129See httpwwwgnuorgcopyleftfdlhtml re-quested on Feb 4 2017

130See httpcreativecomonsorgpublicdomainzero10 requested on Feb 4 2017

131See httpcreateivecommonsorglicensesby30 requested on Feb 4 2017

132License information is provided as plain text among furtherinformation with the relation rdfscomment

Table 14Evaluation results for the KGs regarding the dimension Interlinking

DB FB OC WD YA

mInst 025 0 038 0 (09) 031mURIs 093 091 089 096 096

5211 InterlinkingThe evaluation results of the dimension Interlinking

are shown in Table 14

Linking via owlsameAs mInst

Evaluation method Given all owlsameAs triplesin each KG we queried all those subjects thereof whichare instances but neither classes nor relations133 andwhere the resource in the object position of the triple isan external source ie not belonging to the namespaceof the KG

Evaluation result OpenCyc and YAGO achieve thebest results wrt this metric but DBpedia has by farthe most instances with at least one owlsameAs linkWe can therefore confirm the statement by Bizer et al[12] that DBpedia has established itself as a hub in theLinked Data cloud

In DBpedia there are about 52M instances with atleast one owlsameAs link Links to localized DBpe-dia versions (eg dedbpediaorg) were countedas internal links and hence not considered here Intotal one-fourth of all instances have at least oneowlsameAs link

In Wikidata neither owlsameAs links are pro-vided nor a corresponding proprietary relation is avail-able Instead Wikidata uses for each linked data seta proprietary relation (called identifier) to indicateequivalence For example the M-ID of a Freebase in-stance is stored via the relation ldquoFreebase identifierrdquo(wdtP646) as literal value (eg m01x3gpk)So far links to 426 different data sources are maintainedin this way

Although the equivalence statements in Wikidata canbe used to generate corresponding owlsameAs state-ments and although the stored identifiers are providedin the Browser interface as hyperlinks there are no gen-uine owlsameAs links available Hence Wikidata isevaluated with 0 If we view each equivalence relationas owlsameAs relation we would obtain around122M instances with owlsameAs statements Thiscorresponds to 86 of all instances If we consider

133The interlinking on schema level is already covered by thecriterion Interoperability of proprietary vocabulary

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 43

only entities instead of instances (since there are manyinstances due to reification) we obtain a coverage of65 Note however that although the linked resourcesprovide relevant content the resources are not alwaysRDF documents but instead HTML web pages There-fore we cannot easily subsume all identifiers (equiv-alence statements) under owlsameAs

YAGO has around 36M instances with at least oneowlsameAs link However most of them are linksto DBpedia based on common Wikipedia articles Ifthose links are excluded YAGO contains mostly linksto GeoNames and would be evaluated with just 001

In case of OpenCyc links to Cyc134 the commercialversion of OpenCyc were considered as being internalStill OpenCyc has the highest fulfillment degree witharound 40K instances with at least one owlsameAslink As mentioned earlier the relative high linkingdegree of OpenCyc can be attributed to dedicated ap-proaches of linking OpenCyc to other KGs135

Validity of external URIs mURIs

Regarding the dimension Accessibility we alreadyanalyzed the dereferencing possibility of resources inthe KG namespace Now we analyze the links to exter-nal URIs

Evaluation method External links include owlsameAs links as well as links to non-RDF-based Webresources (eg via foafhomepage) We measureerrors such as timouts client errors (HTTP response4xx) and server errors (HTTP response 5xx)

Evaluation result The external links are in most ofthe cases valid for all KGs All KGs obtain a metricvalue between 089 and 096

DBpedia stores provenance information via the re-lation provwasDerivedFrom Since almost alllinks refer to Wikipedia 99 of the resources are avail-able

Freebase achieves high metric values here sinceit contains owlsameAs links mainly to WikipediaAlso Wikipedia URIs are mostly resolvable

OpenCyc contains mainly external links to non-RDF-based Web resources to wikipediaorg and w3org

YAGO also achieves high metric values since it pro-vides owlsameAs links only to DBpedia and Geo-Names whose URIs do not change

For Wikidata the relation reference URL (wdtP854) which states provenance information amongother relations belongs to the links linking to external

134Ie swcyccom135See Interoperability of proprietary vocabulary in sec 528

Web resources Here we were able to resolve around955 without errors

Noticeable is that DBpedia and OpenCyc containmany owlsameAs links to URIs whose domains donot exist anymore136 One solution for such invalid linksmight be to remove them if they have been invalid for acertain time span

5212 Summary of ResultsWe now summarize the results of the evaluations

presented in this section

1 Syntactic validity of RDF documents All KGsprovide syntactically valid RDF documents

2 Syntactic validity of Literals In general the KGsachieve good scores regarding the Syntactic valid-ity of literals Although OpenCyc comprises over1M literals in total these literals are mainly labelsand descriptions which are not formatted in a spe-cial format For YAGO we detected about 519Ksyntactic errors (given 1M literal values) due to theusage of wildcards in the date values Obviouslythe syntactic invalidity of literals is accepted bythe publishers in order to keep the number of rela-tions low In case of Wikidata some invalid literalssuch as the ISBN have been corrected in newerversions of Wikidata This indicates that knowl-edge in Wikidata is curated continuously For DB-pedia comments next to the values to be extracted(such as ISBN) in the infoboxes of Wikipedia ledto inaccurately extracted values

3 Semantic validity of triples All considered KGsscored well regarding this metric This shows thatKGs can be used in general without concerns re-garding the correctness Note however that eval-uating the semantic validity of facts is very chal-lenging since a reliable ground truth is needed

4 Trustworthiness on KG level Based on the way ofhow data is imported and curated OpenCyc andWikidata can be trusted the most

5 Trustworthiness on statement level Here espe-cially good values are achieved for Freebase Wiki-data and YAGO YAGO stores per statement boththe source and the extraction technique which isunique among the KGs Wikidata also supports tostore the source of information but only around13 of the statements have provenance informa-tion attached Note however that not every state-

136Eg httprdfaboutcom httpwww4wiwissfu-berlindefactbook and httpwikicompanyorg (requested on Jan 11 2017)

44 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

ment in Wikidata requires a reference and that itis hard to evaluate which statements lack such areference

6 Using unknown and empty values Wikidata andFreebase support the indication of unknown andempty values

7 Check of schema restrictions during insertion ofnew statements Since Freebase and Wikidata areeditable by community members simple consis-tency checks are made during the insertion of newfacts in the user interface

8 Consistency of statements wrt class constraintsFreebase and Wikidata do not specify any classconstraints via owldisjointWith while theother KGs do

9 Consistency of statements wrt relation con-straints The inconsistencies of all KGs regardingthe range indications of relations are mainly due toinconsistently used data types (eg xsdgYearis used instead of xsdDate)Regarding the constraint of functional proper-ties the relation owlFunctionalPropertyis used by all KGs except Wikidata in most casesthe KGs comply with the usage restrictions of thisrelation

10 Creating a ranking of statements Only Wikidatasupports a ranking of statements This is in partic-ular worthwhile in case of statements which areonly temporally limited valid

11 Schema completeness Wikidata shows the highestdegree of schema completeness Also for DBpe-dia OpenCyc and YAGO we obtain results whichare presumably acceptable in most cross-domainuse cases While DBpedia classes were sometimesmissing in our evaluation the DBpedia relationswere covered considerably well OpenCyc lackssome relations of the gold standard but the classesof the gold standard were existing in OpenCycWhile the YAGO classes are peculiar in the sensethat they are connected to WordNet synsets it isremarkable that YAGO relations are often keptvery abstract so that they can be applied in differ-ent senses Freebase shows considerable room forimprovement concerning the coverage of typicalcross-domain classes and relations Note that Free-base classes are belonging to different domainsHence it is difficult to find related classes if theyare not in the same domain

12 Column completeness DBpedia and Freebaseshow the best column completeness values ie inthose KGs the predicates used by the instances of

each class are on average frequently used by all ofthose class instances We can name data importsas one reason for it

13 Population completeness Not very surprising isthe fact that all KGs show a higher degree of com-pleteness regarding well-known entities than re-garding rather unknown entities Especially Wiki-data shows an excellent performance for both well-known and rather unknown entities

14 Timeliness frequency of the KG Only Wikidataprovides the highest fulfillment degree for thiscriterion as it is continuously updated and as thechanges are immediately visible and queryable byusers

15 Specification of the validity period of statementsIn YAGO Freebase and Wikidata the temporalvalidity period of statements (eg term of office)can be specified

16 Specification of the modification date of state-ments Only Freebase keeps the modification datesof statements Wikidata provides the modificationdate of the queried resource during URI derefer-encing

17 Description of resources YAGO Wikidata andOpenCyc contain a label for almost every entitySurprisingly DBpedia shows a relatively low cov-erage wrt labels and descriptions (only 704)Manual investigations suggest that the interme-diate node mapping template is the main reasonfor that By means of this template intermediatenodes are introduced and instantiated but no la-bels are provided for them137

18 Labels in multiple languages YAGO Freebaseand Wikidata support hundreds of languages re-garding their stored labels Only OpenCyc con-tains labels merely in English While DBpediaYAGO and Freebase show a high coverage re-garding the English language Wikidata does nothave such a high coverage regarding English butinstead covers other languages to a considerableextent It is hence not only the most diverse KGin terms of languages but also the KG which con-tains the most labels for languages other than En-glish

19 Understandable RDF serialization DBpediaWikidata and YAGO provide several understand-

137An example is dbrVolkswagen_Passat_(B1)which has dboengine statements to the intermediate nodesVolkswagen_Passat_(B1)__1 etc representing differentengine variations

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 45

able RDF serialization formats Freebase onlyprovides the understandable format RDFTurtleOpenCyc relies only on RDFXML which is con-sidered as being not easily understandable for hu-mans

20 Self-describing URIs We can find mixed paradigmsregarding the URI generation DBpedia YAGOand OpenCyc rely on descriptive URIs whileWikidata and Freebase (in part classes and rela-tions are identified with self-describing URIs) usegeneric IDs ie opaque URIs

21 Avoiding blank nodes and RDF reification DB-pedia Wikidata YAGO and Freebase are theKGs which use reification ie which formulatestatements about statements There are differentways of implementing reification [27] DBpediaWikidata and Freebase use n-ary relations whileYAGO uses N-Quads creating so-called namedgraphs

22 Provisioning of several serialization formatsMany KGs provide RDF in several serializationformats Freebase is the only KG providing datain the serialization format RDFTurtle only

23 Using external vocabulary DBpedia and Wiki-data show high degrees of external vocabularyusage In DBpedia the RDF RDFS and OWLvocabularies are used Wikidata has a high ex-ternal vocabulary ratio since there exist manylanguage labels and descriptions (modeled viardfslabel and schemadescription)Also due to instantiations of statements withwdoStatement for reification purposes theexternal relation rdftype is used a lot

24 Interoperability of proprietary vocabulary Weobtained low fulfillment scores regarding this cri-terion OpenCyc shows the highest value Wecan mention as reason for that the fact thathalf of all OpenCyc classes exhibit at least oneowlsameAs linkWhile DBpedia has equivalence statements to ex-ternal classes for almost every second class only63 of all relations have equivalence relations torelations outside the DBpedia namespaceWikidata shows a very low interlinking degreeof classes to external classes and of relations toexternal relations

25 Dereferencing possibility of resources Resourcesin DBpedia OpenCyc and YAGO can be derefer-enced without considerable issues Wikidata usespredicates derived from relations that are not deref-erencable at all as well as blank nodes For Free-

base we measured a quite considerable amountof dereferencing failures due to server errors andunknown URIs Note also that Freebase requiredan API key for a large amount of requests

26 Availability of the KG While all other KGsshowed almost no outages YAGO shows a note-worthy instability regarding its online availabilityWe measured around 100 outages for YAGO ina time interval of 8 weeks taking on average 35hours

27 Provisioning of public SPARQL endpoint DBpe-dia Wikidata and YAGO provide a SPARQL end-point while Freebase and OpenCyc do not Note-worthy is that the Wikidata SPARQL endpoint hasa maximum execution time per query of 30 sec-onds This might be a bottleneck for some queries

28 Provisioning of an RDF export RDF exports areavailable for all KGs and are provided mostly inN-Triples and Turtle format

29 Support of content negotiation DBpedia Wiki-data and YAGO correctly return RDF data basedon content negotiation Both OpenCyc and Free-base do not support any content negotiation WhileOpenCyc only provides data in RDFXML Free-base only returns data with textplain as con-tent type

30 Linking HTML sites to RDF serializations AllKGs except OpenCyc interlink the HTML rep-resentations of resources with the correspondingRDF representations

31 Provisioning of KG metadata Only DBpedia andOpenCyc integrate metadata about the KG insome form DBpedia has the VoID vocabulary in-tegrated while OpenCyc reveals the current KGversion as machine-readable metadata

32 Provisioning machine-readable licensing informa-tion Only DBpedia and Wikidata provide licens-ing information about their KG data in machine-readable form

33 Interlinking via owlsameAs OpenCyc andYAGO achieve the best results wrt this met-ric but DBpedia has by far the most instanceswith at least one owlsameAs link Based onthe resource interlinkage DBpedia is justifiablycalled Linked Data hub Wikidata does not provideowlsameAs links but stores identifiers as liter-als that could be used to generate owlsameAslinks

34 Validity of external URIs The links to exter-nal Web resources are for all KGs valid inmost cases DBpedia and OpenCyc contain many

46 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Step 1 Requirements Analysis

- Identifying the preselection criteria P- Assigning a weight wi to each DQ criterion ci isin C

Step 2 Preselection based on the Preselection Criteria

- Manually selecting the KGs GP that fulfill the preselection criteria P

Step 3 Quantitative Assessment of the KGs

- Calculating the DQ metric mi(g) for each DQ criterion ci isin C- Calculating the fulfillment degree h(g) for each KG g isin GP

- Determining the KG g with the highest fulfillment degree h(g)

Step 4 Qualitative Assessment of the Result

- Assessing the selected KG g wrt qualitative aspects- Comparing the selected KG g with other KGs in G P

Fig 11 Proposed process for using our KG recommendation frame-work

owlsameAs links to RDF documents on do-mains which do not exist anymore those linkscould be deleted

6 KG Recommendation Framework

We now propose a framework for selecting themost suitable KG (or a set of suitable KGs) for agiven concrete setting based on a given set of KGsG = g1 gn To use this framework the user needsto go through the steps depicted in Fig 11

In Step 1 the preselection criteria and the weightsfor the criteria are specified The preselection criteriacan be both quality criteria or general criteria and needto be selected dependent on the use case The Timeli-ness frequency of the KG is an example for a qualitycriterion The license under which a KG is provided(eg CC0 license) is an example for a general criterionAfter weighting the criteria in Step 2 those KGs areneglected which do not fulfill the preselection criteriaIn Step 3 the fulfillment degrees of the remaining KGsare calculated and the KG with the highest fulfillmentdegree is selected Finally in Step 4 the result can be as-sessed wrt qualitative aspects (besides the quantitativeassessments using the DQ metrics) and if necessary analternative KG can be selected for being applied for thegiven scenario

Use case application In the following we showhow to use the KG recommendation framework in aparticular scenario The use case is based on the usageof DBpedia and MusicBrainz for the project BBC Musicas described in [33]

Description of the use case The publisher BBCwants to enrich news articles with fact sheets providingrelevant information about musicians mentioned in thearticles In order to obtain more details about the mu-sicians the user can leave the news section and accessthe musicians section where detailed information is pro-vided including a short description a picture the birthdate and the complete discography for each musicianFor being able to integrate the musicians informationinto the articles and to enable such a linking editorsshall tag the article based on a controlled vocabulary

The KG Recommendation Framework can be appliedas follows

1 Requirements analysis

ndash Preselection criteria According to the sce-nario description [33] the KG in questionshould (i) be actively curated and (ii) con-tain an appropriate amount of media enti-ties Given these two criteria a satisfactoryand up-to-date coverage of both old and newmusicians is expected

ndash Weighting of DQ criteria Based on the pre-selection criteria an example weighting ofthe DQ metrics for our use case is given inTable 15 Note that this is only one exam-ple configuration and the assignment of theweights is subjective to some degree Giventhe preselection criteria the criterion Timeli-ness frequency of the KG and the criteria ofthe DQ dimension Completeness are empha-sized Furthermore the criteria Dereferenc-ing possibility of resources and Availabilityof the KG are important as the KG shall beavailable online ready to be queried138

2 Preselection Freebase and OpenCyc are not con-sidered any further since Freebase is not being up-dated anymore and since OpenCyc contains onlyaround 4K entities in the media domain

3 Quantitative Assessment The overall fulfillmentscore for each KG is calculated based on the for-mula presented in Section 31 The result of thequantitative KG evaluation is presented in Ta-ble 15 By weighting the criteria according tothe constraints Wikidata achieves the best rankclosely followed by DBpedia Based on the quan-titative assessment Wikidata is recommended bythe framework

138We assume that in this use case rather the dereferencing ofHTTP URIs than the execution of SPARQL queries is desired

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 47

Table 15Framework with an example weighting which would be reasonablefor a user setting as given in [33]

Dimension Metric DBpedia Freebase OpenCyc Wikidata YAGO Example of UserWeighting wi

Accuracy msynRDF 1 1 1 1 1 1msynLit 0994 1 1 1 0624 1msemTriple 0990 0995 1 0993 0993 1

Trustworthiness mgraph 05 05 1 075 025 0mfact 05 1 0 1 1 1mNoV al 0 1 0 1 0 0

Consistency mcheckRestr 0 1 0 1 0 0mconClass 0875 1 0999 1 0333 0mconRelat 0992 0451 1 0500 0992 0

Relevancy mRanking 0 1 0 1 0 1

Completeness mcSchema 0905 0762 0921 1 0952 1mcCol 0402 0425 0 0285 0332 2mcPop 093 094 048 099 089 3

Timeliness mFreq 05 0 025 1 025 3mV alidity 0 1 0 1 1 0mChange 0 1 0 0 0 0

Ease of understanding mDescr 0704 0972 1 09999 1 1mLang 1 1 0 1 1 0muSer 1 1 0 1 1 0muURI 1 05 1 0 1 1

Interoperability mReif 05 05 05 0 05 0miSerial 1 0 05 1 1 1mextV oc 061 0108 0415 0682 0134 1mpropV oc 0150 0 0513 0001 0 1

Accessibility mDeref 1 0437 1 0414 1 2mAvai 09961 09998 1 09999 07306 2mSPARQL 1 0 0 1 1 1mExport 1 1 1 1 1 0mNegot 05 0 0 1 1 0mHTMLRDF 1 1 0 1 1 0mMeta 1 0 1 0 0 0

Licensing mmacLicense 1 0 0 1 0 0

Interlinking mInst 0251 0 0382 0 0310 3mURIs 0929 0908 0894 0957 0956 1

Unweighted Average 0683 0603 0496 0752 0625Weighted Average 0701 0493 0556 0714 0648

48 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

4 Qualitative Assessment The high population com-pleteness in general and the high coverage of enti-ties in the media domain in particular give Wiki-data advantage over the other KGs FurthermoreWikidata does not require that there is a Wikipediaarticle for each entity Thus missing Wikidata en-tities can be added by the editors directly and arethen available immediatelyThe use case requires to retrieve also detailed infor-mation about the musicians from the KG such as ashort descripion and a discography DBpedia tendsto store more of that data especially wrt discogra-phy A specialized database like MusicBrainz pro-vides even more data about musicians than DBpe-dia as it is not limited to the Wikipedia infoboxesWhile DBpedia does not provide any links to Mu-sicBrainz Wikidata stores around 120K equiva-lence links to MusicBrainz that can be used to pullmore data In conclusion Wikidata especially inthe combination with MusicBrainz seems to bean appropriate choice for the use case In this casethe qualitative assessment confirms the result ofthe quantitative assessment

The use case shows that our KG recommendationframework enables users to find the most suitable KGand is especially useful in giving an overview of themost relevant criteria when choosing a KG Howeverapplying our framework to the use case also showedthat besides the quantitative assessment there is stilla need for a deep understanding of commonalities anddifference of the KGs in order to make an informedchoice

7 Related Work

71 Linked Data Quality Criteria

Zaveri et al [49] provide a conceptual framework forquality assessment of linked data based on quality cri-teria and metrics which are grouped into quality dimen-sions and categories and which are based on the frame-work of Wang et al [47] Our framework is also basedon Wangrsquos dimensions and extended by the dimensionsConsistency [11] Licensing and Interlinking [49] Fur-thermore we reintroduce the dimensions Trustworthi-ness and Interoperability as a collective term for multi-ple dimensions

Many published DQ criteria and metrics are ratherabstract We in contrast selected and developed con-

crete criteria which can be applied to any KG in theLinked Open Data cloud Table 16 shows which ofthe metrics introduced in this article have already beenused to some extent in existing literature In summaryrelated work mainly proposed generic guidelines forpublishing Linked Data [26] introduced DQ criteriawith corresponding metrics (eg [2030]) and criteriawithout metrics (eg [4029]) 27 of the 34 criteria in-troduced in this article have been introduced or sup-ported in one way or another in earlier works The re-maining seven criteria namely Trustworthiness on KGlevel mgraph Indicating unknown and empty valuesmNoV al Check of schema restrictions during insertionof new statements mcheckRestr Creating a rankingof statements mRanking Timeliness frequency of theKG mFreq Specification of the validity period of state-ments mV alidity and Availability of the KG mAvaihave not been proposed so far to the best of our knowl-edge In the following we present more details of singleexisting approaches for Linked Data quality criteria

Pipino et al [40] introduce the criteria Schema com-pleteness Column completeness and Population com-pleteness in the context of databases We introducethose metrics for KGs and apply them to the best ofour knowledge the first time on the KGs DBpediaFreebase OpenCyc Wikidata and YAGO

OntoQA [45] introduces criteria and correspondingmetrics that can be used for the analysis of ontologiesBesides simple statistical figures such as the average ofinstances per class Tartir et al introduce also criteriaand metrics similar to our DQ criteria Description ofresources mDescr and Column completeness mcCol

Based on a large-scale crawl of RDF data Hogan etal [29] analyze quality issues of published RDF dataLater Hogan et al [30] introduce further criteria andmetrics based on Linked Data guidelines for data pub-lishers [26] Whereas Hogan et al crawl and analyzemany KGs we analyze a selected set of KGs in moredetail

Heath et al [26] provide guidelines for Linked Databut do not introduce criteria or metrics for the assess-ment of Linked Data quality Still the guidelines can beeasily translated into relevant criteria and metrics Forinstance Do you refer to additional access methodsleads to the criteria Provisioning of public SPARQLendpoint mSPARQL and Provisioning of an RDF ex-port mExport Also Do you map proprietary vocabu-lary terms to other vocabularies leads to the criterionInteroperability of proprietary vocabulary mpropV ocMetrics that are based on the guidelines of Heath et alcan also be found in other frameworks [3020]

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 49

Table 16Overview of related work regarding data quality criteria for KGs

DQ Metric [40] [45] [29] [26] [20] [22] [30] [48] [2] [34]

msynRDF X X

msynLit X X X X

msemTriple X X X X

mfact X X

mconClass X X X

mconRelat X X X X X X

mcSchema X X

mcCol X X X X

mcPop X X

mChange X X

mDescr X X X X

mLang X

muSer X

muURI X

mReif X X X

miSerial X

mextV oc X X

mpropV oc X

mDeref X X X X

mSPARQL X

mExport X X

mNegot X X X

mHTMLRDF X

mMeta X X X

mmacLicense X X X

mInst X X X

mURIs X X

Flemming [20] introduces a framework for the qual-ity assessment of Linked Data quality This frameworkmeasures the Linked Data quality based on a sample ofa few RDF documents Based on a systematic literaturereview criteria and metrics are introduced Flemmingintroduces the criteria Labels in multiple languagesmLang and Validity of external URIs mURIs the firsttime The framework is evaluated on a sample of RDFdocuments of DBpedia In contrast to Flemming weevaluate the whole KG DBpedia and also four otherwidely used KGs

SWIQA[22] is a quality assessment framework intro-duced by Fuumlrber et al that introduces criteria and met-rics for the dimensions Accuracy Completeness Timeli-ness and Uniqueness In this framework the dimensionAccuracy is divided into Syntactic validity and Sematicvalidity as proposed by Batini et al [6] Furthermorethe dimension Completeness comprises Schema com-pleteness Column completeness and Population com-pleteness following Pipino et al [40] In this articlewe make the same distinction but in addition distin-guish between RDF documents RDF triples and RDF

50 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

literals for evaluating the Accuracy since we considerRDF KGs

TripleCheckMate [35] is a framework for LinkedData quality assessment using a crowdsourcing-approachfor the manual validation of facts Based on this ap-proach Zaveri et al [48] and Acosta et al [23] analyzeboth syntactic and semantic accuracy as well as theconsistency of data in DBpedia

Kontokostas et al [34] present the test-driven evalu-ation framework RDFUnit for assessing Linked Dataquality This framework is inspired by the paradigmof test-driven software development The frameworkintroduces 17 SPARQL templates of tests that can beused for analyzing KGs wrt Accuracy and Consis-tency Note that those tests can also be used for eval-uating external constraints that exist due to the usageof external vocabulary The framework is applied byKontokostas et al on a set of KGs including DBpedia

72 Comparing KGs by Key Statistics

Duan et al [14] Tartir [45] and Hassanzadeh [25]can be mentioned as the most similar related work re-garding the evaluation of KGs using the key statisticspresented in Section 51

Duan et al [14] analyze the structuredness of data inDBpedia YAGO2 UniProt and in several benchmarkdata sets To that end the authors use simple statisticalkey figures that are calculated based on the correspond-ing RDF dumps In contrast to that approach we useSPARQL queries to obtain the figures thus not limitingourselves to the N-Tripel serialization of RDF dumpfiles Duan et al claim that simple statistical figures arenot sufficient to gain fruitful findings when analyzingthe structuredness and differences of RDF datasets Theauthors therefore propose in addition a coherence met-ric Accordingly we analyze not only simple statisti-cal key figures but further analyze the KGs wrt dataquality using 34 DQ metrics

Tartir et al [45] introduce with the system OntoQAmetrics that can be used for analyzing ontologies Moreprecisely it can be measured to which degree theschema level information is actually used on instancelevel An example of such a metric is the class richnessdefined as the number of classes with instances dividedby the number of classes without instances SWETOTAP and GlycO are used as showcase ontologies

Tartir et al [45] and Hassanzadeh et al [25] analyzehow domains are covered by KGs on both schema andinstance level For that Tartir et al introduce the mea-sure importance as the number of instances per class

and their subclasses In our case we cannot use this ap-proach since Freebase has no hierarchy Hassanzadehet al analyze the coverage of domains by listing themost frequent classes with the highest number of in-stances as a table This gives only little overview of thecovered domains since instances can belong to multi-ple classes in the same domain such as dboPlaceand dboPopulatedPlace For determining thedomain coverages of KGs for this article we there-fore adapt the idea of Hassanzadeh et al by manu-ally mapping the most frequent classes to domains anddeleting duplicates within the domains That meansif an instance is instantiated both as dboPlaceand dboPopulatedPlace the instance will becounted only once in the domain geography

8 Conclusion

Freely available knowledge graphs (KGs) have notbeen in the focus of any extensive comparative study sofar In this survey we defined a range of aspects accord-ing to which KGs can be analyzed We analyzed andcompared DBpedia Freebase OpenCyc Wikidata andYAGO along these aspects and proposed a frameworkas well as a process to enable readers to find the mostsuitable KG for their settings

References

[1] M Acosta E Simperl F Floumlck and M Vidal HARE AHybrid SPARQL Engine to Enhance Query Answers viaCrowdsourcing In Proceedings of the 8th InternationalConference on Knowledge Capture K-CAP 2015 pages111ndash118 ACM 2015

[2] M Acosta A Zaveri E Simperl D Kontokostas S Auer andJ Lehmann Crowdsourcing linked data quality assessment InThe Semantic WebndashISWC 2013 pages 260ndash276 Springer 2013

[3] M Acosta A Zaveri E Simperl D Kontokostas F Floumlckand J Lehmann Detecting Linked Data Quality Issues viaCrowdsourcing A DBpedia Study Semantic Web 2016

[4] S Auer C Bizer G Kobilarov J Lehmann R Cyganiak andZ Ives DBpedia A Nucleus for a Web of Open Data InProceedings of the 6th International Semantic Web Conferenceand 2nd Asian Semantic Web Conference ISWC 2007ASWC2007 pages 722ndash735 Springer 2007

[5] S Auer J Lehmann A-C Ngonga Ngomo and A ZaveriIntroduction to Linked Data and Its Lifecycle on the Web InReasoning Web Semantic Technologies for Intelligent DataAccess volume 8067 of Lecture Notes in Computer Sciencepages 1ndash90 Springer Berlin Heidelberg 2013

[6] C Batini C Cappiello C Francalanci and A MaurinoMethodologies for Data Quality Assessment and ImprovementACM Comput Surv 41(3)161ndash1652 July 2009

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 51

[7] S Bechhofer F van Harmelen J Hendler I Horrocks D LMcGuinness and P F Patel-Schneider OWL Web OntologyLanguage Reference httpswwww3orgTR2004REC-owl-ref-200402102004 [Online accessed 06-Apr-2016]

[8] T Berners-Lee Linked Data httpwwww3orgDesignIssuesLinkedDatahtml2006 [Online accessed 28-Feb-2016]

[9] T Berners-Lee Linked Data Is Merely More Datahttpwwww3orgDesignIssuesLinkedDatahtml 2006[Online accessed 28-02-2016]

[10] T Berners-Lee J Hendler and O Lassila The Semantic WebScientific American 284(5)29ndash37 5 2001

[11] C Bizer Quality-Driven Information Filtering in the Contextof Web-Based Information Systems VDM Publishing 2007

[12] C Bizer J Lehmann G Kobilarov S Auer C BeckerR Cyganiak and S Hellmann DBpediandashA crystallizationpoint for the Web of Data Web Semantics science servicesand agents on the world wide web 7(3)154ndash165 2009

[13] X Dong E Gabrilovich G Heitz W Horn N LaoK Murphy T Strohmann S Sun and W Zhang KnowledgeVault A Web-Scale Approach to Probabilistic KnowledgeFusion In Proceedings of the 20th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining KDD rsquo14 pages 601ndash610 New York NY USA 2014ACM

[14] S Duan A Kementsietsidis K Srinivas and O UdreaApples and Oranges A Comparison of RDF Benchmarks andReal RDF Datasets In Proceedings of the ACM SIGMODInternational Conference on Management of Data SIGMOD2011 pages 145ndash156 2011

[15] B Ell D Vrandecic and E Simperl Proceedings of the 10thInternational Semantic Web Conference (ISWC 2011) chapterLabels in the Web of Data pages 162ndash176 Springer BerlinHeidelberg Berlin Heidelberg 2011

[16] F Erxleben M Guumlnther M Kroumltzsch J Mendez andD Vrandecic Introducing Wikidata to the Linked Data WebIn Proceedings of the 13th International Semantic WebConference ISWC 2014 pages 50ndash65 Springer 2014

[17] M Faumlrber F Bartscherer C Menne and A Rettinger LinkedData Quality of DBpedia Freebase OpenCyc Wikidata andYAGO Semantic Web Journal 2017 to be published

[18] M Faumlrber C Menne and A Rettinger A Linked DataWrapper for CrunchBase Semantic Web Journal 2017 to bepublished

[19] C Fellbaum WordNet ndash An Electronic Lexical Database MITPress 1998

[20] A Flemming Qualitaumltsmerkmale von LinkedData-veroumlffentlichenden Datenquellen (Quality characteristicsof linked data publishing datasources) Diploma ThesisHumboldt University of Berlinhttpwwwdbisinformatikhu-berlindefileadminresearchpapersdiploma_seminar_thesisDiplomarbeit_Annika_Flemmingpdf 2011

[21] G Freedman and E G Reynolds Enriching Basal ReaderLessons with Semantic Webbing Reading Teacher33(6)677ndash684 1980

[22] C Fuumlrber and M Hepp SWIQA ndash A Semantic WebInformation Quality Assessment Framework In Proceedings ofthe 19th European Conference on Information Systems

(ECIS2011) volume 15 page 19 2011[23] R Guns Tracing the origins of the Semantic Web Journal of

the American Society for Information Science and Technology64(10)2173ndash2181 2013

[24] H Halpin P J Hayes J P McCusker D L McGuinness andH S Thompson The Semantic Web ndash ISWC 2010 9thInternational Semantic Web Conference ISWC 2010 ShanghaiChina chapter When owlsameAs Isnrsquot the Same An Analysisof Identity in Linked Data pages 305ndash320 Springer BerlinHeidelberg Berlin Heidelberg 2010

[25] O Hassanzadeh M J Ward M Rodriguez-Muro andK Srinivas Understanding a Large Corpus of Web TablesThrough Matching with Knowledge Bases ndash An EmpiricalStudy In Proceedings of the 10th International Workshop onOntology Matching collocated with the 14th InternationalSemantic Web Conference ISWC 2015 2015

[26] T Heath and C Bizer Linked data Evolving the web into aglobal data space Synthesis lectures on the semantic webtheory and technology 1(1)1ndash136 2011

[27] D Hernaacutendez A Hogan and M Kroumltzsch Reifying RDFWhat Works Well With Wikidata In Proceedings of the 11thInternational Workshop on Scalable Semantic Web KnowledgeBase Systems co-located with 14th International Semantic WebConference pages 32ndash47 2015

[28] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 A Spatially and Temporally Enhanced KnowledgeBase from Wikipedia Artificial Intelligence 19428ndash61 2013

[29] A Hogan A Harth A Passant S Decker and A PolleresWeaving the Pedantic Web Proceedings of the WWW2010Workshop on Linked Data on the Web 628 2010

[30] A Hogan J Umbrich A Harth R Cyganiak A Polleres andS Decker An empirical survey of linked data conformanceWeb Semantics Science Services and Agents on the WorldWide Web 1414ndash44 2012

[31] P Jain P Hitzler K Janowicz and C Venkatramani TherersquosNo Money in Linked Data httpcorescholarlibrarieswrighteducse240 2013 accessedJuly 20 2015

[32] J M Juran F M Gryna and R S Bingham editors QualityControl Handbook McGraw-Hill 1974

[33] G Kobilarov T Scott Y Raimond S Oliver C SizemoreM Smethurst C Bizer and R Lee Media Meets SemanticWeb ndash How the BBC Uses DBpedia and Linked Data to MakeConnections In Proceedings of the 6th European SemanticWeb Conference on The Semantic Web Research andApplications ESWC 2009 Heraklion pages 723ndash737 BerlinHeidelberg 2009 Springer

[34] D Kontokostas P Westphal S Auer S HellmannJ Lehmann R Cornelissen and A Zaveri Test-drivenevaluation of linked data quality In Proceedings of the 23rdinternational conference on World Wide Web pages 747ndash758ACM 2014

[35] D Kontokostas A Zaveri S Auer and J LehmannTripleCheckMate A Tool for Crowdsourcing the QualityAssessment of Linked Data In Knowledge Engineering andthe Semantic Web ndash 4th International Conference KESW 2013St Petersburg Russia October 7-9 2013 Proceedings pages265ndash272 Springer 2013

[36] C Matuszek J Cabral M J Witbrock and J DeOliveira AnIntroduction to the Syntax and Content of Cyc In AAAI SpringSymposium Formalizing and Compiling Background

52 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Knowledge and Its Applications to Knowledge Representationand Question Answering pages 44ndash49 AAAI - Association forthe Advancement of Artificial Intelligence 2006

[37] M Mecella M Scannapieco A Virgillito R BaldoniT Catarci and C Batini Managing data quality in cooperativeinformation systems In On the Move to Meaningful InternetSystems 2002 CoopIS DOA and ODBASE pages 486ndash502Springer 2002

[38] O Medelyan and C Legg Integrating Cyc and WikipediaFolksonomy meets rigorously defined common-sense InWikipedia and Artificial Intelligence An Evolving SynergyPapers from the 2008 AAAI Workshop page 65 2008

[39] F Naumann Quality-Driven Query Answering for IntegratedInformation Systems volume 2261 Springer Science ampBusiness Media 2002

[40] L L Pipino Y W Lee and R Y Wang Data QualityAssessment Communications of the ACM 45(4)211ndash2182002

[41] E Sandhaus Semantic Technology at the New York TimesLessons Learned and Future Directions In Proceedings of the9th International Semantic Web Conference on The SemanticWeb - Volume Part II ISWCrsquo10 pages 355ndash355 BerlinHeidelberg 2010 Springer

[42] A Singhal Introducing the Knowledge Graph things notstrings httpsgoogleblogblogspotde201205introducing-knowledge-graph-things-nothtml retrieved on Aug 29 2016 2012

[43] F M Suchanek G Kasneci and G Weikum YAGO A LargeOntology from Wikipedia and WordNet Web SemanticsScience Services and Agents on the World Wide Web6(3)203ndash217 2008

[44] T P Tanon D Vrandecic S Schaffert T Steiner andL Pintscher From Freebase to Wikidata The Great MigrationIn Proceedings of the 25th International Conference on WorldWide Web WWW 2016 pages 1419ndash1428 2016

[45] S Tartir I B Arpinar M Moore A P Sheth andB Aleman-meza OntoQA Metric-Based Ontology QualityAnalysis In IEEE Workshop on Knowledge Acquisition fromDistributed Autonomous Semantically Heterogeneous Dataand Knowledge Sources 2005

[46] R Y Wang M P Reddy and H B Kon Toward quality dataAn attribute-based approach Decision Support Systems13(3)349ndash372 1995

[47] R Y Wang and D M Strong Beyond Accuracy What DataQuality Means to Data Consumers Journal of managementinformation systems 12(4)5ndash33 1996

[48] A Zaveri D Kontokostas M A Sherif L BuumlhmannM Morsey S Auer and J Lehmann User-driven qualityevaluation of dbpedia In Proceedings of the 9th InternationalConference on Semantic Systems pages 97ndash104 ACM 2013

[49] A Zaveri A Rula A Maurino R Pietrobon J Lehmann andS Auer Quality Assessment for Linked Data A SurveySemantic Web 7(1)63ndash93 2015

Page 2: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with

2 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

disjoint We denote the system that hosts a KG g withhg

In this survey we focus on those KGs having thefollowing aspects

1 The KGs are freely accessible and freely usablewithin the Linked Open Data (LOD) cloudLinked Data refers to a set of best practices1 forpublishing and interlinking structured data on theWeb defined by Berners-Lee [8] in 2006 LinkedOpen Data refers to the Linked Data which canbe freely used modified and shared by anyone forany purpose2 The aim of the Linking Open Datacommunity project3 is to publish RDF datasets onthe Web and to interlink these datasets

2 The KGs should cover general knowledge (oftenalso called cross-domain or encyclopedic knowl-edge) instead of knowledge about special domainssuch as biomedicine

Thus out of scope are KGs which are not openlyavailable such as the Google Knowledge Graph4 andthe Google Knowledge Vault [13] Excluded are alsoKGs which are only accessible via an API but whichare not provided as dump files (see WolframAlpha5

and the Facebook Graph6) as well as KGs which arenot based on Semantic Web standards at all or whichare only unstructured or weakly structured knowledgecollections (eg The World Factbook of the CIA7)

For selecting the KGs for analysis we regardedall datasets which had been registered at the onlinedataset catalog httpdatahubio8 and whichwere tagged as ldquocrossdomainrdquo Besides that we tookWikidata into consideration since it also fulfilled theabove mentioned requirements Based on that we se-

1See httpwwww3orgTRld-bp requested on April5 2016

2See httpopendefinitionorg requested on Apr 52016

3See httpwwww3orgwikiSweoIGTaskForcesCommunityProjectsLinkingOpenDatarequested on Apr 5 2016

4See httpwwwgooglecominsidesearchfeaturessearchknowledgehtml requested on Apr 32016

5See httpproductswolframalphacomapi re-quested on Aug 30 2016

6See httpsdevelopersfacebookcomdocsgraph-api requested on Aug 30 2016

7See httpswwwciagovlibrarypublicationsthe-world-factbook requested on Aug30 2016

8This catalog is also used for registering Linked Open Datadatasets

lected DBpedia Freebase OpenCyc Wikidata andYAGO as KGs for our comparison

In this paper we give a systematic overview of theseKGs in their current versions (as of April 2016) anddiscuss how the knowledge in these KGs is modeledstored and queried To the best of our knowledge sucha comparison between these widely used KGs has notbeen presented before Note that the focus of this surveyis not the life cycle of KGs on the Web or in enterprisesWe can refer in this respect to [5] Instead the focus ofour KG comparison is on data quality as this is one ofthe most crucial aspects when it comes to consideringwhich KG to use in a specific setting

Furthermore we provide a KG recommendationframework for users who are interested in using one ofthe mentioned KGs in a research or industrial settingbut who are inexperienced in which KG to choose fortheir concrete settings

The main contributions of this survey are

1 Based on existing literature on data quality weprovide 34 data quality criteria according to whichKGs can be analyzed

2 We calculate key statistics for the KGs DBpediaFreebase OpenCyc Wikidata and YAGO

3 We analyze DBpedia Freebase OpenCyc Wiki-data and YAGO along the mentioned data qualitycriteria9

4 We propose a framework which enables users tofind the most suitable KG for their needs

The survey is organized as follows

ndash In Section 2 we introduce formal definitions usedthroughout the article

ndash In Section 3 we describe the data quality dimen-sions which we later use for the KG comparisonincluding their subordinated data quality criteriaand corresponding data quality metrics

ndash In Section 4 we describe the selected KGsndash In Section 5 we analyze the KGs using several

key statistics and using the data quality metricsintroduced in Section 3

ndash In Section 6 we present our framework for assess-ing and rating KGs according to the userrsquos setting

ndash In Section 7 we present related work on (linked)data quality criteria and on key statistics for KGs

ndash In Section 8 we conclude the survey

9The data and detailed evaluation results for both thekey statistics and the metric evaluations are online avail-able at httpkmaifbkitedusitesknowledge-graph-comparison (requested on Jan 31 2017)

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 3

2 Important Definitions

We define the following sets that are used in formal-izations throughout the article If not otherwise statedwe use the prefixes listed in Listing 1 for indicatingnamespaces throughout the article

ndash Cg denotes the set of classes in gCg = x | (xrdfssubClassOf o) ising or (srdfssubClassOf x) isin g or (xwdtP279 o) isin g or (swdtP279 x) isin g or(xrdftyperdfsClass) isin g

ndash An instance of a class is a resource which is mem-ber of that class This membership is given by acorresponding instantiation assignment10 Ig de-notes the set of instances in gIg = s | (srdftype o) isin g or (swdtP31o) isin g

ndash Entities are defined as instances which representreal world objects Eg denotes the set of entitiesin gEg = s | (srdftypeowlThing) ising or (srdftypewdoItem) isin g or(srdftypefreebasecommontopic) ising or (srdftypecychIndividual) ising

ndash Relations (interchangeably used with proper-ties) are links between RDF terms11 defined onthe schema level (ie T-Box) To emphasize thischaracterization we also call them explicitly de-fined relations Pg denotes the set of all thoserelations in gPg = s | (srdftyperdfProperty) ising or (srdftyperdfsProperty)isin g or (srdftypewdoProperty) ising or (srdftypeowlFunctionalProperty) isin g or (srdftypeowlInverseFunctionalProperty) isin g or(srdftypeowlDatatypeProperty) ising or (srdftypeowlObjectProperty) isin g or (srdftypeowlSymmetricProperty) isin g or(srdftypeowlTransitiveProperty)isin g

ndash Implicitly defined relations embrace all linksused in the KG ie on instance and schema level

10See httpswwww3orgTRrdf-schema re-quested on Aug 29 2016

11RDF terms comprise URIs blank nodes and literals

We also call them predicates P impg denotes the

set of all implicitly defined relations in gP impg = p | (s p o) isin g

ndash Ug denotes the set of all URIs used in gUg = x | ((x p o) isin g or (s x o) isin g or(s p x) isin g) and x isin U

ndash U localg denotes the set of all URIs in g with local

namespace ie those URIs start with the KG gdedicated prefix (cf Listing 1)

ndash Complementary Uextg consists of all URIs in Ug

which are external to the KG g which means thathg is not responsible for resolving those URIs

Note that knowledge about the KGs which were ana-lyzed for this survey was taken into account when defin-ing these sets These definitions may not be appropriatefor other KGs

Furthermore the setsrsquo extensions would be differentwhen assuming a certain semantic (eg RDF RDFS orOWL-LD) Under the assumption that all entailmentsunder one of these semantics were added to a KG thedefinition of each set could be simplified and the exten-sions would be of larger cardinality However for thisarticle we did not derive entailments

3 Data Quality Assessment wrt KGs

Everybody on the Web can publish informationTherefore a data consumer does not only face the chal-lenge to find a suitable data source but is also con-fronted with the issue that data on the Web can dif-fer very much regarding its quality Data quality canthereby be viewed not only in terms of accuracy but inmultiple other dimensions In the following we intro-duce concepts regarding the data quality of KGs in theLinked Data context which are used in the followingsections The data quality dimensions are then exposedin Sections 32 ndash 35

Data quality (DQ) ndash in the following interchange-ably used with information quality12 ndash is defined byJuran et al [32] as fitness for use This means that dataquality is dependent on the actual use case

One of the most important and foundational works ondata quality is that of Wang et al [47] They developeda framework for assessing the data quality of datasetsin the database context In this framework Wang et al

12As soon as data is considered wrt usefulness the data is seenin a specific context It can thus already be regarded as informationleading to the term ldquoinformation qualityrdquo instead of ldquodata qualityrdquo

4 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Listing 1 Default prefixes for namespaces used throughout this article

prefix cc lthttpcreativecommonsorgnsgt prefix cyc lthttpswopencycorgconceptgt prefix cych lthttpswopencycorg20120510conceptengt prefix dbo lthttpdbpediaorgontologygt prefix dbp lthttpdbpediaorgpropertygt prefix dbr lthttpdbpediaorgresourcegt prefix dby lthttpdbpediaorgclassyagogt prefix dcterms lthttppurlorgdctermsgt prefix foaf lthttpxmlnscomfoaf01gt prefix freebase lthttprdffreebasecomnsgt prefix owl lthttpwwww3org200207owlgt prefix prov lthttpwwww3orgnsprovgt prefix rdf lthttpwwww3org19990222-rdf-syntax-nsgt prefix rdfs lthttpwwww3org200001rdf-schemagt prefix schema lthttpschemaorggt prefix umbel lthttpumbelorgumbelscgt prefix void lthttpwwww3orgTRvoidgt prefix wdo lthttpwwwwikidataorgontologygt prefix wdt lthttpwwwwikidataorgentitygt prefix xsd lthttpwwww3org2001XMLSchemagt prefix yago lthttpyago-knowledgeorgresourcegt

distinguish between data quality criteria data qualitydimensions and data quality categories13 In the follow-ing we reuse these concepts for our own frameworkwhich has the particular focus on the data quality ofKGs in the context of Linked Open Data

A data quality criterion (Wang et al also call itldquodata quality attributerdquo) is a particular characteristic ofdata wrt its quality and can be either subjective orobjective An example of a subjectively measurabledata quality criterion is Trustworthiness on KG levelAn example of an objective data quality criterion is theSyntactic validity of RDF documents (see Section 32and [46])

In order to measure the degree to which a certaindata quality criterion is fulfilled for a given KG eachcriterion is formalized and expressed in terms of a func-tion with the value range of [0 1] We call this functionthe data quality metric of the respective data qualitycriterion

A data quality dimension ndash in the following justcalled dimension ndash is a main aspect how data qualitycan be viewed A data quality dimension comprises oneor several data quality criteria [47] For instance the

13The quality dimensions are defined in [47] the sub-classificationinto parametersindicators in [46 p 354]

criteria Syntactic validity of RDF documents Syntacticvalidity of literals and Semantic validity of triples formthe Accuracy dimension

Data quality dimensions and their respective dataquality criteria are further grouped into data qualitycategories Based on empirical studies Wang et alspecified four categories

ndash Criteria of the category of the intrinsic data qualityfocus on the fact that data has quality in its ownright

ndash Criteria of the category of the contextual data qual-ity cannot be considered in general but must beassessed depending on the application context ofthe data consumer

ndash Criteria of the category of the representationaldata quality reveal in which form the data is avail-able

ndash Criteria of the category of the accessibility dataquality determine how the data can be accessed

Since its publication the presented framework ofWang et al has been extensively used either in itsoriginal version or in an adapted or extended versionBizer [11] and Zaveri [49] worked on data quality in theLinked Data context They make the following adapta-tions on Wang et alrsquos framework

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 5

ndash Bizer [11] compared the work of Wang et al [47]with other works in the area of data quality Hethereby complements the framework with the di-mensions consistency verifiability and offensive-ness

ndash Zaveri et al [49] follow Wang et al [47] but intro-duce licensing and interlinking as new dimensionsin the linked data context

In this article we use the DQ dimensions as definedby Wang et al [47] and as extended by Bizer [11] andZaveri [49] More precisely we make the followingadaptations on Wang et alrsquos framework

1 Consistency is treated by us as separate DQ dimen-sion

2 Verifiability is incorporated within the DQ dimen-sion Trustworthiness as criterion Trustworthinesson statement level

3 The Offensiveness of KG facts is not consideredby us as it is hard to make an objective evaluationin this regard

4 We extend the category of the accessibility dataquality by the dimension License and Interlinkingas those data quality dimensions get in additionrelevant in the Linked Data context

31 Criteria Weighting

When applying our framework to compare KGs thesingle DQ metrics can be weighted differently so thatthe needs and requirements of the users can be takeninto account In the following we first formalize theidea of weighting the different metrics We then presentthe criteria and the corresponding metrics of our frame-work

Given are a KG g a set of criteria C = c1 cn aset of metrics M = m1 mn and a set of weightsW = w1 wn Each metric mi corresponds to thecriterion ci and mi(g) isin [0 1] where a value of 0 de-fines the minimum fulfillment degree of a KG regardinga quality criterion and a value of 1 the maximum fulfill-ment degree Furthermore each criterion ci is weightedby wi

The fulfillment degree h(g) isin [0 1] of a KG g isthen the weighted normalized sum of the fulfillmentdegrees wrt the criteria c1 cn

h(g) =

sumni=1 wi mi(g)sumn

j=1 wj

Based on the quality dimensions introduced by Wanget al [47] we now present the DQ criteria and met-rics as used in our KG comparison Note that some ofthe criteria have already been introduced by others asoutlined in Section 7

Note also that our metrics are to be understood aspossible ways of how to evaluate the DQ dimensionsOther definitions of the DQ metrics might be possibleand reasonable We defined the metrics along the char-acteristics of the KGs DBpedia Freebase OpenCycWikidata and YAGO but kept the definitions as genericas possible In the evaluations we then used those met-ric definitions and applied them eg on the basis ofown-created gold standards

32 Intrinsic Category

ldquoIntrinsic data quality denotes that data have qualityin their own rightrdquo [47] This kind of data quality cantherefore be assessed independently from the contextThe intrinsic category embraces the three dimensionsAccuracy Trustworthiness and Consistency which aredefined in the following subsections The dimensionsBelievability Objectivity and Reputation which areseparate dimensions in Wang et alrsquos classification sys-tem [47] are subsumed by us under the dimensionTrustworthiness

321 AccuracyDefinition of dimension Accuracy is ldquothe extent to

which data are correct reliable and certified free oferrorrdquo [47]

Discussion Accuracy is intuitively an important di-mension of data quality Previous work on data qualityhas mainly analyzed only this aspect [47] Hence accu-racy has often been used as synonym for data quality[39] Bizer [11] highlights in this context that Accuracyis an objective dimension and can only be applied onverifiable statements

Batini et al [6] distinguish between syntactic andsemantic accuracy Syntactic accuracy describes theformal compliance to syntactic rules without review-ing whether the value reflects the reality The semanticaccuracy determines whether the value is semanticallyvalid ie whether the value is true Based on the clas-sification of Batini et al we can define the metric forAccuracy as follows

Definition of metric The dimension Accuracy isdetermined by the criteria

ndash Syntactic validity of RDF documentsndash Syntactic validity of literals and

6 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

ndash Semantic validity of triples

The fulfillment degree of a KG g wrt the dimen-sion Accuracy is measured by the metrics msynRDF msynLit and msemTriple which are defined as fol-lows

Syntactic validity of RDF documents The syntacticvalidity of RDF documents is an important require-ment for machines to interpret an RDF document com-pletely and correctly Hogan et al [29] suggest usingstandardized tools for creating RDF data The authorsstate that in this way normally only little syntax errorsoccur despite the complex syntactic representation ofRDFXML

RDF data can be validated by an RDF validator suchas the W3C RDF validator14

msynRDF (g) =

1 if all RDF documents are valid0 otherwise

Syntactic validity of literals Assessing the syntacticvalidity of literals means to determine to which degreeliteral values stored in the KG are syntactically validThe syntactic validity of literal values depends on thedata types of the literals and can be automatically as-sessed via rules [2234] Syntactic rules can be writ-ten in the form of regular expressions For instanceit can be verified whether a literal representing a datefollows the ISO 8601 specification Assuming that L isthe infinite set of literals we can state

msynLit(g) =|(s p o) isin g | o isin L and synV alid(o)|

|(s p o) isin g | o isin L|

In case of an empty set in the denominator of thefraction the metric should evaluate to 1

Semantic validity of triples The criterion Semanticvalidity of triples is introduced to evaluate whether thestatements expressed by the triples (with or withoutliterals) hold true Determining whether a statementis true or false is strictly speaking impossible (see thefield of epistemology in philosophy) For evaluating theSemantic validity of statements Bizer et al [11] notethat a triple is semantically correct if it is also availablefrom a trusted source (eg Name Authority File) if it

14See httpwwww3orgRDFValidator requestedon Feb 29 2016

is common sense or if the statement can be measuredor perceived by the user directly Wikidata has similarguidelines implemented to determine whether a factneeds to be sourced15

We measure the Semantic validity of triples based onempirical evidence ie based on a reference data setserving as gold standard We determine the fulfillmentdegree as the precision that the triples which are in theKG g and in the gold standard GS have the same valuesNote that this measurement is heavily depending on thetruthfulness of the reference data set

Formally let nogGS = |(s p o) | (s p o) isin g and(x y z) isin GSandequi(s x)andequi(p y)andequi(o z))|be the number of triples in g to which semanticallycorresponding triples in the gold standard GS exist Letnog = |(s p o) | (s p o) isin g and (x y z) isin GS andequi(s x) and equi(p y)| be the number of triples in gwhere the subject-relation-pairs (s p) are semanticallyequivalent to subject-relation-pairs (x y) in the goldstandard Then we can state

msemTriple(g) =nogGS

nog

In case of an empty set in the denominator of thefraction the metric should evaluate to 1

322 TrustworthinessDefinition of dimension Trustworthiness is defined

as the degree to which the information is accepted to becorrect true real and credible [49] We define it as acollective term for believability reputation objectivityand verifiability These aspects were defined by Wanget al [47] and Naumann [39] as follows

ndash Believability Believability is ldquothe extent to whichdata are accepted or regarded as true real andcrediblerdquo [47]

ndash Reputation Reputation is ldquothe extent to whichdata are trusted or highly regarded in terms of theirsource or contentrdquo [47]

ndash Objectivity Objectivity is ldquothe extent to whichdata are unbiased (unprejudiced) and impartialrdquo[47]

ndash Verifiability Verifiability is ldquothe degree and easewith which the data can be checked for correctnessrdquo[39]

15See httpswwwwikidataorgwikiHelpSources requested on Sep 8 2016

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 7

Discussion In summary believability considers thesubject (data consumer) side reputation takes the gen-eral social view on trustworthiness objectivity consid-ers the object (data provider) side while verifiabilityfocuses on the possibility of verification

Trustworthiness has been discussed as follows

ndash Believability According to Naumann [39] believ-ability is the ldquoexpected accuracyrdquo of a data source

ndash Reputation The essential difference of believ-ability to accuracy is that for believability data istrusted without verification [11] Thus believabil-ity is closely related to the reputation of a dataset

ndash Objectivity According to Naumann [39] the ob-jectivity of a data source is strongly related to theverifiability The more verifiable a data source orstatement is the more objective it is The authorsof this article would not go so far since also biasedstatements could be verifiable

ndash Verifiability Heath et al [26] emphasize that it isessential for trustworthy applications to be able toverify the origin of data

Definition of metric We define the metric for thedata quality dimension Trustworthiness as a combina-tion of trustworthiness metrics on both KG and state-ment level Believability and reputation are thereby cov-ered by the DQ criterion Trustworthiness on KG level(metric mgraph(hg)) while objectivity and verifiabilityare covered by the DQ criteria Trustworthiness on state-ment level (metric mfact(g)) and Indicating unknownand empty values (metric mNoV al(g)) Hence the ful-fillment degree of a KG g wrt the dimension Trust-worthiness is measured by the metrics mgraph mfactand mNoV al which are defined as follows

Trustworthiness on KG level The measure of Trust-worthiness on KG level exposes a basic indication aboutthe trustworthiness of the KG In this assessment themethod of data curation as well as the method of datainsertion is taken into account Regarding the methodof data curation we distinguish between manual andautomated methods Regarding the data insertion wecan differentiate between 1 whether the data is enteredby experts (of a specific domain) 2 whether the knowl-edge comes from volunteers contributing in a commu-nity and 3 whether the knowledge is extracted automat-ically from a data source This data source can itself beeither structured semi-structured or un-structured Weassume that a closed system where experts or other reg-istered users feed knowledge into a system is less vul-nerable to harmful behavior of users than an open sys-

tem where data is curated by a community Thereforewe assign the values of the metric for Trustworthinesson KG level as follows

mgraph(hg) =

1 manual data curation man-ual data insertion in aclosed system

075 manual data curation and in-sertion both by a commu-nity

05 manual data curation datainsertion by community ordata insertion by automatedknowledge extraction

025 automated data curationdata insertion by automatedknowledge extraction fromstructured data sources

0 automated data curationdata insertion by automatedknowledge extraction fromunstructured data sources

Note that all proposed DQ metrics should be seen assuggestions of how to formulate DQ metrics Henceother numerical values and other classification schemes(eg for mgraph(hg)) might be taken for defining theDQ metrics

Trustworthiness on statement level The fulfillment ofTrustworthiness on statement level is determined by anassessment whether a provenance vocabulary is usedBy means of a provenance vocabulary the source ofstatements can be stored Storing source information isan important precondition to assess statements easilywrt semantic validity We distinguish between prove-nance information provided for triples and provenanceinformation provided for resources

The most widely used ontologies for storing prove-nance information are the Dublin Core Metadataterms16 with properties such as dctermsprovenance and dctermssource and the W3C PROVontology17 with properties such as provwasDerivedFrom

16See httppurlorgdcterms requested on Feb 42017

17See httpswwww3orgTRprov-o requested onDec 27 2016

8 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

mfact(g) =

1 provenance on statement

level is used05 provenance on resource

level is used0 otherwise

Indicating unknown and empty values If the datamodel of the considered KG supports the representa-tion of unknown and empty values more complex state-ments can be represented For instance empty valuesallow to represent that a person has no children andunknown values allow to represent that the birth date ofa person in not known This kind of higher explanatorypower of a KG increases the trustworthiness of the KG

mNoV al(g) =

1 unknown and empty values

are used05 either unknown or empty

values are used0 otherwise

323 ConsistencyDefinition of dimension Consistency implies that

ldquotwo or more values [in a dataset] do not conflict eachotherrdquo [37]

Discussion Due to the high variety of data providersin the Web of Data a user must expect data inconsisten-cies Data inconsistencies may be caused by (i) differ-ent information providers (ii) different levels of knowl-edge and (iii) different views of the world [11]

In OWL restrictions can be introduced to ensureconsistent modeling of knowledge to some degree TheOWL schema restrictions can be divided into class re-strictions and relation restrictions [7]

Class restrictions refer to classes For instanceone can specify via owldisjointWith that twoclasses have no common instance

Relation restrictions refer to the usage of relationsThey can be classified into value constraints and cardi-nality constraints

Value constraints determine the range of relationsowlsomeValuesFrom for instance specifies thatat least one value of a relation belongs to a certainclass If the expected data type of a relation is specifiedvia rdfsrange we also consider this as relationrestriction

Cardinality constraints limit the number of times a re-lation may exist per resource Via owlFunctionalproperty and owlInverseFunctionalProp

erty global cardinality constraints can be specifiedFunctional relations permit at most one value per re-source (eg the birth date of a person) Inverse func-tional relations specify that a value should only occuronce per resource This means that the subject is theonly resource linked to the given object via the givenrelation

Definition of metric We can measure the data qual-ity dimension Consistency by means of (i) whetherschema constraints are checked during the insertion ofnew statements into the KG and (ii) whether alreadyexisting statements in the KG are consistent to specifiedclass and relation constraints The fulfillment degree ofa KG g wrt the dimension consistency is measuredby the metrics mcheckRestr mconClass and mconRelatwhich are defined as follows

Check of schema restrictions during insertion of newstatements Checking the schema restrictions duringthe insertion of new statements can help to reject factsthat would render the KG inconsistent Such simplechecks are often done on the client side in the user inter-face For instance the application checks whether datawith the right data type is inserted Due to the depen-dency to the actual inserted data the check needs to becustom-designed Simple rules are applicable howeverinconsistencies can still appear if no suitable rules areavailable Examples of consistency checks are check-ing the expected data types of literals checking whetherthe entity to be inserted has a valid entity type (iechecking the rdftype relation) checking whetherthe assigned classes of the entity are disjoint ie con-tradicting each other (utilizing owldisjointWithrelations)

mcheckRestr(hg) =

1 schema restrictions arechecked

0 otherwise

Consistency of statements wrt class constraints Thismetric is intended to measure the degree to which theinstance data is consistent with the class restrictions(eg owldisjointWith) specified on the schemalevel

In the following we limit ourselves to the classconstraints given by all owldisjointWith state-ments defined on the schema level of the consid-ered KG Ie let CC be the set of all class con-straints defined as CC = (c1 c2) | (c1owldis-

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 9

jointWith c2) isin g18 Furthermore let cg(e) bethe set of all classes of instance e in g defined ascg(e) = c | (erdftype c) isin g Then we definemconClass(g) as follows

mconClass(g) =

|(c1 c2) isin CC | notexiste (c1 isin cg(e) and c2 isin cg(e))||(c1 c2) isin CC|

In case of an empty set of class constraints CC themetric should evaluate to 1

Consistency of statements wrt relation constraintsThe metric for this criterion is intended for measur-ing the degree to which the instance data is consis-tent with the relation restrictions (eg indicated viardfsrange and owlFunctionalProperty)specified on the schema level We evaluate this crite-rion by averaging over the scores obtained from sin-gle metrics mconRelati indicating the consistency ofstatements wrt different relation constraints

mconRelat(g) =1

n

nsumi=1

mconRelati(g)

In case of evaluating the consistency of instance dataconcretely wrt given rdfsrange and owlFunctionalProperty statements19 we can state

mconRelat(g) =mconRelatRg(g) +mconRelatFct(g)

2

Let Rr be the set of all rdfsrange constraints

Rr = (p d) | (prdfsrange d) isin g

and isDatatype(d)

18Implicit restrictions which can be deducted from the class hi-erarchy eg that a restriction for dboAnimal counts also fordboMammal a subclass of dboAnimal are not considered byus here

19We chose those relations (and for instance notowlInverseFunctionalProperty) as only those relationsare used by more than half of the considered KGs

and Rf be the set of all owlFunctionalPro-perty constraints

Rf = (p d) | (prdftypeowlFunc

tionalProperty) isin g and

(prdfsrange d) isin g and isDatatype(d)

Then we can define the metrics mconRelatRg(g) andmconRelatFct(g) as follows

mconRelatRg(g) =

|(s p o) isin g | exist(p d) isin Rr datatype(o) 6= d||(s p o) isin g | exist(p d) isin Rr|

mconRelatFct(g) =

|(s p o) isin g|exist(p d) isin Rf notexist(s p o2) isin g o 6= o2||(s p o) isin g | exist(p d) isin Rf|

In case of an empty set of relation constraints (Rr orRf ) the respective metric should evaluate to 1

33 Contextual Category

Contextual data quality ldquohighlights the requirementthat data quality must be considered within the contextof the task at handrdquo [47] This category contains thethree dimensions (i) Relevancy (ii) Completeness and(iii) Timeliness Wang et alrsquos further dimensions in thiscategory appropriate amount of data and value-addedare considered by us as being part of the dimensionCompleteness

331 RelevancyDefinition of dimension Relevancy is ldquothe extent

to which data are applicable and helpful for the task athandrdquo [47]

Discussion According to Bizer [11] Relevancy isan important quality dimension since the user is con-fronted with a variety of potentially relevant informa-tion on the Web

Definition of metric The dimension Relevancy isdetermined by the criterion Creating a ranking ofstatements20 The fulfillment degree of a KG g wrtthe dimension Relevancy is measured by the metricmRanking which is defined as follows

20We do not consider the relevancy of literals as there is no rankingof literals provided for the considered KGs

10 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Creating a ranking of statements By means of thiscriterion one can determine whether the KG supportsa ranking of statements by which the relative rele-vance of statements among other statements can beexpressed For instance given the Wikidata entityBarack Obama (wdtQ76) and the relation posi-tion held (wdtP39) President of the United Statesof America (wdtQ11696) has a preferred rank(wdoPreferredRank) (until 2017) while olderpositions which he holds no more are ranked as normalrank (wdoNormalRank)

mRanking(g) =

1 ranking of statements supported0 otherwise

Note that this criterion refers to a characteristic ofthe KG and not to a characteristic of the system thathosts the KG

332 CompletenessDefinition of dimension Completeness is ldquothe ex-

tent to which data are of sufficient breadth depth andscope for the task at handrdquo [47]

We include the following two aspects in this dimen-sion which are separate dimensions in Wang et alrsquosframework

ndash Appropriate amount of data Appropriate amountof data is ldquothe extent to which the quantity orvolume of available data is appropriaterdquo [47]

ndash Value-added Value-added is ldquothe extent to whichdata are beneficial and provide advantages fromtheir userdquo [47]

Discussion Pipino et al [40] divide Completenessinto

1 Schema completeness ie the extent to whichclasses and relations are not missing

2 Column completeness ie the extent to whichvalues of relations on instance level ndash ie facts ndashare not missing and

3 Population completeness ie the extent to whichentities are not missing

The Completeness dimension is context-dependent andtherefore belongs to the contextual category becausethe fact that a KG is seen as complete depends on theuse case scenario ie on the given KG and on the infor-mation need of the user As exemplified by Bizer [11]a list of German stocks is complete for an investor whois interested in German stocks but it is not complete for

an investor who is looking for an overview of Europeanstocks The completeness is hence only assessable bymeans of a concrete use case at hand or with the helpof a defined gold standard

Definition of metric We follow the above-mentioneddistinction of Pipino et al [40] and determine Com-pleteness by means of the criteria Schema completenessColumn completeness and Population completeness

The fulfillment degree of a KG g wrt the dimensionCompleteness is measured by the metrics mcSchemamcCol and mcPop which are defined as follows

Schema completeness By means of the criterionSchema completeness one can determine the complete-ness of the schema wrt classes and relations [40] Theschema is assessed by means of a gold standard Thisgold standard consists of classes and relations which arerelevant for the use case For evaluating cross-domainKGs we use as gold standard a typical set of cross-domain classes and relations It comprises (i) basicclasses such as people and locations in different gran-ularities and (ii) basic relations such as birth date andnumber of inhabitants We define the schema complete-ness mcSchema as the ratio of the number of classesand relations of the gold standard existing in g noclatg and the number of classes and relations in the goldstandard noclat

mcSchema(g) =noclatgnoclat

Column completeness In the traditional database area(with fixed schema) by means of the Column complete-ness criterion one can determine the degree by whichthe relations of a class which are defined on the schemalevel (each relation has one column) exist on the in-stance level [40] In the Semantic Web and Linked Datacontext however we cannot presume any fixed rela-tional schema on the schema level The set of possiblerelations for the instances of a class is given at run-time by the set of used relations for the instances ofthis class Therefore we need to modify this criterionas already proposed by Pipino et al [40] In the updatedversion by means of the criterion Column completenessone can determine the degree by which the instances ofa class use the same relations averaged over all classes

Formally we define the Column completeness met-ric mcCol(g) as the ratio of the number of instanceshaving class k and a value for the relation r nokp tothe number of all instances having class k nok Byaveraging over all class-relation-pairs which occur on

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 11

instance level we obtain a fulfillment degree regardingthe whole KG

mcCol(g) =1

|H|sum

(kp)isinH

nokpnok

We thereby let H = (k p) isin (K times P ) | existk isinCg and exist(x p o) | p isin P imp

g and (xrdftype k) bethe set of all combinations of the considered classesK = k1 kn and considered relations P =p1 pm

Note that there are also relations which are dedicatedto the instances of a specific class but which do notneed to exist for all instances of that class For instancenot all people need to have a relation hasChild ordeathDate21 For measuring the Column complete-ness we selected only those relations for an assessmentwhere a value of the relation typically exists for allgiven instances

Population completeness The Population complete-ness metric determines the extent to which the consid-ered KG covers a basic population [40] The assess-ment of the KG completeness wrt a basic populationis performed by means of a gold standard which coversboth well-known entities (called ldquoshort headrdquo eg then largest cities in the world according to the number ofinhabitants) and little-known entities (called ldquolong tailrdquoeg municipalities in Germany) We take all entitiescontained in our gold standard equally into account

Let GS be the set of entities in the gold standardThen we can define

mcPop(g) =|e|e isin GS and e isin Eg|

|e|e isin GS|

333 TimelinessDefinition of dimension Timeliness is ldquothe extent

to which the age of the data is appropriate for the taskat handrdquo [47]

Discussion Timeliness does not describe the creationdate of a statement but instead the time range since thelast update or the last verification of the statement [39]Due to the easy way of publishing data on the Webdata sources can be kept easier up-to-date than tradi-tional isolated data sources This results in advantagesto the consumer of Web data [39] How Timeliness is

21For an evaluation about the prediction which relations are of thisnature see [1]

measured depends on the application context For somesituations years are sufficient while in other situationsone may need days [39]

Definition of metric The dimension timeliness isdetermined by the criteria Timeliness frequency of theKG Specification of the validity period and Specifica-tion of the modification date of statements

The fulfillment degree of a KG g wrt the dimen-sion Timeliness is measured by the metrics mFreqmV alidity and mChange which are defined as follows

Timeliness frequency of the KG The criterion Time-liness frequency of the KG indicates how fast the KGis updated We consider the KG RDF export here anddifferentiate between continuous updates where the up-dates are always performed immediately and discreteKG updates where the updates take place in discretetime intervals In case the KG edits are available onlineimmediately but the RDF export files are available indiscrete varying updating intervals we consider theonline version of the KG since in the context of LinkedData it is sufficient that URIs are dereferenceable

mFreq(g) =

1 continuous updates05 discrete periodic updates025 discrete non-periodic updates0 otherwise

Specification of the validity period of statements Spec-ifying the validity period of statements enables to tem-porally limit the validity of statements By using this cri-terion we measure whether the KG supports the speci-fication of starting and maybe end dates of statementsby means of providing suitable forms of representation

mV alidity(g) =

1 specification of validity pe-riod supported

0 otherwise

Specification of the modification date of statementsThe modification date discloses the point in timeof the last verification of a statement The modifi-cation date is typically represented via the relationsschemadateModified and dctermsmodified

mChange(g) =

1 specification of modifica-

tion dates for statementssupported

0 otherwise

12 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

34 Representational Data Quality

Representational data quality ldquocontains aspects re-lated to the format of the data [] and meaning ofdatardquo [47] This category contains the two dimensions(i) Ease of understanding (ie regarding the human-readability) and (ii) Interoperability (ie regarding themachine-readability) The dimensions InterpretabilityRepresentational consistency and Concise representa-tion as in addition proposed by Wang et al [47] areconsidered by us as being a part of the dimension Inter-operability

341 Ease of UnderstandingDefinition of dimension The ease of understanding

is ldquothe extent to which data are clear without ambiguityand easily comprehendedrdquo [47]

Discussion This dimension focuses on the under-standability of a data source by a human data con-sumer In contrast the dimension Interoperability fo-cuses on technical aspects The understandability of adata source (here KG) can be improved by things suchas descriptive labels and literals in multiple languages

Definition of metric The dimension understand-ability is determined by the criteria Description of re-sources Labels in multiple languages UnderstandableRDF serialization and Self-describing URIs The ful-fillment degree of a KG g wrt the dimension Con-sistency is measured by the metrics mDescr mLangmuSer and muURI which are defined as follows

Description of resources Heath et al [2630] suggestto describe resources in a human-understandable wayeg via rdfslabel or rdfscomment Withinour framework the criterion is measured as followsGiven a sample of resources we divide the numberof resources in the KG for which at least one label orone description is provided (eg via rdfslabelrdfscomment or schemadescription) bythe number of all considered resources in the localnamespace

mDescr(g) = |u|u isin U localg and exist(u p o) isin g

p isin PlDesc||u|u isin U localg |

PlDesc is the set of implicitly used relations in g in-dicating that the value is a label or description (egPlDesc = rdfslabelrdfscomment)

Beschreibung) Daruumlber hinaus ist das Ergebnisder Evaluation auf Basis der Entitaumlten interessant -gt DBpedia weicht deutlich ab da manche Entitaumlten

(Intermediate-Node-Mapping) keine rdfslabel habenFolglich wuumlrde ich die Definition der Metrik allgemeinhalten (beschraumlnkt auf proprietaumlre Ressourcen dh imselben Namespace) die Evaluation jedoch nur anhandder Entitaumlten machen

Labels in multiple languages Resources in the KG aredescribed in a human-readable way via labels eg viardfslabel or skosprefLabel22 The charac-teristic feature of skosprefLabel is that this kindof label should be used per resource at most once incontrast rdfslabel has no cardinality restrictionsie it can be used several times for a given resourceLabels are usually provided in English as the ldquobasiclanguagerdquo The now introduced metric for the criterionLabels in multiple languages determines whether labelsin other languages than English are provided in the KG

mLang(g) =

1 Labels provided in English

and at least one other lan-guage

0 otherwise

Understandable RDF serialization RDFXML is therecommended RDF serialization format of the W3CHowever due to its syntax RDFXML documents arehard to read for humans The understandability of RDFdata by humans can be increased by providing RDFin other more human-understandable serialization for-mats such as N3 N-Triple and Turtle We measurethis criterion by measuring the supported serializationformats during the dereferencing of resources

muSer(hg) =

1 Other RDF serializationsthan RDFXML available

0 otherwise

Note that conversions from one RDF serializationformat into another are easy to perform

Self-describing URIs Descriptive URIs contribute toa better human-readability of KG data Sauermann etal23 recommend to use short memorable URIs in theSemantic Web context which are easier understandableand memorable by humans compared to opaque URIs24

22Using the namespace httpwwww3org200402skoscore

23See httpswwww3orgTRcooluris requested onMar 1 2016

24For an overview of URI patterns see httpswwww3orgcommunitybpmlodwikiBest_practises_-_previous_notes requested on Dec 27 2016

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 13

such as wdtQ1040 The criterion Self-describingURIs is dedicated to evaluate whether self-describingURIs or generic IDs are used for the identification ofresources

muURI(g) =

1 self-describing URIs always used05 self-describing URIs partly used0 otherwise

342 InteroperabilityInteroperability is another dimension of the repre-

sentational data quality category and subsumes Wanget alrsquos aspects interpretability representational consis-tency and concise representation

Definition of dimension We define Interoperabilityalong the subsumed dimensions of Wang et al

ndash Interpretability Interpretability is ldquothe extent towhich data are in appropriate language and unitsand the data definitions are clearrdquo [47]

ndash Representational consistency Representationalconsistency is ldquothe extent to which data are alwayspresented in the same format and are compatiblewith previous datardquo [47]

ndash Concise representation Concise representationis ldquothe extent to which data are compactly repre-sented without being overwhelmingrdquo [47]

Discussion regarding interpretability In contrastto the dimension understandability which focuses onthe understandability of RDF KG data towards the useras data consumer interpretability focuses on the rep-resentation forms of information in the KG from atechnical perspective An example is the considerationwhether blank nodes are used According to Heath etal [26] blank nodes should be avoided in the LinkedData context since they complicate the integration ofmultiple data sources and since they cannot be linkedby resources of other data sources

Discussion regarding representational consistencyIn the context of Linked Data it is best practice to reuseexisting vocabulary for the creation of own RDF dataIn this way less data needs to be prepared for beingpublished as Linked Data [26]

Discussion regarding concise representation Heathet al [26] made the observation that the RDF features(i) RDF reification25 (ii) RDF collections and RDF

25In the literature it is often not differentiated between reificationin the general sense and reification in the sense of the specific

container and (iii) blank nodes are not very widelyused in the Linked Open Data context Those featuresshould be avoided according to Heath et al in orderto simplify the processing of data on the client sideEven the querying of the data via SPARQL may getcomplicated if RDF reification RDF collections andRDF container are used We agree on that but alsopoint out that reification (implemented via RDF stan-dard reification n-ary relations singleton propertiesor named graphs) is inevitably necessary for makingstatements about statements

Definition of metric The dimension Interoperabil-ity is determined via the following criteria

ndash Avoiding blank nodes and RDF reificationndash Provisioning of several serialization formatsndash Using external vocabularyndash Interoperability of proprietary vocabulary

The fulfillment degree of a KG g wrt the dimen-sion Interoperability is measured by the metrics mReif miSerial mexV oc and mpropV oc which are defined asfollows

Avoiding blank nodes and RDF reification Using RDFblank nodes RDF reification RDF container and RDFlists is often considered as ambivalent On the one handthese RDF features are not very common and theycomplicate the processing and querying of RDF data[3026] On the other hand they are necessary in cer-tain situations eg when statements about statementsshould be made We measure the criterion by evaluatingwhether blank nodes and RDF reification are used

mReif (g) =

1 no blank nodes and no RDF

reification05 either blank nodes or RDF

reification0 otherwise

Provisioning of several serialization formats The in-terpretability of RDF data of a KG is increased if be-

proposal described in the RDF standard (Brickley D Guha R (eds)RDF Vocabulary Description Language 10 RDF Schema W3CRecommendation online available at httpwwww3orgTRrdf-schema requested on Sep 2 2016) For more informationabout reification and its implementation possibilities we can refer thereader to [27] In this article we use the term reification by defaultfor the general sense and standard reification or RDF reificationfor referring to the modeling of reification according to the RDFstandard

14 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

sides the serialization standard RDFXML further seri-alization formats are supported for URI dereferencing

miSerial(hg) =

1 RDFXML and further for-

mats are supported05 only RDFXML is supported0 otherwise

Using external vocabulary Using a common vocabu-lary for representing and describing the KG data allowsto represent resources and relations between resourcesin the Web of Data in a unified way This increases theinteroperability of data [3026] and allows a comfort-able data integration We measure the criterion of usingan external vocabulary by setting the number of tripleswith external vocabulary in predicate position to thenumber of all triples in the KG

mextV oc(g) =|(s p o)|(s p o) isin g and p isin P external

g ||(s p o) isin g|

Interoperability of proprietary vocabulary Linkingon schema level means to link the proprietary vo-cabulary to external vocabulary Proprietary vocab-ulary are classes and relations which were definedin the KG itself The interlinking to external vo-cabulary guarantees a high degree of interoperabil-ity [26] We measure the interlinking on schemalevel by calculating the ratio to which classes andrelations have at least one equivalency link (egowlsameAs owlequivalentProperty orowlequivalentClass) to classes and relationsrespectively of other data sources

mpropV oc(g) = |x isin Pg cup Cg|exist(x p o) isin g

(p isin Peq and (o isin U and o isin Uextg ))||Pg cup Cg|

where Peq = owlsameAsowlequivalent-PropertyowlequivalenClass and Uext

g con-sists of all URIs in Ug which are external to the KG gwhich means that hg is not responsible for resolvingthese URIs

35 Accessibility Category

Accessibility data quality refers to aspects on howdata can be accessed This category contains the threedimensions

ndash Accessibilityndash Licensing andndash Interlinking

Wangrsquos dimension access security is considered by usas being not relevant in the Linked Open Data contextas we only take open data sources into account

In the following we go into details of the mentioneddata quality dimensions

351 AccessibilityDefinition of dimension Accessibility is ldquothe ex-

tent to which data are available or easily and quicklyretrievablerdquo [47]

Discussion Wang et alrsquos definition of Accessibilitycontains the aspects availability response time anddata request They are defined as follows

1 Availability ldquoof a data source is the probability thata feasible query is correctly answered in a giventime rangerdquo [39]According to Naumann [39] the availability is animportant quality aspect for data sources on theWeb since in case of integrated systems (with fed-erated queries) usually all data sources need tobe available in order to execute the query Therecan be different influencing factors regarding theavailability of data sources such as the day timethe worldwide distribution of servers the planedmaintenance work and the caching of data LinkedData sources can be available as SPARQL end-points (for performing complex queries on thedata) and via HTTP URI dereferencing We needto consider both possibilities for this DQ dimen-sion

2 Response time characterizes the delay betweenthe point in time when the query was submittedand the point in time when the query response isreceived [11]Note that the response time is dependent on em-pirical factors such as the query the size of the in-dexed data the data structure the used triple storethe hardware and so on We do not consider theresponse time in our evaluations since obtaininga comprehensive result here is hard

3 In the context of Linked Data data requests canbe made (i) on SPARQL endpoints (ii) on RDFdumps (export files) and (iii) on Linked DataAPIs

Definition of metric We define the metric for thedimension Accessibility by means of metrics for thefollowing criteria

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 15

ndash Dereferencing possibility of resourcesndash Availability of the KGndash Provisioning of public SPARQL endpointndash Provisioning of an RDF exportndash Support of content negotiationndash Linking HTML sites to RDF serializationsndash Provisioning of KG metadata

The fulfillment degree of a KG g wrt the dimen-sion Accessibility is measured by the metrics mDeref mAvai mSPARQL mExport mNegot mHTMLRDF and mMeta which are defined as follows

Dereferencing possibility of resources One of theLinked Data principles [9] is the dereferencing possi-bility of resources URIs must be resolvable via HTTPrequests and useful information should be returnedthereby We assess the dereferencing possibility of re-sources in the KG by analyzing for each URI in the sam-ple set (here all URIs Ug) the HTTP response statuscode and by evaluating whether RDF data is returned Asuccessful dereferencing of resources is given if HTTPstatus code 200 and an RDF document is returned

mDeref (hg) =|dereferencable(Ug)|

|Ug|

Availability of the KG The Availability of the KG cri-terion indicates the uptime of the KG It is an essentialcriterion in the context of Linked Data since in case ofan integrated or federated query mostly all data sourcesneed to be available [39] We measure the availabil-ity of a KG by monitoring the ability of dereferencingURIs over a period of time This monitoring processcan be done with the help of a monitoring tool such asPingdom26

mAvai(hg) =Number of successful requests

Number of all requests

Provisioning of public SPARQL endpoint SPARQLendpoints allow the user to perform complex queries(including potentially many instances classes and rela-tions) on the KG This criterion here indicates whetheran official SPARQL endpoint is publicly availableThere might be additional restrictions of this SPARQLendpoint such as a maximum number of requests pertime slice or a maximum runtime of a query However

26See httppingdomcom requested on Mar 1 2016

we do not measure these restrictions here

mSPARQL(hg) =

1 SPARQL endpoint publiclyavailable

0 otherwise

Provisioning of an RDF export If there is no pub-lic SPARQL endpoint available or the restrictions ofthis endpoint are so strict that the user does not useit an RDF export dataset (RDF dump) can often beused This dataset can be used to set up a local pri-vate SPARQL endpoint The criterion here indicateswhether an RDF export dataset is officially available

mExport(hg) =

1 RDF export available0 otherwise

Support of content negotiation Content negotiation(CN) allows that the server returns RDF documentsduring the dereferencing of resources in the desiredRDF serialization format The HTTP protocol allowsthe client to specify the desired content type (eg RDFXML) in the HTTP request and the server to specifythe returned content type in the HTTP response header(eg applicationrdf+xml) In this way the de-sired and the provided content type are matched as faras possible It can happen that the server does not pro-vide the desired content type Moreover it may hap-pen that the server returns an incorrect content typeThis may lead to the fact that serialized RDF data isnot processed further An example is RDF data whichis declared as textplain [26] Hogan et al [29]therefore propose to let KGs return the most specificcontent type as possible We measure the Support ofcontent negotiation by dereferencing resources withdifferent RDF serialization formats as desired contenttype and by comparing the accept header of the HTTPrequest with the content type of the HTTP response

mNegot(hg) =

1 CN supported and correct

content types returned05 CN supported but wrong

content types returned0 otherwise

Linking HTML sites to RDF serializations Heath etal [26] suggest linking any HTML description of aresource to RDF serializations of this resource in or-der to make the discovery of corresponding RDF dataeasier (for Linked Data aware applications) For thatreason in the HTML header the so-called Autodiscov-

16 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

ery pattern can be included This pattern consists ofthe phrase link rel=alternate the indicationabout the provided RDF content type and a link to theRDF document27 We measure the linking of HTMLpages to RDF documents (ie resource representations)by evaluating whether the HTML representations of theresources contain links as described

mHTMLRDF (hg) =

1 Autodiscovery pattern usedat least once

0 otherwise

Provisioning of KG metadata In the light of the Se-mantic Web vision where agents select and make useof appropriate data sources on the Web also the meta-information about KGs needs to be available in amachine-readable format The two important mech-anisms to specify metadata about KGs are (i) usingsemantic sitemaps and (ii) using the VoID vocabu-lary28 [26] For instance the URI of the SPARQL end-point can be assigned via voidsparqlEndpointand the RDF export URL can be specified withvoiddataDump Such metadata can be added as ad-ditional facts to the KG or it can be provided as separateVoID file We measure the Provisioning of KG meta-data by evaluating whether machine-readable metadataabout the KG is available Note that the provisioningof licensing information in a machine-readable format(which is also a meta-information about the KG) isconsidered in the data quality dimension License lateron

mMeta(g) =

1 Machine-readable metadataabout g available

0 otherwise

352 LicenseDefinition of dimension Licensing is defined as

ldquothe granting of permission for a consumer to re-use adataset under defined conditionsrdquo [49]

Discussion The publication of licensing informationabout KGs is important for using KGs without legalconcerns especially in commercial settings CreativeCommons (CC)29 publishes several standard licensing

27An example is ltlinkrel=alternate type =applicationrdf+xml href=companyrdfgt

28See namespace httpwwww3orgTRvoid29See httpcreativecommonsorg requested on Mar

1 2016

contracts which define rights and obligations Thesecontracts are also in the Linked Data context popularThe most frequent licenses for Linked Data are CC-BYCC-BY-SA and CC0 [31] CC-BY30 requires specify-ing the source of the data CC-BY-SA31 requires in ad-dition that if the data is published it is published underthe same legal conditions CC032 defines the respectivedata as public domain and without any restrictions

Noteworthy is that most data sources in the LinkedOpen Data cloud do not provide any licensing infor-mation [31] which makes it difficult to use the datain commercial settings Even if data is published un-der CC-BY or CC-BY-SA the data is often not usedsince companies refer to uncertainties regarding thesecontracts

Definition of metric The dimension License isdetermined by the criterion Provisioning machine-readable licensing information

The fulfillment degree of a KG g wrt the dimensionLicense is measured by the metric mmacLicense whichis defined as follows

Provisioning machine-readable licensing informationLicenses define the legal frameworks under which theKG data may be used Providing machine-readable li-censing information allows users and applications to beaware of the license and to use the data of the KG inaccordance with the legal possibilities [3026]

Licenses can be specified in RDF via relationssuch as cclicence33 dctermslicence ordctermsrights The licensing information can bespecified either in the KG as additional facts or sepa-rately in a VoID file We measure the criterion by eval-uating whether licensing information is available in amachine-readable format

mmacLicense(g) =

1 machine-readable

licensing informationavailable

0 otherwise

353 InterlinkingDefinition of dimension Interlinking is the extent

ldquoto which entities that represent the same concept are

30See httpscreativecommonsorglicensesby40 requestedon Mar 1 2016

31See httpscreativecommonsorglicensesby-sa40 requested on Mar 1 2016

32See httpcreativecommonsorgpublicdomainzero10 requested on Mar 3 2016

33Using the namespace httpcreativecommonsorgns

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 17

linked to each other be it within or between two ormore data sourcesrdquo [49]

Discussion According to Bizer et al [12] DBpediaestablished itself as a hub in the Linked Data clouddue to its intensive interlinking with other KGs Theseinterlinking is on the instance level usually establishedvia owlsameAs links However according to Halpinet al [24] those owlsameAs links do not alwaysinterlink identical entities in reality According to theauthors one reason might be that the KGs provideentries in different granularity For instance the DB-pedia resource for Berlin (dboBerlin) links viaowlsameAs relations to three different resources inthe KG GeoNames34 namely (i) Berlin the capital35

(ii) Berlin the state36 and (iii) Berlin the city37 More-over owlsameAs relations are often created auto-matically by some mapping function Due to mappingerrors the precision is often below 100 [18]

Definition of metric The dimension Interlinking isdetermined by the criteria

ndash Interlinking via owlsameAsndash Validity of external URIs

The fulfillment degree of a KG g wrt the dimen-sion Interlinking is measured by the metrics mInst andmURIs which are defined as follows

Interlinking via owlsameAs The forth LinkedData principle according to Berners-Lee [8] is the inter-linking of data resources so that the user can explorefurther information According to Hogan et al [30] theinterlinking has a side effect It does not only result inotherwise isolated KGs but the number of incominglinks of a KG indicates the importance of the KG in theLinked Open Data cloud We measure the interlinkingon instance level38 by calculating the extent to which in-stances have at least one owlsameAs link to externalKGs

34See httpwwwgeonamesorg requested on Dec 312016

35See httpwwwgeonamesorg2950159berlinhtml requested on Feb 4 2017

36See httpwwwgeonamesorg2950157land-berlinhtml requested on Feb 4 2017

37See httpwwwgeonamesorg6547383berlin-stadthtml requested on Feb 4 2017

38The interlinking on schema level is already measured via thecriterion Interoperability of proprietary vocabulary

mInst(g) = |x isin Ig (Pg cup Cg) |

exist(xowlsameAs y) isin g and y isin Uextg |

|Ig (Pg cup Cg)|

Validity of external URIs The considered KG maycontain outgoing links referring to RDF resourcesor Web documents (non-RDF data) The linking toRDF resources is usually enabled by owlsameAsowlequivalentProperty and owlequivalentClass relations Web documents are linked viarelations such as foafhomepage and foafdepiction Linking to external resources always entailsthe problem that those links might get invalid over timeThis can have different causes For instance the URIsare not available anymore We measure the Validity ofexternal URIs by evaluating the URIs from an URI sam-ple set wrt whether there is a timeout a client error(HTTP response 4xx) or a server error (HTTP response5xx)

mURIs(g) =|x isin A | resolvable(x)|

|A|

where A = y | exist(x p y) isin g (p isin Peqandx isin Ug(CgcupPg)andx isin U local

g andy isin Uextg ) and resolvable(x)

returns true if HTTP status code 200 is returned Peq isthe set of relations used for linking to external sourcesExamples for such relations are owlsameAs andfoafhomepage

In case of an empty set A the metric should evaluateto 1

36 Conclusion

In this section we provided 34 DQ criteria which canbe applied in the form of DQ metrics to KGs in order toassess those KGs wrt data quality The DQ criteria areclassified into 11 DQ dimensions These dimensionsare themselves grouped into 4 DQ categories In totalwe have the following picture

ndash Intrinsic category

lowast Accuracylowast Syntactic validity of RDF documentslowast Syntactic validity of literalslowast Semantic validity of triples

18 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

lowast Trustworthinesslowast Trustworthiness on KG levellowast Trustworthiness on statement levellowast Using unknown and empty values

lowast Consistencylowast Check of schema restrictions during inser-

tion of new statementslowast Consistency of statements wrt class con-

straintslowast Consistency of statements wrt relation con-

straints

ndash Contextual category

lowast Relevancylowast Creating a ranking of statements

lowast Completenesslowast Schema completenesslowast Column completenesslowast Population completeness

lowast Timelinesslowast Timeliness frequency of the KGlowast Specification of the validity period of state-

mentslowast Specification of the modification date of

statements

ndash Representational data quality

lowast Ease of understandinglowast Description of resourceslowast Labels in multiple languageslowast Understandable RDF serializationlowast Self-describing URIs

lowast Interoperabilitylowast Avoiding blank nodes and RDF reificationlowast Provisioning of several serialization formatslowast Using external vocabularylowast Interoperability of proprietary vocabulary

ndash Accessibility category

lowast Accessibilitylowast Dereferencing possibility of resourceslowast Availability of the KGlowast Provisioning of public SPARQL endpointlowast Provisioning of an RDF exportlowast Support of content negotiationlowast Linking HTML sites to RDF serializationslowast Provisioning of KG metadata

lowast Licenselowast Provisioning machine-readable licensing in-

formationlowast Interlinking

lowast Interlinking via owlsameAslowast Validity of external URIs

4 Selection of KGs

We consider the following KGs for our comparativeevaluation

ndash DBpedia DBpedia39 is the most prominent KGin the LOD cloud [4] The project was initiatedby researchers from the Free University of Berlinand the University of Leipzig in collaborationwith OpenLink Software Since the first public re-lease in 2007 DBpedia is updated roughly once ayear40 By means of a dedicated open source ex-traction framework DBpedia is created from infor-mation contained in Wikipedia such as infobox ta-bles categorization information geo-coordinatesand external links Due to its role as the hub ofthe LOD cloud DBpedia contains many links toother datasets in the LOD cloud such as FreebaseOpenCyc UMBEL41 GeoNames Musicbrainz42

CIA World Factbook43 DBLP44 Project Guten-berg45 DBtune Jamendo46 Eurostat47 Uniprot48

and Bio2RDF4950 DBpedia has been used exten-sively in the Semantic Web research communitybut has become also relevant in commercial set-tings for instance companies such as the BBC[33] and the New York Times [41] use DBpediato organize their content The version of DBpediawe analyzed is 2015-04

39See httpdbpediaorg requested on Nov 1 201640There is also DBpedia live which started in 2009 and which

gets updated when Wikipedia is updated See httplivedbpediaorg requested on Nov 1 2016 Note however thatDBpedia live only provides a restricted set of relations compared toDBpedia Also the provisioning of data varies a lot While for sometime ranges DBpedia live provides data for each hour for other timeranges DBpedia live data is only available once a month

41See httpumbelorg requested on Dec 31 201642See httpmusicbrainzorg requested on Dec 31

201643See httpswwwciagovlibrary

publicationsthe-world-factbook requested on Dec31 2016

44See httpwwwdblporg requested on Dec 31 201645See httpswwwgutenbergorg requested on Dec

31 201646See httpdbtuneorgjamendo requested on Dec

31 201647See httpeurostatlinked-statisticsorg

requested on Dec 31 201648See httpwwwuniprotorg requested on Dec 31

201649See httpbio2rdforg requested on Dec 31 201650See a complete list of the links on the websites describing the sin-

gle DBpedia versions such as httpdownloadsdbpediaorg2016-04links (requested on Nov 1 2016)

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 19

ndash Freebase Freebase51 is a KG announced byMetaweb Technologies Inc in 2007 and was ac-quired by Google Inc on July 16 2010 In con-trast to DBpedia Freebase had provided an in-terface that allowed end-users to contribute tothe KG by editing structured data Besides user-contributed data Freebase integrated data fromWikipedia NNDB52 FMD53 and MusicBrainz54

Freebase uses a proprietary graph model for stor-ing also complex statements Freebase shut downits services completely on August 31 2016 Onlythe latest data dump is still available WikimediaDeutschland and Google integrate Freebase datainto Wikidata via the Primary Sources Tool55 Fur-ther information about the migration from Free-base to Wikidata is provided in [44] We analyzedthe latest Freebase version as of March 2015

ndash OpenCyc The Cyc56 project started in 1984 bythe industry research and development consortiumMicroelectronics and Computer Technology Cor-poration The aim of Cyc is to store ndash in a machine-processable way ndash millions of common sense factssuch as ldquoEvery tree is a plantrdquo The main focus ofCyc has been on inferencing and reasoning SinceCyc is proprietary a smaller version of the KGcalled OpenCyc57 was released under the opensource Apache license Version 2 In July 2006 Re-searchCyc58 was published for the research com-munity containing more facts than OpenCyc Wedid not consider Cyc and ResearchCyc since thoseKGs do not meet the chosen requirements namelythat the KGs are freely available and freely us-able in any context The version of OpenCyc weanalyzed is 2012-05-10

ndash Wikidata Wikidata59 is a project of WikimediaDeutschland which started on October 30 2012The aim of the project is to provide data whichcan be used by any Wikimedia project including

51See httpfreebasecom requested on Nov 1 201652See httpwwwnndbcom requested on Dec 31 201653See httpwwwfashionmodeldirectorycom re-

quested on Dec 31 201654See httpmusicbrainzorg requested on Dec 31

201655See httpswwwwikidataorgwikiWikidata

Primary_sources_tool requested on Apr 8 201656See httpwwwcyccom requested on Dec 31 201657See httpwwwopencycorg accessed on Nov 1

201658See httpresearchcyccom requested on Dec 31

201659See httpwikidataorg accessed on Nov 1 2016

Wikipedia Wikidata does not only store facts butalso the corresponding sources so that the valid-ity of facts can be checked Labels aliases anddescriptions of entities in Wikidata are providedin almost 400 languages Wikidata is a commu-nity effort ie users collaboratively add and editinformation Also the schema is maintained andextended based on community agreements Wiki-data is currently growing considerably due to theintegration of Freebase data [44] The version ofWikidata we analyzed is 2015-10

ndash YAGO YAGO60 ndash Yet Another Great Ontol-ogy ndash has been developed at the Max PlanckInstitute for Computer Science in Saarbruumlckensince 2007 YAGO comprises information ex-tracted from Wikipedia (such as information fromthe categories redirects and infoboxes) Word-Net [19] (such as information about synsets andhyponomies) and GeoNames61 The version ofYAGO we analyzed is YAGO3 which was pub-lished in March 2015

5 Comparison of KGs

51 Key Statistics

In the following we present statistical commonal-ities and differences of the KGs DBpedia FreebaseOpenCyc Wikidata and YAGO We thereby use thefollowing key statistics

ndash Number of triplesndash Number of classesndash Number of relationsndash Distribution of classes wrt the number of their

corresponding instancesndash Coverage of classes with at least one instance per

classndash Covered domains wrt entitiesndash Number of entitiesndash Number of instancesndash Number of entities per classndash Number of unique subjectsndash Number of unique predicatesndash Number of unique objects

In Section 72 we provide an overview of relatedwork wrt those key statistics

60See httpwwwmpi-infmpgdedepartmentsdatabases-and-information-systemsresearchyago-nagayagodownloads accessed on Nov 1 2016

61See httpwwwgeonamesorg requested on Dec 312016

20 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

511 TriplesRanking of KGs wrt number of triples The num-

ber of triples (see Table 2) differs considerably betweenthe KGs Freebase is the largest KG with over 31Btriples while OpenCyc resides the smallest KG withonly 24M triples The large size of Freebase can betraced back to the fact that large data sets such as Mu-sicBrainz have been integrated into this KG OpenCycin contrast has been built purely manually by expertsIn general this indicates a correlation between the wayof building up a KG and its size

Size differences between DBpedia and YAGO Asboth DBpedia and YAGO were created automaticallyby extracting semantically-structured information fromWikipedia the significant difference between their sizesndash in terms of triples ndash is in particular noteworthy Wecan mention here the following reasons YAGO inte-grates the statements from different language versionsof Wikipedia in one single KG while for the canon-ical DBpedia dataset (which is used in our evalua-tions) solely the English Wikipedia was used as in-formation source Besides that YAGO contains con-textual information and detailed provenance informa-tion Contextual information is for instance the an-chor texts of all links within Wikipedia For repre-senting the anchor texts the relation yagohasWikipediaAnchorText (330M triples in total) is usedThe provenance information of single statements isstored in a reified form In particular the relationsyagoextractionSource (1612M triples) andyagoextractionTechnique (1762M triples)are applied therefore3nInfluence of reification on the number of triples

DBpedia Freebase Wikidata and YAGO use someform of reification Reification in general describesthe possibility of making statements about statementsWhile reification has an influence on the number oftriples for DBpedia Freebase and Wikidata the num-ber of triples in YAGO is not influenced by reificationsince data is here provided in N-Quads62 This style ofreification is called Named Graph [27] The additionalcolumn (in comparison to triples) contains a unique IDof the statement by which the triple becomes identifiedFor backward compatibility the ID is commented andtherefore not imported into the triple store Note how-ever that transforming N-Quads to N-Triples leads to a

62The idea of N-Quads is based on the assignment of triples todifferent graphs YAGO uses N-Quads to identify statements per ID

high number of unique subjects concerning the set ofall triples

In case of DBpedia Freebase and Wikidata reifica-tion is implemented by means of n-ary relations Ann-ary relation denotes the relation between more thantwo resources and is implemented via additional inter-mediate nodes since in RDF only binary statementscan be modeled [1627] In Freebase and DBpedia datais mostly provided in the form of plain N-Triples andn-ary relations are only used for data from higher ar-ity63 Wikidata in contrast has the peculiarity that notonly every statement is expressed with the help of ann-ary relation but that in addition each statement is in-stantiated with wdoStatement This leads to about74M additional instances which is about one tenth ofall triples in Wikidata

512 ClassesMethods for counting classes The number of

classes can be calculated in different ways Classes canbe identified via rdfsClass and owlClass re-lations or via rdfssubClassOf relations64 SinceFreebase does not provide any class hierarchy withrdfssubClassOf relations and since Wikidatadoes not instantiate classes explicitly as classes butuses instead only ldquosubclass ofrdquo (wdtP279) relationsthe method of calculating the number of classes de-pends on the considered KG

Ranking of KG wrt number of classes Our eval-uations revealed that YAGO contains the highest num-ber of classes of all considered KGs DBpedia in con-trast has the fewest (see Table 2)

Number of classes in YAGO and DBpedia Howdoes it come to this gap between DBpedia and YAGOwith respect to the number of classes although bothKGs were created automatically based on WikipediaFor YAGO the classes are extracted from the categoriesin Wikipedia while the hierarchy of the classes is de-ployed with the help of WordNet synset relations TheDBpedia ontology in contrast is very small since itis created manually based on the mostly used infobox

63In Freebase Compound Value Types are used for reifi-cation [44] In DBpedia it is named Intermedia Node Map-ping see httpmappingsdbpediaorgindexphpTemplateIntermediateNodeMapping (requested on Dec31 2016)

64The number of classes in a KG may also be calculated by takingall entity type relations (rdftype and ldquoinstance ofrdquo (wdtP31)in case of Wikidata) on the instance level into account However thiswould result only in a lower bound estimation as here those classesare not considered which have no instances

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 21

DBpe

dia

Freeb

ase

Ope

nCyc

Wikidat

a

YAGO

20

40

60

80

100

Co

ve

rag

e in

Fig 1 Coverage of classes having at least one instance

templates in Wikipedia Besides those 736 classes theDBpedia KG contains further 444895 classes whichoriginate from the imported YAGO classes and whichare published in the namespace yago Those YAGOclasses are ndash like the DBpedia ontology classes ndash inter-connected via rdfssubClassOf to form a taxon-omy In the evaluation of DBpedia the YAGO classesare ignored as they do not belong to the DBpedia on-tology given as OWL file

Coverage of classes with at least one instanceFig 1 shows for each KG the extent to which classes areinstantiated that is for how many classes at least oneinstance exists YAGO exhibits the highest coveragerate (826) although it contains the highest numberof classes among the KGs This can be traced back tothe fact that YAGO classes are chosen by a heuristicthat considers Wikipedia leaf categories which tend tohave instances [43] OpenCyc (with 65) and Wiki-data (54) come last in the ranking Wikidata has thesecond highest number of classes in total (see Table 2)out of which relatively little are used on instance levelNote however that in some scenarios solely the schemalevel information (including classes) of KGs is neces-sary so that the low coverage of instances by classes isnot necessarily an issue

Correlation between number of classes and num-ber of instances In Fig 2 we can see a histogramof the classes with respect to the number of instancesper class That is for each KG we can spot how manyclasses have a high number of instances and how manyclasses have a low number of instances Note the log-arithmic scale on both axes The curves seem to fol-low power law distributions For DBpedia the line de-

Table 1Percentage of considered entities per KG for covered domains

DB FB OC WD YA

Reach of method 88 92 81 41 82

creases consistently for the first 250 classes before itdecreases more than exponentially beyond class 250

513 DomainsAll considered KGs are cross-domain meaning that a

variety of domains are covered in those KGs Howeverthe KGs often cover the single domains to a differentdegree Tartir [45] proposed to measure the covered do-mains of ontologies by determining the usage degree ofcorresponding classes the number of instances belong-ing to one or more subclasses of the respective domainis compared to the number of all instances In our workhowever we decided to evaluate the coverage of do-mains concerning the classes per KG via manual assign-ments of the mostly used classes to the domains peoplemedia organizations geography and biology65 Thislist of domains was created by aggregating the mostfrequent domains in Freebase

The manual assignment of classes to domainsis necessary in order to obtain a consistent assign-ment of the classes to the domains across all con-sidered KGs Otherwise the same classes in differ-ent KGs may be assigned to different domains More-over in some KGs classes may otherwise appear invarious domains simultaneously For instance theFreebase classes freebasemusicartist andfreebasepeopleperson overlap in terms oftheir instances and multiple domains (such as musicand people) might be assigned to them

As the reader can see in Table 1 our method to de-termine the coverage of domains and hence the reachof our evaluation includes about 80 of all entities ofeach KG except Wikidata It is calculated as the ratio ofthe number of unique entities of all considered domainsof a given KG divided by the number of all entities ofthis KG66 If the ratio was at 100 we were able toassign all entities of a KG to the chosen domains

Fig 3 shows the number of entities per domain in thedifferent KGs with a logarithmic scale Fig 4 presents

65See our website for examples of classes per domain andper KG httpkmaifbkitedusitesknowledge-graph-comparison (requested on Dec 31 2016)

66We used the number of unique entities of all domains and notthe sum of the entities measured per domain since entities may be inseveral domains at the same time

22 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

100 101 102 103

Classes

100

102

104

106

108

Nu

mb

er

of

insta

nce

s

DBpedia

Freebase

OpenCyc

Wikidata

YAGO

Fig 2 Distribution of classes wrt the number of instances per KG

persons media organizations geography biology100

102

104

106

108

1010

Num

ber

of entities

DBpedia

Freebase

OpenCyc

Wikidata

YAGO

Fig 3 Number of entities per domain

the relative coverage of each domain in each KG It iscalculated as the ratio of the number of entities in eachdomain to the total number of entities of the KG Avalue of 100 means that all instances reside in onesingle domain

The case of Freebase is especially outstanding here77 of all entities here are located in the media

domain This fact can be traced back to large-scaledata imports such as from MusicBrainz The classfreebasemusicrelease_track is account-able for 42 of the media entities As shown in Fig 3Freebase provides the most entities in four out of thefive domains when considering all KGs

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 23

persons media organizations geography biology

10

20

30

40

50

60

70

80

Rela

tive n

um

ber

of entities in p

erc

ent

DBpedia

Freebase

OpenCyc

Wikidata

YAGO

Fig 4 Relative number of entities per domain

In DBpedia and YAGO the domain of people is thelargest domain (50 and 34 respectively) Peculiar isthe higher coverage of YAGO regarding the geographydomain compared to DBpedia As one reason for thatwe can point out the data import of GeoNames intoYAGO

Wikidata contains around 150K entities in the do-main organization This is relativly few consideringthe total amount of entities being around 187M andconsidering the number of organizations in other KGsNote that even DBpedia provides more organizationentities than Wikidata The reason why Wikidata hasnot so many organization entities is not fully compre-hensible to us However we can point out that for ouranalysis we only considered Wikidata classes whichappeared more than 6000 times67 and that about 16Kclasses were therefore not considered It is possible thatentities of the domain organization are belonging tothose rather rarely occurring classes

514 Relations and PredicatesEvaluation method In this article we differentiate

between relations and predicates (see also Section 2)

ndash Relations ndash as short term for explicitly defined re-lations ndash refers to (proprietary) vocabulary definedon the schema level of a KG We identify the setof relations of a KG as the set of those links which

67This number is based on heuristics We focused on the 150 mostinstantiated classes and cut the long tail of classes having only fewinstances

are explicitly defined as such via assignments (forinstance with rdfsProperty) to classes InSection 2 we used Pg to denote this set

ndash In contrast we use predicates to denote links usedin the KG independently of their introduction onthe schema level The set of unique predicates perKG denoted as P imp

g is nothing else than the setof unique RDF terms on the predicate position ofall triples in the KG

It is important to distinguish the key statistics for rela-tions from the key statistics for predicates since theycan differ considerably depending on to which degreerelations are only defined on schema level but not usedon instance level

Evaluation resultsRelationsRanking regarding relations As presented in Ta-

ble 2 Freebase exhibits by far the highest number ofunique relations (around 785K) among the KGs YAGOshows only 106 relations which is the lowest value inthis comparison In the following we point out furtherfindings regarding the relations of the single KGs

DBpedia Regarding DBpedia relations we need todistinguish between so-called mapping-based prop-erties and non-mapping-based properties Mapping-based properties are created by extracing the informa-tion from infoboxes in Wikipedia using manually cre-ated mappings These mappings are specified in the DB-

24 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

pedia Mappings Wiki68 Mapping-based properties arecontained in the DBpedia ontology and located in thenamespace httpdbpediaorgontologyWe count 2819 such relations for the considered DB-pedia version 2015-04 Non-mapping-based properties(also called ldquoraw infobox propertiesrdquo) are extractedfrom Wikipedia without the help of manually createdmappings and hence without any manual adjustmentsTherefore they are generally of lower quality We count58776 such unique relations They reside in the names-pace httpdbpediaorgproperty Bothmapping-based and non-mapping-based properties areinstantiated in DBpedia with rdfProperty We ig-nore the non-mapping based properties for the calcu-lation of the number of relations |Pg| (see Table 2)since in contrast to DBpedia in YAGO non-mappingbased properties are not instantiated Note that themapping-based properties and the non-mapping basedproperties in DBpedia are not aligned69 and may over-lap until DBpedia version 2016-0470

Freebase The high number or Freebase relations canbe explained by two facts 1 About a third of all rela-tions in Freebase are duplicates in the sense that they aredeclared by means of the owlinverseOf relationas being inverse of other relations An example is the re-lation freebasemusicartistalbum and itsinverse relation freebasemusicalbumartist2 Freebase allowed users to introduce their own rela-tions without any limits These relations were originallyin each userrsquos namespace So-called commons adminswere able to approve those relations so that they gotincluded into the Freebase commons schema

OpenCyc For OpenCyc we measure 18028 uniquerelations We can assume that most of them are dedi-cated to statements on the schema level

Wikidata In Wikidata a relatively small set of rela-tions is provided Note in this context that despite thefact that Wikidata is curated by a community (just likeFreebase) Wikidata community members cannot insertarbitrarily new relations as it was possible in Freebaseinstead relations first need to be proposed and thenget accepted by the community if and only if certain

68See httpmappingsdbpediaorgindexphpMain_Page accessed on Nov 4 2016

69For instance The DBpedia ontology containsdbobirthName for the name of a person while the non-mappingbased property set contains dbpname dbpfirstname anddbpalternativeNames

70For instance dbpalias and dboalias

criteria are met71 One of those criteria is that each newrelation is presumably used at least 100 times Thisrelation proposal process can be mentioned as likelyreason why in Wikidata in relative terms more relationsare actually used than in Freebase

YAGO For YAGO we measure the small set of 106unique relations Although relations are curated man-ually for YAGO and DBpedia the size of the relationset differs significantly between those KGs Hoffart etal [28] mention the following reasons for that

1 Peculiarity of relations The DBpedia ontologyprovides quite many special relations For in-stance there exists the relation dboaircraftFighter between dboMilitaryUnit anddboMeanOfTransportation

2 Granularity of relations Relations in the DB-pedia ontology are more fine-grained than rela-tions in YAGO For instance DBpedia contains therelations dboauthor and dbodirectorwhereas in YAGO there is only the generic relationyagocreated

3 Date specification The DBpedia ontology intro-duces several relations for dates For instance DB-pedia contains the relations dbobirthDateand dbobirthYear for birth dates while inYAGO only the relation yagobirthOnDateis used Incomplete date specifications ndash for in-stance if only the year is known ndash are specifiedin YAGO by wildcards (ldquordquo) so that no multiplerelations are needed

4 Inverse relations YAGO has no relations ex-plicitly specified as being inverse In DBpediawe can find relations specified as inverse such asdboparent and dbochild

5 Reification YAGO introduces the SPOTL(X) for-mat This format extends the triple format ldquoSPOldquowith a specification of Time Location and conteXtIn this way no contextual relations are necessary(such as dbodistanceToLondon or dbopopulationAsOf) which occur if the relationsare closely aligned to Wikipedia template attributenames

Frequency of the usage of relations Fig 5 showsthe relative proportions of how often relations are usedper KG grouped into three classes Surprisingly DB-pedia and Freebase exhibit a high number of relationswhich are not used at all on the instance level In case of

71See httpswwwwikidataorgwikiWikidataProperty_proposal requested on Dec 31 2016

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 25

0 1-500 gt500

Number of relations

20

40

60

80

100

Re

lative

occu

ren

cie

s in

pe

rce

nt

DBpedia

Freebase

OpenCyc

Wikidata

YAGO

Fig 5 Frequency of the usage of the relations per KG grouped by(i) zero occurrences (ii) 1ndash500 occurrences and (iii) more than 500occurrences in the respective KG

OpenCyc 992 of the defined relations are never usedWe assume that those relations are used only withinCyc the commercial version of OpenCyc In case ofFreebase only 5 of the relations are used more than500 times and about 70 are not used at all Analo-gously to the discussion regarding the number of Free-base relations we can mention again the high numberof defined owlinverseOf relations and the highnumber of usersrsquo relation proposals as reasons for that

PredicatesRanking regarding predicates Freebase is here ndash

like in case of the ranking regarding relations ndash rankedfirst The lowest number of unique predictes is providedby OpenCyc which exhibits only 165 predicates AllKGs except OpenCyc provide more predicates then re-lations Our single observations regarding the predicatesets are as follows

DBpedia DBpedia is ranked third in terms of the ab-solute numbers of predicates about 60K predicates areused in DBpedia The set of relations and the set of pred-icates varies considerably here since also facts are ex-tracted from Wikipedia info-boxes whose predicates areconsidered by us as being only implicitly defined andwhich hence occur only as predicates These are the so-called non-mapping-based properties Note that in the

studied DBpedia version 2015-04 the set of explicitlydefined relations (mapping-based properties) and theset of implicitly defined relations (non-mapping-basedproperties) overlaps An example is dbpalias withdboalias

Freebase We can observe here a similar picture asfor the set of Freebase relations With about 785Kunique predicates Freebase exceeds the other KGs byfar Note however that 95 of the predicates (around743K) are used only once This relativizes the highnumber Most of the predicates are keys in the senseof ids and are used for internal modeling (for instancefreebasekeyuseradrianb)

OpenCyc In contrast to the 18028 unique relationswe measure only 164 unique predicates for OpenCycMore predicates are presumably used in Cyc

Wikidata We measure more Wikidata predicates thanWikidata relations since Wikidata predicates are cre-ated by modifying Wikidata relations An exampleare the following triples which express the statementBarack Obama (wdtQ76) is a human (wdtQ5) byan intermediate node (wdtQ76S123 abbreviated)

wdtQ76 wdtP31s wdtQ76S123wdtQ76S123 wdtP31v wdtQ5

The relation extension ldquosrdquo indicates that the RDF termin the object position is a statement The ldquovrdquo extensionallows to refer to a value (in Wikidata terminology)Besides those extensions there is ldquorrdquo to refer to a ref-erence and the ldquoqrdquo extension to refer to a qualifier Ingeneral these relation extensions are used for realizingreification via n-ary relations For that intermediatenodes are used which represent statements [16]

YAGO YAGO contains more predicates than DBpe-dia since infobox attributes from different languageversions of Wikipedia are aggregated into one KG72

while for DBpedia separate localized KG versions areoffered for non-English languages

515 Instances and EntitiesEvaluation method We distinguish between in-

stances Ig and entities Eg of a KG (cf Section 2)

1 Instances are belonging to classes They are iden-tified by retrieving the subjects of all triples wherethe predicates indicate class affiliations

72The language of each attribute is encoded in theURI for instance yagoinfoboxdeflaumlche andyagoinfoboxenareakm

26 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGODBpe

diaFre

ebas

eOpe

nCyc

Wiki

data

YAGO

10 010 110 210 310 410 510 610 710 810 9

Num

ber

of In

stan

ces

Fig 6 Number of instances per KG

2 Entities are real-world objects This excludesfor instance instantiated statements for beingentities Determining the set of entities is par-tially tricky In DBpedia and YAGO entitiesare determined as being an instance of theclass owlThing In Freebase entities are in-stances of freebasecommontopic and inWikidata instance of wdoItem In OpenCyccychIndividual corresponds to owlThingbut not all entities are classified in this way There-fore we approximately determine the set of en-tities in OpenCyc by manually classifying allclasses having more than 300 instances includingat least one entity73 In this way abstract classessuch as cychExistingObjectType are ne-glected

Ranking wrt the number of instances Table 2and Fig 6 show the number of instances per KG Wecan see that Wikidata comprises the highest numberof instances (142M) in total and OpenCyc the fewest(242K)

Ranking wrt the number of entities Table 2shows the ranking of KGs regarding the number of en-tities Freebase contains by far the highest number ofentities (about 499M) OpenCyc is at the bottom withonly about 41K entities

Differences in number of entities The reason whythe KGs show quite varying numbers of entities are theinformation sources of the KGs We illustrate this withthe music domain as example

1 Freebase had been created mainly from data im-ports such as from MusicBrainz Therefore enti-

73For instance cychIndividual cychMovie_CW andcychCity

ties in the domain of media and especially songrelease tracks are covered very well in Freebase77 of all entities are in the media domain (seeSection 513) out of which 42 are releasetracks74

Due to the large size and the world-wide coverageof entities in MusicBrainz Freebase contains al-bums and release tracks of both English and non-English languages For instance regarding the En-glish language the album ldquoThrillerrdquo from MichaelJackson and its single ldquoBillie Jeanrdquo are there aswell as rather unknown songs from the ldquoThrillerrdquoalbum such as ldquoThe Lady in My Liferdquo Regard-ing non-English languages Freebase contains forinstance songs and albums from Helene Fischersuch as ldquoLassrsquo mich in dein Lebenrdquo and ldquoZauber-mondrdquo also rather unknown songs such as ldquoHabrsquoden Himmel beruumlhrtrdquo can be found

2 In case of DBpedia the English Wikipedia is thesource of information In the English Wikipediamany albums and singles of English artists are cov-ered ndash such as the album ldquoThrillerrdquo and the singleldquoBillie Jeanrdquo Rather unknown songs such as ldquoTheLady in My Liferdquo are not covered in WikipediaFor many non-English artists such as the Germansinger Helene Fischer no music albums and nosingles are contained in the English Wikipedia Inthe corresponding language version of Wikipedia(and localized DBpedia version) this informationis often available (for instance the album ldquoZauber-mondrdquo and the song ldquoLassrsquo mich in dein Lebenrdquo)but not the rather unknown songs such as ldquoHabrsquoden Himmel beruumlhrtrdquo

3 For YAGO the same situation as for DBpediaholds with the difference that YAGO in additionimports entities also from the different languageversions of Wikipedia and imports also data fromsources such as GeoNames However the abovementioned works (ldquoLassrsquo mich in dein LebenrdquoldquoZaubermondrdquo and ldquoHabrsquo den Himmel beruumlhrtrdquo)of Helene Fischer are not in the YAGO althoughthe song ldquoLassrsquo mich in dein Lebenrdquo exists inthe German Wikipedia since May 2014 and al-though the used YAGO version 3 is based on theWikipedia dump of June 201475 Presumably theYAGO extraction system was unable to extract any

74Those release tracks are expressed via freebasemusicrelease_track

75See httpwwwmpi-infmpgdededepartmentsdatabases-and-information-

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 27DBpe

diaFre

ebas

eOpe

nCyc

Wiki

data

YAGO

10 0

10 1

10 2

10 3

10 4

Ave

rage

num

ber

of e

ntiti

es

Fig 7 Average number of entities per class per KG

types for those entities so that those entities werediscarded

4 Wikidata is supported by the community and con-tains music albums of English and non-Englishartists even if they do not exist in Wikipedia Anexample is the song ldquoThe Lady in My Liferdquo Notehowever that Wikidata does not provide all artistrsquosworks such as from Helene Fischer

5 OpenCyc contains only very few entities in themusic domain The reason is that OpenCyc has itsfocus mainly on common-sense knowledge andnot so much on facts about entities

Average number of entities per class Fig 7 showsthe average number of entities per class which can bewritten as |Eg||Cg| Obvious is the difference betweenDBpedia and YAGO (despite the similar number of en-tities) The reason for that is that the number of classesin the DBpedia ontology is small (as created manually)and in YAGO large (as created automatically)

Comparing number of instances with number ofentities Comparing the ratio of the number of instancesto the number of entities for each KG Wikidata ex-poses the highest difference As reason for that we canstate that each statement in Wikidata is modeled as aninstance of wdoStatement leading to 74M addi-tional instances In other KGs such as DBpedia state-ments are modeled without any dedicated statementassignment OpenCyc exposes also a high ratio sinceit contains mainly common sense knowledge and notas many entities as the other KGs Furthermore for ouranalysis we do not regard 100 of the entities but onlya large fraction of it (more precisely the classes with

systemsresearchyago-nagayagoarchive re-quested on Dec 31 2016

DBpedia

Freeb

ase

OpenC

ycW

ikida

ta

YAGO

0

1

2

3

4

5

6

7

8

Rat

io o

f num

ber

of in

stan

ces

to n

umbe

r of

ent

ities

Fig 8 Ratio of the number of instances to the number of entities foreach KG

the most frequently occurring instantiations) since en-tities are not consistently instantiated in OpenCyc (seebeginning of Section 515)

516 Subjects and ObjectsEvaluation method The number of unique subjects

and unique objects can be a meaningful KG charac-teristic regarding the link structure within the KG andin comparison to other KGs Especially interesting aredifferences between the number of unique subjects andthe number of unique objects

We measure the number of unique subjects by count-ing the unique resources (ie URIs and blank nodes) onthe subject position of N-Triples Sg = s | (s p o) ising Furthermore we measure the number of uniqueobjects by counting the unique resources on the ob-ject position of N-Triples excluding literals Og =o | (s p o) isin g and o isin U cup B Complementary thenumber of literals is given as Olit

g = o | (s p o) ising and o isin L

Ranking of KGs regarding number of uniquesubjects The number of unique subjects per KG is pre-sented in Fig 9 YAGO contains the highest number ofdifferent subjects while OpenCyc contains the fewest

Ranking of KGs regarding number of unique ob-jects The number of unique objects is also presented inFig 9 Freebase shows the highest score in this regardOpenCyc again the lowest

Ranking of KGs regarding the ratio of numberof unique subjects to number of unique objects Theratios of the number of unique subjects to the number ofunique objects vary considerably between the KGs (seeFig 8) We can observe that DBpedia has 265 timesmore objects than subjects while YAGO on the otherside has 19 times more unique subjects than objects

28 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Table 2Summary of key statistics

DBpedia Freebase OpenCyc Wikidata YAGO

Number of triples |(s p o) isin g| 411 885 960 3 124 791 156 2 412 520 748 530 833 1 001 461 792

Number of classes |Cg| 736 53 092 116 822 302 280 569 751

Number of relations |Pg| 2819 70 902 18 028 1874 106

No of unique predicates |P impg | 60 231 784 977 165 4839 88 736

Number of entities |Eg| 4 298 433 49 947 799 41 029 18 697 897 5 130 031

Number of instances |Ig| 20 764 283 115 880 761 242 383 142 213 806 12 291 250

Avg number of entities per class |Eg||Cg| 58403 9408 035 619 90

No of unique subjects |Sg| 31 391 413 125 144 313 261 097 142 278 154 331 806 927

No of unique non-literals in obj pos |Og| 83 284 634 189 466 866 423 432 101 745 685 17 438 196

No of unique literals in obj pos |Olitg | 161 398 382 1 782 723 759 1 081 818 308 144 682 682 313 508

DBpedia

Freeb

ase

OpenC

yc

Wiki

data

YAGO10 0

10 2

10 4

10 6

10 8

10 1 0

10 1 2

unique subjectsunique objects

Fig 9 Number of unique subjects and objects per KG Note thelogarithmic scale on the axis of ordinates

The high number of unique subjects in YAGO is sur-prising and can be explained by the reification styleused in YAGO Facts are stored as N-Quads in orderto allow for making statements about statements (forinstance storing the provenance information for state-ments) To that end IDs (instead of blank nodes) whichidentify the triples are used on the first position of N-Triples They lead to 308M unique subjects such asyagoid_6jg5ow_115_lm6jdp In the RDF ex-port of YAGO the IDs which identify the triples arecommented out in order to facilitate the N-Triple for-mat However the statements about statements are alsotransformed to triples In those cases the IDs identi-fying the reified statements are in the subject positionleading to such a high number of unique subjects

DBpedia contains considerably more owlsameAslinks to external resources than KGs like YAGO (290M

vs 38M links) leading to a bias of DBpedia towards ahigh number of unique objects

517 Summary of Key StatisticsBased on the evaluation results presented in the last

subsections we can highlight the following insights

1 Triples All KGs are very large Freebase is thelargest KG in terms of number of triples whileOpenCyc is the smallest KG We notice a corre-lation between the way of building up a KG andthe size of the KG automatically created KGs aretypically larger as the burdens of integrating newknowledge become lower Datasets which havebeen imported into the KGs such as MusicBrainzinto Freebase have a huge impact on the numberof triples and on the number of facts in the KGAlso the way of modeling data has a great impacton the number of triples For instance if n-aryrelations are expressed in N-Triples format (as incase of Wikidata) many intermediate nodes needto be modeled leading to many additional triplescompared to plain statements Last but not leastthe number of supported languages influences thenumber of triples

2 Classes The number of classes is highly varyingamong the KGs ranging from 736 (DBpedia) upto 300K (Wikidata) and 570K (YAGO) Despite itshigh number of classes YAGO contains in relativeterms the most classes which are actually used(ie classes with at least one instance) This canbe traced back to the fact that heuristics are usedfor selecting appropriate Wikipedia categories asclasses for YAGO Wikidata in contrast containsmany classes but out of them only a small fraction

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 29

is actually used on instance level Note howeverthat this is not necessarily a burden

3 Domains Although all considered KGs are speci-fied as crossdomain domains are not equally dis-tributed in the KGs Also the domain coverageamong the KGs differs considerably Which do-mains are well represented heavily depends onwhich datasets have been integrated into the KGsMusicBrainz facts had been imported into Free-base leading to a strong knowledge representation(77) in the domain of media in Freebase In DB-pedia and YAGO the domain people is the largestlikely due to Wikipedia as data source

4 Relations and Predicates Many relations arerarely used in the KGs Only 5 of the Freebaserelations are used more than 500 times and about70 are not used at all In DBpedia half of therelations of the DBpedia ontology are not usedat all and only a quarter of the relations is usedmore than 500 times For OpenCyc 992 of therelations are not used We assume that they areused only within Cyc the commercial version ofOpenCyc

5 Instances and Entities Freebase contains by farthe highest number of entities Wikidata exposesrelatively many instances in comparison to theentities as each statement is instantiated leadingto around 74M instances which are not entities

6 Subjects and Objects YAGO provides the high-est number of unique subjects among the KGsand also the highest ratio of the number of uniquesubjects to the number of unique objects This isdue to the fact that N-Quad representations needto be expressed via intermedium nodes and thatYAGO is concentrated on classes which are linkedby entities and other classes but which do not pro-vide outlinks DBpedia exhibits more unique ob-jects than unique subjects since it contains manyowlsameAs statements to external entities

52 Data Quality Analysis

We now present the results obtained by applyingthe DQ metrics introduced in the Sections 32 ndash 35 tothe KGs DBpedia Freebase OpenCyc Wikidata andYAGO

521 AccuracyThe fulfillment degrees of the KGs regarding the

Accuracy metrics are shown in Table 3

Table 3Evaluation results for the KGs regarding the dimension Accuracy

DB FB OC WD YA

msynRDF 1 1 1 1 1msynLit 099 1 1 1 062msemTriple 099 lt1 1 099 099

Syntactic validity of RDF documents msynRDF

Evaluation method For evaluating the Syntactic va-lidity of RDF documents we dereference the entityldquoHamburgrdquo as resource sample in each KG In caseof DBpedia YAGO Wikidata and OpenCyc thereare RDFXML serializations of the resource availablewhich can be validated by the official W3C RDF valida-tor76 Freebase only provides a Turtle serialization Weevaluate the syntactic validity of this Turtle documentby verifying if the document can be loaded into an RDFmodel of the Apache Jena Framework77

Evaluation result All considered KGs provide syn-tactically valid RDF documents In case of YAGO andWikidata the RDF validator declares the used languagecodes as invalid since the validator evaluates languagecodes in accordance with ISO-639 The criticized lan-guage codes are however contained in the newer stan-dard ISO 639-3 and actually valid

Syntactic validity of literals msynLit

Evaluation method We evaluate the Syntactic va-lidity of literals by means of the relations date ofbirth number of inhabitants and International Stan-dard Book Number (ISBN) as those relations cover dif-ferent domains ndash namely people cities and books ndashand as they can be found in all KGs In general do-main knowledge is needed for selecting representativerelations so that a meaningful coverage is guaranteed

Note that OpenCyc is not taken into account forthis criterion Although OpenCyc comprises around11M literals in total these literals are essentially la-bels and descriptions (given via rdfslabel andrdfscomment) ie not aligned to specific datatypes Hence OpenCyc has no syntactic invalid literalsand is assigned the metric value 1

As long as a literal with data type is given its syntaxis verified with the help of the function RDFDatatypeisValid(String) of the Apache Jena framework

76See httpsw3orgRDFValidator requested onMar 2 2016

77See httpsjenaapacheorg requested Mar 22016

30 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Thereby standard data types such as xsddate canbe validated easily especially if different data types areprovided78 If no data type is provided or if the literalvalue is of type xsdString the literal is evaluatedby a regular expression which is created manually (seebelow depending on the considered relation) For eachof the three relations we created a sample of 1M literalvalues per KG as long as the respective KG containsso many literals

Evaluation results All KGs except YAGO per-formed very well regarding the Syntactic validity ofliterals

Date of Birth For Wikidata DBpedia and Freebaseall verified literal values (1M per KG) were syntacti-cally correct79 For YAGO we detected around 519Ksyntactic errors (given 1M literal values) due to the us-age of wildcards in the date values For instance thebirth date of yagoSocrates is specified as ldquo470--rdquo which does not correspond to the syntax ofxsddate Obviously the syntactic invalidity of lit-erals is accepted by the YAGO publishers in order tokeep the number of relations low80

Number of inhabitants The data types of the literalvalues regarding the number of inhabitants were validin all KGs For DBpedia YAGO and Wikidata weevaluated the syntactic validity of the number of inhab-itants by checking if xsdnonNegativeIntegerxsddecimal and xsdinteger were used asdata types for the typed literals In Freebase no datatype is specified Therefore we evaluated the values bymeans of a regular expression which allows only thedecimals 0-9 periods and commas

ISBN The ISBN is an identifier for books and maga-zines The identifier can occur in various formats withor without preceding ldquoISBNrdquo with or without delim-iters and with 10 or 13 digits Gupta81 provided a regu-lar expression for validating ISBN in its different formswhich we used in our evaluation All in all most ofthe ISBN were assessed as syntactically correct The

78In DBpedia for instance data for the relationdbobirthDate is stored both as xsdgYear and xsddate

79Surprisingly the Jena Framework assessed data values with anegative year (ie BC eg ldquo-600rdquo for xsdgYear) as invaliddespite the correct syntax

80In order to model the dates to the extent they are known furtherrelations would be necessary such as using wasBornOnYearwith range xsdgYear wasBornOnYearMonth with rangexsdgYearMonth

81See httphowtodoinjavacomregexjava-regex-validate-international-standard-book-number-isbns requested on Mar 1 2016

lowest fulfillment degree was obtained for DBpediaWe found the following findings for the single KGs InFreebase around 699K ISBN numbers were availableOut of them 38 were assessed as syntactically incorrectTypical mistakes were too long numbers and wrongprefixes82 In case of Wikidata 18 of around 11K ISBNnumbers were syntactically invalid However some in-valid numbers have meanwhile been corrected This in-dicates that the Wikidata community does not only careabout inserting new data but also about curating givenKG data In case of YAGO we could only find 400triples with the relation yagohasISBN Seven of theliterals on the object position were syntactically incor-rect For DBpedia we evaluated around 24K literals7419 of them were assessed as syntactically incorrectIn many cases comments next to the ISBN numbers inthe info-boxes of Wikipedia led to an inaccurate extrac-tion of data so that the comments are either extractedas additional facts about ISBN numbers83 or togetherwith the actual ISBN numbers as coherent strings84

Semantic validity of triples msemTriple

Evaluation method The semantic validity can be re-liably measured by means of a reference data set which(i) contains at least to some degree the same facts asin the KG and (ii) which is regarded as some kind ofauthority We decided to use the Integrated AuthorityFile (Gemeinsame Normdatei GND)85 which is anauthority file especially concerning persons and corpo-rate bodies and which was created manually by Ger-man libraries Due to the focus on persons (especiallyauthors) we decided to evaluate a random sample ofperson entities wrt the following relations birth placedeath place birth date and death date For each ofthese relations the corresponding relations in the KGswere determined Then a random sample of 100 personentities per KG was chosen For each entity we retrievedthe facts with the mentioned relations and assessedmanually whether a GND entry exists and whether thevalues of the relations match with the values in the KG

Evaluation result We evaluated up to 400 facts perKG and observed only for a few facts some discrep-ancies For instance Wikidata states as death date of

82Eg we found the 16 digit ISBN 9789780307986931 (cffreebasem0pkny27) and the ISBN 2940045143431 with pre-fix 294 instead of 978 (cf freebasem0v3xf7b)

83See dbrPrince_Caspian84An example is ldquoISBN 0755111974 (hardcover edition)rdquo for

dbrMy_Family_and_Other_Animals85See httpwwwdnbdeENStandardisierung

GNDgndhtml requested on Sep 8 2016

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 31

ldquoAnton Erkelenzldquo (wdtQ589196) April 24 whereasGND states April 25 For DBpedia and YAGO we en-countered 3 and for Wikidata 4 errors Hence thoseKGs were evaluated with 099 Note that OpenCyc hasno values for the chosen relations and thus evaluates to1

During evaluation we identified the following issues

1 For finding the right entry in GND more informa-tion besides the name of the person is needed Thisinformation is sometimes not given so that entitydisambiguation is in those cases hard to perform

2 Contrary to assumptions often either no corre-sponding GND entry exists or not many facts ofthe GND entity are given In other words GND isincomplete wrt to entities (cf Population com-pleteness) and relations (cf Column complete-ness)

3 Values of different granularity need to be matchedsuch as an exact date of birth against the indicationof a year only

In conclusion the evaluation of semantic validity ishard even if a random sample set is evaluated manuallyMeaningful differences among the KGs might be re-vealed only when a very large sample is evaluated egby using crowd-sourcing [2348] Another approachfor assessing the semantic validity is presented by Kon-tokostas et al [34] who propose a test-driven evalu-ation where test cases are created to evaluate triplessemi-automatically For instance an interval specifiesthe valid height of a person and all triples which lieoutside of this interval are evaluated manually In thisway outliers can be easily found but possible wrongvalues within the interval are not detected

Our findings appear to be consistent with the evalua-tion results of the YAGO developer team for YAGO2where manually assessing 4412 statements resulted inan accuracy of 98186

522 TrustworthinessThe fulfillment degrees of the KGs regarding the

Trustworthiness criteria are shown in Table 4

Trustworthiness on KG level mgraph

Evaluation method Regarding the trustworthinessof a KG in general we differentiate between the method

86With a weighted averaging of 95 see httpwwwmpi-infmpgdededepartmentsdatabases-and-information-systemsresearchyago-nagayagostatistics requested on Mar 3 2016

Table 4Evaluation results for the KGs regarding the dimensionTrustworthiness

DB FB OC WD YA

mgraph 05 05 1 075 025mfact 05 1 0 1 1mNoV al 0 1 0 1 0

of how new data is inserted into the KG and the methodof how existing data is curated

Evaluation results The KGs differ considerablywrt this metric OpenCyc obtains the highest scorehere followed by Wikidata In the following we pro-vide findings for the single KGs which are listed bydecreasing fulfillment score

Cyc is edited (expanded and modified) exclusively bya dedicated expert group The free version OpenCycis derived from Cyc and only a locally hosted versioncan be modified by the data consumer

Wikidata is also curated and expanded manually butby volunteers of the Wikidata community Wikidataallows importing data from external sources such asFreebase87 However new data is not just inserted butis approved by the community

Freebase was also curated by a community of vol-unteers In contrast to Wikidata the proportion of dataimported automatically is considerably higher and newdata imports were not dependent on community ap-provals

DBpedia and YAGO The knowledge of both KGs isextracted from Wikipedia but DBpedia differs fromYAGO wrt the community involvement Any usercan engage (i) in mapping the Wikipedia infobox tem-plates to the DBpedia ontology in the DBpedia map-pings wiki88 and (ii) in the development of the DBpediaextraction framework

Trustworthiness on statement levelWe determine the Trustworthiness on statement level

by evaluating whether provenance information for state-ments is used in the KGs The picture is mixed

DBpedia uses the relation provwasDerivedFrom to store the sources of the entities and their state-

87Note that imports from Freebase require the approval ofthe community (see httpswwwwikidataorgwikiWikidataPrimary_sources_tool) Besides that there arebots which import automatically (see httpswwwwikidataorgwikiWikidataBotsde)

88See httpmappingsdbpediaorg requested onMar 3 2016

32 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

ments However as the source is always the correspond-ing Wikipedia article89 this provenance informationis trivial and the fulfillment degree is hence of ratherformal nature

YAGO uses its own vocabulary to indicate thesource of information Interestingly YAGO stores perstatement both the source (via yagoextractionSource eg the Wikipedia article) and the used ex-traction technique (via yagoextractionTech-nique eg ldquoInfobox Extractorrdquo or ldquoCategoryMap-perrdquo) The number of statements about sources is 161Mand hence many times over the number of instances inthe KG The reason for that is that in YAGO the sourceis stored for each fact

In Wikidata several relations can be used for refer-ring to sources such as ldquoimported fromrdquo (wdtP143)ldquostated inrdquo (wdtP248) and ldquoreference URLrdquo (wdtP854)90 Note that ldquoimported fromrdquo relations are usedfor automatic imports but that statements with such areference are not accepted (ldquodata is not sourcedrdquo)91 Tosource data the other relations ldquostated inrdquo and ldquoref-erence URLrdquo can be used The number of all storedreferences in Wikidata92 is around 971K Based on thenumber of all statements93 74M this corresponds to acoverage of around 13 Note however that not everystatement in Wikidata requires a reference according tothe Wikidata guidelines In order to be able to state howmany references are actually missing a manual evalua-tion would be necessary However such an evaluationwould be presumably highly subjective

Freebase uses proprietary vocabulary for represent-ing provenance via n-ary relations which are in Free-base called Compound Value Types (CVT) data fromhigher arity can be expressed [44]94

OpenCyc differs from the other KGs in that it usesneither an external vocabulary nor a proprietary vocab-ulary for storing provenance information

89Eg httpenwikipediaorgwikiHamburg fordbrHamburg

90All relations are instances of Wikidata property to indicate asource (wdtQ18608359)

91See httpswwwwikidataorgwikiPropertyP143 requested Mar 3 2016

92This is the number of instances of wdoReference93This is the number of instances of wdoStatement94Eg for a statement with the relation freebaselocation

statistical_regionpopulation the source can bestored via freebasemeasurement_unitdated_integersource

Table 5Evaluation results for the KGs regarding the dimension Consistency

DB FB OC WD YA

mcheckRestr 0 1 0 1 0mconClass 088 1 lt1 1 033mconRelat 099 045 1 050 099

Indicating unknown and empty values mNoV al

This criterion highlights the subtle data model ofWikidata and Freebase in comparison to the data mod-els of the other KGs Wikidata allows for storing un-known values and empty values (eg that ldquoElizabeth Iof Englandrdquo (wdtQ7207) had no children) Howeverin the Wikidata RDF export such statements are onlyindirectly available since they are represented via blanknodes and via the relation owlsomeValuesFrom

YAGO supports the representation of unknown val-ues and empty values by providing explicit relationsfor such cases95 Inexact dates are modeled by meansof wildcards (eg ldquo1940--rdquo if only the year isknown) Note however the invalidity of such stringsas date literals (see Section 521) Unknown dates arenot supported by YAGO

523 ConsistencyThe fulfillment degrees of the KGs regarding the

Consistency criteria are shown in Table 5

Check of schema restrictions during insertion of newstatements mcheckRestr

The values of the metric mcheckRestr indicating re-strictions during the insertion of new statements arevarying among the KGs The web interfaces of Free-base and Wikidata verify during the insertion of newstatements by the user whether the input is compatiblewith the respective data type For instance data of therelation ldquodate of birthrdquo (wdtP569) is expected to bein a syntactically valid form DBpedia OpenCyc andYAGO have no checks for schema restriction during theinsertion of new statements

Consistency of statements wrt class constraintsmconClass

Evaluation method For evaluating the consis-tency of class constraints we considered the relationowldisjointWith since this is the only rela-tion which is used by more than half of the consid-

95Eg freebasefreebasevaluenotationhas_no_value

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 33

ered KGs We only focused on direct instantiationshere if there is for instance the triple (dboPlantowldisjointWith dboAnimal) then theremust not be a resource which is instantiated both asdboPlant and dboAnimal

Evaluation results We obtained mixed results hereOnly Freebase OpenCyc and Wikidata perform verywell96

Freebase and Wikidata do not specify any constraintswith owldisjointWith Hence those two KGshave no inconsistencies wrt class restrictions and wecan assign the metric value 1 to them In case of Open-Cyc 5 out of the 27112 class restrictions are incon-sistent DBpedia contains 24 class constraints Threeout of them are inconsistent For instance over 1200instances exist which are both a dboAgent and adboPlace YAGO contains 42 constraints dedi-cated mainly for WordNet classes which are mostlyinconsistent

Consistency of statements wrt relation constraintsmconRelat

Evaluation method Here we considered the rela-tions rdfsrange and owlFunctionalProperty as those are used in more than every second con-sidered KG rdfsrange specifies the expected typeof an instance on the object position of a triple whileowlFunctionalProperty indicates that a rela-tion should only be used at most once per resource Weonly took datatype properties into account for this eval-uation since consistencies regarding object propertieswould require to distinguish Open World assumptionand Closed World assumption

Evaluation results In the following we considerthe fulfillment degree for the relation constraintsrdfsrange and owlFunctionalPropertyseparately In Table 5 we show the average of the fulfill-ment scores of each KG regarding rdfsrange andowlFunctionalProperty Note that the num-bers of evaluated relation constraints varied from KG toKG depending on how many relation constraints wereavailable per KG

Range Wikidata does not use any rdfsrangerestrictions Within the Wikidata data model there iswdopropertyType but this indicates not the ex-act allowed data type of a relation (eg wdoprop

96Note that the sample size varies among the KGs (depend-ing on how many owldisjointWith statements are availableper KG) Therefore inconsistencies measured on a small set ofowldisjointWith facts become more visible

Table 6Evaluation results for the KGs regarding the dimension Relevancy

DB FB OC WD YA

mRanking 0 1 0 1 0

ertyTypeTime can represent a year or an exact date)On the talk pages of Wikidata relations users can indi-cate the allowed values of relations via One of state-ments97 Since One of statements are only listed onthe property talk pages and since not only entity typesbut also concrete instances are used as One of valueswe do not consider those statements here

DBpedia obtains the highest measured fulfillmentscore wrt consistency of rdfsrange statementsAn example for a range inconsistency is that the relationdbobirthDate requires a data type xsddatein about 20 of those relations the data type xsdgYear is used though

YAGO Freebase and OpenCyc contain range incon-sistencies primarily since they specify designated datatypes via range relations which are not consistentlyused on the instance level For instance YAGO spec-ifies proprietary data types such as yagoyagoURLand yagoyagoISBN On the instance level how-ever either no data type is used or the unspecific datatype xsdstring

FunctionalProperty The restriction indicated byowlFunctionalProperty is used by all KGsexcept Wikidata On the talk pages about the rela-tions in Wikidata users can specify the cardinalityrestriction via setting the relation to single how-ever this is not part of the Wikidata data modelThe other KGs mostly comply with the usage re-strictions of owlFunctionalProperty Note-worthy is that in Freebase 999 of the inconsis-tencies obtained here are caused by the usages ofthe relations freebasetypeobjectname andfreebasecommonnotable_fordisplay_name

524 RelevancyThe fulfillment degrees of the KGs regarding the

Relevancy criteria are shown in Table 6

Creating a ranking of statements mRanking

Only Wikidata supports the modeling of a rankingof statements Each statement is ranked with ldquopre-

97See httpswwwwikidataorgwikiCategoryProperties_with_one-of_constraints for an overviewrequested on Jan 29 2017

34 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Table 7Evaluation results for the KGs regarding the dimensionCompleteness

DB FB OC WD YA

mcSchema 091 076 092 1 095mcColumn 040 043 0 029 033mcPop 093 094 048 099 089mcPop (short) 1 1 082 1 090mcPop (long) 086 088 014 098 088

ferred rankrdquo (wdoPreferredRank) ldquonormal rankrdquo(wdoNormalRank) or ldquodeprecated rankrdquo (wdoDeprecatedRank) The preferred rank corre-sponds to the up-to-date value or the consensus of theWikidata community wrt this relation Freebase doesnot provide any ranking of statements entities or re-lations However the meanwhile shutdown FreebaseSearch API provided a ranking for resources98

525 CompletenessThe fulfillment degrees of the KGs regarding the

Completeness criteria are shown in Table 7

Schema completeness mcSchema

Evaluation method Since a gold standard for eval-uating the Schema completeness of the considered KGshas not been published we built one on our own Thisgold standard is available online99 It is based on thedata set used in Section 513 where we needed as-signments of classes to domains and comprises of 41classes as well as 22 relations It is oriented towards thedomains people media organizations geography andbiology The classes in the gold standard were alignedto corresponding WordNet synsets (using WordNet ver-sion 31) and were grouped into main classes

Evaluation results Generally Wikidata performsoptimal also DBpedia OpenCyc and YAGO exhibitresults which can be judged as acceptable for most usecases Freebase shows considerable room for improve-ment concerning the coverage of typical cross-domainclasses and relations The results in more detail are asfollows

DBpedia DBpedia shows a good score regardingSchema completeness and its schema is mainly limited

98See httpsdevelopersgooglecomfreebasev1search-cookbookscoring-and-ranking re-quested on Mar 4 2016

99See httpkmaifbkitedusitesknowledge-graph-comparison requested on Jan 29 2017

due to the characteristics of how information is storedand extracted from Wikipedia

1 Classes The DBpedia ontology was created man-ually and covers all domains well However it is incom-plete in the details and therefore appears unbalancedFor instance within the domain of plants the DBpe-dia ontology does not use the class tree but the classginko which is a subclass of trees We can mentionas reason for such gaps in the modeling the fact thatthe ontology is created by means of the most frequentlyused infobox templates in Wikipedia

2 Relations Relations are considerably well cov-ered in the DBpedia ontology Some missing relationsor modeling failures are due to the Wikipedia infoboxcharacteristics For example to represent the gender ofa person the existing relation foafgender seemsto fit However it is only modeled in the ontology asbelonging to the class dbolanguage and not usedon instance level Note that the gender of a person is of-ten not explicitly mentioned in the Wikipedia infoboxesbut implicitly mentioned in the category names (forinstance American male singers) While DBpediadoes not exploit this knowledge YAGO does use it andprovides facts with the relation yagohasGender

Freebase Freebase shows a very ambivalent schemacompleteness On the one hand Freebase targets ratherthe representation of facts on instance level than therepresentation of classes and their hierarchy On theother hand Freebase provides a vast amount of rela-tions leading to a very good coverage of the requestedrelations

1 Classes Freebase lacks a class hierarchy and sub-classes of classes are often in different domains (for in-stance the classes freebasemusicartist andsportsmen freebasesportspro_athlete arelogically a subclass of the class people freebasepersonpeople but not explicitly stated as such)which makes it difficult to find suitable sub- and su-perclasses Noteworthy the biology domain containsno classes This is due to the fact that classes are rep-resented as entities such as tree100 and ginko101 Theginko tree is not classified as tree but by the genericclass freebasebiologyoganism_classification

2 Relations Freebase exhibits all relations requestedby our gold standard This is not surprising given thevast amount of available relations in Freebase (see Sec-tion 514 and Table 2)

100Freebase ID freebasem07j7r101Freebase ID freebasem0htd3

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 35

OpenCyc In total OpenCyc exposes a quite highSchema completeness scoring This is due to the factthat OpenCyc has been created manually and has itsfocus on generic and common-sense knowledge

1 Classes The ontology of OpenCyc covers bothgeneric and specific classes such as cychSocialGroup and cychLandTopographicalFeatureWe can state that OpenCyc is complete with respect tothe considered classes

2 Relations OpenCyc lacks some relations of thegold standard such as the number of pages or the ISBNof books

Wikidata According to our evaluation Wikidata iscomplete both with respect to classes and relations

1 Classes Besides frequently used generic classessuch as ldquohumanrdquo (wdtQ5) also very specific classesexist such as ldquolandformrdquo (wdtQ271669) in the senseof a geomorphologial unit with over 3K instances

2 Relations In particular remarkable is that Wiki-data covers all relations of the gold standard eventhough it has extremely less relations than FreebaseThus the Wikidata methodology to let users proposenew relations to discuss about their outreach and fi-nally to approve or disapprove the relations seems tobe appropriate

YAGO Due to its concentration on modeling classesYAGO shows the best overall Schema completenessfulfillment score among the KGs

1 Classes To create the set of classes in YAGOthe Wikipedia categories are extracted and connectedto WordNet synsets Since also our gold standard isalready aligned to WordNet synsets we can measure afull completeness score for YAGO classes

2 Relations The YAGO schema does not containmany unique but rather abstract relations which canbe understood in different senses The abstract rela-tion names make it often difficult to infer the mean-ing The relation yagowasCreatedOnDate forinstance can be used reasonably for both the founda-tion year of a company and for the publication dateof a movie DBpedia in contrast provides the rela-tion dbpfoundationYear Often the meaning ofYAGO relations is only fully understood after consider-ing the associated classes using domain and range ofthe relations Expanding the YAGO schema by furthermore fine-grained relations appears reasonable

Column completeness mcColumn

Evaluation method For evaluating KGs wrt Col-umn completeness for each KG 25 class-relation-

Table 8Metric values of mcCol for single class-relation-pairs

Relation DB FB OC ED YA

Personndashbirthdate 048 048 0 070 077

Personndashsex ndash 057 0 094 064

Bookndashauthor 091 093 0 082 028

BookndashISBN 073 063 ndash 018 001

combinations102 were created based on our gold stan-dard created for measuring the Schema completenessIt was ensured that only those relations were selectedfor a given class for which a value typically exists forthat class For instance we did not include the deathdate as potential relation for living people

Evaluation results In general no KG yields a met-ric score of over 043 As visible in Table 8 KGs oftenhave some specific class-relation-pairs which are wellrepresented on instance level while the rest of the pairsare poorly represented The well-represented pairs pre-sumably originate either from column-complete datasets which were imported (cf MusicBrainz in case ofFreebase) or from user edits focusing primarily on factsabout entities of popular classes such as people Wenotice the following observations with respect to thesingle KGs

DBpedia DBpedia fails regarding the relation sex forinstances of class Person since it does not containsuch a relation in its ontology If we considered the non-mapping-based property dbpgender instead (notdefined in the ontology) we would gain a coverage ofonly 025 (about 5K people) We can note hence thatthe extraction of data out of the Wikipedia categorieswould be a further fruitful data source for DBpedia

Freebase Freebase surprisingly shows a very highcoverage (927) of the authors of books given the ba-sic population of 17M books Note however that thereare not only books modeled under freebasebookbook but also entities of other types such as a descrip-tion of the Lord of Rings (see freebasem07bz5)Also the coverage of ISBN for books is quite high(634)

OpenCyc OpenCyc breaks ranks as mostly no val-ues for the considered relations are stored in this KG It

102The selection of class-relation-pairs was depending on the factwhich class-relation-pairs were available per KG Hence the choiceis varying from KG to KG Also note that less class-relation-pairswere used if no 25 pairs were available in the respective KG

36 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

contains mainly taxonomic knowledge and only thinlyspread instance facts

Wikidata Wikidata achieves a high coverage of birthdates (703) and of gender (941) despite the highnumber of 3M people103

YAGO YAGO obtains a coverage of 635 for gen-der relations as it in contrast to DBpedia extracts thisimplicit information from Wikipedia

Population completeness mcPop

Evaluation method In order to evaluate the Popu-lation completeness we need a gold standard consist-ing of a basic entity population for each consideredKG This gold standard which is available online104

was created on the basis of our gold standard usedfor evaluating the Schema completeness and the Col-umn completeness For its creation we selected fiveclasses from each of the five domains and determinedtwo well-known entities (called short head) and tworather unknown entities (called long tail) for each ofthose classes The exact entity selection criteria are asfollows

1 The well-known entities were chosen without tem-poral and location-based restrictions To take themost popular entities per domain we used quan-titative statements For instance to select well-known athletes we ranked athletes by the numberof won olympic medals to select the most popu-lar mountains we ranked the mountains by theirheights

2 To select the rather unknown entities we consid-ered entities associated to both Germany and aspecific year For instance regarding the athleteswe selected German athletes active in the year2010 such as Maria Houmlfl-Riesch The selectionof rather unknown entities in the domain of biol-ogy is based on the IUCN Red List of ThreatenedSpecies105106

Selecting four entities per class and five classes perdomain resulted in 100 entities to be used for evaluatingthe Population completeness

103These 3M instances form about 185 of all instances in Wiki-data See httpswwwwikidataorgwikiWikidataStatistics requested on Nov 7 2016

104See httpkmaifbkitedusitesknowledge-graph-comparison requested on Jan 29 2017

105See httpwwwiucnredlistorg requested on Apr2 2016

106Note that selecting entities by their importance or popularity ishard in general and that also other popularity measures such as thePageRank scores may be taken into account

Evaluation results All KGs except OpenCyc showgood evaluation results Since also Wikidata exhibitsgood evaluation results the population degree appar-ently does not depend on the age or the insertion methodof the KG Fig 10 additionally depicts the populationcompleteness for the single domains for each KG Inthe following we firstly present our findings for well-known entities before we secondly go into the detailsof rather unknown entities

Well-known entities Here all considered KGsachieve good results DBpedia Freebase and Wikidataare complete wrt the well-known entities in our goldstandard YAGO lacks some well-known entities al-though some of them are represented in Wikipedia Onereason for this fact is that those Wikipedia entities donot get imported into YAGO for which a WordNet classexists For instance there is no ldquoGreat White Sharkrdquoentity only the WordNet class yagowordnet_great_white_shark_101484850

Not-well-known entities First of all not very surpris-ing is the fact that all KGs show a higher degree of com-pleteness regarding well-known entities than regard-ing rather unknown entities as the KGs are orientedtowards general knowledge and not domain-specificknowledge Secondly two things are in particular pe-culiar concerning long-tail entities in the KGs Whilemost of the KGs obtain a score of about 088 Wiki-data deflects upwards and OpenCyc deflects stronglydownwards

Wikidata exhibits a very high Population complete-ness degree for long tail entities This is a result fromthe central storage of interwiki links between differentWikimedia projects (especially between the differentWikipedia language versions) in Wikidata A Wikidataentry is added to Wikidata as soon as a new entity isadded in one of the many Wikipedia language versionsNote however that in this way English-language labelsfor the entities are often missing We measure that onlyabout 546 (102M) of all Wikidata resources have anEnglish label

OpenCyc exhibits a poor population degree score of014 for long-tail entities OpenCycrsquos sister KGs Cycand ResearchCyc are apparently considerably bettercovered with entities [36] leading to higher Populationcompleteness scores

526 TimelinessThe evaluation results concerning the dimension

Timeliness are presented in Table 9

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 37

DBpedia Freebase OpenCyc Wikidata YAGO0

01

02

03

04

05

06

07

08

09

1

PeopleMediaOrganizationsGeographyBiology

Fig 10 Population completeness regarding the different domains per KG

Table 9Evaluation results for the KGs regarding the dimension Timeliness

DB FB OC WD YA

mFreq 05 0 025 1 025mV alidity 0 1 0 1 1mChange 0 1 0 0 0

Timeliness frequency of the KG mFreq

Evaluation results The KGs are very diverse re-garding the frequency in which the KGs are updatedranging from a score of 0 for Freebase (not updated anymore) to 1 for Wikidata (updates immediately visibleand retrievable) Note that the Timeliness frequency ofthe KG can be a crucial point and a criterion for exclu-sion in the process of choosing the right KG for a givensetting [17] In the following we outline some charac-teristics of the KGs with respect to their up-to-dateness

DBpedia is created about once to twice a year andis not modified in the meantime From September2013 until November 2016 six DBpedia versions havebeen published107 Besides the static DBpedia DBpe-dia live108 has been continuously updated by trackingchanges in Wikipedia in real-time However it does notprovide the full range of relations as DBpedia

107These versions are DBpedia 38 DBpedia 39 DBpedia 2014DBpedia 2015-04 DBpedia 2015-10 and DBpedia 2016-04 Alwaysthe latest DBpedia version is published online for dereferencing

108See httplivedbpediaorg requested on Mar 42016

Freebase had been updated continuously until itsclose-down and is not updated anymore

OpenCyc has been updated less than once per yearThe last OpenCyc version dates from May 2012109 Tothe best of our knowledge Cyc and OpenCyc respec-tively are developed further but no exact date of thenext version is known

Wikidata provides the highest fulfillment degree forthis criterion Modifications in Wikidata are via browserand via HTTP URI dereferencing immediately visibleHence Wikidata falls in the category of continuousupdates Besides that an RDF export is provided ona roughly monthly basis (either via the RDF exportwebpage110 or via own processing using the Wikidatatoolkit111)

YAGO has been updated less than once per yearYAGO3 was published in 2015 YAGO2 in 2011 andthe interim version YAGO2s in 2013 A date of the nextrelease has not been published

Specification of the validity period of statementsmV alidity

Evaluation results Although representing the va-lidity period of statements is obviously reasonable formany relations (for instance the presidentrsquos term of

109See httpswopencycorg requested on Nov 82016

110See httptoolswmflabsorgwikidata-exportsrdfexports requested on Nov 23 2016

111See httpsgithubcomWikidataWikidata-Toolkit requested on Nov 8 2016

38 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Table 10Evaluation results for the KGs regarding the dimension Ease ofunderstanding

DB FB OC WD YA

mDescr 070 097 1 lt1 1mLang 1 1 0 1 1muSer 1 1 0 1 1muURI 1 05 1 0 1

office) specifying the validity period of statements isin several KGs either not possible at all or only rudi-mentary performed

DBpedia and OpenCyc do not realize any specifi-cation possibility In YAGO Freebase and Wikidatathe temporal validity period of statements can be spec-ified In YAGO this modeling possibility is madeavailable via the relations yagooccursSinceyagooccursUntil and yagooccursOnDateWikidata provides the relations ldquostart timerdquo (wdtP580)and ldquoend timerdquo (wdtP582) In Freebase CompoundValue Types (CVTs) are used to represent relations withhigher arity [44] As part of this representation validityperiods of statements can be specified An example isldquoVancouverrsquos population in 1997rdquo

Specification of the modification date of statementsmChange

Evaluation results The modification date of state-ments can only be specified in Freebase but not in theother KGs Together with the criteria on Timelinessthis reflects that the considered KGs are mostly notsufficiently equipped with possibilities for modelingtemporal aspects within and about the KG

In Freebase the date of the last review of a fact can berepresented via the relation freebasefreebasevaluenotationis_reviewed In the DBpediaontology the relation dctermsmodified is usedto state the date of the last revision of the DBpediaontology When dereferencing a resource in Wikidatathe latest modification date of the resource is returnedvia schemadateModified This however doesnot hold for statements Thus Wikidata is evaluatedwith 0 too

527 Ease of UnderstandingDescription of resources mDescr

Evaluation method We measured the extent towhich entities are described Regarding the labelswe considered rdfslabel for all KGs Regard-ing the descriptions the corresponding relations dif-

fer from KG to KG DBpedia for instance usesrdfscomment and dcelementsdescriptionwhile Freebase provides freebasecommontopicdescription112

Evaluation result For all KGs the rule applies thatin case there is no label available usually there isalso no description available The current metric couldtherefore (without significant restrictions) be applied tordfslabel occurrences only

YAGO Wikidata and OpenCyc contain a label foralmost every entity In Wikidata the entities withoutany label are of experimental nature and are most likelynot used113

Surprisingly DBpedia shows a relatively low cov-erage wrt labels and descriptions (only 704) Ourmanual investigations suggest that relations with higherarity are modeled by means of intermediate nodeswhich have no labels114

Labels in multiple languages mLang

Evaluation method Here we measure whether theKGs contain labels (rdfslabel) in other languagesthan English This is done by means of the languageannotations of literals such as ldquoderdquo for literals inGerman

Evaluation results DBpedia provides labels in 13languages Further languages are provided in the lo-calized DBpedia versions YAGO integrates statementsof the different language versions of Wikipedia intoone KG Therefore it provides labels in 326 differentlanguages Freebase and Wikidata also provide a lot oflanguages (244 and 395 languages respectively) Con-trary to the other KGs OpenCyc only provides labelsin English

Coverage of languages We also measured the cov-erage of selected languages in the KGs ie the extentto which entities have an rdfslabel with a specificlanguage annotation115 Our evaluation shows that DB-pedia YAGO and Freebase achieve a high coveragewith more than 90 regarding the English language Incontrast to those KGs Wikidata shows a relative low

112Human-readable resource descriptions may also be representedby other relations [15] However we focused on those relations whichare commonly used in the considered KGs

113For instance wdtQ5127809 represents a game fo the Nin-tendo Entertainment System but there is no further information foran identification of the entity available

114Eg dbrNayim links via dboCareerStation to 10entities of his carrier stations

115Note that literals such as rdfslabel do not necessarily havelanguage annotations In those cases we assume that no languageinformation is available

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 39

coverage regarding the English language of only 546but a coverage of over 30 for further languages suchas German and French Wikidata is hence not only themost diverse KG in terms of languages but has also thehighest coverage regarding non-English languages

Understandable RDF serialization muSer

The provisioning of understandable RDF serializa-tions in the context of URI dereferencing leads to a bet-ter understandability for human data consumers DB-pedia YAGO and Wikidata provide N-Triples andN3Turtle serializations Freebase in contrast onlyprovides a Turtle serialization OpenCyc only usesRDFXML which is regarded as not easily understand-able by humans

Self-describing URIs muURI

We can observe two different paradigms of URI us-age On the one hand DBpedia OpenCyc and YAGOrely on descriptive URIs and therefore achieve the fullfulfillment degree In DBpedia and YAGO the URIsof the entities are determined by the corresponding En-glish Wikipedia article The mapping to the EnglishWikipedia is thus trivial In case of OpenCyc two RDFexports are provided one using opaque and one us-ing self-describing URIs The self-describing URIs arethereby derived from the rdfslabel values of theresources

On the other hand Wikidata and Freebase (the latterin part) rely on opaque URIs Wikidata uses Q-IDsfor resources (items in Wikidata terminology) andP-IDs for relations Freebase uses self-describing URIsonly partially namely opaque M-IDs for entities andself-describing URIs for classes and relations116

528 InteroperabilityThe evaluation results of the dimension Interoper-

ability are presented in Table 11

Avoiding blank nodes and RDF reification mReif

Reification allows to represent further informationabout single statements In conclusion we can state thatDBpedia Freebase OpenCyc and YAGO use someform of reification However none of the consideredKGs uses the RDF standard for reification Wikidatamakes extensive use of reification every relation isstored in the form of an n-ary relation In case of DB-pedia and Freebase in contrast facts are predominantlystored as N-Tripels and only relations of higher arity

116Eg freebasemusicalbum for the class music al-bums and freebasepeoplepersondate_of_birthfor the relation day of birth

Table 11Evaluation results for the KGs regarding the dimensionInteroperability

DB FB OC WD YA

mReif 05 05 05 0 05miSerial 1 0 05 1 1mextV oc 061 011 041 068 013mpropV oc 015 0 051 gt0 0

are stored via n-ary relations117 YAGO stores facts asN-Quads in order to be able to store meta informationof facts like provenance information When the quadsare loaded in a triple store the IDs referring to thesingle statements are ignored and quads are convertedinto triples In this way most of the statements are stillusable without the necessity to deal with reification

Blank nodes are non-dereferencable anonymous re-sources They are used by the Wikidata and OpenCycdata model

Provisioning of several serialization formats miSerial

DBpedia YAGO and Wikidata fulfill the criterion ofProvisioning several RDF serialization formats to thefull extent as they provide data in RDFXML and sev-eral other serialization formats during the URI derefer-encing In addition DBpedia and YAGO provide fur-ther RDF serialization formats (eg JSON-LD Micro-data and CSV) via their SPARQL endpoints Freebaseis the only KG providing RDF only in Turtle format

Using external vocabulary mextV oc

Evaluation method This criterion indicates the ex-tent to which external vocabulary is used For that foreach KG we divide the occurrence number of tripleswith external relations by the number of all relations inthis KG

Evaluation results DBpedia uses 37 unique exter-nal relations from 8 different vocabularies while theother KGs mainly restrict themselves to the externalvocabularies RDF RDFS and OWL

Wikidata reveals a high external vocabulary ratiotoo We can mention two obvious reasons for that fact1 Information in Wikidata is provided in a huge varietyof languages leading to 85M rdfslabel and 140Mschemadescription literals 2 Wikidata makesextensive use of reification Out of the 140M triplesused for instantiations via rdftype about 74M (ie

117See Section 511 for more details wrt the influence of reifica-tion on the number of triples

40 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

about the half) are taken for instantiations of statementsie for reification

Interoperability of proprietary vocabulary mpropV oc

Evaluation method This criterion determines the ex-tent to which URIs of proprietary vocabulary are linkedto external vocabulary via equivalence relations Foreach KG we measure which classes and relationsare linked via owlsameAs118 owlequivalentClass (in Wikidata wdtP1709) and owlequivalentProperty (in Wikidata wdtP1628) to ex-ternal vocabulary Note that other relations such asrdfsubPropertyOf could be taken into accounthowever in this work we only consider equivalencyrelations

Evaluation results In general we obtained low ful-fillment scores regarding this criterion OpenCyc showsthe highest value We achieved the following singlefindings

Regarding its classes DBpedia reaches a relativehigh interlinking degree of about 484 Classes arethereby linked to FOAF Wikidata schemaorg andDUL119 Regarding its relations DBpedia links to Wiki-data and schemaorg120 Only 63 of the DBpediarelations are linked to external vocabulary

Freebase only provides owlsameAs links in theform of a separate RDF file but these links are only oninstance level Thus the KG is evaluated with 0

In OpenCyc about half of all classes exhibit at leastone external linking via owlsameAs Internal linksto resources of swcyccom the commercial ver-sion of OpenCyc were ignored in our evaluation Theconsidered classes are mainly linked to FOAF UM-BEL DBpedia and linkedmdborg the relations mainlyto FOAF DBpedia Dublin Core Terms and linked-mdborg The relative high linking degree of OpenCyccan be attributed to dedicated approaches of linkingOpenCyc to other KGs (see eg Medelyan et al [38])

Regarding the classes Wikidata provides linksmainly to DBpedia Considering all Wikidata classesonly 01 of all Wikidata classes are linked to equiva-

118OpenCyc uses owlsameAs both on schema and instancelevel This is appropriate as the OWL primer states The built-inOWL property owlsameAs links an individual to an individualas well as The owlsameAs statements are often used in definingmappings between ontologies see httpswwww3orgTR2004REC-owl-ref-20040210sameAs-def (requestedon Feb 4 2017)

119See httpwwwontologydesignpatternsorgontdulDULowl requested on Jan 11 2017

120Eg dbobirthDate is linked to wdtP569 andschemabirthDate

Table 12Evaluation results for the KGs regarding the dimension Accessibility

DB FB OC WD YA

mDeref 1 1 044 041 1mAvai lt1 073 lt1 lt1 1mSPARQL 1 1 0 1 0mExport 1 1 1 1 1mNegot 05 1 0 1 0mHTMLRDF 1 1 1 1 0mMeta 1 0 0 0 1

lent external classes This may be due to the high num-ber of classes in Wikidata in general Regarding therelations Wikidata provides links in particular to FOAFand schemaorg and achieves here a linking coverageof 21 Although this is low frequently used relationsare linked121

YAGO contains around 553K owlequivalentClass links to classes within the DBpedia namespacedby However as YAGO classes (and their hierarchy)were imported also into DBpedia (using the namespacehttpdbpediaorgclassyago) we donot count those owlequivalentClass links inYAGO as external links for YAGO

529 AccessibilityThe evaluation results of the dimension Accessibility

are presented in Table 12

Dereferencing possibility of resources mDeref

Evaluation method We measured the dereferenc-ing possibilities of resources by trying to dereferenceURIs containing the fully-qualified domain name ofthe KG For that we randomly selected 15K URIs inthe subject predicate and object position of triples ineach KG We submitted HTTP requests with the HTTPaccept header field set to applicationrdf+xmlin order to perform content negotiation

Evaluation results In case of DBpedia OpenCycand YAGO all URIs were dereferenced successfullyand returned appropriate RDF data so that they fulfilledthis criterion completely For DBpedia 45K URIs wereanalyzed for OpenCyc only around 30K due to thesmall number of unique predicates We observed almost

121Frequently used relations with stated equivalence to externalrelations are eg wdtP31 linked to rdftype and wdtP279linked to rdfssubClassOf

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 41

the same picture for YAGO namely no notable errorsduring dereferencing

For Wikidata which contains also not that manyunique predicates we analyzed around 35K URIs Notethat predicates which are derived from relations using asuffix (eg the suffix s as in wdtP1024s is usedfor predicates referring to a statement) could not bedereferenced at all Furthermore the blank nodes usedfor reification cannot be dereferenced

Regarding Freebase mainly all URIs on subjectand object position of triples could be dereferencedSome resources were not resolvable even after multi-ple attempts (HTTP server error 503 eg freebasem0156q) Surprisingly server errors also appearedwhile browsing the website freebasecom so that datawas partially not available Regarding the predicate po-sition many URIs are not dereferencable due to servererrors (HTTP 503) or due to unknown URIs (HTTP404) Note that if a large number of Freebase requestsare performed an API key from Google is necessaryIn our experiments the access was blocked after a fewthousand requests Hence we can point out that withoutan API key the Freebase KG is only usable to a limitedextent

Availability of the KG mAvai

Evaluation method We measured the availabilityof the officially hosted KGs with the monitoring servicePingdom122 For each KG an uptime test was set upwhich checked the availability of the resource Ham-burg as representative resource for successful URI re-solving (ie returning the status code HTTP 200) ev-ery minute over the time range of 60 days (Dec 182015ndashFeb 15 2016)

Evaluation result While the other KGs showed al-most no outages and were again online after some min-utes on average YAGO outages took place frequentlyand lasted on average 35 hours123 In the given timerange four outages took longer than one day Based onthese insights we recommend to use a local version ofYAGO for time-critical queries

Availability of a public SPARQL endpoint mSPARQL

The SPARQL endpoints of DBpedia and YAGO are

122See httpswwwpingdomcom requested Mar 2 2016The HTTP requests of Pingdom are executed by various servers sothat caching is prevented

123See diagrams per KG on our website (httpkmaifbkitedusitesknowledge-graph-comparisonrequested on Jan 31 2017)

provided by a Virtuoso server124 the Wikidata SPARQLendpoint via Blazegraph125 Freebase and OpenCyc donot provide an official SPARQL endpoint However anendpoint for the MQL query language for the FreebaseKG was available

Especially regarding the Wikidata SPARQL endpointwe observed access restrictions The maximum execu-tion time per query is set to 30 seconds but there is nolimitation regarding the returning number of rows How-ever the front-end of the SPARQL endpoint crashed incase of large result sets with more than 15M rows Al-though public SPARQL endpoints need to be preparedfor inefficient queries the time limit of Wikidata mayimpede the execution of reasonable queries

Provisioning of an RDF export mExport

All considered KGs provide RDF exports as down-loadable files The format of the data differs from KGto KG Mostly data is provided in N-Triples and Turtleformat

Support of content negotiation mNegot

We measure the support of content negotiation re-garding the serialization formats RDFXML N3Turtleand N-Triples OpenCyc does not provide any contentnegotiation only RDFXML is supported as contenttype Therefore OpenCyc does not fulfill the criterionof supporting content negotiation

The endpoints for DBpedia Wikidata and YAGOcorrectly returned the appropriate RDF serializationformat and the corresponding HTML representationof the tested resources Freebase does currently notprovide any content negotiation and only the contenttype textplain is returned

Noteworthy is also that regarding the N-Triples seri-alization YAGO and DBpedia require the accept headertextplain and not applicationn-triplesThis is due to the usage of Virtuoso as endpoint For DB-pedia the forwarding to httpdbpediaorgdata[resource]ntriples does not work in-stead the HTML representation is returned Thereforethe KG is evaluated with 05

Linking HTML sites to RDF serializations mHTMLRDF

All KGs except OpenCyc interlink the HTML represen-tations of resources with the corresponding RDF repre-sentations by means of ltlink rel=alternate

124See httpsvirtuosoopenlinkswcom re-quested on Dec 28 2016

125See httpswwwblazegraphcom requested on Dec28 2016

42 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Table 13Evaluation results for the KGs regarding the dimension License

DB FB OC WD YA

mmacLicense 1 0 0 1 0

type=content type href=URLgtin the HTML header

Provisioning of metadata about the KG mmeta

For this criterion we analyzed if KG metadata isavailable such as in the form of a VoID file126 DBpediaintegrates the VoID vocabulary directly in its KG127 andprovides information such as the SPARQL endpointURL and the number of all triples OpenCyc revealsthe current KG version number via owlversionInfo For YAGO Freebase and Wikidata no metainformation could be found

5210 LicenseThe evaluation results of the dimension License are

shown in Table 13

Provisioning machine-readable licensing informationmmacLicense

DBpedia and Wikidata provide licensing informa-tion about their KG data in machine-readable form ForDBpedia this is done in the ontology via the predi-cate cclicense linking to CC-BY-SA128 and GNUFree Documentation License (GNU FDL)129 Wikidataembeds licensing information during the dereferenc-ing of resources in the RDF document by linking withcclicense to the license CC0130 YAGO and Free-base do not provide machine-readable licensing infor-mation However their data is published under the li-cense CC-BY131 OpenCyc embeds licensing informa-tion into the RDF document during dereferencing butnot in machine-readable form132

126See httpswwww3orgTRvoid requested on Apr7 2016

127See httpdbpediaorgvoidpageDataset re-quested on Mar 5 2016

128See httpcreativecomonsorglicensesby-sa30 requested on Feb 4 2017

129See httpwwwgnuorgcopyleftfdlhtml re-quested on Feb 4 2017

130See httpcreativecomonsorgpublicdomainzero10 requested on Feb 4 2017

131See httpcreateivecommonsorglicensesby30 requested on Feb 4 2017

132License information is provided as plain text among furtherinformation with the relation rdfscomment

Table 14Evaluation results for the KGs regarding the dimension Interlinking

DB FB OC WD YA

mInst 025 0 038 0 (09) 031mURIs 093 091 089 096 096

5211 InterlinkingThe evaluation results of the dimension Interlinking

are shown in Table 14

Linking via owlsameAs mInst

Evaluation method Given all owlsameAs triplesin each KG we queried all those subjects thereof whichare instances but neither classes nor relations133 andwhere the resource in the object position of the triple isan external source ie not belonging to the namespaceof the KG

Evaluation result OpenCyc and YAGO achieve thebest results wrt this metric but DBpedia has by farthe most instances with at least one owlsameAs linkWe can therefore confirm the statement by Bizer et al[12] that DBpedia has established itself as a hub in theLinked Data cloud

In DBpedia there are about 52M instances with atleast one owlsameAs link Links to localized DBpe-dia versions (eg dedbpediaorg) were countedas internal links and hence not considered here Intotal one-fourth of all instances have at least oneowlsameAs link

In Wikidata neither owlsameAs links are pro-vided nor a corresponding proprietary relation is avail-able Instead Wikidata uses for each linked data seta proprietary relation (called identifier) to indicateequivalence For example the M-ID of a Freebase in-stance is stored via the relation ldquoFreebase identifierrdquo(wdtP646) as literal value (eg m01x3gpk)So far links to 426 different data sources are maintainedin this way

Although the equivalence statements in Wikidata canbe used to generate corresponding owlsameAs state-ments and although the stored identifiers are providedin the Browser interface as hyperlinks there are no gen-uine owlsameAs links available Hence Wikidata isevaluated with 0 If we view each equivalence relationas owlsameAs relation we would obtain around122M instances with owlsameAs statements Thiscorresponds to 86 of all instances If we consider

133The interlinking on schema level is already covered by thecriterion Interoperability of proprietary vocabulary

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 43

only entities instead of instances (since there are manyinstances due to reification) we obtain a coverage of65 Note however that although the linked resourcesprovide relevant content the resources are not alwaysRDF documents but instead HTML web pages There-fore we cannot easily subsume all identifiers (equiv-alence statements) under owlsameAs

YAGO has around 36M instances with at least oneowlsameAs link However most of them are linksto DBpedia based on common Wikipedia articles Ifthose links are excluded YAGO contains mostly linksto GeoNames and would be evaluated with just 001

In case of OpenCyc links to Cyc134 the commercialversion of OpenCyc were considered as being internalStill OpenCyc has the highest fulfillment degree witharound 40K instances with at least one owlsameAslink As mentioned earlier the relative high linkingdegree of OpenCyc can be attributed to dedicated ap-proaches of linking OpenCyc to other KGs135

Validity of external URIs mURIs

Regarding the dimension Accessibility we alreadyanalyzed the dereferencing possibility of resources inthe KG namespace Now we analyze the links to exter-nal URIs

Evaluation method External links include owlsameAs links as well as links to non-RDF-based Webresources (eg via foafhomepage) We measureerrors such as timouts client errors (HTTP response4xx) and server errors (HTTP response 5xx)

Evaluation result The external links are in most ofthe cases valid for all KGs All KGs obtain a metricvalue between 089 and 096

DBpedia stores provenance information via the re-lation provwasDerivedFrom Since almost alllinks refer to Wikipedia 99 of the resources are avail-able

Freebase achieves high metric values here sinceit contains owlsameAs links mainly to WikipediaAlso Wikipedia URIs are mostly resolvable

OpenCyc contains mainly external links to non-RDF-based Web resources to wikipediaorg and w3org

YAGO also achieves high metric values since it pro-vides owlsameAs links only to DBpedia and Geo-Names whose URIs do not change

For Wikidata the relation reference URL (wdtP854) which states provenance information amongother relations belongs to the links linking to external

134Ie swcyccom135See Interoperability of proprietary vocabulary in sec 528

Web resources Here we were able to resolve around955 without errors

Noticeable is that DBpedia and OpenCyc containmany owlsameAs links to URIs whose domains donot exist anymore136 One solution for such invalid linksmight be to remove them if they have been invalid for acertain time span

5212 Summary of ResultsWe now summarize the results of the evaluations

presented in this section

1 Syntactic validity of RDF documents All KGsprovide syntactically valid RDF documents

2 Syntactic validity of Literals In general the KGsachieve good scores regarding the Syntactic valid-ity of literals Although OpenCyc comprises over1M literals in total these literals are mainly labelsand descriptions which are not formatted in a spe-cial format For YAGO we detected about 519Ksyntactic errors (given 1M literal values) due to theusage of wildcards in the date values Obviouslythe syntactic invalidity of literals is accepted bythe publishers in order to keep the number of rela-tions low In case of Wikidata some invalid literalssuch as the ISBN have been corrected in newerversions of Wikidata This indicates that knowl-edge in Wikidata is curated continuously For DB-pedia comments next to the values to be extracted(such as ISBN) in the infoboxes of Wikipedia ledto inaccurately extracted values

3 Semantic validity of triples All considered KGsscored well regarding this metric This shows thatKGs can be used in general without concerns re-garding the correctness Note however that eval-uating the semantic validity of facts is very chal-lenging since a reliable ground truth is needed

4 Trustworthiness on KG level Based on the way ofhow data is imported and curated OpenCyc andWikidata can be trusted the most

5 Trustworthiness on statement level Here espe-cially good values are achieved for Freebase Wiki-data and YAGO YAGO stores per statement boththe source and the extraction technique which isunique among the KGs Wikidata also supports tostore the source of information but only around13 of the statements have provenance informa-tion attached Note however that not every state-

136Eg httprdfaboutcom httpwww4wiwissfu-berlindefactbook and httpwikicompanyorg (requested on Jan 11 2017)

44 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

ment in Wikidata requires a reference and that itis hard to evaluate which statements lack such areference

6 Using unknown and empty values Wikidata andFreebase support the indication of unknown andempty values

7 Check of schema restrictions during insertion ofnew statements Since Freebase and Wikidata areeditable by community members simple consis-tency checks are made during the insertion of newfacts in the user interface

8 Consistency of statements wrt class constraintsFreebase and Wikidata do not specify any classconstraints via owldisjointWith while theother KGs do

9 Consistency of statements wrt relation con-straints The inconsistencies of all KGs regardingthe range indications of relations are mainly due toinconsistently used data types (eg xsdgYearis used instead of xsdDate)Regarding the constraint of functional proper-ties the relation owlFunctionalPropertyis used by all KGs except Wikidata in most casesthe KGs comply with the usage restrictions of thisrelation

10 Creating a ranking of statements Only Wikidatasupports a ranking of statements This is in partic-ular worthwhile in case of statements which areonly temporally limited valid

11 Schema completeness Wikidata shows the highestdegree of schema completeness Also for DBpe-dia OpenCyc and YAGO we obtain results whichare presumably acceptable in most cross-domainuse cases While DBpedia classes were sometimesmissing in our evaluation the DBpedia relationswere covered considerably well OpenCyc lackssome relations of the gold standard but the classesof the gold standard were existing in OpenCycWhile the YAGO classes are peculiar in the sensethat they are connected to WordNet synsets it isremarkable that YAGO relations are often keptvery abstract so that they can be applied in differ-ent senses Freebase shows considerable room forimprovement concerning the coverage of typicalcross-domain classes and relations Note that Free-base classes are belonging to different domainsHence it is difficult to find related classes if theyare not in the same domain

12 Column completeness DBpedia and Freebaseshow the best column completeness values ie inthose KGs the predicates used by the instances of

each class are on average frequently used by all ofthose class instances We can name data importsas one reason for it

13 Population completeness Not very surprising isthe fact that all KGs show a higher degree of com-pleteness regarding well-known entities than re-garding rather unknown entities Especially Wiki-data shows an excellent performance for both well-known and rather unknown entities

14 Timeliness frequency of the KG Only Wikidataprovides the highest fulfillment degree for thiscriterion as it is continuously updated and as thechanges are immediately visible and queryable byusers

15 Specification of the validity period of statementsIn YAGO Freebase and Wikidata the temporalvalidity period of statements (eg term of office)can be specified

16 Specification of the modification date of state-ments Only Freebase keeps the modification datesof statements Wikidata provides the modificationdate of the queried resource during URI derefer-encing

17 Description of resources YAGO Wikidata andOpenCyc contain a label for almost every entitySurprisingly DBpedia shows a relatively low cov-erage wrt labels and descriptions (only 704)Manual investigations suggest that the interme-diate node mapping template is the main reasonfor that By means of this template intermediatenodes are introduced and instantiated but no la-bels are provided for them137

18 Labels in multiple languages YAGO Freebaseand Wikidata support hundreds of languages re-garding their stored labels Only OpenCyc con-tains labels merely in English While DBpediaYAGO and Freebase show a high coverage re-garding the English language Wikidata does nothave such a high coverage regarding English butinstead covers other languages to a considerableextent It is hence not only the most diverse KGin terms of languages but also the KG which con-tains the most labels for languages other than En-glish

19 Understandable RDF serialization DBpediaWikidata and YAGO provide several understand-

137An example is dbrVolkswagen_Passat_(B1)which has dboengine statements to the intermediate nodesVolkswagen_Passat_(B1)__1 etc representing differentengine variations

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 45

able RDF serialization formats Freebase onlyprovides the understandable format RDFTurtleOpenCyc relies only on RDFXML which is con-sidered as being not easily understandable for hu-mans

20 Self-describing URIs We can find mixed paradigmsregarding the URI generation DBpedia YAGOand OpenCyc rely on descriptive URIs whileWikidata and Freebase (in part classes and rela-tions are identified with self-describing URIs) usegeneric IDs ie opaque URIs

21 Avoiding blank nodes and RDF reification DB-pedia Wikidata YAGO and Freebase are theKGs which use reification ie which formulatestatements about statements There are differentways of implementing reification [27] DBpediaWikidata and Freebase use n-ary relations whileYAGO uses N-Quads creating so-called namedgraphs

22 Provisioning of several serialization formatsMany KGs provide RDF in several serializationformats Freebase is the only KG providing datain the serialization format RDFTurtle only

23 Using external vocabulary DBpedia and Wiki-data show high degrees of external vocabularyusage In DBpedia the RDF RDFS and OWLvocabularies are used Wikidata has a high ex-ternal vocabulary ratio since there exist manylanguage labels and descriptions (modeled viardfslabel and schemadescription)Also due to instantiations of statements withwdoStatement for reification purposes theexternal relation rdftype is used a lot

24 Interoperability of proprietary vocabulary Weobtained low fulfillment scores regarding this cri-terion OpenCyc shows the highest value Wecan mention as reason for that the fact thathalf of all OpenCyc classes exhibit at least oneowlsameAs linkWhile DBpedia has equivalence statements to ex-ternal classes for almost every second class only63 of all relations have equivalence relations torelations outside the DBpedia namespaceWikidata shows a very low interlinking degreeof classes to external classes and of relations toexternal relations

25 Dereferencing possibility of resources Resourcesin DBpedia OpenCyc and YAGO can be derefer-enced without considerable issues Wikidata usespredicates derived from relations that are not deref-erencable at all as well as blank nodes For Free-

base we measured a quite considerable amountof dereferencing failures due to server errors andunknown URIs Note also that Freebase requiredan API key for a large amount of requests

26 Availability of the KG While all other KGsshowed almost no outages YAGO shows a note-worthy instability regarding its online availabilityWe measured around 100 outages for YAGO ina time interval of 8 weeks taking on average 35hours

27 Provisioning of public SPARQL endpoint DBpe-dia Wikidata and YAGO provide a SPARQL end-point while Freebase and OpenCyc do not Note-worthy is that the Wikidata SPARQL endpoint hasa maximum execution time per query of 30 sec-onds This might be a bottleneck for some queries

28 Provisioning of an RDF export RDF exports areavailable for all KGs and are provided mostly inN-Triples and Turtle format

29 Support of content negotiation DBpedia Wiki-data and YAGO correctly return RDF data basedon content negotiation Both OpenCyc and Free-base do not support any content negotiation WhileOpenCyc only provides data in RDFXML Free-base only returns data with textplain as con-tent type

30 Linking HTML sites to RDF serializations AllKGs except OpenCyc interlink the HTML rep-resentations of resources with the correspondingRDF representations

31 Provisioning of KG metadata Only DBpedia andOpenCyc integrate metadata about the KG insome form DBpedia has the VoID vocabulary in-tegrated while OpenCyc reveals the current KGversion as machine-readable metadata

32 Provisioning machine-readable licensing informa-tion Only DBpedia and Wikidata provide licens-ing information about their KG data in machine-readable form

33 Interlinking via owlsameAs OpenCyc andYAGO achieve the best results wrt this met-ric but DBpedia has by far the most instanceswith at least one owlsameAs link Based onthe resource interlinkage DBpedia is justifiablycalled Linked Data hub Wikidata does not provideowlsameAs links but stores identifiers as liter-als that could be used to generate owlsameAslinks

34 Validity of external URIs The links to exter-nal Web resources are for all KGs valid inmost cases DBpedia and OpenCyc contain many

46 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Step 1 Requirements Analysis

- Identifying the preselection criteria P- Assigning a weight wi to each DQ criterion ci isin C

Step 2 Preselection based on the Preselection Criteria

- Manually selecting the KGs GP that fulfill the preselection criteria P

Step 3 Quantitative Assessment of the KGs

- Calculating the DQ metric mi(g) for each DQ criterion ci isin C- Calculating the fulfillment degree h(g) for each KG g isin GP

- Determining the KG g with the highest fulfillment degree h(g)

Step 4 Qualitative Assessment of the Result

- Assessing the selected KG g wrt qualitative aspects- Comparing the selected KG g with other KGs in G P

Fig 11 Proposed process for using our KG recommendation frame-work

owlsameAs links to RDF documents on do-mains which do not exist anymore those linkscould be deleted

6 KG Recommendation Framework

We now propose a framework for selecting themost suitable KG (or a set of suitable KGs) for agiven concrete setting based on a given set of KGsG = g1 gn To use this framework the user needsto go through the steps depicted in Fig 11

In Step 1 the preselection criteria and the weightsfor the criteria are specified The preselection criteriacan be both quality criteria or general criteria and needto be selected dependent on the use case The Timeli-ness frequency of the KG is an example for a qualitycriterion The license under which a KG is provided(eg CC0 license) is an example for a general criterionAfter weighting the criteria in Step 2 those KGs areneglected which do not fulfill the preselection criteriaIn Step 3 the fulfillment degrees of the remaining KGsare calculated and the KG with the highest fulfillmentdegree is selected Finally in Step 4 the result can be as-sessed wrt qualitative aspects (besides the quantitativeassessments using the DQ metrics) and if necessary analternative KG can be selected for being applied for thegiven scenario

Use case application In the following we showhow to use the KG recommendation framework in aparticular scenario The use case is based on the usageof DBpedia and MusicBrainz for the project BBC Musicas described in [33]

Description of the use case The publisher BBCwants to enrich news articles with fact sheets providingrelevant information about musicians mentioned in thearticles In order to obtain more details about the mu-sicians the user can leave the news section and accessthe musicians section where detailed information is pro-vided including a short description a picture the birthdate and the complete discography for each musicianFor being able to integrate the musicians informationinto the articles and to enable such a linking editorsshall tag the article based on a controlled vocabulary

The KG Recommendation Framework can be appliedas follows

1 Requirements analysis

ndash Preselection criteria According to the sce-nario description [33] the KG in questionshould (i) be actively curated and (ii) con-tain an appropriate amount of media enti-ties Given these two criteria a satisfactoryand up-to-date coverage of both old and newmusicians is expected

ndash Weighting of DQ criteria Based on the pre-selection criteria an example weighting ofthe DQ metrics for our use case is given inTable 15 Note that this is only one exam-ple configuration and the assignment of theweights is subjective to some degree Giventhe preselection criteria the criterion Timeli-ness frequency of the KG and the criteria ofthe DQ dimension Completeness are empha-sized Furthermore the criteria Dereferenc-ing possibility of resources and Availabilityof the KG are important as the KG shall beavailable online ready to be queried138

2 Preselection Freebase and OpenCyc are not con-sidered any further since Freebase is not being up-dated anymore and since OpenCyc contains onlyaround 4K entities in the media domain

3 Quantitative Assessment The overall fulfillmentscore for each KG is calculated based on the for-mula presented in Section 31 The result of thequantitative KG evaluation is presented in Ta-ble 15 By weighting the criteria according tothe constraints Wikidata achieves the best rankclosely followed by DBpedia Based on the quan-titative assessment Wikidata is recommended bythe framework

138We assume that in this use case rather the dereferencing ofHTTP URIs than the execution of SPARQL queries is desired

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 47

Table 15Framework with an example weighting which would be reasonablefor a user setting as given in [33]

Dimension Metric DBpedia Freebase OpenCyc Wikidata YAGO Example of UserWeighting wi

Accuracy msynRDF 1 1 1 1 1 1msynLit 0994 1 1 1 0624 1msemTriple 0990 0995 1 0993 0993 1

Trustworthiness mgraph 05 05 1 075 025 0mfact 05 1 0 1 1 1mNoV al 0 1 0 1 0 0

Consistency mcheckRestr 0 1 0 1 0 0mconClass 0875 1 0999 1 0333 0mconRelat 0992 0451 1 0500 0992 0

Relevancy mRanking 0 1 0 1 0 1

Completeness mcSchema 0905 0762 0921 1 0952 1mcCol 0402 0425 0 0285 0332 2mcPop 093 094 048 099 089 3

Timeliness mFreq 05 0 025 1 025 3mV alidity 0 1 0 1 1 0mChange 0 1 0 0 0 0

Ease of understanding mDescr 0704 0972 1 09999 1 1mLang 1 1 0 1 1 0muSer 1 1 0 1 1 0muURI 1 05 1 0 1 1

Interoperability mReif 05 05 05 0 05 0miSerial 1 0 05 1 1 1mextV oc 061 0108 0415 0682 0134 1mpropV oc 0150 0 0513 0001 0 1

Accessibility mDeref 1 0437 1 0414 1 2mAvai 09961 09998 1 09999 07306 2mSPARQL 1 0 0 1 1 1mExport 1 1 1 1 1 0mNegot 05 0 0 1 1 0mHTMLRDF 1 1 0 1 1 0mMeta 1 0 1 0 0 0

Licensing mmacLicense 1 0 0 1 0 0

Interlinking mInst 0251 0 0382 0 0310 3mURIs 0929 0908 0894 0957 0956 1

Unweighted Average 0683 0603 0496 0752 0625Weighted Average 0701 0493 0556 0714 0648

48 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

4 Qualitative Assessment The high population com-pleteness in general and the high coverage of enti-ties in the media domain in particular give Wiki-data advantage over the other KGs FurthermoreWikidata does not require that there is a Wikipediaarticle for each entity Thus missing Wikidata en-tities can be added by the editors directly and arethen available immediatelyThe use case requires to retrieve also detailed infor-mation about the musicians from the KG such as ashort descripion and a discography DBpedia tendsto store more of that data especially wrt discogra-phy A specialized database like MusicBrainz pro-vides even more data about musicians than DBpe-dia as it is not limited to the Wikipedia infoboxesWhile DBpedia does not provide any links to Mu-sicBrainz Wikidata stores around 120K equiva-lence links to MusicBrainz that can be used to pullmore data In conclusion Wikidata especially inthe combination with MusicBrainz seems to bean appropriate choice for the use case In this casethe qualitative assessment confirms the result ofthe quantitative assessment

The use case shows that our KG recommendationframework enables users to find the most suitable KGand is especially useful in giving an overview of themost relevant criteria when choosing a KG Howeverapplying our framework to the use case also showedthat besides the quantitative assessment there is stilla need for a deep understanding of commonalities anddifference of the KGs in order to make an informedchoice

7 Related Work

71 Linked Data Quality Criteria

Zaveri et al [49] provide a conceptual framework forquality assessment of linked data based on quality cri-teria and metrics which are grouped into quality dimen-sions and categories and which are based on the frame-work of Wang et al [47] Our framework is also basedon Wangrsquos dimensions and extended by the dimensionsConsistency [11] Licensing and Interlinking [49] Fur-thermore we reintroduce the dimensions Trustworthi-ness and Interoperability as a collective term for multi-ple dimensions

Many published DQ criteria and metrics are ratherabstract We in contrast selected and developed con-

crete criteria which can be applied to any KG in theLinked Open Data cloud Table 16 shows which ofthe metrics introduced in this article have already beenused to some extent in existing literature In summaryrelated work mainly proposed generic guidelines forpublishing Linked Data [26] introduced DQ criteriawith corresponding metrics (eg [2030]) and criteriawithout metrics (eg [4029]) 27 of the 34 criteria in-troduced in this article have been introduced or sup-ported in one way or another in earlier works The re-maining seven criteria namely Trustworthiness on KGlevel mgraph Indicating unknown and empty valuesmNoV al Check of schema restrictions during insertionof new statements mcheckRestr Creating a rankingof statements mRanking Timeliness frequency of theKG mFreq Specification of the validity period of state-ments mV alidity and Availability of the KG mAvaihave not been proposed so far to the best of our knowl-edge In the following we present more details of singleexisting approaches for Linked Data quality criteria

Pipino et al [40] introduce the criteria Schema com-pleteness Column completeness and Population com-pleteness in the context of databases We introducethose metrics for KGs and apply them to the best ofour knowledge the first time on the KGs DBpediaFreebase OpenCyc Wikidata and YAGO

OntoQA [45] introduces criteria and correspondingmetrics that can be used for the analysis of ontologiesBesides simple statistical figures such as the average ofinstances per class Tartir et al introduce also criteriaand metrics similar to our DQ criteria Description ofresources mDescr and Column completeness mcCol

Based on a large-scale crawl of RDF data Hogan etal [29] analyze quality issues of published RDF dataLater Hogan et al [30] introduce further criteria andmetrics based on Linked Data guidelines for data pub-lishers [26] Whereas Hogan et al crawl and analyzemany KGs we analyze a selected set of KGs in moredetail

Heath et al [26] provide guidelines for Linked Databut do not introduce criteria or metrics for the assess-ment of Linked Data quality Still the guidelines can beeasily translated into relevant criteria and metrics Forinstance Do you refer to additional access methodsleads to the criteria Provisioning of public SPARQLendpoint mSPARQL and Provisioning of an RDF ex-port mExport Also Do you map proprietary vocabu-lary terms to other vocabularies leads to the criterionInteroperability of proprietary vocabulary mpropV ocMetrics that are based on the guidelines of Heath et alcan also be found in other frameworks [3020]

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 49

Table 16Overview of related work regarding data quality criteria for KGs

DQ Metric [40] [45] [29] [26] [20] [22] [30] [48] [2] [34]

msynRDF X X

msynLit X X X X

msemTriple X X X X

mfact X X

mconClass X X X

mconRelat X X X X X X

mcSchema X X

mcCol X X X X

mcPop X X

mChange X X

mDescr X X X X

mLang X

muSer X

muURI X

mReif X X X

miSerial X

mextV oc X X

mpropV oc X

mDeref X X X X

mSPARQL X

mExport X X

mNegot X X X

mHTMLRDF X

mMeta X X X

mmacLicense X X X

mInst X X X

mURIs X X

Flemming [20] introduces a framework for the qual-ity assessment of Linked Data quality This frameworkmeasures the Linked Data quality based on a sample ofa few RDF documents Based on a systematic literaturereview criteria and metrics are introduced Flemmingintroduces the criteria Labels in multiple languagesmLang and Validity of external URIs mURIs the firsttime The framework is evaluated on a sample of RDFdocuments of DBpedia In contrast to Flemming weevaluate the whole KG DBpedia and also four otherwidely used KGs

SWIQA[22] is a quality assessment framework intro-duced by Fuumlrber et al that introduces criteria and met-rics for the dimensions Accuracy Completeness Timeli-ness and Uniqueness In this framework the dimensionAccuracy is divided into Syntactic validity and Sematicvalidity as proposed by Batini et al [6] Furthermorethe dimension Completeness comprises Schema com-pleteness Column completeness and Population com-pleteness following Pipino et al [40] In this articlewe make the same distinction but in addition distin-guish between RDF documents RDF triples and RDF

50 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

literals for evaluating the Accuracy since we considerRDF KGs

TripleCheckMate [35] is a framework for LinkedData quality assessment using a crowdsourcing-approachfor the manual validation of facts Based on this ap-proach Zaveri et al [48] and Acosta et al [23] analyzeboth syntactic and semantic accuracy as well as theconsistency of data in DBpedia

Kontokostas et al [34] present the test-driven evalu-ation framework RDFUnit for assessing Linked Dataquality This framework is inspired by the paradigmof test-driven software development The frameworkintroduces 17 SPARQL templates of tests that can beused for analyzing KGs wrt Accuracy and Consis-tency Note that those tests can also be used for eval-uating external constraints that exist due to the usageof external vocabulary The framework is applied byKontokostas et al on a set of KGs including DBpedia

72 Comparing KGs by Key Statistics

Duan et al [14] Tartir [45] and Hassanzadeh [25]can be mentioned as the most similar related work re-garding the evaluation of KGs using the key statisticspresented in Section 51

Duan et al [14] analyze the structuredness of data inDBpedia YAGO2 UniProt and in several benchmarkdata sets To that end the authors use simple statisticalkey figures that are calculated based on the correspond-ing RDF dumps In contrast to that approach we useSPARQL queries to obtain the figures thus not limitingourselves to the N-Tripel serialization of RDF dumpfiles Duan et al claim that simple statistical figures arenot sufficient to gain fruitful findings when analyzingthe structuredness and differences of RDF datasets Theauthors therefore propose in addition a coherence met-ric Accordingly we analyze not only simple statisti-cal key figures but further analyze the KGs wrt dataquality using 34 DQ metrics

Tartir et al [45] introduce with the system OntoQAmetrics that can be used for analyzing ontologies Moreprecisely it can be measured to which degree theschema level information is actually used on instancelevel An example of such a metric is the class richnessdefined as the number of classes with instances dividedby the number of classes without instances SWETOTAP and GlycO are used as showcase ontologies

Tartir et al [45] and Hassanzadeh et al [25] analyzehow domains are covered by KGs on both schema andinstance level For that Tartir et al introduce the mea-sure importance as the number of instances per class

and their subclasses In our case we cannot use this ap-proach since Freebase has no hierarchy Hassanzadehet al analyze the coverage of domains by listing themost frequent classes with the highest number of in-stances as a table This gives only little overview of thecovered domains since instances can belong to multi-ple classes in the same domain such as dboPlaceand dboPopulatedPlace For determining thedomain coverages of KGs for this article we there-fore adapt the idea of Hassanzadeh et al by manu-ally mapping the most frequent classes to domains anddeleting duplicates within the domains That meansif an instance is instantiated both as dboPlaceand dboPopulatedPlace the instance will becounted only once in the domain geography

8 Conclusion

Freely available knowledge graphs (KGs) have notbeen in the focus of any extensive comparative study sofar In this survey we defined a range of aspects accord-ing to which KGs can be analyzed We analyzed andcompared DBpedia Freebase OpenCyc Wikidata andYAGO along these aspects and proposed a frameworkas well as a process to enable readers to find the mostsuitable KG for their settings

References

[1] M Acosta E Simperl F Floumlck and M Vidal HARE AHybrid SPARQL Engine to Enhance Query Answers viaCrowdsourcing In Proceedings of the 8th InternationalConference on Knowledge Capture K-CAP 2015 pages111ndash118 ACM 2015

[2] M Acosta A Zaveri E Simperl D Kontokostas S Auer andJ Lehmann Crowdsourcing linked data quality assessment InThe Semantic WebndashISWC 2013 pages 260ndash276 Springer 2013

[3] M Acosta A Zaveri E Simperl D Kontokostas F Floumlckand J Lehmann Detecting Linked Data Quality Issues viaCrowdsourcing A DBpedia Study Semantic Web 2016

[4] S Auer C Bizer G Kobilarov J Lehmann R Cyganiak andZ Ives DBpedia A Nucleus for a Web of Open Data InProceedings of the 6th International Semantic Web Conferenceand 2nd Asian Semantic Web Conference ISWC 2007ASWC2007 pages 722ndash735 Springer 2007

[5] S Auer J Lehmann A-C Ngonga Ngomo and A ZaveriIntroduction to Linked Data and Its Lifecycle on the Web InReasoning Web Semantic Technologies for Intelligent DataAccess volume 8067 of Lecture Notes in Computer Sciencepages 1ndash90 Springer Berlin Heidelberg 2013

[6] C Batini C Cappiello C Francalanci and A MaurinoMethodologies for Data Quality Assessment and ImprovementACM Comput Surv 41(3)161ndash1652 July 2009

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 51

[7] S Bechhofer F van Harmelen J Hendler I Horrocks D LMcGuinness and P F Patel-Schneider OWL Web OntologyLanguage Reference httpswwww3orgTR2004REC-owl-ref-200402102004 [Online accessed 06-Apr-2016]

[8] T Berners-Lee Linked Data httpwwww3orgDesignIssuesLinkedDatahtml2006 [Online accessed 28-Feb-2016]

[9] T Berners-Lee Linked Data Is Merely More Datahttpwwww3orgDesignIssuesLinkedDatahtml 2006[Online accessed 28-02-2016]

[10] T Berners-Lee J Hendler and O Lassila The Semantic WebScientific American 284(5)29ndash37 5 2001

[11] C Bizer Quality-Driven Information Filtering in the Contextof Web-Based Information Systems VDM Publishing 2007

[12] C Bizer J Lehmann G Kobilarov S Auer C BeckerR Cyganiak and S Hellmann DBpediandashA crystallizationpoint for the Web of Data Web Semantics science servicesand agents on the world wide web 7(3)154ndash165 2009

[13] X Dong E Gabrilovich G Heitz W Horn N LaoK Murphy T Strohmann S Sun and W Zhang KnowledgeVault A Web-Scale Approach to Probabilistic KnowledgeFusion In Proceedings of the 20th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining KDD rsquo14 pages 601ndash610 New York NY USA 2014ACM

[14] S Duan A Kementsietsidis K Srinivas and O UdreaApples and Oranges A Comparison of RDF Benchmarks andReal RDF Datasets In Proceedings of the ACM SIGMODInternational Conference on Management of Data SIGMOD2011 pages 145ndash156 2011

[15] B Ell D Vrandecic and E Simperl Proceedings of the 10thInternational Semantic Web Conference (ISWC 2011) chapterLabels in the Web of Data pages 162ndash176 Springer BerlinHeidelberg Berlin Heidelberg 2011

[16] F Erxleben M Guumlnther M Kroumltzsch J Mendez andD Vrandecic Introducing Wikidata to the Linked Data WebIn Proceedings of the 13th International Semantic WebConference ISWC 2014 pages 50ndash65 Springer 2014

[17] M Faumlrber F Bartscherer C Menne and A Rettinger LinkedData Quality of DBpedia Freebase OpenCyc Wikidata andYAGO Semantic Web Journal 2017 to be published

[18] M Faumlrber C Menne and A Rettinger A Linked DataWrapper for CrunchBase Semantic Web Journal 2017 to bepublished

[19] C Fellbaum WordNet ndash An Electronic Lexical Database MITPress 1998

[20] A Flemming Qualitaumltsmerkmale von LinkedData-veroumlffentlichenden Datenquellen (Quality characteristicsof linked data publishing datasources) Diploma ThesisHumboldt University of Berlinhttpwwwdbisinformatikhu-berlindefileadminresearchpapersdiploma_seminar_thesisDiplomarbeit_Annika_Flemmingpdf 2011

[21] G Freedman and E G Reynolds Enriching Basal ReaderLessons with Semantic Webbing Reading Teacher33(6)677ndash684 1980

[22] C Fuumlrber and M Hepp SWIQA ndash A Semantic WebInformation Quality Assessment Framework In Proceedings ofthe 19th European Conference on Information Systems

(ECIS2011) volume 15 page 19 2011[23] R Guns Tracing the origins of the Semantic Web Journal of

the American Society for Information Science and Technology64(10)2173ndash2181 2013

[24] H Halpin P J Hayes J P McCusker D L McGuinness andH S Thompson The Semantic Web ndash ISWC 2010 9thInternational Semantic Web Conference ISWC 2010 ShanghaiChina chapter When owlsameAs Isnrsquot the Same An Analysisof Identity in Linked Data pages 305ndash320 Springer BerlinHeidelberg Berlin Heidelberg 2010

[25] O Hassanzadeh M J Ward M Rodriguez-Muro andK Srinivas Understanding a Large Corpus of Web TablesThrough Matching with Knowledge Bases ndash An EmpiricalStudy In Proceedings of the 10th International Workshop onOntology Matching collocated with the 14th InternationalSemantic Web Conference ISWC 2015 2015

[26] T Heath and C Bizer Linked data Evolving the web into aglobal data space Synthesis lectures on the semantic webtheory and technology 1(1)1ndash136 2011

[27] D Hernaacutendez A Hogan and M Kroumltzsch Reifying RDFWhat Works Well With Wikidata In Proceedings of the 11thInternational Workshop on Scalable Semantic Web KnowledgeBase Systems co-located with 14th International Semantic WebConference pages 32ndash47 2015

[28] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 A Spatially and Temporally Enhanced KnowledgeBase from Wikipedia Artificial Intelligence 19428ndash61 2013

[29] A Hogan A Harth A Passant S Decker and A PolleresWeaving the Pedantic Web Proceedings of the WWW2010Workshop on Linked Data on the Web 628 2010

[30] A Hogan J Umbrich A Harth R Cyganiak A Polleres andS Decker An empirical survey of linked data conformanceWeb Semantics Science Services and Agents on the WorldWide Web 1414ndash44 2012

[31] P Jain P Hitzler K Janowicz and C Venkatramani TherersquosNo Money in Linked Data httpcorescholarlibrarieswrighteducse240 2013 accessedJuly 20 2015

[32] J M Juran F M Gryna and R S Bingham editors QualityControl Handbook McGraw-Hill 1974

[33] G Kobilarov T Scott Y Raimond S Oliver C SizemoreM Smethurst C Bizer and R Lee Media Meets SemanticWeb ndash How the BBC Uses DBpedia and Linked Data to MakeConnections In Proceedings of the 6th European SemanticWeb Conference on The Semantic Web Research andApplications ESWC 2009 Heraklion pages 723ndash737 BerlinHeidelberg 2009 Springer

[34] D Kontokostas P Westphal S Auer S HellmannJ Lehmann R Cornelissen and A Zaveri Test-drivenevaluation of linked data quality In Proceedings of the 23rdinternational conference on World Wide Web pages 747ndash758ACM 2014

[35] D Kontokostas A Zaveri S Auer and J LehmannTripleCheckMate A Tool for Crowdsourcing the QualityAssessment of Linked Data In Knowledge Engineering andthe Semantic Web ndash 4th International Conference KESW 2013St Petersburg Russia October 7-9 2013 Proceedings pages265ndash272 Springer 2013

[36] C Matuszek J Cabral M J Witbrock and J DeOliveira AnIntroduction to the Syntax and Content of Cyc In AAAI SpringSymposium Formalizing and Compiling Background

52 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Knowledge and Its Applications to Knowledge Representationand Question Answering pages 44ndash49 AAAI - Association forthe Advancement of Artificial Intelligence 2006

[37] M Mecella M Scannapieco A Virgillito R BaldoniT Catarci and C Batini Managing data quality in cooperativeinformation systems In On the Move to Meaningful InternetSystems 2002 CoopIS DOA and ODBASE pages 486ndash502Springer 2002

[38] O Medelyan and C Legg Integrating Cyc and WikipediaFolksonomy meets rigorously defined common-sense InWikipedia and Artificial Intelligence An Evolving SynergyPapers from the 2008 AAAI Workshop page 65 2008

[39] F Naumann Quality-Driven Query Answering for IntegratedInformation Systems volume 2261 Springer Science ampBusiness Media 2002

[40] L L Pipino Y W Lee and R Y Wang Data QualityAssessment Communications of the ACM 45(4)211ndash2182002

[41] E Sandhaus Semantic Technology at the New York TimesLessons Learned and Future Directions In Proceedings of the9th International Semantic Web Conference on The SemanticWeb - Volume Part II ISWCrsquo10 pages 355ndash355 BerlinHeidelberg 2010 Springer

[42] A Singhal Introducing the Knowledge Graph things notstrings httpsgoogleblogblogspotde201205introducing-knowledge-graph-things-nothtml retrieved on Aug 29 2016 2012

[43] F M Suchanek G Kasneci and G Weikum YAGO A LargeOntology from Wikipedia and WordNet Web SemanticsScience Services and Agents on the World Wide Web6(3)203ndash217 2008

[44] T P Tanon D Vrandecic S Schaffert T Steiner andL Pintscher From Freebase to Wikidata The Great MigrationIn Proceedings of the 25th International Conference on WorldWide Web WWW 2016 pages 1419ndash1428 2016

[45] S Tartir I B Arpinar M Moore A P Sheth andB Aleman-meza OntoQA Metric-Based Ontology QualityAnalysis In IEEE Workshop on Knowledge Acquisition fromDistributed Autonomous Semantically Heterogeneous Dataand Knowledge Sources 2005

[46] R Y Wang M P Reddy and H B Kon Toward quality dataAn attribute-based approach Decision Support Systems13(3)349ndash372 1995

[47] R Y Wang and D M Strong Beyond Accuracy What DataQuality Means to Data Consumers Journal of managementinformation systems 12(4)5ndash33 1996

[48] A Zaveri D Kontokostas M A Sherif L BuumlhmannM Morsey S Auer and J Lehmann User-driven qualityevaluation of dbpedia In Proceedings of the 9th InternationalConference on Semantic Systems pages 97ndash104 ACM 2013

[49] A Zaveri A Rula A Maurino R Pietrobon J Lehmann andS Auer Quality Assessment for Linked Data A SurveySemantic Web 7(1)63ndash93 2015

Page 3: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 3

2 Important Definitions

We define the following sets that are used in formal-izations throughout the article If not otherwise statedwe use the prefixes listed in Listing 1 for indicatingnamespaces throughout the article

ndash Cg denotes the set of classes in gCg = x | (xrdfssubClassOf o) ising or (srdfssubClassOf x) isin g or (xwdtP279 o) isin g or (swdtP279 x) isin g or(xrdftyperdfsClass) isin g

ndash An instance of a class is a resource which is mem-ber of that class This membership is given by acorresponding instantiation assignment10 Ig de-notes the set of instances in gIg = s | (srdftype o) isin g or (swdtP31o) isin g

ndash Entities are defined as instances which representreal world objects Eg denotes the set of entitiesin gEg = s | (srdftypeowlThing) ising or (srdftypewdoItem) isin g or(srdftypefreebasecommontopic) ising or (srdftypecychIndividual) ising

ndash Relations (interchangeably used with proper-ties) are links between RDF terms11 defined onthe schema level (ie T-Box) To emphasize thischaracterization we also call them explicitly de-fined relations Pg denotes the set of all thoserelations in gPg = s | (srdftyperdfProperty) ising or (srdftyperdfsProperty)isin g or (srdftypewdoProperty) ising or (srdftypeowlFunctionalProperty) isin g or (srdftypeowlInverseFunctionalProperty) isin g or(srdftypeowlDatatypeProperty) ising or (srdftypeowlObjectProperty) isin g or (srdftypeowlSymmetricProperty) isin g or(srdftypeowlTransitiveProperty)isin g

ndash Implicitly defined relations embrace all linksused in the KG ie on instance and schema level

10See httpswwww3orgTRrdf-schema re-quested on Aug 29 2016

11RDF terms comprise URIs blank nodes and literals

We also call them predicates P impg denotes the

set of all implicitly defined relations in gP impg = p | (s p o) isin g

ndash Ug denotes the set of all URIs used in gUg = x | ((x p o) isin g or (s x o) isin g or(s p x) isin g) and x isin U

ndash U localg denotes the set of all URIs in g with local

namespace ie those URIs start with the KG gdedicated prefix (cf Listing 1)

ndash Complementary Uextg consists of all URIs in Ug

which are external to the KG g which means thathg is not responsible for resolving those URIs

Note that knowledge about the KGs which were ana-lyzed for this survey was taken into account when defin-ing these sets These definitions may not be appropriatefor other KGs

Furthermore the setsrsquo extensions would be differentwhen assuming a certain semantic (eg RDF RDFS orOWL-LD) Under the assumption that all entailmentsunder one of these semantics were added to a KG thedefinition of each set could be simplified and the exten-sions would be of larger cardinality However for thisarticle we did not derive entailments

3 Data Quality Assessment wrt KGs

Everybody on the Web can publish informationTherefore a data consumer does not only face the chal-lenge to find a suitable data source but is also con-fronted with the issue that data on the Web can dif-fer very much regarding its quality Data quality canthereby be viewed not only in terms of accuracy but inmultiple other dimensions In the following we intro-duce concepts regarding the data quality of KGs in theLinked Data context which are used in the followingsections The data quality dimensions are then exposedin Sections 32 ndash 35

Data quality (DQ) ndash in the following interchange-ably used with information quality12 ndash is defined byJuran et al [32] as fitness for use This means that dataquality is dependent on the actual use case

One of the most important and foundational works ondata quality is that of Wang et al [47] They developeda framework for assessing the data quality of datasetsin the database context In this framework Wang et al

12As soon as data is considered wrt usefulness the data is seenin a specific context It can thus already be regarded as informationleading to the term ldquoinformation qualityrdquo instead of ldquodata qualityrdquo

4 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Listing 1 Default prefixes for namespaces used throughout this article

prefix cc lthttpcreativecommonsorgnsgt prefix cyc lthttpswopencycorgconceptgt prefix cych lthttpswopencycorg20120510conceptengt prefix dbo lthttpdbpediaorgontologygt prefix dbp lthttpdbpediaorgpropertygt prefix dbr lthttpdbpediaorgresourcegt prefix dby lthttpdbpediaorgclassyagogt prefix dcterms lthttppurlorgdctermsgt prefix foaf lthttpxmlnscomfoaf01gt prefix freebase lthttprdffreebasecomnsgt prefix owl lthttpwwww3org200207owlgt prefix prov lthttpwwww3orgnsprovgt prefix rdf lthttpwwww3org19990222-rdf-syntax-nsgt prefix rdfs lthttpwwww3org200001rdf-schemagt prefix schema lthttpschemaorggt prefix umbel lthttpumbelorgumbelscgt prefix void lthttpwwww3orgTRvoidgt prefix wdo lthttpwwwwikidataorgontologygt prefix wdt lthttpwwwwikidataorgentitygt prefix xsd lthttpwwww3org2001XMLSchemagt prefix yago lthttpyago-knowledgeorgresourcegt

distinguish between data quality criteria data qualitydimensions and data quality categories13 In the follow-ing we reuse these concepts for our own frameworkwhich has the particular focus on the data quality ofKGs in the context of Linked Open Data

A data quality criterion (Wang et al also call itldquodata quality attributerdquo) is a particular characteristic ofdata wrt its quality and can be either subjective orobjective An example of a subjectively measurabledata quality criterion is Trustworthiness on KG levelAn example of an objective data quality criterion is theSyntactic validity of RDF documents (see Section 32and [46])

In order to measure the degree to which a certaindata quality criterion is fulfilled for a given KG eachcriterion is formalized and expressed in terms of a func-tion with the value range of [0 1] We call this functionthe data quality metric of the respective data qualitycriterion

A data quality dimension ndash in the following justcalled dimension ndash is a main aspect how data qualitycan be viewed A data quality dimension comprises oneor several data quality criteria [47] For instance the

13The quality dimensions are defined in [47] the sub-classificationinto parametersindicators in [46 p 354]

criteria Syntactic validity of RDF documents Syntacticvalidity of literals and Semantic validity of triples formthe Accuracy dimension

Data quality dimensions and their respective dataquality criteria are further grouped into data qualitycategories Based on empirical studies Wang et alspecified four categories

ndash Criteria of the category of the intrinsic data qualityfocus on the fact that data has quality in its ownright

ndash Criteria of the category of the contextual data qual-ity cannot be considered in general but must beassessed depending on the application context ofthe data consumer

ndash Criteria of the category of the representationaldata quality reveal in which form the data is avail-able

ndash Criteria of the category of the accessibility dataquality determine how the data can be accessed

Since its publication the presented framework ofWang et al has been extensively used either in itsoriginal version or in an adapted or extended versionBizer [11] and Zaveri [49] worked on data quality in theLinked Data context They make the following adapta-tions on Wang et alrsquos framework

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 5

ndash Bizer [11] compared the work of Wang et al [47]with other works in the area of data quality Hethereby complements the framework with the di-mensions consistency verifiability and offensive-ness

ndash Zaveri et al [49] follow Wang et al [47] but intro-duce licensing and interlinking as new dimensionsin the linked data context

In this article we use the DQ dimensions as definedby Wang et al [47] and as extended by Bizer [11] andZaveri [49] More precisely we make the followingadaptations on Wang et alrsquos framework

1 Consistency is treated by us as separate DQ dimen-sion

2 Verifiability is incorporated within the DQ dimen-sion Trustworthiness as criterion Trustworthinesson statement level

3 The Offensiveness of KG facts is not consideredby us as it is hard to make an objective evaluationin this regard

4 We extend the category of the accessibility dataquality by the dimension License and Interlinkingas those data quality dimensions get in additionrelevant in the Linked Data context

31 Criteria Weighting

When applying our framework to compare KGs thesingle DQ metrics can be weighted differently so thatthe needs and requirements of the users can be takeninto account In the following we first formalize theidea of weighting the different metrics We then presentthe criteria and the corresponding metrics of our frame-work

Given are a KG g a set of criteria C = c1 cn aset of metrics M = m1 mn and a set of weightsW = w1 wn Each metric mi corresponds to thecriterion ci and mi(g) isin [0 1] where a value of 0 de-fines the minimum fulfillment degree of a KG regardinga quality criterion and a value of 1 the maximum fulfill-ment degree Furthermore each criterion ci is weightedby wi

The fulfillment degree h(g) isin [0 1] of a KG g isthen the weighted normalized sum of the fulfillmentdegrees wrt the criteria c1 cn

h(g) =

sumni=1 wi mi(g)sumn

j=1 wj

Based on the quality dimensions introduced by Wanget al [47] we now present the DQ criteria and met-rics as used in our KG comparison Note that some ofthe criteria have already been introduced by others asoutlined in Section 7

Note also that our metrics are to be understood aspossible ways of how to evaluate the DQ dimensionsOther definitions of the DQ metrics might be possibleand reasonable We defined the metrics along the char-acteristics of the KGs DBpedia Freebase OpenCycWikidata and YAGO but kept the definitions as genericas possible In the evaluations we then used those met-ric definitions and applied them eg on the basis ofown-created gold standards

32 Intrinsic Category

ldquoIntrinsic data quality denotes that data have qualityin their own rightrdquo [47] This kind of data quality cantherefore be assessed independently from the contextThe intrinsic category embraces the three dimensionsAccuracy Trustworthiness and Consistency which aredefined in the following subsections The dimensionsBelievability Objectivity and Reputation which areseparate dimensions in Wang et alrsquos classification sys-tem [47] are subsumed by us under the dimensionTrustworthiness

321 AccuracyDefinition of dimension Accuracy is ldquothe extent to

which data are correct reliable and certified free oferrorrdquo [47]

Discussion Accuracy is intuitively an important di-mension of data quality Previous work on data qualityhas mainly analyzed only this aspect [47] Hence accu-racy has often been used as synonym for data quality[39] Bizer [11] highlights in this context that Accuracyis an objective dimension and can only be applied onverifiable statements

Batini et al [6] distinguish between syntactic andsemantic accuracy Syntactic accuracy describes theformal compliance to syntactic rules without review-ing whether the value reflects the reality The semanticaccuracy determines whether the value is semanticallyvalid ie whether the value is true Based on the clas-sification of Batini et al we can define the metric forAccuracy as follows

Definition of metric The dimension Accuracy isdetermined by the criteria

ndash Syntactic validity of RDF documentsndash Syntactic validity of literals and

6 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

ndash Semantic validity of triples

The fulfillment degree of a KG g wrt the dimen-sion Accuracy is measured by the metrics msynRDF msynLit and msemTriple which are defined as fol-lows

Syntactic validity of RDF documents The syntacticvalidity of RDF documents is an important require-ment for machines to interpret an RDF document com-pletely and correctly Hogan et al [29] suggest usingstandardized tools for creating RDF data The authorsstate that in this way normally only little syntax errorsoccur despite the complex syntactic representation ofRDFXML

RDF data can be validated by an RDF validator suchas the W3C RDF validator14

msynRDF (g) =

1 if all RDF documents are valid0 otherwise

Syntactic validity of literals Assessing the syntacticvalidity of literals means to determine to which degreeliteral values stored in the KG are syntactically validThe syntactic validity of literal values depends on thedata types of the literals and can be automatically as-sessed via rules [2234] Syntactic rules can be writ-ten in the form of regular expressions For instanceit can be verified whether a literal representing a datefollows the ISO 8601 specification Assuming that L isthe infinite set of literals we can state

msynLit(g) =|(s p o) isin g | o isin L and synV alid(o)|

|(s p o) isin g | o isin L|

In case of an empty set in the denominator of thefraction the metric should evaluate to 1

Semantic validity of triples The criterion Semanticvalidity of triples is introduced to evaluate whether thestatements expressed by the triples (with or withoutliterals) hold true Determining whether a statementis true or false is strictly speaking impossible (see thefield of epistemology in philosophy) For evaluating theSemantic validity of statements Bizer et al [11] notethat a triple is semantically correct if it is also availablefrom a trusted source (eg Name Authority File) if it

14See httpwwww3orgRDFValidator requestedon Feb 29 2016

is common sense or if the statement can be measuredor perceived by the user directly Wikidata has similarguidelines implemented to determine whether a factneeds to be sourced15

We measure the Semantic validity of triples based onempirical evidence ie based on a reference data setserving as gold standard We determine the fulfillmentdegree as the precision that the triples which are in theKG g and in the gold standard GS have the same valuesNote that this measurement is heavily depending on thetruthfulness of the reference data set

Formally let nogGS = |(s p o) | (s p o) isin g and(x y z) isin GSandequi(s x)andequi(p y)andequi(o z))|be the number of triples in g to which semanticallycorresponding triples in the gold standard GS exist Letnog = |(s p o) | (s p o) isin g and (x y z) isin GS andequi(s x) and equi(p y)| be the number of triples in gwhere the subject-relation-pairs (s p) are semanticallyequivalent to subject-relation-pairs (x y) in the goldstandard Then we can state

msemTriple(g) =nogGS

nog

In case of an empty set in the denominator of thefraction the metric should evaluate to 1

322 TrustworthinessDefinition of dimension Trustworthiness is defined

as the degree to which the information is accepted to becorrect true real and credible [49] We define it as acollective term for believability reputation objectivityand verifiability These aspects were defined by Wanget al [47] and Naumann [39] as follows

ndash Believability Believability is ldquothe extent to whichdata are accepted or regarded as true real andcrediblerdquo [47]

ndash Reputation Reputation is ldquothe extent to whichdata are trusted or highly regarded in terms of theirsource or contentrdquo [47]

ndash Objectivity Objectivity is ldquothe extent to whichdata are unbiased (unprejudiced) and impartialrdquo[47]

ndash Verifiability Verifiability is ldquothe degree and easewith which the data can be checked for correctnessrdquo[39]

15See httpswwwwikidataorgwikiHelpSources requested on Sep 8 2016

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 7

Discussion In summary believability considers thesubject (data consumer) side reputation takes the gen-eral social view on trustworthiness objectivity consid-ers the object (data provider) side while verifiabilityfocuses on the possibility of verification

Trustworthiness has been discussed as follows

ndash Believability According to Naumann [39] believ-ability is the ldquoexpected accuracyrdquo of a data source

ndash Reputation The essential difference of believ-ability to accuracy is that for believability data istrusted without verification [11] Thus believabil-ity is closely related to the reputation of a dataset

ndash Objectivity According to Naumann [39] the ob-jectivity of a data source is strongly related to theverifiability The more verifiable a data source orstatement is the more objective it is The authorsof this article would not go so far since also biasedstatements could be verifiable

ndash Verifiability Heath et al [26] emphasize that it isessential for trustworthy applications to be able toverify the origin of data

Definition of metric We define the metric for thedata quality dimension Trustworthiness as a combina-tion of trustworthiness metrics on both KG and state-ment level Believability and reputation are thereby cov-ered by the DQ criterion Trustworthiness on KG level(metric mgraph(hg)) while objectivity and verifiabilityare covered by the DQ criteria Trustworthiness on state-ment level (metric mfact(g)) and Indicating unknownand empty values (metric mNoV al(g)) Hence the ful-fillment degree of a KG g wrt the dimension Trust-worthiness is measured by the metrics mgraph mfactand mNoV al which are defined as follows

Trustworthiness on KG level The measure of Trust-worthiness on KG level exposes a basic indication aboutthe trustworthiness of the KG In this assessment themethod of data curation as well as the method of datainsertion is taken into account Regarding the methodof data curation we distinguish between manual andautomated methods Regarding the data insertion wecan differentiate between 1 whether the data is enteredby experts (of a specific domain) 2 whether the knowl-edge comes from volunteers contributing in a commu-nity and 3 whether the knowledge is extracted automat-ically from a data source This data source can itself beeither structured semi-structured or un-structured Weassume that a closed system where experts or other reg-istered users feed knowledge into a system is less vul-nerable to harmful behavior of users than an open sys-

tem where data is curated by a community Thereforewe assign the values of the metric for Trustworthinesson KG level as follows

mgraph(hg) =

1 manual data curation man-ual data insertion in aclosed system

075 manual data curation and in-sertion both by a commu-nity

05 manual data curation datainsertion by community ordata insertion by automatedknowledge extraction

025 automated data curationdata insertion by automatedknowledge extraction fromstructured data sources

0 automated data curationdata insertion by automatedknowledge extraction fromunstructured data sources

Note that all proposed DQ metrics should be seen assuggestions of how to formulate DQ metrics Henceother numerical values and other classification schemes(eg for mgraph(hg)) might be taken for defining theDQ metrics

Trustworthiness on statement level The fulfillment ofTrustworthiness on statement level is determined by anassessment whether a provenance vocabulary is usedBy means of a provenance vocabulary the source ofstatements can be stored Storing source information isan important precondition to assess statements easilywrt semantic validity We distinguish between prove-nance information provided for triples and provenanceinformation provided for resources

The most widely used ontologies for storing prove-nance information are the Dublin Core Metadataterms16 with properties such as dctermsprovenance and dctermssource and the W3C PROVontology17 with properties such as provwasDerivedFrom

16See httppurlorgdcterms requested on Feb 42017

17See httpswwww3orgTRprov-o requested onDec 27 2016

8 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

mfact(g) =

1 provenance on statement

level is used05 provenance on resource

level is used0 otherwise

Indicating unknown and empty values If the datamodel of the considered KG supports the representa-tion of unknown and empty values more complex state-ments can be represented For instance empty valuesallow to represent that a person has no children andunknown values allow to represent that the birth date ofa person in not known This kind of higher explanatorypower of a KG increases the trustworthiness of the KG

mNoV al(g) =

1 unknown and empty values

are used05 either unknown or empty

values are used0 otherwise

323 ConsistencyDefinition of dimension Consistency implies that

ldquotwo or more values [in a dataset] do not conflict eachotherrdquo [37]

Discussion Due to the high variety of data providersin the Web of Data a user must expect data inconsisten-cies Data inconsistencies may be caused by (i) differ-ent information providers (ii) different levels of knowl-edge and (iii) different views of the world [11]

In OWL restrictions can be introduced to ensureconsistent modeling of knowledge to some degree TheOWL schema restrictions can be divided into class re-strictions and relation restrictions [7]

Class restrictions refer to classes For instanceone can specify via owldisjointWith that twoclasses have no common instance

Relation restrictions refer to the usage of relationsThey can be classified into value constraints and cardi-nality constraints

Value constraints determine the range of relationsowlsomeValuesFrom for instance specifies thatat least one value of a relation belongs to a certainclass If the expected data type of a relation is specifiedvia rdfsrange we also consider this as relationrestriction

Cardinality constraints limit the number of times a re-lation may exist per resource Via owlFunctionalproperty and owlInverseFunctionalProp

erty global cardinality constraints can be specifiedFunctional relations permit at most one value per re-source (eg the birth date of a person) Inverse func-tional relations specify that a value should only occuronce per resource This means that the subject is theonly resource linked to the given object via the givenrelation

Definition of metric We can measure the data qual-ity dimension Consistency by means of (i) whetherschema constraints are checked during the insertion ofnew statements into the KG and (ii) whether alreadyexisting statements in the KG are consistent to specifiedclass and relation constraints The fulfillment degree ofa KG g wrt the dimension consistency is measuredby the metrics mcheckRestr mconClass and mconRelatwhich are defined as follows

Check of schema restrictions during insertion of newstatements Checking the schema restrictions duringthe insertion of new statements can help to reject factsthat would render the KG inconsistent Such simplechecks are often done on the client side in the user inter-face For instance the application checks whether datawith the right data type is inserted Due to the depen-dency to the actual inserted data the check needs to becustom-designed Simple rules are applicable howeverinconsistencies can still appear if no suitable rules areavailable Examples of consistency checks are check-ing the expected data types of literals checking whetherthe entity to be inserted has a valid entity type (iechecking the rdftype relation) checking whetherthe assigned classes of the entity are disjoint ie con-tradicting each other (utilizing owldisjointWithrelations)

mcheckRestr(hg) =

1 schema restrictions arechecked

0 otherwise

Consistency of statements wrt class constraints Thismetric is intended to measure the degree to which theinstance data is consistent with the class restrictions(eg owldisjointWith) specified on the schemalevel

In the following we limit ourselves to the classconstraints given by all owldisjointWith state-ments defined on the schema level of the consid-ered KG Ie let CC be the set of all class con-straints defined as CC = (c1 c2) | (c1owldis-

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 9

jointWith c2) isin g18 Furthermore let cg(e) bethe set of all classes of instance e in g defined ascg(e) = c | (erdftype c) isin g Then we definemconClass(g) as follows

mconClass(g) =

|(c1 c2) isin CC | notexiste (c1 isin cg(e) and c2 isin cg(e))||(c1 c2) isin CC|

In case of an empty set of class constraints CC themetric should evaluate to 1

Consistency of statements wrt relation constraintsThe metric for this criterion is intended for measur-ing the degree to which the instance data is consis-tent with the relation restrictions (eg indicated viardfsrange and owlFunctionalProperty)specified on the schema level We evaluate this crite-rion by averaging over the scores obtained from sin-gle metrics mconRelati indicating the consistency ofstatements wrt different relation constraints

mconRelat(g) =1

n

nsumi=1

mconRelati(g)

In case of evaluating the consistency of instance dataconcretely wrt given rdfsrange and owlFunctionalProperty statements19 we can state

mconRelat(g) =mconRelatRg(g) +mconRelatFct(g)

2

Let Rr be the set of all rdfsrange constraints

Rr = (p d) | (prdfsrange d) isin g

and isDatatype(d)

18Implicit restrictions which can be deducted from the class hi-erarchy eg that a restriction for dboAnimal counts also fordboMammal a subclass of dboAnimal are not considered byus here

19We chose those relations (and for instance notowlInverseFunctionalProperty) as only those relationsare used by more than half of the considered KGs

and Rf be the set of all owlFunctionalPro-perty constraints

Rf = (p d) | (prdftypeowlFunc

tionalProperty) isin g and

(prdfsrange d) isin g and isDatatype(d)

Then we can define the metrics mconRelatRg(g) andmconRelatFct(g) as follows

mconRelatRg(g) =

|(s p o) isin g | exist(p d) isin Rr datatype(o) 6= d||(s p o) isin g | exist(p d) isin Rr|

mconRelatFct(g) =

|(s p o) isin g|exist(p d) isin Rf notexist(s p o2) isin g o 6= o2||(s p o) isin g | exist(p d) isin Rf|

In case of an empty set of relation constraints (Rr orRf ) the respective metric should evaluate to 1

33 Contextual Category

Contextual data quality ldquohighlights the requirementthat data quality must be considered within the contextof the task at handrdquo [47] This category contains thethree dimensions (i) Relevancy (ii) Completeness and(iii) Timeliness Wang et alrsquos further dimensions in thiscategory appropriate amount of data and value-addedare considered by us as being part of the dimensionCompleteness

331 RelevancyDefinition of dimension Relevancy is ldquothe extent

to which data are applicable and helpful for the task athandrdquo [47]

Discussion According to Bizer [11] Relevancy isan important quality dimension since the user is con-fronted with a variety of potentially relevant informa-tion on the Web

Definition of metric The dimension Relevancy isdetermined by the criterion Creating a ranking ofstatements20 The fulfillment degree of a KG g wrtthe dimension Relevancy is measured by the metricmRanking which is defined as follows

20We do not consider the relevancy of literals as there is no rankingof literals provided for the considered KGs

10 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Creating a ranking of statements By means of thiscriterion one can determine whether the KG supportsa ranking of statements by which the relative rele-vance of statements among other statements can beexpressed For instance given the Wikidata entityBarack Obama (wdtQ76) and the relation posi-tion held (wdtP39) President of the United Statesof America (wdtQ11696) has a preferred rank(wdoPreferredRank) (until 2017) while olderpositions which he holds no more are ranked as normalrank (wdoNormalRank)

mRanking(g) =

1 ranking of statements supported0 otherwise

Note that this criterion refers to a characteristic ofthe KG and not to a characteristic of the system thathosts the KG

332 CompletenessDefinition of dimension Completeness is ldquothe ex-

tent to which data are of sufficient breadth depth andscope for the task at handrdquo [47]

We include the following two aspects in this dimen-sion which are separate dimensions in Wang et alrsquosframework

ndash Appropriate amount of data Appropriate amountof data is ldquothe extent to which the quantity orvolume of available data is appropriaterdquo [47]

ndash Value-added Value-added is ldquothe extent to whichdata are beneficial and provide advantages fromtheir userdquo [47]

Discussion Pipino et al [40] divide Completenessinto

1 Schema completeness ie the extent to whichclasses and relations are not missing

2 Column completeness ie the extent to whichvalues of relations on instance level ndash ie facts ndashare not missing and

3 Population completeness ie the extent to whichentities are not missing

The Completeness dimension is context-dependent andtherefore belongs to the contextual category becausethe fact that a KG is seen as complete depends on theuse case scenario ie on the given KG and on the infor-mation need of the user As exemplified by Bizer [11]a list of German stocks is complete for an investor whois interested in German stocks but it is not complete for

an investor who is looking for an overview of Europeanstocks The completeness is hence only assessable bymeans of a concrete use case at hand or with the helpof a defined gold standard

Definition of metric We follow the above-mentioneddistinction of Pipino et al [40] and determine Com-pleteness by means of the criteria Schema completenessColumn completeness and Population completeness

The fulfillment degree of a KG g wrt the dimensionCompleteness is measured by the metrics mcSchemamcCol and mcPop which are defined as follows

Schema completeness By means of the criterionSchema completeness one can determine the complete-ness of the schema wrt classes and relations [40] Theschema is assessed by means of a gold standard Thisgold standard consists of classes and relations which arerelevant for the use case For evaluating cross-domainKGs we use as gold standard a typical set of cross-domain classes and relations It comprises (i) basicclasses such as people and locations in different gran-ularities and (ii) basic relations such as birth date andnumber of inhabitants We define the schema complete-ness mcSchema as the ratio of the number of classesand relations of the gold standard existing in g noclatg and the number of classes and relations in the goldstandard noclat

mcSchema(g) =noclatgnoclat

Column completeness In the traditional database area(with fixed schema) by means of the Column complete-ness criterion one can determine the degree by whichthe relations of a class which are defined on the schemalevel (each relation has one column) exist on the in-stance level [40] In the Semantic Web and Linked Datacontext however we cannot presume any fixed rela-tional schema on the schema level The set of possiblerelations for the instances of a class is given at run-time by the set of used relations for the instances ofthis class Therefore we need to modify this criterionas already proposed by Pipino et al [40] In the updatedversion by means of the criterion Column completenessone can determine the degree by which the instances ofa class use the same relations averaged over all classes

Formally we define the Column completeness met-ric mcCol(g) as the ratio of the number of instanceshaving class k and a value for the relation r nokp tothe number of all instances having class k nok Byaveraging over all class-relation-pairs which occur on

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 11

instance level we obtain a fulfillment degree regardingthe whole KG

mcCol(g) =1

|H|sum

(kp)isinH

nokpnok

We thereby let H = (k p) isin (K times P ) | existk isinCg and exist(x p o) | p isin P imp

g and (xrdftype k) bethe set of all combinations of the considered classesK = k1 kn and considered relations P =p1 pm

Note that there are also relations which are dedicatedto the instances of a specific class but which do notneed to exist for all instances of that class For instancenot all people need to have a relation hasChild ordeathDate21 For measuring the Column complete-ness we selected only those relations for an assessmentwhere a value of the relation typically exists for allgiven instances

Population completeness The Population complete-ness metric determines the extent to which the consid-ered KG covers a basic population [40] The assess-ment of the KG completeness wrt a basic populationis performed by means of a gold standard which coversboth well-known entities (called ldquoshort headrdquo eg then largest cities in the world according to the number ofinhabitants) and little-known entities (called ldquolong tailrdquoeg municipalities in Germany) We take all entitiescontained in our gold standard equally into account

Let GS be the set of entities in the gold standardThen we can define

mcPop(g) =|e|e isin GS and e isin Eg|

|e|e isin GS|

333 TimelinessDefinition of dimension Timeliness is ldquothe extent

to which the age of the data is appropriate for the taskat handrdquo [47]

Discussion Timeliness does not describe the creationdate of a statement but instead the time range since thelast update or the last verification of the statement [39]Due to the easy way of publishing data on the Webdata sources can be kept easier up-to-date than tradi-tional isolated data sources This results in advantagesto the consumer of Web data [39] How Timeliness is

21For an evaluation about the prediction which relations are of thisnature see [1]

measured depends on the application context For somesituations years are sufficient while in other situationsone may need days [39]

Definition of metric The dimension timeliness isdetermined by the criteria Timeliness frequency of theKG Specification of the validity period and Specifica-tion of the modification date of statements

The fulfillment degree of a KG g wrt the dimen-sion Timeliness is measured by the metrics mFreqmV alidity and mChange which are defined as follows

Timeliness frequency of the KG The criterion Time-liness frequency of the KG indicates how fast the KGis updated We consider the KG RDF export here anddifferentiate between continuous updates where the up-dates are always performed immediately and discreteKG updates where the updates take place in discretetime intervals In case the KG edits are available onlineimmediately but the RDF export files are available indiscrete varying updating intervals we consider theonline version of the KG since in the context of LinkedData it is sufficient that URIs are dereferenceable

mFreq(g) =

1 continuous updates05 discrete periodic updates025 discrete non-periodic updates0 otherwise

Specification of the validity period of statements Spec-ifying the validity period of statements enables to tem-porally limit the validity of statements By using this cri-terion we measure whether the KG supports the speci-fication of starting and maybe end dates of statementsby means of providing suitable forms of representation

mV alidity(g) =

1 specification of validity pe-riod supported

0 otherwise

Specification of the modification date of statementsThe modification date discloses the point in timeof the last verification of a statement The modifi-cation date is typically represented via the relationsschemadateModified and dctermsmodified

mChange(g) =

1 specification of modifica-

tion dates for statementssupported

0 otherwise

12 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

34 Representational Data Quality

Representational data quality ldquocontains aspects re-lated to the format of the data [] and meaning ofdatardquo [47] This category contains the two dimensions(i) Ease of understanding (ie regarding the human-readability) and (ii) Interoperability (ie regarding themachine-readability) The dimensions InterpretabilityRepresentational consistency and Concise representa-tion as in addition proposed by Wang et al [47] areconsidered by us as being a part of the dimension Inter-operability

341 Ease of UnderstandingDefinition of dimension The ease of understanding

is ldquothe extent to which data are clear without ambiguityand easily comprehendedrdquo [47]

Discussion This dimension focuses on the under-standability of a data source by a human data con-sumer In contrast the dimension Interoperability fo-cuses on technical aspects The understandability of adata source (here KG) can be improved by things suchas descriptive labels and literals in multiple languages

Definition of metric The dimension understand-ability is determined by the criteria Description of re-sources Labels in multiple languages UnderstandableRDF serialization and Self-describing URIs The ful-fillment degree of a KG g wrt the dimension Con-sistency is measured by the metrics mDescr mLangmuSer and muURI which are defined as follows

Description of resources Heath et al [2630] suggestto describe resources in a human-understandable wayeg via rdfslabel or rdfscomment Withinour framework the criterion is measured as followsGiven a sample of resources we divide the numberof resources in the KG for which at least one label orone description is provided (eg via rdfslabelrdfscomment or schemadescription) bythe number of all considered resources in the localnamespace

mDescr(g) = |u|u isin U localg and exist(u p o) isin g

p isin PlDesc||u|u isin U localg |

PlDesc is the set of implicitly used relations in g in-dicating that the value is a label or description (egPlDesc = rdfslabelrdfscomment)

Beschreibung) Daruumlber hinaus ist das Ergebnisder Evaluation auf Basis der Entitaumlten interessant -gt DBpedia weicht deutlich ab da manche Entitaumlten

(Intermediate-Node-Mapping) keine rdfslabel habenFolglich wuumlrde ich die Definition der Metrik allgemeinhalten (beschraumlnkt auf proprietaumlre Ressourcen dh imselben Namespace) die Evaluation jedoch nur anhandder Entitaumlten machen

Labels in multiple languages Resources in the KG aredescribed in a human-readable way via labels eg viardfslabel or skosprefLabel22 The charac-teristic feature of skosprefLabel is that this kindof label should be used per resource at most once incontrast rdfslabel has no cardinality restrictionsie it can be used several times for a given resourceLabels are usually provided in English as the ldquobasiclanguagerdquo The now introduced metric for the criterionLabels in multiple languages determines whether labelsin other languages than English are provided in the KG

mLang(g) =

1 Labels provided in English

and at least one other lan-guage

0 otherwise

Understandable RDF serialization RDFXML is therecommended RDF serialization format of the W3CHowever due to its syntax RDFXML documents arehard to read for humans The understandability of RDFdata by humans can be increased by providing RDFin other more human-understandable serialization for-mats such as N3 N-Triple and Turtle We measurethis criterion by measuring the supported serializationformats during the dereferencing of resources

muSer(hg) =

1 Other RDF serializationsthan RDFXML available

0 otherwise

Note that conversions from one RDF serializationformat into another are easy to perform

Self-describing URIs Descriptive URIs contribute toa better human-readability of KG data Sauermann etal23 recommend to use short memorable URIs in theSemantic Web context which are easier understandableand memorable by humans compared to opaque URIs24

22Using the namespace httpwwww3org200402skoscore

23See httpswwww3orgTRcooluris requested onMar 1 2016

24For an overview of URI patterns see httpswwww3orgcommunitybpmlodwikiBest_practises_-_previous_notes requested on Dec 27 2016

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 13

such as wdtQ1040 The criterion Self-describingURIs is dedicated to evaluate whether self-describingURIs or generic IDs are used for the identification ofresources

muURI(g) =

1 self-describing URIs always used05 self-describing URIs partly used0 otherwise

342 InteroperabilityInteroperability is another dimension of the repre-

sentational data quality category and subsumes Wanget alrsquos aspects interpretability representational consis-tency and concise representation

Definition of dimension We define Interoperabilityalong the subsumed dimensions of Wang et al

ndash Interpretability Interpretability is ldquothe extent towhich data are in appropriate language and unitsand the data definitions are clearrdquo [47]

ndash Representational consistency Representationalconsistency is ldquothe extent to which data are alwayspresented in the same format and are compatiblewith previous datardquo [47]

ndash Concise representation Concise representationis ldquothe extent to which data are compactly repre-sented without being overwhelmingrdquo [47]

Discussion regarding interpretability In contrastto the dimension understandability which focuses onthe understandability of RDF KG data towards the useras data consumer interpretability focuses on the rep-resentation forms of information in the KG from atechnical perspective An example is the considerationwhether blank nodes are used According to Heath etal [26] blank nodes should be avoided in the LinkedData context since they complicate the integration ofmultiple data sources and since they cannot be linkedby resources of other data sources

Discussion regarding representational consistencyIn the context of Linked Data it is best practice to reuseexisting vocabulary for the creation of own RDF dataIn this way less data needs to be prepared for beingpublished as Linked Data [26]

Discussion regarding concise representation Heathet al [26] made the observation that the RDF features(i) RDF reification25 (ii) RDF collections and RDF

25In the literature it is often not differentiated between reificationin the general sense and reification in the sense of the specific

container and (iii) blank nodes are not very widelyused in the Linked Open Data context Those featuresshould be avoided according to Heath et al in orderto simplify the processing of data on the client sideEven the querying of the data via SPARQL may getcomplicated if RDF reification RDF collections andRDF container are used We agree on that but alsopoint out that reification (implemented via RDF stan-dard reification n-ary relations singleton propertiesor named graphs) is inevitably necessary for makingstatements about statements

Definition of metric The dimension Interoperabil-ity is determined via the following criteria

ndash Avoiding blank nodes and RDF reificationndash Provisioning of several serialization formatsndash Using external vocabularyndash Interoperability of proprietary vocabulary

The fulfillment degree of a KG g wrt the dimen-sion Interoperability is measured by the metrics mReif miSerial mexV oc and mpropV oc which are defined asfollows

Avoiding blank nodes and RDF reification Using RDFblank nodes RDF reification RDF container and RDFlists is often considered as ambivalent On the one handthese RDF features are not very common and theycomplicate the processing and querying of RDF data[3026] On the other hand they are necessary in cer-tain situations eg when statements about statementsshould be made We measure the criterion by evaluatingwhether blank nodes and RDF reification are used

mReif (g) =

1 no blank nodes and no RDF

reification05 either blank nodes or RDF

reification0 otherwise

Provisioning of several serialization formats The in-terpretability of RDF data of a KG is increased if be-

proposal described in the RDF standard (Brickley D Guha R (eds)RDF Vocabulary Description Language 10 RDF Schema W3CRecommendation online available at httpwwww3orgTRrdf-schema requested on Sep 2 2016) For more informationabout reification and its implementation possibilities we can refer thereader to [27] In this article we use the term reification by defaultfor the general sense and standard reification or RDF reificationfor referring to the modeling of reification according to the RDFstandard

14 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

sides the serialization standard RDFXML further seri-alization formats are supported for URI dereferencing

miSerial(hg) =

1 RDFXML and further for-

mats are supported05 only RDFXML is supported0 otherwise

Using external vocabulary Using a common vocabu-lary for representing and describing the KG data allowsto represent resources and relations between resourcesin the Web of Data in a unified way This increases theinteroperability of data [3026] and allows a comfort-able data integration We measure the criterion of usingan external vocabulary by setting the number of tripleswith external vocabulary in predicate position to thenumber of all triples in the KG

mextV oc(g) =|(s p o)|(s p o) isin g and p isin P external

g ||(s p o) isin g|

Interoperability of proprietary vocabulary Linkingon schema level means to link the proprietary vo-cabulary to external vocabulary Proprietary vocab-ulary are classes and relations which were definedin the KG itself The interlinking to external vo-cabulary guarantees a high degree of interoperabil-ity [26] We measure the interlinking on schemalevel by calculating the ratio to which classes andrelations have at least one equivalency link (egowlsameAs owlequivalentProperty orowlequivalentClass) to classes and relationsrespectively of other data sources

mpropV oc(g) = |x isin Pg cup Cg|exist(x p o) isin g

(p isin Peq and (o isin U and o isin Uextg ))||Pg cup Cg|

where Peq = owlsameAsowlequivalent-PropertyowlequivalenClass and Uext

g con-sists of all URIs in Ug which are external to the KG gwhich means that hg is not responsible for resolvingthese URIs

35 Accessibility Category

Accessibility data quality refers to aspects on howdata can be accessed This category contains the threedimensions

ndash Accessibilityndash Licensing andndash Interlinking

Wangrsquos dimension access security is considered by usas being not relevant in the Linked Open Data contextas we only take open data sources into account

In the following we go into details of the mentioneddata quality dimensions

351 AccessibilityDefinition of dimension Accessibility is ldquothe ex-

tent to which data are available or easily and quicklyretrievablerdquo [47]

Discussion Wang et alrsquos definition of Accessibilitycontains the aspects availability response time anddata request They are defined as follows

1 Availability ldquoof a data source is the probability thata feasible query is correctly answered in a giventime rangerdquo [39]According to Naumann [39] the availability is animportant quality aspect for data sources on theWeb since in case of integrated systems (with fed-erated queries) usually all data sources need tobe available in order to execute the query Therecan be different influencing factors regarding theavailability of data sources such as the day timethe worldwide distribution of servers the planedmaintenance work and the caching of data LinkedData sources can be available as SPARQL end-points (for performing complex queries on thedata) and via HTTP URI dereferencing We needto consider both possibilities for this DQ dimen-sion

2 Response time characterizes the delay betweenthe point in time when the query was submittedand the point in time when the query response isreceived [11]Note that the response time is dependent on em-pirical factors such as the query the size of the in-dexed data the data structure the used triple storethe hardware and so on We do not consider theresponse time in our evaluations since obtaininga comprehensive result here is hard

3 In the context of Linked Data data requests canbe made (i) on SPARQL endpoints (ii) on RDFdumps (export files) and (iii) on Linked DataAPIs

Definition of metric We define the metric for thedimension Accessibility by means of metrics for thefollowing criteria

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 15

ndash Dereferencing possibility of resourcesndash Availability of the KGndash Provisioning of public SPARQL endpointndash Provisioning of an RDF exportndash Support of content negotiationndash Linking HTML sites to RDF serializationsndash Provisioning of KG metadata

The fulfillment degree of a KG g wrt the dimen-sion Accessibility is measured by the metrics mDeref mAvai mSPARQL mExport mNegot mHTMLRDF and mMeta which are defined as follows

Dereferencing possibility of resources One of theLinked Data principles [9] is the dereferencing possi-bility of resources URIs must be resolvable via HTTPrequests and useful information should be returnedthereby We assess the dereferencing possibility of re-sources in the KG by analyzing for each URI in the sam-ple set (here all URIs Ug) the HTTP response statuscode and by evaluating whether RDF data is returned Asuccessful dereferencing of resources is given if HTTPstatus code 200 and an RDF document is returned

mDeref (hg) =|dereferencable(Ug)|

|Ug|

Availability of the KG The Availability of the KG cri-terion indicates the uptime of the KG It is an essentialcriterion in the context of Linked Data since in case ofan integrated or federated query mostly all data sourcesneed to be available [39] We measure the availabil-ity of a KG by monitoring the ability of dereferencingURIs over a period of time This monitoring processcan be done with the help of a monitoring tool such asPingdom26

mAvai(hg) =Number of successful requests

Number of all requests

Provisioning of public SPARQL endpoint SPARQLendpoints allow the user to perform complex queries(including potentially many instances classes and rela-tions) on the KG This criterion here indicates whetheran official SPARQL endpoint is publicly availableThere might be additional restrictions of this SPARQLendpoint such as a maximum number of requests pertime slice or a maximum runtime of a query However

26See httppingdomcom requested on Mar 1 2016

we do not measure these restrictions here

mSPARQL(hg) =

1 SPARQL endpoint publiclyavailable

0 otherwise

Provisioning of an RDF export If there is no pub-lic SPARQL endpoint available or the restrictions ofthis endpoint are so strict that the user does not useit an RDF export dataset (RDF dump) can often beused This dataset can be used to set up a local pri-vate SPARQL endpoint The criterion here indicateswhether an RDF export dataset is officially available

mExport(hg) =

1 RDF export available0 otherwise

Support of content negotiation Content negotiation(CN) allows that the server returns RDF documentsduring the dereferencing of resources in the desiredRDF serialization format The HTTP protocol allowsthe client to specify the desired content type (eg RDFXML) in the HTTP request and the server to specifythe returned content type in the HTTP response header(eg applicationrdf+xml) In this way the de-sired and the provided content type are matched as faras possible It can happen that the server does not pro-vide the desired content type Moreover it may hap-pen that the server returns an incorrect content typeThis may lead to the fact that serialized RDF data isnot processed further An example is RDF data whichis declared as textplain [26] Hogan et al [29]therefore propose to let KGs return the most specificcontent type as possible We measure the Support ofcontent negotiation by dereferencing resources withdifferent RDF serialization formats as desired contenttype and by comparing the accept header of the HTTPrequest with the content type of the HTTP response

mNegot(hg) =

1 CN supported and correct

content types returned05 CN supported but wrong

content types returned0 otherwise

Linking HTML sites to RDF serializations Heath etal [26] suggest linking any HTML description of aresource to RDF serializations of this resource in or-der to make the discovery of corresponding RDF dataeasier (for Linked Data aware applications) For thatreason in the HTML header the so-called Autodiscov-

16 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

ery pattern can be included This pattern consists ofthe phrase link rel=alternate the indicationabout the provided RDF content type and a link to theRDF document27 We measure the linking of HTMLpages to RDF documents (ie resource representations)by evaluating whether the HTML representations of theresources contain links as described

mHTMLRDF (hg) =

1 Autodiscovery pattern usedat least once

0 otherwise

Provisioning of KG metadata In the light of the Se-mantic Web vision where agents select and make useof appropriate data sources on the Web also the meta-information about KGs needs to be available in amachine-readable format The two important mech-anisms to specify metadata about KGs are (i) usingsemantic sitemaps and (ii) using the VoID vocabu-lary28 [26] For instance the URI of the SPARQL end-point can be assigned via voidsparqlEndpointand the RDF export URL can be specified withvoiddataDump Such metadata can be added as ad-ditional facts to the KG or it can be provided as separateVoID file We measure the Provisioning of KG meta-data by evaluating whether machine-readable metadataabout the KG is available Note that the provisioningof licensing information in a machine-readable format(which is also a meta-information about the KG) isconsidered in the data quality dimension License lateron

mMeta(g) =

1 Machine-readable metadataabout g available

0 otherwise

352 LicenseDefinition of dimension Licensing is defined as

ldquothe granting of permission for a consumer to re-use adataset under defined conditionsrdquo [49]

Discussion The publication of licensing informationabout KGs is important for using KGs without legalconcerns especially in commercial settings CreativeCommons (CC)29 publishes several standard licensing

27An example is ltlinkrel=alternate type =applicationrdf+xml href=companyrdfgt

28See namespace httpwwww3orgTRvoid29See httpcreativecommonsorg requested on Mar

1 2016

contracts which define rights and obligations Thesecontracts are also in the Linked Data context popularThe most frequent licenses for Linked Data are CC-BYCC-BY-SA and CC0 [31] CC-BY30 requires specify-ing the source of the data CC-BY-SA31 requires in ad-dition that if the data is published it is published underthe same legal conditions CC032 defines the respectivedata as public domain and without any restrictions

Noteworthy is that most data sources in the LinkedOpen Data cloud do not provide any licensing infor-mation [31] which makes it difficult to use the datain commercial settings Even if data is published un-der CC-BY or CC-BY-SA the data is often not usedsince companies refer to uncertainties regarding thesecontracts

Definition of metric The dimension License isdetermined by the criterion Provisioning machine-readable licensing information

The fulfillment degree of a KG g wrt the dimensionLicense is measured by the metric mmacLicense whichis defined as follows

Provisioning machine-readable licensing informationLicenses define the legal frameworks under which theKG data may be used Providing machine-readable li-censing information allows users and applications to beaware of the license and to use the data of the KG inaccordance with the legal possibilities [3026]

Licenses can be specified in RDF via relationssuch as cclicence33 dctermslicence ordctermsrights The licensing information can bespecified either in the KG as additional facts or sepa-rately in a VoID file We measure the criterion by eval-uating whether licensing information is available in amachine-readable format

mmacLicense(g) =

1 machine-readable

licensing informationavailable

0 otherwise

353 InterlinkingDefinition of dimension Interlinking is the extent

ldquoto which entities that represent the same concept are

30See httpscreativecommonsorglicensesby40 requestedon Mar 1 2016

31See httpscreativecommonsorglicensesby-sa40 requested on Mar 1 2016

32See httpcreativecommonsorgpublicdomainzero10 requested on Mar 3 2016

33Using the namespace httpcreativecommonsorgns

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 17

linked to each other be it within or between two ormore data sourcesrdquo [49]

Discussion According to Bizer et al [12] DBpediaestablished itself as a hub in the Linked Data clouddue to its intensive interlinking with other KGs Theseinterlinking is on the instance level usually establishedvia owlsameAs links However according to Halpinet al [24] those owlsameAs links do not alwaysinterlink identical entities in reality According to theauthors one reason might be that the KGs provideentries in different granularity For instance the DB-pedia resource for Berlin (dboBerlin) links viaowlsameAs relations to three different resources inthe KG GeoNames34 namely (i) Berlin the capital35

(ii) Berlin the state36 and (iii) Berlin the city37 More-over owlsameAs relations are often created auto-matically by some mapping function Due to mappingerrors the precision is often below 100 [18]

Definition of metric The dimension Interlinking isdetermined by the criteria

ndash Interlinking via owlsameAsndash Validity of external URIs

The fulfillment degree of a KG g wrt the dimen-sion Interlinking is measured by the metrics mInst andmURIs which are defined as follows

Interlinking via owlsameAs The forth LinkedData principle according to Berners-Lee [8] is the inter-linking of data resources so that the user can explorefurther information According to Hogan et al [30] theinterlinking has a side effect It does not only result inotherwise isolated KGs but the number of incominglinks of a KG indicates the importance of the KG in theLinked Open Data cloud We measure the interlinkingon instance level38 by calculating the extent to which in-stances have at least one owlsameAs link to externalKGs

34See httpwwwgeonamesorg requested on Dec 312016

35See httpwwwgeonamesorg2950159berlinhtml requested on Feb 4 2017

36See httpwwwgeonamesorg2950157land-berlinhtml requested on Feb 4 2017

37See httpwwwgeonamesorg6547383berlin-stadthtml requested on Feb 4 2017

38The interlinking on schema level is already measured via thecriterion Interoperability of proprietary vocabulary

mInst(g) = |x isin Ig (Pg cup Cg) |

exist(xowlsameAs y) isin g and y isin Uextg |

|Ig (Pg cup Cg)|

Validity of external URIs The considered KG maycontain outgoing links referring to RDF resourcesor Web documents (non-RDF data) The linking toRDF resources is usually enabled by owlsameAsowlequivalentProperty and owlequivalentClass relations Web documents are linked viarelations such as foafhomepage and foafdepiction Linking to external resources always entailsthe problem that those links might get invalid over timeThis can have different causes For instance the URIsare not available anymore We measure the Validity ofexternal URIs by evaluating the URIs from an URI sam-ple set wrt whether there is a timeout a client error(HTTP response 4xx) or a server error (HTTP response5xx)

mURIs(g) =|x isin A | resolvable(x)|

|A|

where A = y | exist(x p y) isin g (p isin Peqandx isin Ug(CgcupPg)andx isin U local

g andy isin Uextg ) and resolvable(x)

returns true if HTTP status code 200 is returned Peq isthe set of relations used for linking to external sourcesExamples for such relations are owlsameAs andfoafhomepage

In case of an empty set A the metric should evaluateto 1

36 Conclusion

In this section we provided 34 DQ criteria which canbe applied in the form of DQ metrics to KGs in order toassess those KGs wrt data quality The DQ criteria areclassified into 11 DQ dimensions These dimensionsare themselves grouped into 4 DQ categories In totalwe have the following picture

ndash Intrinsic category

lowast Accuracylowast Syntactic validity of RDF documentslowast Syntactic validity of literalslowast Semantic validity of triples

18 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

lowast Trustworthinesslowast Trustworthiness on KG levellowast Trustworthiness on statement levellowast Using unknown and empty values

lowast Consistencylowast Check of schema restrictions during inser-

tion of new statementslowast Consistency of statements wrt class con-

straintslowast Consistency of statements wrt relation con-

straints

ndash Contextual category

lowast Relevancylowast Creating a ranking of statements

lowast Completenesslowast Schema completenesslowast Column completenesslowast Population completeness

lowast Timelinesslowast Timeliness frequency of the KGlowast Specification of the validity period of state-

mentslowast Specification of the modification date of

statements

ndash Representational data quality

lowast Ease of understandinglowast Description of resourceslowast Labels in multiple languageslowast Understandable RDF serializationlowast Self-describing URIs

lowast Interoperabilitylowast Avoiding blank nodes and RDF reificationlowast Provisioning of several serialization formatslowast Using external vocabularylowast Interoperability of proprietary vocabulary

ndash Accessibility category

lowast Accessibilitylowast Dereferencing possibility of resourceslowast Availability of the KGlowast Provisioning of public SPARQL endpointlowast Provisioning of an RDF exportlowast Support of content negotiationlowast Linking HTML sites to RDF serializationslowast Provisioning of KG metadata

lowast Licenselowast Provisioning machine-readable licensing in-

formationlowast Interlinking

lowast Interlinking via owlsameAslowast Validity of external URIs

4 Selection of KGs

We consider the following KGs for our comparativeevaluation

ndash DBpedia DBpedia39 is the most prominent KGin the LOD cloud [4] The project was initiatedby researchers from the Free University of Berlinand the University of Leipzig in collaborationwith OpenLink Software Since the first public re-lease in 2007 DBpedia is updated roughly once ayear40 By means of a dedicated open source ex-traction framework DBpedia is created from infor-mation contained in Wikipedia such as infobox ta-bles categorization information geo-coordinatesand external links Due to its role as the hub ofthe LOD cloud DBpedia contains many links toother datasets in the LOD cloud such as FreebaseOpenCyc UMBEL41 GeoNames Musicbrainz42

CIA World Factbook43 DBLP44 Project Guten-berg45 DBtune Jamendo46 Eurostat47 Uniprot48

and Bio2RDF4950 DBpedia has been used exten-sively in the Semantic Web research communitybut has become also relevant in commercial set-tings for instance companies such as the BBC[33] and the New York Times [41] use DBpediato organize their content The version of DBpediawe analyzed is 2015-04

39See httpdbpediaorg requested on Nov 1 201640There is also DBpedia live which started in 2009 and which

gets updated when Wikipedia is updated See httplivedbpediaorg requested on Nov 1 2016 Note however thatDBpedia live only provides a restricted set of relations compared toDBpedia Also the provisioning of data varies a lot While for sometime ranges DBpedia live provides data for each hour for other timeranges DBpedia live data is only available once a month

41See httpumbelorg requested on Dec 31 201642See httpmusicbrainzorg requested on Dec 31

201643See httpswwwciagovlibrary

publicationsthe-world-factbook requested on Dec31 2016

44See httpwwwdblporg requested on Dec 31 201645See httpswwwgutenbergorg requested on Dec

31 201646See httpdbtuneorgjamendo requested on Dec

31 201647See httpeurostatlinked-statisticsorg

requested on Dec 31 201648See httpwwwuniprotorg requested on Dec 31

201649See httpbio2rdforg requested on Dec 31 201650See a complete list of the links on the websites describing the sin-

gle DBpedia versions such as httpdownloadsdbpediaorg2016-04links (requested on Nov 1 2016)

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 19

ndash Freebase Freebase51 is a KG announced byMetaweb Technologies Inc in 2007 and was ac-quired by Google Inc on July 16 2010 In con-trast to DBpedia Freebase had provided an in-terface that allowed end-users to contribute tothe KG by editing structured data Besides user-contributed data Freebase integrated data fromWikipedia NNDB52 FMD53 and MusicBrainz54

Freebase uses a proprietary graph model for stor-ing also complex statements Freebase shut downits services completely on August 31 2016 Onlythe latest data dump is still available WikimediaDeutschland and Google integrate Freebase datainto Wikidata via the Primary Sources Tool55 Fur-ther information about the migration from Free-base to Wikidata is provided in [44] We analyzedthe latest Freebase version as of March 2015

ndash OpenCyc The Cyc56 project started in 1984 bythe industry research and development consortiumMicroelectronics and Computer Technology Cor-poration The aim of Cyc is to store ndash in a machine-processable way ndash millions of common sense factssuch as ldquoEvery tree is a plantrdquo The main focus ofCyc has been on inferencing and reasoning SinceCyc is proprietary a smaller version of the KGcalled OpenCyc57 was released under the opensource Apache license Version 2 In July 2006 Re-searchCyc58 was published for the research com-munity containing more facts than OpenCyc Wedid not consider Cyc and ResearchCyc since thoseKGs do not meet the chosen requirements namelythat the KGs are freely available and freely us-able in any context The version of OpenCyc weanalyzed is 2012-05-10

ndash Wikidata Wikidata59 is a project of WikimediaDeutschland which started on October 30 2012The aim of the project is to provide data whichcan be used by any Wikimedia project including

51See httpfreebasecom requested on Nov 1 201652See httpwwwnndbcom requested on Dec 31 201653See httpwwwfashionmodeldirectorycom re-

quested on Dec 31 201654See httpmusicbrainzorg requested on Dec 31

201655See httpswwwwikidataorgwikiWikidata

Primary_sources_tool requested on Apr 8 201656See httpwwwcyccom requested on Dec 31 201657See httpwwwopencycorg accessed on Nov 1

201658See httpresearchcyccom requested on Dec 31

201659See httpwikidataorg accessed on Nov 1 2016

Wikipedia Wikidata does not only store facts butalso the corresponding sources so that the valid-ity of facts can be checked Labels aliases anddescriptions of entities in Wikidata are providedin almost 400 languages Wikidata is a commu-nity effort ie users collaboratively add and editinformation Also the schema is maintained andextended based on community agreements Wiki-data is currently growing considerably due to theintegration of Freebase data [44] The version ofWikidata we analyzed is 2015-10

ndash YAGO YAGO60 ndash Yet Another Great Ontol-ogy ndash has been developed at the Max PlanckInstitute for Computer Science in Saarbruumlckensince 2007 YAGO comprises information ex-tracted from Wikipedia (such as information fromthe categories redirects and infoboxes) Word-Net [19] (such as information about synsets andhyponomies) and GeoNames61 The version ofYAGO we analyzed is YAGO3 which was pub-lished in March 2015

5 Comparison of KGs

51 Key Statistics

In the following we present statistical commonal-ities and differences of the KGs DBpedia FreebaseOpenCyc Wikidata and YAGO We thereby use thefollowing key statistics

ndash Number of triplesndash Number of classesndash Number of relationsndash Distribution of classes wrt the number of their

corresponding instancesndash Coverage of classes with at least one instance per

classndash Covered domains wrt entitiesndash Number of entitiesndash Number of instancesndash Number of entities per classndash Number of unique subjectsndash Number of unique predicatesndash Number of unique objects

In Section 72 we provide an overview of relatedwork wrt those key statistics

60See httpwwwmpi-infmpgdedepartmentsdatabases-and-information-systemsresearchyago-nagayagodownloads accessed on Nov 1 2016

61See httpwwwgeonamesorg requested on Dec 312016

20 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

511 TriplesRanking of KGs wrt number of triples The num-

ber of triples (see Table 2) differs considerably betweenthe KGs Freebase is the largest KG with over 31Btriples while OpenCyc resides the smallest KG withonly 24M triples The large size of Freebase can betraced back to the fact that large data sets such as Mu-sicBrainz have been integrated into this KG OpenCycin contrast has been built purely manually by expertsIn general this indicates a correlation between the wayof building up a KG and its size

Size differences between DBpedia and YAGO Asboth DBpedia and YAGO were created automaticallyby extracting semantically-structured information fromWikipedia the significant difference between their sizesndash in terms of triples ndash is in particular noteworthy Wecan mention here the following reasons YAGO inte-grates the statements from different language versionsof Wikipedia in one single KG while for the canon-ical DBpedia dataset (which is used in our evalua-tions) solely the English Wikipedia was used as in-formation source Besides that YAGO contains con-textual information and detailed provenance informa-tion Contextual information is for instance the an-chor texts of all links within Wikipedia For repre-senting the anchor texts the relation yagohasWikipediaAnchorText (330M triples in total) is usedThe provenance information of single statements isstored in a reified form In particular the relationsyagoextractionSource (1612M triples) andyagoextractionTechnique (1762M triples)are applied therefore3nInfluence of reification on the number of triples

DBpedia Freebase Wikidata and YAGO use someform of reification Reification in general describesthe possibility of making statements about statementsWhile reification has an influence on the number oftriples for DBpedia Freebase and Wikidata the num-ber of triples in YAGO is not influenced by reificationsince data is here provided in N-Quads62 This style ofreification is called Named Graph [27] The additionalcolumn (in comparison to triples) contains a unique IDof the statement by which the triple becomes identifiedFor backward compatibility the ID is commented andtherefore not imported into the triple store Note how-ever that transforming N-Quads to N-Triples leads to a

62The idea of N-Quads is based on the assignment of triples todifferent graphs YAGO uses N-Quads to identify statements per ID

high number of unique subjects concerning the set ofall triples

In case of DBpedia Freebase and Wikidata reifica-tion is implemented by means of n-ary relations Ann-ary relation denotes the relation between more thantwo resources and is implemented via additional inter-mediate nodes since in RDF only binary statementscan be modeled [1627] In Freebase and DBpedia datais mostly provided in the form of plain N-Triples andn-ary relations are only used for data from higher ar-ity63 Wikidata in contrast has the peculiarity that notonly every statement is expressed with the help of ann-ary relation but that in addition each statement is in-stantiated with wdoStatement This leads to about74M additional instances which is about one tenth ofall triples in Wikidata

512 ClassesMethods for counting classes The number of

classes can be calculated in different ways Classes canbe identified via rdfsClass and owlClass re-lations or via rdfssubClassOf relations64 SinceFreebase does not provide any class hierarchy withrdfssubClassOf relations and since Wikidatadoes not instantiate classes explicitly as classes butuses instead only ldquosubclass ofrdquo (wdtP279) relationsthe method of calculating the number of classes de-pends on the considered KG

Ranking of KG wrt number of classes Our eval-uations revealed that YAGO contains the highest num-ber of classes of all considered KGs DBpedia in con-trast has the fewest (see Table 2)

Number of classes in YAGO and DBpedia Howdoes it come to this gap between DBpedia and YAGOwith respect to the number of classes although bothKGs were created automatically based on WikipediaFor YAGO the classes are extracted from the categoriesin Wikipedia while the hierarchy of the classes is de-ployed with the help of WordNet synset relations TheDBpedia ontology in contrast is very small since itis created manually based on the mostly used infobox

63In Freebase Compound Value Types are used for reifi-cation [44] In DBpedia it is named Intermedia Node Map-ping see httpmappingsdbpediaorgindexphpTemplateIntermediateNodeMapping (requested on Dec31 2016)

64The number of classes in a KG may also be calculated by takingall entity type relations (rdftype and ldquoinstance ofrdquo (wdtP31)in case of Wikidata) on the instance level into account However thiswould result only in a lower bound estimation as here those classesare not considered which have no instances

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 21

DBpe

dia

Freeb

ase

Ope

nCyc

Wikidat

a

YAGO

20

40

60

80

100

Co

ve

rag

e in

Fig 1 Coverage of classes having at least one instance

templates in Wikipedia Besides those 736 classes theDBpedia KG contains further 444895 classes whichoriginate from the imported YAGO classes and whichare published in the namespace yago Those YAGOclasses are ndash like the DBpedia ontology classes ndash inter-connected via rdfssubClassOf to form a taxon-omy In the evaluation of DBpedia the YAGO classesare ignored as they do not belong to the DBpedia on-tology given as OWL file

Coverage of classes with at least one instanceFig 1 shows for each KG the extent to which classes areinstantiated that is for how many classes at least oneinstance exists YAGO exhibits the highest coveragerate (826) although it contains the highest numberof classes among the KGs This can be traced back tothe fact that YAGO classes are chosen by a heuristicthat considers Wikipedia leaf categories which tend tohave instances [43] OpenCyc (with 65) and Wiki-data (54) come last in the ranking Wikidata has thesecond highest number of classes in total (see Table 2)out of which relatively little are used on instance levelNote however that in some scenarios solely the schemalevel information (including classes) of KGs is neces-sary so that the low coverage of instances by classes isnot necessarily an issue

Correlation between number of classes and num-ber of instances In Fig 2 we can see a histogramof the classes with respect to the number of instancesper class That is for each KG we can spot how manyclasses have a high number of instances and how manyclasses have a low number of instances Note the log-arithmic scale on both axes The curves seem to fol-low power law distributions For DBpedia the line de-

Table 1Percentage of considered entities per KG for covered domains

DB FB OC WD YA

Reach of method 88 92 81 41 82

creases consistently for the first 250 classes before itdecreases more than exponentially beyond class 250

513 DomainsAll considered KGs are cross-domain meaning that a

variety of domains are covered in those KGs Howeverthe KGs often cover the single domains to a differentdegree Tartir [45] proposed to measure the covered do-mains of ontologies by determining the usage degree ofcorresponding classes the number of instances belong-ing to one or more subclasses of the respective domainis compared to the number of all instances In our workhowever we decided to evaluate the coverage of do-mains concerning the classes per KG via manual assign-ments of the mostly used classes to the domains peoplemedia organizations geography and biology65 Thislist of domains was created by aggregating the mostfrequent domains in Freebase

The manual assignment of classes to domainsis necessary in order to obtain a consistent assign-ment of the classes to the domains across all con-sidered KGs Otherwise the same classes in differ-ent KGs may be assigned to different domains More-over in some KGs classes may otherwise appear invarious domains simultaneously For instance theFreebase classes freebasemusicartist andfreebasepeopleperson overlap in terms oftheir instances and multiple domains (such as musicand people) might be assigned to them

As the reader can see in Table 1 our method to de-termine the coverage of domains and hence the reachof our evaluation includes about 80 of all entities ofeach KG except Wikidata It is calculated as the ratio ofthe number of unique entities of all considered domainsof a given KG divided by the number of all entities ofthis KG66 If the ratio was at 100 we were able toassign all entities of a KG to the chosen domains

Fig 3 shows the number of entities per domain in thedifferent KGs with a logarithmic scale Fig 4 presents

65See our website for examples of classes per domain andper KG httpkmaifbkitedusitesknowledge-graph-comparison (requested on Dec 31 2016)

66We used the number of unique entities of all domains and notthe sum of the entities measured per domain since entities may be inseveral domains at the same time

22 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

100 101 102 103

Classes

100

102

104

106

108

Nu

mb

er

of

insta

nce

s

DBpedia

Freebase

OpenCyc

Wikidata

YAGO

Fig 2 Distribution of classes wrt the number of instances per KG

persons media organizations geography biology100

102

104

106

108

1010

Num

ber

of entities

DBpedia

Freebase

OpenCyc

Wikidata

YAGO

Fig 3 Number of entities per domain

the relative coverage of each domain in each KG It iscalculated as the ratio of the number of entities in eachdomain to the total number of entities of the KG Avalue of 100 means that all instances reside in onesingle domain

The case of Freebase is especially outstanding here77 of all entities here are located in the media

domain This fact can be traced back to large-scaledata imports such as from MusicBrainz The classfreebasemusicrelease_track is account-able for 42 of the media entities As shown in Fig 3Freebase provides the most entities in four out of thefive domains when considering all KGs

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 23

persons media organizations geography biology

10

20

30

40

50

60

70

80

Rela

tive n

um

ber

of entities in p

erc

ent

DBpedia

Freebase

OpenCyc

Wikidata

YAGO

Fig 4 Relative number of entities per domain

In DBpedia and YAGO the domain of people is thelargest domain (50 and 34 respectively) Peculiar isthe higher coverage of YAGO regarding the geographydomain compared to DBpedia As one reason for thatwe can point out the data import of GeoNames intoYAGO

Wikidata contains around 150K entities in the do-main organization This is relativly few consideringthe total amount of entities being around 187M andconsidering the number of organizations in other KGsNote that even DBpedia provides more organizationentities than Wikidata The reason why Wikidata hasnot so many organization entities is not fully compre-hensible to us However we can point out that for ouranalysis we only considered Wikidata classes whichappeared more than 6000 times67 and that about 16Kclasses were therefore not considered It is possible thatentities of the domain organization are belonging tothose rather rarely occurring classes

514 Relations and PredicatesEvaluation method In this article we differentiate

between relations and predicates (see also Section 2)

ndash Relations ndash as short term for explicitly defined re-lations ndash refers to (proprietary) vocabulary definedon the schema level of a KG We identify the setof relations of a KG as the set of those links which

67This number is based on heuristics We focused on the 150 mostinstantiated classes and cut the long tail of classes having only fewinstances

are explicitly defined as such via assignments (forinstance with rdfsProperty) to classes InSection 2 we used Pg to denote this set

ndash In contrast we use predicates to denote links usedin the KG independently of their introduction onthe schema level The set of unique predicates perKG denoted as P imp

g is nothing else than the setof unique RDF terms on the predicate position ofall triples in the KG

It is important to distinguish the key statistics for rela-tions from the key statistics for predicates since theycan differ considerably depending on to which degreerelations are only defined on schema level but not usedon instance level

Evaluation resultsRelationsRanking regarding relations As presented in Ta-

ble 2 Freebase exhibits by far the highest number ofunique relations (around 785K) among the KGs YAGOshows only 106 relations which is the lowest value inthis comparison In the following we point out furtherfindings regarding the relations of the single KGs

DBpedia Regarding DBpedia relations we need todistinguish between so-called mapping-based prop-erties and non-mapping-based properties Mapping-based properties are created by extracing the informa-tion from infoboxes in Wikipedia using manually cre-ated mappings These mappings are specified in the DB-

24 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

pedia Mappings Wiki68 Mapping-based properties arecontained in the DBpedia ontology and located in thenamespace httpdbpediaorgontologyWe count 2819 such relations for the considered DB-pedia version 2015-04 Non-mapping-based properties(also called ldquoraw infobox propertiesrdquo) are extractedfrom Wikipedia without the help of manually createdmappings and hence without any manual adjustmentsTherefore they are generally of lower quality We count58776 such unique relations They reside in the names-pace httpdbpediaorgproperty Bothmapping-based and non-mapping-based properties areinstantiated in DBpedia with rdfProperty We ig-nore the non-mapping based properties for the calcu-lation of the number of relations |Pg| (see Table 2)since in contrast to DBpedia in YAGO non-mappingbased properties are not instantiated Note that themapping-based properties and the non-mapping basedproperties in DBpedia are not aligned69 and may over-lap until DBpedia version 2016-0470

Freebase The high number or Freebase relations canbe explained by two facts 1 About a third of all rela-tions in Freebase are duplicates in the sense that they aredeclared by means of the owlinverseOf relationas being inverse of other relations An example is the re-lation freebasemusicartistalbum and itsinverse relation freebasemusicalbumartist2 Freebase allowed users to introduce their own rela-tions without any limits These relations were originallyin each userrsquos namespace So-called commons adminswere able to approve those relations so that they gotincluded into the Freebase commons schema

OpenCyc For OpenCyc we measure 18028 uniquerelations We can assume that most of them are dedi-cated to statements on the schema level

Wikidata In Wikidata a relatively small set of rela-tions is provided Note in this context that despite thefact that Wikidata is curated by a community (just likeFreebase) Wikidata community members cannot insertarbitrarily new relations as it was possible in Freebaseinstead relations first need to be proposed and thenget accepted by the community if and only if certain

68See httpmappingsdbpediaorgindexphpMain_Page accessed on Nov 4 2016

69For instance The DBpedia ontology containsdbobirthName for the name of a person while the non-mappingbased property set contains dbpname dbpfirstname anddbpalternativeNames

70For instance dbpalias and dboalias

criteria are met71 One of those criteria is that each newrelation is presumably used at least 100 times Thisrelation proposal process can be mentioned as likelyreason why in Wikidata in relative terms more relationsare actually used than in Freebase

YAGO For YAGO we measure the small set of 106unique relations Although relations are curated man-ually for YAGO and DBpedia the size of the relationset differs significantly between those KGs Hoffart etal [28] mention the following reasons for that

1 Peculiarity of relations The DBpedia ontologyprovides quite many special relations For in-stance there exists the relation dboaircraftFighter between dboMilitaryUnit anddboMeanOfTransportation

2 Granularity of relations Relations in the DB-pedia ontology are more fine-grained than rela-tions in YAGO For instance DBpedia contains therelations dboauthor and dbodirectorwhereas in YAGO there is only the generic relationyagocreated

3 Date specification The DBpedia ontology intro-duces several relations for dates For instance DB-pedia contains the relations dbobirthDateand dbobirthYear for birth dates while inYAGO only the relation yagobirthOnDateis used Incomplete date specifications ndash for in-stance if only the year is known ndash are specifiedin YAGO by wildcards (ldquordquo) so that no multiplerelations are needed

4 Inverse relations YAGO has no relations ex-plicitly specified as being inverse In DBpediawe can find relations specified as inverse such asdboparent and dbochild

5 Reification YAGO introduces the SPOTL(X) for-mat This format extends the triple format ldquoSPOldquowith a specification of Time Location and conteXtIn this way no contextual relations are necessary(such as dbodistanceToLondon or dbopopulationAsOf) which occur if the relationsare closely aligned to Wikipedia template attributenames

Frequency of the usage of relations Fig 5 showsthe relative proportions of how often relations are usedper KG grouped into three classes Surprisingly DB-pedia and Freebase exhibit a high number of relationswhich are not used at all on the instance level In case of

71See httpswwwwikidataorgwikiWikidataProperty_proposal requested on Dec 31 2016

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 25

0 1-500 gt500

Number of relations

20

40

60

80

100

Re

lative

occu

ren

cie

s in

pe

rce

nt

DBpedia

Freebase

OpenCyc

Wikidata

YAGO

Fig 5 Frequency of the usage of the relations per KG grouped by(i) zero occurrences (ii) 1ndash500 occurrences and (iii) more than 500occurrences in the respective KG

OpenCyc 992 of the defined relations are never usedWe assume that those relations are used only withinCyc the commercial version of OpenCyc In case ofFreebase only 5 of the relations are used more than500 times and about 70 are not used at all Analo-gously to the discussion regarding the number of Free-base relations we can mention again the high numberof defined owlinverseOf relations and the highnumber of usersrsquo relation proposals as reasons for that

PredicatesRanking regarding predicates Freebase is here ndash

like in case of the ranking regarding relations ndash rankedfirst The lowest number of unique predictes is providedby OpenCyc which exhibits only 165 predicates AllKGs except OpenCyc provide more predicates then re-lations Our single observations regarding the predicatesets are as follows

DBpedia DBpedia is ranked third in terms of the ab-solute numbers of predicates about 60K predicates areused in DBpedia The set of relations and the set of pred-icates varies considerably here since also facts are ex-tracted from Wikipedia info-boxes whose predicates areconsidered by us as being only implicitly defined andwhich hence occur only as predicates These are the so-called non-mapping-based properties Note that in the

studied DBpedia version 2015-04 the set of explicitlydefined relations (mapping-based properties) and theset of implicitly defined relations (non-mapping-basedproperties) overlaps An example is dbpalias withdboalias

Freebase We can observe here a similar picture asfor the set of Freebase relations With about 785Kunique predicates Freebase exceeds the other KGs byfar Note however that 95 of the predicates (around743K) are used only once This relativizes the highnumber Most of the predicates are keys in the senseof ids and are used for internal modeling (for instancefreebasekeyuseradrianb)

OpenCyc In contrast to the 18028 unique relationswe measure only 164 unique predicates for OpenCycMore predicates are presumably used in Cyc

Wikidata We measure more Wikidata predicates thanWikidata relations since Wikidata predicates are cre-ated by modifying Wikidata relations An exampleare the following triples which express the statementBarack Obama (wdtQ76) is a human (wdtQ5) byan intermediate node (wdtQ76S123 abbreviated)

wdtQ76 wdtP31s wdtQ76S123wdtQ76S123 wdtP31v wdtQ5

The relation extension ldquosrdquo indicates that the RDF termin the object position is a statement The ldquovrdquo extensionallows to refer to a value (in Wikidata terminology)Besides those extensions there is ldquorrdquo to refer to a ref-erence and the ldquoqrdquo extension to refer to a qualifier Ingeneral these relation extensions are used for realizingreification via n-ary relations For that intermediatenodes are used which represent statements [16]

YAGO YAGO contains more predicates than DBpe-dia since infobox attributes from different languageversions of Wikipedia are aggregated into one KG72

while for DBpedia separate localized KG versions areoffered for non-English languages

515 Instances and EntitiesEvaluation method We distinguish between in-

stances Ig and entities Eg of a KG (cf Section 2)

1 Instances are belonging to classes They are iden-tified by retrieving the subjects of all triples wherethe predicates indicate class affiliations

72The language of each attribute is encoded in theURI for instance yagoinfoboxdeflaumlche andyagoinfoboxenareakm

26 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGODBpe

diaFre

ebas

eOpe

nCyc

Wiki

data

YAGO

10 010 110 210 310 410 510 610 710 810 9

Num

ber

of In

stan

ces

Fig 6 Number of instances per KG

2 Entities are real-world objects This excludesfor instance instantiated statements for beingentities Determining the set of entities is par-tially tricky In DBpedia and YAGO entitiesare determined as being an instance of theclass owlThing In Freebase entities are in-stances of freebasecommontopic and inWikidata instance of wdoItem In OpenCyccychIndividual corresponds to owlThingbut not all entities are classified in this way There-fore we approximately determine the set of en-tities in OpenCyc by manually classifying allclasses having more than 300 instances includingat least one entity73 In this way abstract classessuch as cychExistingObjectType are ne-glected

Ranking wrt the number of instances Table 2and Fig 6 show the number of instances per KG Wecan see that Wikidata comprises the highest numberof instances (142M) in total and OpenCyc the fewest(242K)

Ranking wrt the number of entities Table 2shows the ranking of KGs regarding the number of en-tities Freebase contains by far the highest number ofentities (about 499M) OpenCyc is at the bottom withonly about 41K entities

Differences in number of entities The reason whythe KGs show quite varying numbers of entities are theinformation sources of the KGs We illustrate this withthe music domain as example

1 Freebase had been created mainly from data im-ports such as from MusicBrainz Therefore enti-

73For instance cychIndividual cychMovie_CW andcychCity

ties in the domain of media and especially songrelease tracks are covered very well in Freebase77 of all entities are in the media domain (seeSection 513) out of which 42 are releasetracks74

Due to the large size and the world-wide coverageof entities in MusicBrainz Freebase contains al-bums and release tracks of both English and non-English languages For instance regarding the En-glish language the album ldquoThrillerrdquo from MichaelJackson and its single ldquoBillie Jeanrdquo are there aswell as rather unknown songs from the ldquoThrillerrdquoalbum such as ldquoThe Lady in My Liferdquo Regard-ing non-English languages Freebase contains forinstance songs and albums from Helene Fischersuch as ldquoLassrsquo mich in dein Lebenrdquo and ldquoZauber-mondrdquo also rather unknown songs such as ldquoHabrsquoden Himmel beruumlhrtrdquo can be found

2 In case of DBpedia the English Wikipedia is thesource of information In the English Wikipediamany albums and singles of English artists are cov-ered ndash such as the album ldquoThrillerrdquo and the singleldquoBillie Jeanrdquo Rather unknown songs such as ldquoTheLady in My Liferdquo are not covered in WikipediaFor many non-English artists such as the Germansinger Helene Fischer no music albums and nosingles are contained in the English Wikipedia Inthe corresponding language version of Wikipedia(and localized DBpedia version) this informationis often available (for instance the album ldquoZauber-mondrdquo and the song ldquoLassrsquo mich in dein Lebenrdquo)but not the rather unknown songs such as ldquoHabrsquoden Himmel beruumlhrtrdquo

3 For YAGO the same situation as for DBpediaholds with the difference that YAGO in additionimports entities also from the different languageversions of Wikipedia and imports also data fromsources such as GeoNames However the abovementioned works (ldquoLassrsquo mich in dein LebenrdquoldquoZaubermondrdquo and ldquoHabrsquo den Himmel beruumlhrtrdquo)of Helene Fischer are not in the YAGO althoughthe song ldquoLassrsquo mich in dein Lebenrdquo exists inthe German Wikipedia since May 2014 and al-though the used YAGO version 3 is based on theWikipedia dump of June 201475 Presumably theYAGO extraction system was unable to extract any

74Those release tracks are expressed via freebasemusicrelease_track

75See httpwwwmpi-infmpgdededepartmentsdatabases-and-information-

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 27DBpe

diaFre

ebas

eOpe

nCyc

Wiki

data

YAGO

10 0

10 1

10 2

10 3

10 4

Ave

rage

num

ber

of e

ntiti

es

Fig 7 Average number of entities per class per KG

types for those entities so that those entities werediscarded

4 Wikidata is supported by the community and con-tains music albums of English and non-Englishartists even if they do not exist in Wikipedia Anexample is the song ldquoThe Lady in My Liferdquo Notehowever that Wikidata does not provide all artistrsquosworks such as from Helene Fischer

5 OpenCyc contains only very few entities in themusic domain The reason is that OpenCyc has itsfocus mainly on common-sense knowledge andnot so much on facts about entities

Average number of entities per class Fig 7 showsthe average number of entities per class which can bewritten as |Eg||Cg| Obvious is the difference betweenDBpedia and YAGO (despite the similar number of en-tities) The reason for that is that the number of classesin the DBpedia ontology is small (as created manually)and in YAGO large (as created automatically)

Comparing number of instances with number ofentities Comparing the ratio of the number of instancesto the number of entities for each KG Wikidata ex-poses the highest difference As reason for that we canstate that each statement in Wikidata is modeled as aninstance of wdoStatement leading to 74M addi-tional instances In other KGs such as DBpedia state-ments are modeled without any dedicated statementassignment OpenCyc exposes also a high ratio sinceit contains mainly common sense knowledge and notas many entities as the other KGs Furthermore for ouranalysis we do not regard 100 of the entities but onlya large fraction of it (more precisely the classes with

systemsresearchyago-nagayagoarchive re-quested on Dec 31 2016

DBpedia

Freeb

ase

OpenC

ycW

ikida

ta

YAGO

0

1

2

3

4

5

6

7

8

Rat

io o

f num

ber

of in

stan

ces

to n

umbe

r of

ent

ities

Fig 8 Ratio of the number of instances to the number of entities foreach KG

the most frequently occurring instantiations) since en-tities are not consistently instantiated in OpenCyc (seebeginning of Section 515)

516 Subjects and ObjectsEvaluation method The number of unique subjects

and unique objects can be a meaningful KG charac-teristic regarding the link structure within the KG andin comparison to other KGs Especially interesting aredifferences between the number of unique subjects andthe number of unique objects

We measure the number of unique subjects by count-ing the unique resources (ie URIs and blank nodes) onthe subject position of N-Triples Sg = s | (s p o) ising Furthermore we measure the number of uniqueobjects by counting the unique resources on the ob-ject position of N-Triples excluding literals Og =o | (s p o) isin g and o isin U cup B Complementary thenumber of literals is given as Olit

g = o | (s p o) ising and o isin L

Ranking of KGs regarding number of uniquesubjects The number of unique subjects per KG is pre-sented in Fig 9 YAGO contains the highest number ofdifferent subjects while OpenCyc contains the fewest

Ranking of KGs regarding number of unique ob-jects The number of unique objects is also presented inFig 9 Freebase shows the highest score in this regardOpenCyc again the lowest

Ranking of KGs regarding the ratio of numberof unique subjects to number of unique objects Theratios of the number of unique subjects to the number ofunique objects vary considerably between the KGs (seeFig 8) We can observe that DBpedia has 265 timesmore objects than subjects while YAGO on the otherside has 19 times more unique subjects than objects

28 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Table 2Summary of key statistics

DBpedia Freebase OpenCyc Wikidata YAGO

Number of triples |(s p o) isin g| 411 885 960 3 124 791 156 2 412 520 748 530 833 1 001 461 792

Number of classes |Cg| 736 53 092 116 822 302 280 569 751

Number of relations |Pg| 2819 70 902 18 028 1874 106

No of unique predicates |P impg | 60 231 784 977 165 4839 88 736

Number of entities |Eg| 4 298 433 49 947 799 41 029 18 697 897 5 130 031

Number of instances |Ig| 20 764 283 115 880 761 242 383 142 213 806 12 291 250

Avg number of entities per class |Eg||Cg| 58403 9408 035 619 90

No of unique subjects |Sg| 31 391 413 125 144 313 261 097 142 278 154 331 806 927

No of unique non-literals in obj pos |Og| 83 284 634 189 466 866 423 432 101 745 685 17 438 196

No of unique literals in obj pos |Olitg | 161 398 382 1 782 723 759 1 081 818 308 144 682 682 313 508

DBpedia

Freeb

ase

OpenC

yc

Wiki

data

YAGO10 0

10 2

10 4

10 6

10 8

10 1 0

10 1 2

unique subjectsunique objects

Fig 9 Number of unique subjects and objects per KG Note thelogarithmic scale on the axis of ordinates

The high number of unique subjects in YAGO is sur-prising and can be explained by the reification styleused in YAGO Facts are stored as N-Quads in orderto allow for making statements about statements (forinstance storing the provenance information for state-ments) To that end IDs (instead of blank nodes) whichidentify the triples are used on the first position of N-Triples They lead to 308M unique subjects such asyagoid_6jg5ow_115_lm6jdp In the RDF ex-port of YAGO the IDs which identify the triples arecommented out in order to facilitate the N-Triple for-mat However the statements about statements are alsotransformed to triples In those cases the IDs identi-fying the reified statements are in the subject positionleading to such a high number of unique subjects

DBpedia contains considerably more owlsameAslinks to external resources than KGs like YAGO (290M

vs 38M links) leading to a bias of DBpedia towards ahigh number of unique objects

517 Summary of Key StatisticsBased on the evaluation results presented in the last

subsections we can highlight the following insights

1 Triples All KGs are very large Freebase is thelargest KG in terms of number of triples whileOpenCyc is the smallest KG We notice a corre-lation between the way of building up a KG andthe size of the KG automatically created KGs aretypically larger as the burdens of integrating newknowledge become lower Datasets which havebeen imported into the KGs such as MusicBrainzinto Freebase have a huge impact on the numberof triples and on the number of facts in the KGAlso the way of modeling data has a great impacton the number of triples For instance if n-aryrelations are expressed in N-Triples format (as incase of Wikidata) many intermediate nodes needto be modeled leading to many additional triplescompared to plain statements Last but not leastthe number of supported languages influences thenumber of triples

2 Classes The number of classes is highly varyingamong the KGs ranging from 736 (DBpedia) upto 300K (Wikidata) and 570K (YAGO) Despite itshigh number of classes YAGO contains in relativeterms the most classes which are actually used(ie classes with at least one instance) This canbe traced back to the fact that heuristics are usedfor selecting appropriate Wikipedia categories asclasses for YAGO Wikidata in contrast containsmany classes but out of them only a small fraction

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 29

is actually used on instance level Note howeverthat this is not necessarily a burden

3 Domains Although all considered KGs are speci-fied as crossdomain domains are not equally dis-tributed in the KGs Also the domain coverageamong the KGs differs considerably Which do-mains are well represented heavily depends onwhich datasets have been integrated into the KGsMusicBrainz facts had been imported into Free-base leading to a strong knowledge representation(77) in the domain of media in Freebase In DB-pedia and YAGO the domain people is the largestlikely due to Wikipedia as data source

4 Relations and Predicates Many relations arerarely used in the KGs Only 5 of the Freebaserelations are used more than 500 times and about70 are not used at all In DBpedia half of therelations of the DBpedia ontology are not usedat all and only a quarter of the relations is usedmore than 500 times For OpenCyc 992 of therelations are not used We assume that they areused only within Cyc the commercial version ofOpenCyc

5 Instances and Entities Freebase contains by farthe highest number of entities Wikidata exposesrelatively many instances in comparison to theentities as each statement is instantiated leadingto around 74M instances which are not entities

6 Subjects and Objects YAGO provides the high-est number of unique subjects among the KGsand also the highest ratio of the number of uniquesubjects to the number of unique objects This isdue to the fact that N-Quad representations needto be expressed via intermedium nodes and thatYAGO is concentrated on classes which are linkedby entities and other classes but which do not pro-vide outlinks DBpedia exhibits more unique ob-jects than unique subjects since it contains manyowlsameAs statements to external entities

52 Data Quality Analysis

We now present the results obtained by applyingthe DQ metrics introduced in the Sections 32 ndash 35 tothe KGs DBpedia Freebase OpenCyc Wikidata andYAGO

521 AccuracyThe fulfillment degrees of the KGs regarding the

Accuracy metrics are shown in Table 3

Table 3Evaluation results for the KGs regarding the dimension Accuracy

DB FB OC WD YA

msynRDF 1 1 1 1 1msynLit 099 1 1 1 062msemTriple 099 lt1 1 099 099

Syntactic validity of RDF documents msynRDF

Evaluation method For evaluating the Syntactic va-lidity of RDF documents we dereference the entityldquoHamburgrdquo as resource sample in each KG In caseof DBpedia YAGO Wikidata and OpenCyc thereare RDFXML serializations of the resource availablewhich can be validated by the official W3C RDF valida-tor76 Freebase only provides a Turtle serialization Weevaluate the syntactic validity of this Turtle documentby verifying if the document can be loaded into an RDFmodel of the Apache Jena Framework77

Evaluation result All considered KGs provide syn-tactically valid RDF documents In case of YAGO andWikidata the RDF validator declares the used languagecodes as invalid since the validator evaluates languagecodes in accordance with ISO-639 The criticized lan-guage codes are however contained in the newer stan-dard ISO 639-3 and actually valid

Syntactic validity of literals msynLit

Evaluation method We evaluate the Syntactic va-lidity of literals by means of the relations date ofbirth number of inhabitants and International Stan-dard Book Number (ISBN) as those relations cover dif-ferent domains ndash namely people cities and books ndashand as they can be found in all KGs In general do-main knowledge is needed for selecting representativerelations so that a meaningful coverage is guaranteed

Note that OpenCyc is not taken into account forthis criterion Although OpenCyc comprises around11M literals in total these literals are essentially la-bels and descriptions (given via rdfslabel andrdfscomment) ie not aligned to specific datatypes Hence OpenCyc has no syntactic invalid literalsand is assigned the metric value 1

As long as a literal with data type is given its syntaxis verified with the help of the function RDFDatatypeisValid(String) of the Apache Jena framework

76See httpsw3orgRDFValidator requested onMar 2 2016

77See httpsjenaapacheorg requested Mar 22016

30 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Thereby standard data types such as xsddate canbe validated easily especially if different data types areprovided78 If no data type is provided or if the literalvalue is of type xsdString the literal is evaluatedby a regular expression which is created manually (seebelow depending on the considered relation) For eachof the three relations we created a sample of 1M literalvalues per KG as long as the respective KG containsso many literals

Evaluation results All KGs except YAGO per-formed very well regarding the Syntactic validity ofliterals

Date of Birth For Wikidata DBpedia and Freebaseall verified literal values (1M per KG) were syntacti-cally correct79 For YAGO we detected around 519Ksyntactic errors (given 1M literal values) due to the us-age of wildcards in the date values For instance thebirth date of yagoSocrates is specified as ldquo470--rdquo which does not correspond to the syntax ofxsddate Obviously the syntactic invalidity of lit-erals is accepted by the YAGO publishers in order tokeep the number of relations low80

Number of inhabitants The data types of the literalvalues regarding the number of inhabitants were validin all KGs For DBpedia YAGO and Wikidata weevaluated the syntactic validity of the number of inhab-itants by checking if xsdnonNegativeIntegerxsddecimal and xsdinteger were used asdata types for the typed literals In Freebase no datatype is specified Therefore we evaluated the values bymeans of a regular expression which allows only thedecimals 0-9 periods and commas

ISBN The ISBN is an identifier for books and maga-zines The identifier can occur in various formats withor without preceding ldquoISBNrdquo with or without delim-iters and with 10 or 13 digits Gupta81 provided a regu-lar expression for validating ISBN in its different formswhich we used in our evaluation All in all most ofthe ISBN were assessed as syntactically correct The

78In DBpedia for instance data for the relationdbobirthDate is stored both as xsdgYear and xsddate

79Surprisingly the Jena Framework assessed data values with anegative year (ie BC eg ldquo-600rdquo for xsdgYear) as invaliddespite the correct syntax

80In order to model the dates to the extent they are known furtherrelations would be necessary such as using wasBornOnYearwith range xsdgYear wasBornOnYearMonth with rangexsdgYearMonth

81See httphowtodoinjavacomregexjava-regex-validate-international-standard-book-number-isbns requested on Mar 1 2016

lowest fulfillment degree was obtained for DBpediaWe found the following findings for the single KGs InFreebase around 699K ISBN numbers were availableOut of them 38 were assessed as syntactically incorrectTypical mistakes were too long numbers and wrongprefixes82 In case of Wikidata 18 of around 11K ISBNnumbers were syntactically invalid However some in-valid numbers have meanwhile been corrected This in-dicates that the Wikidata community does not only careabout inserting new data but also about curating givenKG data In case of YAGO we could only find 400triples with the relation yagohasISBN Seven of theliterals on the object position were syntactically incor-rect For DBpedia we evaluated around 24K literals7419 of them were assessed as syntactically incorrectIn many cases comments next to the ISBN numbers inthe info-boxes of Wikipedia led to an inaccurate extrac-tion of data so that the comments are either extractedas additional facts about ISBN numbers83 or togetherwith the actual ISBN numbers as coherent strings84

Semantic validity of triples msemTriple

Evaluation method The semantic validity can be re-liably measured by means of a reference data set which(i) contains at least to some degree the same facts asin the KG and (ii) which is regarded as some kind ofauthority We decided to use the Integrated AuthorityFile (Gemeinsame Normdatei GND)85 which is anauthority file especially concerning persons and corpo-rate bodies and which was created manually by Ger-man libraries Due to the focus on persons (especiallyauthors) we decided to evaluate a random sample ofperson entities wrt the following relations birth placedeath place birth date and death date For each ofthese relations the corresponding relations in the KGswere determined Then a random sample of 100 personentities per KG was chosen For each entity we retrievedthe facts with the mentioned relations and assessedmanually whether a GND entry exists and whether thevalues of the relations match with the values in the KG

Evaluation result We evaluated up to 400 facts perKG and observed only for a few facts some discrep-ancies For instance Wikidata states as death date of

82Eg we found the 16 digit ISBN 9789780307986931 (cffreebasem0pkny27) and the ISBN 2940045143431 with pre-fix 294 instead of 978 (cf freebasem0v3xf7b)

83See dbrPrince_Caspian84An example is ldquoISBN 0755111974 (hardcover edition)rdquo for

dbrMy_Family_and_Other_Animals85See httpwwwdnbdeENStandardisierung

GNDgndhtml requested on Sep 8 2016

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 31

ldquoAnton Erkelenzldquo (wdtQ589196) April 24 whereasGND states April 25 For DBpedia and YAGO we en-countered 3 and for Wikidata 4 errors Hence thoseKGs were evaluated with 099 Note that OpenCyc hasno values for the chosen relations and thus evaluates to1

During evaluation we identified the following issues

1 For finding the right entry in GND more informa-tion besides the name of the person is needed Thisinformation is sometimes not given so that entitydisambiguation is in those cases hard to perform

2 Contrary to assumptions often either no corre-sponding GND entry exists or not many facts ofthe GND entity are given In other words GND isincomplete wrt to entities (cf Population com-pleteness) and relations (cf Column complete-ness)

3 Values of different granularity need to be matchedsuch as an exact date of birth against the indicationof a year only

In conclusion the evaluation of semantic validity ishard even if a random sample set is evaluated manuallyMeaningful differences among the KGs might be re-vealed only when a very large sample is evaluated egby using crowd-sourcing [2348] Another approachfor assessing the semantic validity is presented by Kon-tokostas et al [34] who propose a test-driven evalu-ation where test cases are created to evaluate triplessemi-automatically For instance an interval specifiesthe valid height of a person and all triples which lieoutside of this interval are evaluated manually In thisway outliers can be easily found but possible wrongvalues within the interval are not detected

Our findings appear to be consistent with the evalua-tion results of the YAGO developer team for YAGO2where manually assessing 4412 statements resulted inan accuracy of 98186

522 TrustworthinessThe fulfillment degrees of the KGs regarding the

Trustworthiness criteria are shown in Table 4

Trustworthiness on KG level mgraph

Evaluation method Regarding the trustworthinessof a KG in general we differentiate between the method

86With a weighted averaging of 95 see httpwwwmpi-infmpgdededepartmentsdatabases-and-information-systemsresearchyago-nagayagostatistics requested on Mar 3 2016

Table 4Evaluation results for the KGs regarding the dimensionTrustworthiness

DB FB OC WD YA

mgraph 05 05 1 075 025mfact 05 1 0 1 1mNoV al 0 1 0 1 0

of how new data is inserted into the KG and the methodof how existing data is curated

Evaluation results The KGs differ considerablywrt this metric OpenCyc obtains the highest scorehere followed by Wikidata In the following we pro-vide findings for the single KGs which are listed bydecreasing fulfillment score

Cyc is edited (expanded and modified) exclusively bya dedicated expert group The free version OpenCycis derived from Cyc and only a locally hosted versioncan be modified by the data consumer

Wikidata is also curated and expanded manually butby volunteers of the Wikidata community Wikidataallows importing data from external sources such asFreebase87 However new data is not just inserted butis approved by the community

Freebase was also curated by a community of vol-unteers In contrast to Wikidata the proportion of dataimported automatically is considerably higher and newdata imports were not dependent on community ap-provals

DBpedia and YAGO The knowledge of both KGs isextracted from Wikipedia but DBpedia differs fromYAGO wrt the community involvement Any usercan engage (i) in mapping the Wikipedia infobox tem-plates to the DBpedia ontology in the DBpedia map-pings wiki88 and (ii) in the development of the DBpediaextraction framework

Trustworthiness on statement levelWe determine the Trustworthiness on statement level

by evaluating whether provenance information for state-ments is used in the KGs The picture is mixed

DBpedia uses the relation provwasDerivedFrom to store the sources of the entities and their state-

87Note that imports from Freebase require the approval ofthe community (see httpswwwwikidataorgwikiWikidataPrimary_sources_tool) Besides that there arebots which import automatically (see httpswwwwikidataorgwikiWikidataBotsde)

88See httpmappingsdbpediaorg requested onMar 3 2016

32 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

ments However as the source is always the correspond-ing Wikipedia article89 this provenance informationis trivial and the fulfillment degree is hence of ratherformal nature

YAGO uses its own vocabulary to indicate thesource of information Interestingly YAGO stores perstatement both the source (via yagoextractionSource eg the Wikipedia article) and the used ex-traction technique (via yagoextractionTech-nique eg ldquoInfobox Extractorrdquo or ldquoCategoryMap-perrdquo) The number of statements about sources is 161Mand hence many times over the number of instances inthe KG The reason for that is that in YAGO the sourceis stored for each fact

In Wikidata several relations can be used for refer-ring to sources such as ldquoimported fromrdquo (wdtP143)ldquostated inrdquo (wdtP248) and ldquoreference URLrdquo (wdtP854)90 Note that ldquoimported fromrdquo relations are usedfor automatic imports but that statements with such areference are not accepted (ldquodata is not sourcedrdquo)91 Tosource data the other relations ldquostated inrdquo and ldquoref-erence URLrdquo can be used The number of all storedreferences in Wikidata92 is around 971K Based on thenumber of all statements93 74M this corresponds to acoverage of around 13 Note however that not everystatement in Wikidata requires a reference according tothe Wikidata guidelines In order to be able to state howmany references are actually missing a manual evalua-tion would be necessary However such an evaluationwould be presumably highly subjective

Freebase uses proprietary vocabulary for represent-ing provenance via n-ary relations which are in Free-base called Compound Value Types (CVT) data fromhigher arity can be expressed [44]94

OpenCyc differs from the other KGs in that it usesneither an external vocabulary nor a proprietary vocab-ulary for storing provenance information

89Eg httpenwikipediaorgwikiHamburg fordbrHamburg

90All relations are instances of Wikidata property to indicate asource (wdtQ18608359)

91See httpswwwwikidataorgwikiPropertyP143 requested Mar 3 2016

92This is the number of instances of wdoReference93This is the number of instances of wdoStatement94Eg for a statement with the relation freebaselocation

statistical_regionpopulation the source can bestored via freebasemeasurement_unitdated_integersource

Table 5Evaluation results for the KGs regarding the dimension Consistency

DB FB OC WD YA

mcheckRestr 0 1 0 1 0mconClass 088 1 lt1 1 033mconRelat 099 045 1 050 099

Indicating unknown and empty values mNoV al

This criterion highlights the subtle data model ofWikidata and Freebase in comparison to the data mod-els of the other KGs Wikidata allows for storing un-known values and empty values (eg that ldquoElizabeth Iof Englandrdquo (wdtQ7207) had no children) Howeverin the Wikidata RDF export such statements are onlyindirectly available since they are represented via blanknodes and via the relation owlsomeValuesFrom

YAGO supports the representation of unknown val-ues and empty values by providing explicit relationsfor such cases95 Inexact dates are modeled by meansof wildcards (eg ldquo1940--rdquo if only the year isknown) Note however the invalidity of such stringsas date literals (see Section 521) Unknown dates arenot supported by YAGO

523 ConsistencyThe fulfillment degrees of the KGs regarding the

Consistency criteria are shown in Table 5

Check of schema restrictions during insertion of newstatements mcheckRestr

The values of the metric mcheckRestr indicating re-strictions during the insertion of new statements arevarying among the KGs The web interfaces of Free-base and Wikidata verify during the insertion of newstatements by the user whether the input is compatiblewith the respective data type For instance data of therelation ldquodate of birthrdquo (wdtP569) is expected to bein a syntactically valid form DBpedia OpenCyc andYAGO have no checks for schema restriction during theinsertion of new statements

Consistency of statements wrt class constraintsmconClass

Evaluation method For evaluating the consis-tency of class constraints we considered the relationowldisjointWith since this is the only rela-tion which is used by more than half of the consid-

95Eg freebasefreebasevaluenotationhas_no_value

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 33

ered KGs We only focused on direct instantiationshere if there is for instance the triple (dboPlantowldisjointWith dboAnimal) then theremust not be a resource which is instantiated both asdboPlant and dboAnimal

Evaluation results We obtained mixed results hereOnly Freebase OpenCyc and Wikidata perform verywell96

Freebase and Wikidata do not specify any constraintswith owldisjointWith Hence those two KGshave no inconsistencies wrt class restrictions and wecan assign the metric value 1 to them In case of Open-Cyc 5 out of the 27112 class restrictions are incon-sistent DBpedia contains 24 class constraints Threeout of them are inconsistent For instance over 1200instances exist which are both a dboAgent and adboPlace YAGO contains 42 constraints dedi-cated mainly for WordNet classes which are mostlyinconsistent

Consistency of statements wrt relation constraintsmconRelat

Evaluation method Here we considered the rela-tions rdfsrange and owlFunctionalProperty as those are used in more than every second con-sidered KG rdfsrange specifies the expected typeof an instance on the object position of a triple whileowlFunctionalProperty indicates that a rela-tion should only be used at most once per resource Weonly took datatype properties into account for this eval-uation since consistencies regarding object propertieswould require to distinguish Open World assumptionand Closed World assumption

Evaluation results In the following we considerthe fulfillment degree for the relation constraintsrdfsrange and owlFunctionalPropertyseparately In Table 5 we show the average of the fulfill-ment scores of each KG regarding rdfsrange andowlFunctionalProperty Note that the num-bers of evaluated relation constraints varied from KG toKG depending on how many relation constraints wereavailable per KG

Range Wikidata does not use any rdfsrangerestrictions Within the Wikidata data model there iswdopropertyType but this indicates not the ex-act allowed data type of a relation (eg wdoprop

96Note that the sample size varies among the KGs (depend-ing on how many owldisjointWith statements are availableper KG) Therefore inconsistencies measured on a small set ofowldisjointWith facts become more visible

Table 6Evaluation results for the KGs regarding the dimension Relevancy

DB FB OC WD YA

mRanking 0 1 0 1 0

ertyTypeTime can represent a year or an exact date)On the talk pages of Wikidata relations users can indi-cate the allowed values of relations via One of state-ments97 Since One of statements are only listed onthe property talk pages and since not only entity typesbut also concrete instances are used as One of valueswe do not consider those statements here

DBpedia obtains the highest measured fulfillmentscore wrt consistency of rdfsrange statementsAn example for a range inconsistency is that the relationdbobirthDate requires a data type xsddatein about 20 of those relations the data type xsdgYear is used though

YAGO Freebase and OpenCyc contain range incon-sistencies primarily since they specify designated datatypes via range relations which are not consistentlyused on the instance level For instance YAGO spec-ifies proprietary data types such as yagoyagoURLand yagoyagoISBN On the instance level how-ever either no data type is used or the unspecific datatype xsdstring

FunctionalProperty The restriction indicated byowlFunctionalProperty is used by all KGsexcept Wikidata On the talk pages about the rela-tions in Wikidata users can specify the cardinalityrestriction via setting the relation to single how-ever this is not part of the Wikidata data modelThe other KGs mostly comply with the usage re-strictions of owlFunctionalProperty Note-worthy is that in Freebase 999 of the inconsis-tencies obtained here are caused by the usages ofthe relations freebasetypeobjectname andfreebasecommonnotable_fordisplay_name

524 RelevancyThe fulfillment degrees of the KGs regarding the

Relevancy criteria are shown in Table 6

Creating a ranking of statements mRanking

Only Wikidata supports the modeling of a rankingof statements Each statement is ranked with ldquopre-

97See httpswwwwikidataorgwikiCategoryProperties_with_one-of_constraints for an overviewrequested on Jan 29 2017

34 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Table 7Evaluation results for the KGs regarding the dimensionCompleteness

DB FB OC WD YA

mcSchema 091 076 092 1 095mcColumn 040 043 0 029 033mcPop 093 094 048 099 089mcPop (short) 1 1 082 1 090mcPop (long) 086 088 014 098 088

ferred rankrdquo (wdoPreferredRank) ldquonormal rankrdquo(wdoNormalRank) or ldquodeprecated rankrdquo (wdoDeprecatedRank) The preferred rank corre-sponds to the up-to-date value or the consensus of theWikidata community wrt this relation Freebase doesnot provide any ranking of statements entities or re-lations However the meanwhile shutdown FreebaseSearch API provided a ranking for resources98

525 CompletenessThe fulfillment degrees of the KGs regarding the

Completeness criteria are shown in Table 7

Schema completeness mcSchema

Evaluation method Since a gold standard for eval-uating the Schema completeness of the considered KGshas not been published we built one on our own Thisgold standard is available online99 It is based on thedata set used in Section 513 where we needed as-signments of classes to domains and comprises of 41classes as well as 22 relations It is oriented towards thedomains people media organizations geography andbiology The classes in the gold standard were alignedto corresponding WordNet synsets (using WordNet ver-sion 31) and were grouped into main classes

Evaluation results Generally Wikidata performsoptimal also DBpedia OpenCyc and YAGO exhibitresults which can be judged as acceptable for most usecases Freebase shows considerable room for improve-ment concerning the coverage of typical cross-domainclasses and relations The results in more detail are asfollows

DBpedia DBpedia shows a good score regardingSchema completeness and its schema is mainly limited

98See httpsdevelopersgooglecomfreebasev1search-cookbookscoring-and-ranking re-quested on Mar 4 2016

99See httpkmaifbkitedusitesknowledge-graph-comparison requested on Jan 29 2017

due to the characteristics of how information is storedand extracted from Wikipedia

1 Classes The DBpedia ontology was created man-ually and covers all domains well However it is incom-plete in the details and therefore appears unbalancedFor instance within the domain of plants the DBpe-dia ontology does not use the class tree but the classginko which is a subclass of trees We can mentionas reason for such gaps in the modeling the fact thatthe ontology is created by means of the most frequentlyused infobox templates in Wikipedia

2 Relations Relations are considerably well cov-ered in the DBpedia ontology Some missing relationsor modeling failures are due to the Wikipedia infoboxcharacteristics For example to represent the gender ofa person the existing relation foafgender seemsto fit However it is only modeled in the ontology asbelonging to the class dbolanguage and not usedon instance level Note that the gender of a person is of-ten not explicitly mentioned in the Wikipedia infoboxesbut implicitly mentioned in the category names (forinstance American male singers) While DBpediadoes not exploit this knowledge YAGO does use it andprovides facts with the relation yagohasGender

Freebase Freebase shows a very ambivalent schemacompleteness On the one hand Freebase targets ratherthe representation of facts on instance level than therepresentation of classes and their hierarchy On theother hand Freebase provides a vast amount of rela-tions leading to a very good coverage of the requestedrelations

1 Classes Freebase lacks a class hierarchy and sub-classes of classes are often in different domains (for in-stance the classes freebasemusicartist andsportsmen freebasesportspro_athlete arelogically a subclass of the class people freebasepersonpeople but not explicitly stated as such)which makes it difficult to find suitable sub- and su-perclasses Noteworthy the biology domain containsno classes This is due to the fact that classes are rep-resented as entities such as tree100 and ginko101 Theginko tree is not classified as tree but by the genericclass freebasebiologyoganism_classification

2 Relations Freebase exhibits all relations requestedby our gold standard This is not surprising given thevast amount of available relations in Freebase (see Sec-tion 514 and Table 2)

100Freebase ID freebasem07j7r101Freebase ID freebasem0htd3

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 35

OpenCyc In total OpenCyc exposes a quite highSchema completeness scoring This is due to the factthat OpenCyc has been created manually and has itsfocus on generic and common-sense knowledge

1 Classes The ontology of OpenCyc covers bothgeneric and specific classes such as cychSocialGroup and cychLandTopographicalFeatureWe can state that OpenCyc is complete with respect tothe considered classes

2 Relations OpenCyc lacks some relations of thegold standard such as the number of pages or the ISBNof books

Wikidata According to our evaluation Wikidata iscomplete both with respect to classes and relations

1 Classes Besides frequently used generic classessuch as ldquohumanrdquo (wdtQ5) also very specific classesexist such as ldquolandformrdquo (wdtQ271669) in the senseof a geomorphologial unit with over 3K instances

2 Relations In particular remarkable is that Wiki-data covers all relations of the gold standard eventhough it has extremely less relations than FreebaseThus the Wikidata methodology to let users proposenew relations to discuss about their outreach and fi-nally to approve or disapprove the relations seems tobe appropriate

YAGO Due to its concentration on modeling classesYAGO shows the best overall Schema completenessfulfillment score among the KGs

1 Classes To create the set of classes in YAGOthe Wikipedia categories are extracted and connectedto WordNet synsets Since also our gold standard isalready aligned to WordNet synsets we can measure afull completeness score for YAGO classes

2 Relations The YAGO schema does not containmany unique but rather abstract relations which canbe understood in different senses The abstract rela-tion names make it often difficult to infer the mean-ing The relation yagowasCreatedOnDate forinstance can be used reasonably for both the founda-tion year of a company and for the publication dateof a movie DBpedia in contrast provides the rela-tion dbpfoundationYear Often the meaning ofYAGO relations is only fully understood after consider-ing the associated classes using domain and range ofthe relations Expanding the YAGO schema by furthermore fine-grained relations appears reasonable

Column completeness mcColumn

Evaluation method For evaluating KGs wrt Col-umn completeness for each KG 25 class-relation-

Table 8Metric values of mcCol for single class-relation-pairs

Relation DB FB OC ED YA

Personndashbirthdate 048 048 0 070 077

Personndashsex ndash 057 0 094 064

Bookndashauthor 091 093 0 082 028

BookndashISBN 073 063 ndash 018 001

combinations102 were created based on our gold stan-dard created for measuring the Schema completenessIt was ensured that only those relations were selectedfor a given class for which a value typically exists forthat class For instance we did not include the deathdate as potential relation for living people

Evaluation results In general no KG yields a met-ric score of over 043 As visible in Table 8 KGs oftenhave some specific class-relation-pairs which are wellrepresented on instance level while the rest of the pairsare poorly represented The well-represented pairs pre-sumably originate either from column-complete datasets which were imported (cf MusicBrainz in case ofFreebase) or from user edits focusing primarily on factsabout entities of popular classes such as people Wenotice the following observations with respect to thesingle KGs

DBpedia DBpedia fails regarding the relation sex forinstances of class Person since it does not containsuch a relation in its ontology If we considered the non-mapping-based property dbpgender instead (notdefined in the ontology) we would gain a coverage ofonly 025 (about 5K people) We can note hence thatthe extraction of data out of the Wikipedia categorieswould be a further fruitful data source for DBpedia

Freebase Freebase surprisingly shows a very highcoverage (927) of the authors of books given the ba-sic population of 17M books Note however that thereare not only books modeled under freebasebookbook but also entities of other types such as a descrip-tion of the Lord of Rings (see freebasem07bz5)Also the coverage of ISBN for books is quite high(634)

OpenCyc OpenCyc breaks ranks as mostly no val-ues for the considered relations are stored in this KG It

102The selection of class-relation-pairs was depending on the factwhich class-relation-pairs were available per KG Hence the choiceis varying from KG to KG Also note that less class-relation-pairswere used if no 25 pairs were available in the respective KG

36 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

contains mainly taxonomic knowledge and only thinlyspread instance facts

Wikidata Wikidata achieves a high coverage of birthdates (703) and of gender (941) despite the highnumber of 3M people103

YAGO YAGO obtains a coverage of 635 for gen-der relations as it in contrast to DBpedia extracts thisimplicit information from Wikipedia

Population completeness mcPop

Evaluation method In order to evaluate the Popu-lation completeness we need a gold standard consist-ing of a basic entity population for each consideredKG This gold standard which is available online104

was created on the basis of our gold standard usedfor evaluating the Schema completeness and the Col-umn completeness For its creation we selected fiveclasses from each of the five domains and determinedtwo well-known entities (called short head) and tworather unknown entities (called long tail) for each ofthose classes The exact entity selection criteria are asfollows

1 The well-known entities were chosen without tem-poral and location-based restrictions To take themost popular entities per domain we used quan-titative statements For instance to select well-known athletes we ranked athletes by the numberof won olympic medals to select the most popu-lar mountains we ranked the mountains by theirheights

2 To select the rather unknown entities we consid-ered entities associated to both Germany and aspecific year For instance regarding the athleteswe selected German athletes active in the year2010 such as Maria Houmlfl-Riesch The selectionof rather unknown entities in the domain of biol-ogy is based on the IUCN Red List of ThreatenedSpecies105106

Selecting four entities per class and five classes perdomain resulted in 100 entities to be used for evaluatingthe Population completeness

103These 3M instances form about 185 of all instances in Wiki-data See httpswwwwikidataorgwikiWikidataStatistics requested on Nov 7 2016

104See httpkmaifbkitedusitesknowledge-graph-comparison requested on Jan 29 2017

105See httpwwwiucnredlistorg requested on Apr2 2016

106Note that selecting entities by their importance or popularity ishard in general and that also other popularity measures such as thePageRank scores may be taken into account

Evaluation results All KGs except OpenCyc showgood evaluation results Since also Wikidata exhibitsgood evaluation results the population degree appar-ently does not depend on the age or the insertion methodof the KG Fig 10 additionally depicts the populationcompleteness for the single domains for each KG Inthe following we firstly present our findings for well-known entities before we secondly go into the detailsof rather unknown entities

Well-known entities Here all considered KGsachieve good results DBpedia Freebase and Wikidataare complete wrt the well-known entities in our goldstandard YAGO lacks some well-known entities al-though some of them are represented in Wikipedia Onereason for this fact is that those Wikipedia entities donot get imported into YAGO for which a WordNet classexists For instance there is no ldquoGreat White Sharkrdquoentity only the WordNet class yagowordnet_great_white_shark_101484850

Not-well-known entities First of all not very surpris-ing is the fact that all KGs show a higher degree of com-pleteness regarding well-known entities than regard-ing rather unknown entities as the KGs are orientedtowards general knowledge and not domain-specificknowledge Secondly two things are in particular pe-culiar concerning long-tail entities in the KGs Whilemost of the KGs obtain a score of about 088 Wiki-data deflects upwards and OpenCyc deflects stronglydownwards

Wikidata exhibits a very high Population complete-ness degree for long tail entities This is a result fromthe central storage of interwiki links between differentWikimedia projects (especially between the differentWikipedia language versions) in Wikidata A Wikidataentry is added to Wikidata as soon as a new entity isadded in one of the many Wikipedia language versionsNote however that in this way English-language labelsfor the entities are often missing We measure that onlyabout 546 (102M) of all Wikidata resources have anEnglish label

OpenCyc exhibits a poor population degree score of014 for long-tail entities OpenCycrsquos sister KGs Cycand ResearchCyc are apparently considerably bettercovered with entities [36] leading to higher Populationcompleteness scores

526 TimelinessThe evaluation results concerning the dimension

Timeliness are presented in Table 9

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 37

DBpedia Freebase OpenCyc Wikidata YAGO0

01

02

03

04

05

06

07

08

09

1

PeopleMediaOrganizationsGeographyBiology

Fig 10 Population completeness regarding the different domains per KG

Table 9Evaluation results for the KGs regarding the dimension Timeliness

DB FB OC WD YA

mFreq 05 0 025 1 025mV alidity 0 1 0 1 1mChange 0 1 0 0 0

Timeliness frequency of the KG mFreq

Evaluation results The KGs are very diverse re-garding the frequency in which the KGs are updatedranging from a score of 0 for Freebase (not updated anymore) to 1 for Wikidata (updates immediately visibleand retrievable) Note that the Timeliness frequency ofthe KG can be a crucial point and a criterion for exclu-sion in the process of choosing the right KG for a givensetting [17] In the following we outline some charac-teristics of the KGs with respect to their up-to-dateness

DBpedia is created about once to twice a year andis not modified in the meantime From September2013 until November 2016 six DBpedia versions havebeen published107 Besides the static DBpedia DBpe-dia live108 has been continuously updated by trackingchanges in Wikipedia in real-time However it does notprovide the full range of relations as DBpedia

107These versions are DBpedia 38 DBpedia 39 DBpedia 2014DBpedia 2015-04 DBpedia 2015-10 and DBpedia 2016-04 Alwaysthe latest DBpedia version is published online for dereferencing

108See httplivedbpediaorg requested on Mar 42016

Freebase had been updated continuously until itsclose-down and is not updated anymore

OpenCyc has been updated less than once per yearThe last OpenCyc version dates from May 2012109 Tothe best of our knowledge Cyc and OpenCyc respec-tively are developed further but no exact date of thenext version is known

Wikidata provides the highest fulfillment degree forthis criterion Modifications in Wikidata are via browserand via HTTP URI dereferencing immediately visibleHence Wikidata falls in the category of continuousupdates Besides that an RDF export is provided ona roughly monthly basis (either via the RDF exportwebpage110 or via own processing using the Wikidatatoolkit111)

YAGO has been updated less than once per yearYAGO3 was published in 2015 YAGO2 in 2011 andthe interim version YAGO2s in 2013 A date of the nextrelease has not been published

Specification of the validity period of statementsmV alidity

Evaluation results Although representing the va-lidity period of statements is obviously reasonable formany relations (for instance the presidentrsquos term of

109See httpswopencycorg requested on Nov 82016

110See httptoolswmflabsorgwikidata-exportsrdfexports requested on Nov 23 2016

111See httpsgithubcomWikidataWikidata-Toolkit requested on Nov 8 2016

38 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Table 10Evaluation results for the KGs regarding the dimension Ease ofunderstanding

DB FB OC WD YA

mDescr 070 097 1 lt1 1mLang 1 1 0 1 1muSer 1 1 0 1 1muURI 1 05 1 0 1

office) specifying the validity period of statements isin several KGs either not possible at all or only rudi-mentary performed

DBpedia and OpenCyc do not realize any specifi-cation possibility In YAGO Freebase and Wikidatathe temporal validity period of statements can be spec-ified In YAGO this modeling possibility is madeavailable via the relations yagooccursSinceyagooccursUntil and yagooccursOnDateWikidata provides the relations ldquostart timerdquo (wdtP580)and ldquoend timerdquo (wdtP582) In Freebase CompoundValue Types (CVTs) are used to represent relations withhigher arity [44] As part of this representation validityperiods of statements can be specified An example isldquoVancouverrsquos population in 1997rdquo

Specification of the modification date of statementsmChange

Evaluation results The modification date of state-ments can only be specified in Freebase but not in theother KGs Together with the criteria on Timelinessthis reflects that the considered KGs are mostly notsufficiently equipped with possibilities for modelingtemporal aspects within and about the KG

In Freebase the date of the last review of a fact can berepresented via the relation freebasefreebasevaluenotationis_reviewed In the DBpediaontology the relation dctermsmodified is usedto state the date of the last revision of the DBpediaontology When dereferencing a resource in Wikidatathe latest modification date of the resource is returnedvia schemadateModified This however doesnot hold for statements Thus Wikidata is evaluatedwith 0 too

527 Ease of UnderstandingDescription of resources mDescr

Evaluation method We measured the extent towhich entities are described Regarding the labelswe considered rdfslabel for all KGs Regard-ing the descriptions the corresponding relations dif-

fer from KG to KG DBpedia for instance usesrdfscomment and dcelementsdescriptionwhile Freebase provides freebasecommontopicdescription112

Evaluation result For all KGs the rule applies thatin case there is no label available usually there isalso no description available The current metric couldtherefore (without significant restrictions) be applied tordfslabel occurrences only

YAGO Wikidata and OpenCyc contain a label foralmost every entity In Wikidata the entities withoutany label are of experimental nature and are most likelynot used113

Surprisingly DBpedia shows a relatively low cov-erage wrt labels and descriptions (only 704) Ourmanual investigations suggest that relations with higherarity are modeled by means of intermediate nodeswhich have no labels114

Labels in multiple languages mLang

Evaluation method Here we measure whether theKGs contain labels (rdfslabel) in other languagesthan English This is done by means of the languageannotations of literals such as ldquoderdquo for literals inGerman

Evaluation results DBpedia provides labels in 13languages Further languages are provided in the lo-calized DBpedia versions YAGO integrates statementsof the different language versions of Wikipedia intoone KG Therefore it provides labels in 326 differentlanguages Freebase and Wikidata also provide a lot oflanguages (244 and 395 languages respectively) Con-trary to the other KGs OpenCyc only provides labelsin English

Coverage of languages We also measured the cov-erage of selected languages in the KGs ie the extentto which entities have an rdfslabel with a specificlanguage annotation115 Our evaluation shows that DB-pedia YAGO and Freebase achieve a high coveragewith more than 90 regarding the English language Incontrast to those KGs Wikidata shows a relative low

112Human-readable resource descriptions may also be representedby other relations [15] However we focused on those relations whichare commonly used in the considered KGs

113For instance wdtQ5127809 represents a game fo the Nin-tendo Entertainment System but there is no further information foran identification of the entity available

114Eg dbrNayim links via dboCareerStation to 10entities of his carrier stations

115Note that literals such as rdfslabel do not necessarily havelanguage annotations In those cases we assume that no languageinformation is available

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 39

coverage regarding the English language of only 546but a coverage of over 30 for further languages suchas German and French Wikidata is hence not only themost diverse KG in terms of languages but has also thehighest coverage regarding non-English languages

Understandable RDF serialization muSer

The provisioning of understandable RDF serializa-tions in the context of URI dereferencing leads to a bet-ter understandability for human data consumers DB-pedia YAGO and Wikidata provide N-Triples andN3Turtle serializations Freebase in contrast onlyprovides a Turtle serialization OpenCyc only usesRDFXML which is regarded as not easily understand-able by humans

Self-describing URIs muURI

We can observe two different paradigms of URI us-age On the one hand DBpedia OpenCyc and YAGOrely on descriptive URIs and therefore achieve the fullfulfillment degree In DBpedia and YAGO the URIsof the entities are determined by the corresponding En-glish Wikipedia article The mapping to the EnglishWikipedia is thus trivial In case of OpenCyc two RDFexports are provided one using opaque and one us-ing self-describing URIs The self-describing URIs arethereby derived from the rdfslabel values of theresources

On the other hand Wikidata and Freebase (the latterin part) rely on opaque URIs Wikidata uses Q-IDsfor resources (items in Wikidata terminology) andP-IDs for relations Freebase uses self-describing URIsonly partially namely opaque M-IDs for entities andself-describing URIs for classes and relations116

528 InteroperabilityThe evaluation results of the dimension Interoper-

ability are presented in Table 11

Avoiding blank nodes and RDF reification mReif

Reification allows to represent further informationabout single statements In conclusion we can state thatDBpedia Freebase OpenCyc and YAGO use someform of reification However none of the consideredKGs uses the RDF standard for reification Wikidatamakes extensive use of reification every relation isstored in the form of an n-ary relation In case of DB-pedia and Freebase in contrast facts are predominantlystored as N-Tripels and only relations of higher arity

116Eg freebasemusicalbum for the class music al-bums and freebasepeoplepersondate_of_birthfor the relation day of birth

Table 11Evaluation results for the KGs regarding the dimensionInteroperability

DB FB OC WD YA

mReif 05 05 05 0 05miSerial 1 0 05 1 1mextV oc 061 011 041 068 013mpropV oc 015 0 051 gt0 0

are stored via n-ary relations117 YAGO stores facts asN-Quads in order to be able to store meta informationof facts like provenance information When the quadsare loaded in a triple store the IDs referring to thesingle statements are ignored and quads are convertedinto triples In this way most of the statements are stillusable without the necessity to deal with reification

Blank nodes are non-dereferencable anonymous re-sources They are used by the Wikidata and OpenCycdata model

Provisioning of several serialization formats miSerial

DBpedia YAGO and Wikidata fulfill the criterion ofProvisioning several RDF serialization formats to thefull extent as they provide data in RDFXML and sev-eral other serialization formats during the URI derefer-encing In addition DBpedia and YAGO provide fur-ther RDF serialization formats (eg JSON-LD Micro-data and CSV) via their SPARQL endpoints Freebaseis the only KG providing RDF only in Turtle format

Using external vocabulary mextV oc

Evaluation method This criterion indicates the ex-tent to which external vocabulary is used For that foreach KG we divide the occurrence number of tripleswith external relations by the number of all relations inthis KG

Evaluation results DBpedia uses 37 unique exter-nal relations from 8 different vocabularies while theother KGs mainly restrict themselves to the externalvocabularies RDF RDFS and OWL

Wikidata reveals a high external vocabulary ratiotoo We can mention two obvious reasons for that fact1 Information in Wikidata is provided in a huge varietyof languages leading to 85M rdfslabel and 140Mschemadescription literals 2 Wikidata makesextensive use of reification Out of the 140M triplesused for instantiations via rdftype about 74M (ie

117See Section 511 for more details wrt the influence of reifica-tion on the number of triples

40 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

about the half) are taken for instantiations of statementsie for reification

Interoperability of proprietary vocabulary mpropV oc

Evaluation method This criterion determines the ex-tent to which URIs of proprietary vocabulary are linkedto external vocabulary via equivalence relations Foreach KG we measure which classes and relationsare linked via owlsameAs118 owlequivalentClass (in Wikidata wdtP1709) and owlequivalentProperty (in Wikidata wdtP1628) to ex-ternal vocabulary Note that other relations such asrdfsubPropertyOf could be taken into accounthowever in this work we only consider equivalencyrelations

Evaluation results In general we obtained low ful-fillment scores regarding this criterion OpenCyc showsthe highest value We achieved the following singlefindings

Regarding its classes DBpedia reaches a relativehigh interlinking degree of about 484 Classes arethereby linked to FOAF Wikidata schemaorg andDUL119 Regarding its relations DBpedia links to Wiki-data and schemaorg120 Only 63 of the DBpediarelations are linked to external vocabulary

Freebase only provides owlsameAs links in theform of a separate RDF file but these links are only oninstance level Thus the KG is evaluated with 0

In OpenCyc about half of all classes exhibit at leastone external linking via owlsameAs Internal linksto resources of swcyccom the commercial ver-sion of OpenCyc were ignored in our evaluation Theconsidered classes are mainly linked to FOAF UM-BEL DBpedia and linkedmdborg the relations mainlyto FOAF DBpedia Dublin Core Terms and linked-mdborg The relative high linking degree of OpenCyccan be attributed to dedicated approaches of linkingOpenCyc to other KGs (see eg Medelyan et al [38])

Regarding the classes Wikidata provides linksmainly to DBpedia Considering all Wikidata classesonly 01 of all Wikidata classes are linked to equiva-

118OpenCyc uses owlsameAs both on schema and instancelevel This is appropriate as the OWL primer states The built-inOWL property owlsameAs links an individual to an individualas well as The owlsameAs statements are often used in definingmappings between ontologies see httpswwww3orgTR2004REC-owl-ref-20040210sameAs-def (requestedon Feb 4 2017)

119See httpwwwontologydesignpatternsorgontdulDULowl requested on Jan 11 2017

120Eg dbobirthDate is linked to wdtP569 andschemabirthDate

Table 12Evaluation results for the KGs regarding the dimension Accessibility

DB FB OC WD YA

mDeref 1 1 044 041 1mAvai lt1 073 lt1 lt1 1mSPARQL 1 1 0 1 0mExport 1 1 1 1 1mNegot 05 1 0 1 0mHTMLRDF 1 1 1 1 0mMeta 1 0 0 0 1

lent external classes This may be due to the high num-ber of classes in Wikidata in general Regarding therelations Wikidata provides links in particular to FOAFand schemaorg and achieves here a linking coverageof 21 Although this is low frequently used relationsare linked121

YAGO contains around 553K owlequivalentClass links to classes within the DBpedia namespacedby However as YAGO classes (and their hierarchy)were imported also into DBpedia (using the namespacehttpdbpediaorgclassyago) we donot count those owlequivalentClass links inYAGO as external links for YAGO

529 AccessibilityThe evaluation results of the dimension Accessibility

are presented in Table 12

Dereferencing possibility of resources mDeref

Evaluation method We measured the dereferenc-ing possibilities of resources by trying to dereferenceURIs containing the fully-qualified domain name ofthe KG For that we randomly selected 15K URIs inthe subject predicate and object position of triples ineach KG We submitted HTTP requests with the HTTPaccept header field set to applicationrdf+xmlin order to perform content negotiation

Evaluation results In case of DBpedia OpenCycand YAGO all URIs were dereferenced successfullyand returned appropriate RDF data so that they fulfilledthis criterion completely For DBpedia 45K URIs wereanalyzed for OpenCyc only around 30K due to thesmall number of unique predicates We observed almost

121Frequently used relations with stated equivalence to externalrelations are eg wdtP31 linked to rdftype and wdtP279linked to rdfssubClassOf

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 41

the same picture for YAGO namely no notable errorsduring dereferencing

For Wikidata which contains also not that manyunique predicates we analyzed around 35K URIs Notethat predicates which are derived from relations using asuffix (eg the suffix s as in wdtP1024s is usedfor predicates referring to a statement) could not bedereferenced at all Furthermore the blank nodes usedfor reification cannot be dereferenced

Regarding Freebase mainly all URIs on subjectand object position of triples could be dereferencedSome resources were not resolvable even after multi-ple attempts (HTTP server error 503 eg freebasem0156q) Surprisingly server errors also appearedwhile browsing the website freebasecom so that datawas partially not available Regarding the predicate po-sition many URIs are not dereferencable due to servererrors (HTTP 503) or due to unknown URIs (HTTP404) Note that if a large number of Freebase requestsare performed an API key from Google is necessaryIn our experiments the access was blocked after a fewthousand requests Hence we can point out that withoutan API key the Freebase KG is only usable to a limitedextent

Availability of the KG mAvai

Evaluation method We measured the availabilityof the officially hosted KGs with the monitoring servicePingdom122 For each KG an uptime test was set upwhich checked the availability of the resource Ham-burg as representative resource for successful URI re-solving (ie returning the status code HTTP 200) ev-ery minute over the time range of 60 days (Dec 182015ndashFeb 15 2016)

Evaluation result While the other KGs showed al-most no outages and were again online after some min-utes on average YAGO outages took place frequentlyand lasted on average 35 hours123 In the given timerange four outages took longer than one day Based onthese insights we recommend to use a local version ofYAGO for time-critical queries

Availability of a public SPARQL endpoint mSPARQL

The SPARQL endpoints of DBpedia and YAGO are

122See httpswwwpingdomcom requested Mar 2 2016The HTTP requests of Pingdom are executed by various servers sothat caching is prevented

123See diagrams per KG on our website (httpkmaifbkitedusitesknowledge-graph-comparisonrequested on Jan 31 2017)

provided by a Virtuoso server124 the Wikidata SPARQLendpoint via Blazegraph125 Freebase and OpenCyc donot provide an official SPARQL endpoint However anendpoint for the MQL query language for the FreebaseKG was available

Especially regarding the Wikidata SPARQL endpointwe observed access restrictions The maximum execu-tion time per query is set to 30 seconds but there is nolimitation regarding the returning number of rows How-ever the front-end of the SPARQL endpoint crashed incase of large result sets with more than 15M rows Al-though public SPARQL endpoints need to be preparedfor inefficient queries the time limit of Wikidata mayimpede the execution of reasonable queries

Provisioning of an RDF export mExport

All considered KGs provide RDF exports as down-loadable files The format of the data differs from KGto KG Mostly data is provided in N-Triples and Turtleformat

Support of content negotiation mNegot

We measure the support of content negotiation re-garding the serialization formats RDFXML N3Turtleand N-Triples OpenCyc does not provide any contentnegotiation only RDFXML is supported as contenttype Therefore OpenCyc does not fulfill the criterionof supporting content negotiation

The endpoints for DBpedia Wikidata and YAGOcorrectly returned the appropriate RDF serializationformat and the corresponding HTML representationof the tested resources Freebase does currently notprovide any content negotiation and only the contenttype textplain is returned

Noteworthy is also that regarding the N-Triples seri-alization YAGO and DBpedia require the accept headertextplain and not applicationn-triplesThis is due to the usage of Virtuoso as endpoint For DB-pedia the forwarding to httpdbpediaorgdata[resource]ntriples does not work in-stead the HTML representation is returned Thereforethe KG is evaluated with 05

Linking HTML sites to RDF serializations mHTMLRDF

All KGs except OpenCyc interlink the HTML represen-tations of resources with the corresponding RDF repre-sentations by means of ltlink rel=alternate

124See httpsvirtuosoopenlinkswcom re-quested on Dec 28 2016

125See httpswwwblazegraphcom requested on Dec28 2016

42 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Table 13Evaluation results for the KGs regarding the dimension License

DB FB OC WD YA

mmacLicense 1 0 0 1 0

type=content type href=URLgtin the HTML header

Provisioning of metadata about the KG mmeta

For this criterion we analyzed if KG metadata isavailable such as in the form of a VoID file126 DBpediaintegrates the VoID vocabulary directly in its KG127 andprovides information such as the SPARQL endpointURL and the number of all triples OpenCyc revealsthe current KG version number via owlversionInfo For YAGO Freebase and Wikidata no metainformation could be found

5210 LicenseThe evaluation results of the dimension License are

shown in Table 13

Provisioning machine-readable licensing informationmmacLicense

DBpedia and Wikidata provide licensing informa-tion about their KG data in machine-readable form ForDBpedia this is done in the ontology via the predi-cate cclicense linking to CC-BY-SA128 and GNUFree Documentation License (GNU FDL)129 Wikidataembeds licensing information during the dereferenc-ing of resources in the RDF document by linking withcclicense to the license CC0130 YAGO and Free-base do not provide machine-readable licensing infor-mation However their data is published under the li-cense CC-BY131 OpenCyc embeds licensing informa-tion into the RDF document during dereferencing butnot in machine-readable form132

126See httpswwww3orgTRvoid requested on Apr7 2016

127See httpdbpediaorgvoidpageDataset re-quested on Mar 5 2016

128See httpcreativecomonsorglicensesby-sa30 requested on Feb 4 2017

129See httpwwwgnuorgcopyleftfdlhtml re-quested on Feb 4 2017

130See httpcreativecomonsorgpublicdomainzero10 requested on Feb 4 2017

131See httpcreateivecommonsorglicensesby30 requested on Feb 4 2017

132License information is provided as plain text among furtherinformation with the relation rdfscomment

Table 14Evaluation results for the KGs regarding the dimension Interlinking

DB FB OC WD YA

mInst 025 0 038 0 (09) 031mURIs 093 091 089 096 096

5211 InterlinkingThe evaluation results of the dimension Interlinking

are shown in Table 14

Linking via owlsameAs mInst

Evaluation method Given all owlsameAs triplesin each KG we queried all those subjects thereof whichare instances but neither classes nor relations133 andwhere the resource in the object position of the triple isan external source ie not belonging to the namespaceof the KG

Evaluation result OpenCyc and YAGO achieve thebest results wrt this metric but DBpedia has by farthe most instances with at least one owlsameAs linkWe can therefore confirm the statement by Bizer et al[12] that DBpedia has established itself as a hub in theLinked Data cloud

In DBpedia there are about 52M instances with atleast one owlsameAs link Links to localized DBpe-dia versions (eg dedbpediaorg) were countedas internal links and hence not considered here Intotal one-fourth of all instances have at least oneowlsameAs link

In Wikidata neither owlsameAs links are pro-vided nor a corresponding proprietary relation is avail-able Instead Wikidata uses for each linked data seta proprietary relation (called identifier) to indicateequivalence For example the M-ID of a Freebase in-stance is stored via the relation ldquoFreebase identifierrdquo(wdtP646) as literal value (eg m01x3gpk)So far links to 426 different data sources are maintainedin this way

Although the equivalence statements in Wikidata canbe used to generate corresponding owlsameAs state-ments and although the stored identifiers are providedin the Browser interface as hyperlinks there are no gen-uine owlsameAs links available Hence Wikidata isevaluated with 0 If we view each equivalence relationas owlsameAs relation we would obtain around122M instances with owlsameAs statements Thiscorresponds to 86 of all instances If we consider

133The interlinking on schema level is already covered by thecriterion Interoperability of proprietary vocabulary

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 43

only entities instead of instances (since there are manyinstances due to reification) we obtain a coverage of65 Note however that although the linked resourcesprovide relevant content the resources are not alwaysRDF documents but instead HTML web pages There-fore we cannot easily subsume all identifiers (equiv-alence statements) under owlsameAs

YAGO has around 36M instances with at least oneowlsameAs link However most of them are linksto DBpedia based on common Wikipedia articles Ifthose links are excluded YAGO contains mostly linksto GeoNames and would be evaluated with just 001

In case of OpenCyc links to Cyc134 the commercialversion of OpenCyc were considered as being internalStill OpenCyc has the highest fulfillment degree witharound 40K instances with at least one owlsameAslink As mentioned earlier the relative high linkingdegree of OpenCyc can be attributed to dedicated ap-proaches of linking OpenCyc to other KGs135

Validity of external URIs mURIs

Regarding the dimension Accessibility we alreadyanalyzed the dereferencing possibility of resources inthe KG namespace Now we analyze the links to exter-nal URIs

Evaluation method External links include owlsameAs links as well as links to non-RDF-based Webresources (eg via foafhomepage) We measureerrors such as timouts client errors (HTTP response4xx) and server errors (HTTP response 5xx)

Evaluation result The external links are in most ofthe cases valid for all KGs All KGs obtain a metricvalue between 089 and 096

DBpedia stores provenance information via the re-lation provwasDerivedFrom Since almost alllinks refer to Wikipedia 99 of the resources are avail-able

Freebase achieves high metric values here sinceit contains owlsameAs links mainly to WikipediaAlso Wikipedia URIs are mostly resolvable

OpenCyc contains mainly external links to non-RDF-based Web resources to wikipediaorg and w3org

YAGO also achieves high metric values since it pro-vides owlsameAs links only to DBpedia and Geo-Names whose URIs do not change

For Wikidata the relation reference URL (wdtP854) which states provenance information amongother relations belongs to the links linking to external

134Ie swcyccom135See Interoperability of proprietary vocabulary in sec 528

Web resources Here we were able to resolve around955 without errors

Noticeable is that DBpedia and OpenCyc containmany owlsameAs links to URIs whose domains donot exist anymore136 One solution for such invalid linksmight be to remove them if they have been invalid for acertain time span

5212 Summary of ResultsWe now summarize the results of the evaluations

presented in this section

1 Syntactic validity of RDF documents All KGsprovide syntactically valid RDF documents

2 Syntactic validity of Literals In general the KGsachieve good scores regarding the Syntactic valid-ity of literals Although OpenCyc comprises over1M literals in total these literals are mainly labelsand descriptions which are not formatted in a spe-cial format For YAGO we detected about 519Ksyntactic errors (given 1M literal values) due to theusage of wildcards in the date values Obviouslythe syntactic invalidity of literals is accepted bythe publishers in order to keep the number of rela-tions low In case of Wikidata some invalid literalssuch as the ISBN have been corrected in newerversions of Wikidata This indicates that knowl-edge in Wikidata is curated continuously For DB-pedia comments next to the values to be extracted(such as ISBN) in the infoboxes of Wikipedia ledto inaccurately extracted values

3 Semantic validity of triples All considered KGsscored well regarding this metric This shows thatKGs can be used in general without concerns re-garding the correctness Note however that eval-uating the semantic validity of facts is very chal-lenging since a reliable ground truth is needed

4 Trustworthiness on KG level Based on the way ofhow data is imported and curated OpenCyc andWikidata can be trusted the most

5 Trustworthiness on statement level Here espe-cially good values are achieved for Freebase Wiki-data and YAGO YAGO stores per statement boththe source and the extraction technique which isunique among the KGs Wikidata also supports tostore the source of information but only around13 of the statements have provenance informa-tion attached Note however that not every state-

136Eg httprdfaboutcom httpwww4wiwissfu-berlindefactbook and httpwikicompanyorg (requested on Jan 11 2017)

44 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

ment in Wikidata requires a reference and that itis hard to evaluate which statements lack such areference

6 Using unknown and empty values Wikidata andFreebase support the indication of unknown andempty values

7 Check of schema restrictions during insertion ofnew statements Since Freebase and Wikidata areeditable by community members simple consis-tency checks are made during the insertion of newfacts in the user interface

8 Consistency of statements wrt class constraintsFreebase and Wikidata do not specify any classconstraints via owldisjointWith while theother KGs do

9 Consistency of statements wrt relation con-straints The inconsistencies of all KGs regardingthe range indications of relations are mainly due toinconsistently used data types (eg xsdgYearis used instead of xsdDate)Regarding the constraint of functional proper-ties the relation owlFunctionalPropertyis used by all KGs except Wikidata in most casesthe KGs comply with the usage restrictions of thisrelation

10 Creating a ranking of statements Only Wikidatasupports a ranking of statements This is in partic-ular worthwhile in case of statements which areonly temporally limited valid

11 Schema completeness Wikidata shows the highestdegree of schema completeness Also for DBpe-dia OpenCyc and YAGO we obtain results whichare presumably acceptable in most cross-domainuse cases While DBpedia classes were sometimesmissing in our evaluation the DBpedia relationswere covered considerably well OpenCyc lackssome relations of the gold standard but the classesof the gold standard were existing in OpenCycWhile the YAGO classes are peculiar in the sensethat they are connected to WordNet synsets it isremarkable that YAGO relations are often keptvery abstract so that they can be applied in differ-ent senses Freebase shows considerable room forimprovement concerning the coverage of typicalcross-domain classes and relations Note that Free-base classes are belonging to different domainsHence it is difficult to find related classes if theyare not in the same domain

12 Column completeness DBpedia and Freebaseshow the best column completeness values ie inthose KGs the predicates used by the instances of

each class are on average frequently used by all ofthose class instances We can name data importsas one reason for it

13 Population completeness Not very surprising isthe fact that all KGs show a higher degree of com-pleteness regarding well-known entities than re-garding rather unknown entities Especially Wiki-data shows an excellent performance for both well-known and rather unknown entities

14 Timeliness frequency of the KG Only Wikidataprovides the highest fulfillment degree for thiscriterion as it is continuously updated and as thechanges are immediately visible and queryable byusers

15 Specification of the validity period of statementsIn YAGO Freebase and Wikidata the temporalvalidity period of statements (eg term of office)can be specified

16 Specification of the modification date of state-ments Only Freebase keeps the modification datesof statements Wikidata provides the modificationdate of the queried resource during URI derefer-encing

17 Description of resources YAGO Wikidata andOpenCyc contain a label for almost every entitySurprisingly DBpedia shows a relatively low cov-erage wrt labels and descriptions (only 704)Manual investigations suggest that the interme-diate node mapping template is the main reasonfor that By means of this template intermediatenodes are introduced and instantiated but no la-bels are provided for them137

18 Labels in multiple languages YAGO Freebaseand Wikidata support hundreds of languages re-garding their stored labels Only OpenCyc con-tains labels merely in English While DBpediaYAGO and Freebase show a high coverage re-garding the English language Wikidata does nothave such a high coverage regarding English butinstead covers other languages to a considerableextent It is hence not only the most diverse KGin terms of languages but also the KG which con-tains the most labels for languages other than En-glish

19 Understandable RDF serialization DBpediaWikidata and YAGO provide several understand-

137An example is dbrVolkswagen_Passat_(B1)which has dboengine statements to the intermediate nodesVolkswagen_Passat_(B1)__1 etc representing differentengine variations

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 45

able RDF serialization formats Freebase onlyprovides the understandable format RDFTurtleOpenCyc relies only on RDFXML which is con-sidered as being not easily understandable for hu-mans

20 Self-describing URIs We can find mixed paradigmsregarding the URI generation DBpedia YAGOand OpenCyc rely on descriptive URIs whileWikidata and Freebase (in part classes and rela-tions are identified with self-describing URIs) usegeneric IDs ie opaque URIs

21 Avoiding blank nodes and RDF reification DB-pedia Wikidata YAGO and Freebase are theKGs which use reification ie which formulatestatements about statements There are differentways of implementing reification [27] DBpediaWikidata and Freebase use n-ary relations whileYAGO uses N-Quads creating so-called namedgraphs

22 Provisioning of several serialization formatsMany KGs provide RDF in several serializationformats Freebase is the only KG providing datain the serialization format RDFTurtle only

23 Using external vocabulary DBpedia and Wiki-data show high degrees of external vocabularyusage In DBpedia the RDF RDFS and OWLvocabularies are used Wikidata has a high ex-ternal vocabulary ratio since there exist manylanguage labels and descriptions (modeled viardfslabel and schemadescription)Also due to instantiations of statements withwdoStatement for reification purposes theexternal relation rdftype is used a lot

24 Interoperability of proprietary vocabulary Weobtained low fulfillment scores regarding this cri-terion OpenCyc shows the highest value Wecan mention as reason for that the fact thathalf of all OpenCyc classes exhibit at least oneowlsameAs linkWhile DBpedia has equivalence statements to ex-ternal classes for almost every second class only63 of all relations have equivalence relations torelations outside the DBpedia namespaceWikidata shows a very low interlinking degreeof classes to external classes and of relations toexternal relations

25 Dereferencing possibility of resources Resourcesin DBpedia OpenCyc and YAGO can be derefer-enced without considerable issues Wikidata usespredicates derived from relations that are not deref-erencable at all as well as blank nodes For Free-

base we measured a quite considerable amountof dereferencing failures due to server errors andunknown URIs Note also that Freebase requiredan API key for a large amount of requests

26 Availability of the KG While all other KGsshowed almost no outages YAGO shows a note-worthy instability regarding its online availabilityWe measured around 100 outages for YAGO ina time interval of 8 weeks taking on average 35hours

27 Provisioning of public SPARQL endpoint DBpe-dia Wikidata and YAGO provide a SPARQL end-point while Freebase and OpenCyc do not Note-worthy is that the Wikidata SPARQL endpoint hasa maximum execution time per query of 30 sec-onds This might be a bottleneck for some queries

28 Provisioning of an RDF export RDF exports areavailable for all KGs and are provided mostly inN-Triples and Turtle format

29 Support of content negotiation DBpedia Wiki-data and YAGO correctly return RDF data basedon content negotiation Both OpenCyc and Free-base do not support any content negotiation WhileOpenCyc only provides data in RDFXML Free-base only returns data with textplain as con-tent type

30 Linking HTML sites to RDF serializations AllKGs except OpenCyc interlink the HTML rep-resentations of resources with the correspondingRDF representations

31 Provisioning of KG metadata Only DBpedia andOpenCyc integrate metadata about the KG insome form DBpedia has the VoID vocabulary in-tegrated while OpenCyc reveals the current KGversion as machine-readable metadata

32 Provisioning machine-readable licensing informa-tion Only DBpedia and Wikidata provide licens-ing information about their KG data in machine-readable form

33 Interlinking via owlsameAs OpenCyc andYAGO achieve the best results wrt this met-ric but DBpedia has by far the most instanceswith at least one owlsameAs link Based onthe resource interlinkage DBpedia is justifiablycalled Linked Data hub Wikidata does not provideowlsameAs links but stores identifiers as liter-als that could be used to generate owlsameAslinks

34 Validity of external URIs The links to exter-nal Web resources are for all KGs valid inmost cases DBpedia and OpenCyc contain many

46 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Step 1 Requirements Analysis

- Identifying the preselection criteria P- Assigning a weight wi to each DQ criterion ci isin C

Step 2 Preselection based on the Preselection Criteria

- Manually selecting the KGs GP that fulfill the preselection criteria P

Step 3 Quantitative Assessment of the KGs

- Calculating the DQ metric mi(g) for each DQ criterion ci isin C- Calculating the fulfillment degree h(g) for each KG g isin GP

- Determining the KG g with the highest fulfillment degree h(g)

Step 4 Qualitative Assessment of the Result

- Assessing the selected KG g wrt qualitative aspects- Comparing the selected KG g with other KGs in G P

Fig 11 Proposed process for using our KG recommendation frame-work

owlsameAs links to RDF documents on do-mains which do not exist anymore those linkscould be deleted

6 KG Recommendation Framework

We now propose a framework for selecting themost suitable KG (or a set of suitable KGs) for agiven concrete setting based on a given set of KGsG = g1 gn To use this framework the user needsto go through the steps depicted in Fig 11

In Step 1 the preselection criteria and the weightsfor the criteria are specified The preselection criteriacan be both quality criteria or general criteria and needto be selected dependent on the use case The Timeli-ness frequency of the KG is an example for a qualitycriterion The license under which a KG is provided(eg CC0 license) is an example for a general criterionAfter weighting the criteria in Step 2 those KGs areneglected which do not fulfill the preselection criteriaIn Step 3 the fulfillment degrees of the remaining KGsare calculated and the KG with the highest fulfillmentdegree is selected Finally in Step 4 the result can be as-sessed wrt qualitative aspects (besides the quantitativeassessments using the DQ metrics) and if necessary analternative KG can be selected for being applied for thegiven scenario

Use case application In the following we showhow to use the KG recommendation framework in aparticular scenario The use case is based on the usageof DBpedia and MusicBrainz for the project BBC Musicas described in [33]

Description of the use case The publisher BBCwants to enrich news articles with fact sheets providingrelevant information about musicians mentioned in thearticles In order to obtain more details about the mu-sicians the user can leave the news section and accessthe musicians section where detailed information is pro-vided including a short description a picture the birthdate and the complete discography for each musicianFor being able to integrate the musicians informationinto the articles and to enable such a linking editorsshall tag the article based on a controlled vocabulary

The KG Recommendation Framework can be appliedas follows

1 Requirements analysis

ndash Preselection criteria According to the sce-nario description [33] the KG in questionshould (i) be actively curated and (ii) con-tain an appropriate amount of media enti-ties Given these two criteria a satisfactoryand up-to-date coverage of both old and newmusicians is expected

ndash Weighting of DQ criteria Based on the pre-selection criteria an example weighting ofthe DQ metrics for our use case is given inTable 15 Note that this is only one exam-ple configuration and the assignment of theweights is subjective to some degree Giventhe preselection criteria the criterion Timeli-ness frequency of the KG and the criteria ofthe DQ dimension Completeness are empha-sized Furthermore the criteria Dereferenc-ing possibility of resources and Availabilityof the KG are important as the KG shall beavailable online ready to be queried138

2 Preselection Freebase and OpenCyc are not con-sidered any further since Freebase is not being up-dated anymore and since OpenCyc contains onlyaround 4K entities in the media domain

3 Quantitative Assessment The overall fulfillmentscore for each KG is calculated based on the for-mula presented in Section 31 The result of thequantitative KG evaluation is presented in Ta-ble 15 By weighting the criteria according tothe constraints Wikidata achieves the best rankclosely followed by DBpedia Based on the quan-titative assessment Wikidata is recommended bythe framework

138We assume that in this use case rather the dereferencing ofHTTP URIs than the execution of SPARQL queries is desired

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 47

Table 15Framework with an example weighting which would be reasonablefor a user setting as given in [33]

Dimension Metric DBpedia Freebase OpenCyc Wikidata YAGO Example of UserWeighting wi

Accuracy msynRDF 1 1 1 1 1 1msynLit 0994 1 1 1 0624 1msemTriple 0990 0995 1 0993 0993 1

Trustworthiness mgraph 05 05 1 075 025 0mfact 05 1 0 1 1 1mNoV al 0 1 0 1 0 0

Consistency mcheckRestr 0 1 0 1 0 0mconClass 0875 1 0999 1 0333 0mconRelat 0992 0451 1 0500 0992 0

Relevancy mRanking 0 1 0 1 0 1

Completeness mcSchema 0905 0762 0921 1 0952 1mcCol 0402 0425 0 0285 0332 2mcPop 093 094 048 099 089 3

Timeliness mFreq 05 0 025 1 025 3mV alidity 0 1 0 1 1 0mChange 0 1 0 0 0 0

Ease of understanding mDescr 0704 0972 1 09999 1 1mLang 1 1 0 1 1 0muSer 1 1 0 1 1 0muURI 1 05 1 0 1 1

Interoperability mReif 05 05 05 0 05 0miSerial 1 0 05 1 1 1mextV oc 061 0108 0415 0682 0134 1mpropV oc 0150 0 0513 0001 0 1

Accessibility mDeref 1 0437 1 0414 1 2mAvai 09961 09998 1 09999 07306 2mSPARQL 1 0 0 1 1 1mExport 1 1 1 1 1 0mNegot 05 0 0 1 1 0mHTMLRDF 1 1 0 1 1 0mMeta 1 0 1 0 0 0

Licensing mmacLicense 1 0 0 1 0 0

Interlinking mInst 0251 0 0382 0 0310 3mURIs 0929 0908 0894 0957 0956 1

Unweighted Average 0683 0603 0496 0752 0625Weighted Average 0701 0493 0556 0714 0648

48 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

4 Qualitative Assessment The high population com-pleteness in general and the high coverage of enti-ties in the media domain in particular give Wiki-data advantage over the other KGs FurthermoreWikidata does not require that there is a Wikipediaarticle for each entity Thus missing Wikidata en-tities can be added by the editors directly and arethen available immediatelyThe use case requires to retrieve also detailed infor-mation about the musicians from the KG such as ashort descripion and a discography DBpedia tendsto store more of that data especially wrt discogra-phy A specialized database like MusicBrainz pro-vides even more data about musicians than DBpe-dia as it is not limited to the Wikipedia infoboxesWhile DBpedia does not provide any links to Mu-sicBrainz Wikidata stores around 120K equiva-lence links to MusicBrainz that can be used to pullmore data In conclusion Wikidata especially inthe combination with MusicBrainz seems to bean appropriate choice for the use case In this casethe qualitative assessment confirms the result ofthe quantitative assessment

The use case shows that our KG recommendationframework enables users to find the most suitable KGand is especially useful in giving an overview of themost relevant criteria when choosing a KG Howeverapplying our framework to the use case also showedthat besides the quantitative assessment there is stilla need for a deep understanding of commonalities anddifference of the KGs in order to make an informedchoice

7 Related Work

71 Linked Data Quality Criteria

Zaveri et al [49] provide a conceptual framework forquality assessment of linked data based on quality cri-teria and metrics which are grouped into quality dimen-sions and categories and which are based on the frame-work of Wang et al [47] Our framework is also basedon Wangrsquos dimensions and extended by the dimensionsConsistency [11] Licensing and Interlinking [49] Fur-thermore we reintroduce the dimensions Trustworthi-ness and Interoperability as a collective term for multi-ple dimensions

Many published DQ criteria and metrics are ratherabstract We in contrast selected and developed con-

crete criteria which can be applied to any KG in theLinked Open Data cloud Table 16 shows which ofthe metrics introduced in this article have already beenused to some extent in existing literature In summaryrelated work mainly proposed generic guidelines forpublishing Linked Data [26] introduced DQ criteriawith corresponding metrics (eg [2030]) and criteriawithout metrics (eg [4029]) 27 of the 34 criteria in-troduced in this article have been introduced or sup-ported in one way or another in earlier works The re-maining seven criteria namely Trustworthiness on KGlevel mgraph Indicating unknown and empty valuesmNoV al Check of schema restrictions during insertionof new statements mcheckRestr Creating a rankingof statements mRanking Timeliness frequency of theKG mFreq Specification of the validity period of state-ments mV alidity and Availability of the KG mAvaihave not been proposed so far to the best of our knowl-edge In the following we present more details of singleexisting approaches for Linked Data quality criteria

Pipino et al [40] introduce the criteria Schema com-pleteness Column completeness and Population com-pleteness in the context of databases We introducethose metrics for KGs and apply them to the best ofour knowledge the first time on the KGs DBpediaFreebase OpenCyc Wikidata and YAGO

OntoQA [45] introduces criteria and correspondingmetrics that can be used for the analysis of ontologiesBesides simple statistical figures such as the average ofinstances per class Tartir et al introduce also criteriaand metrics similar to our DQ criteria Description ofresources mDescr and Column completeness mcCol

Based on a large-scale crawl of RDF data Hogan etal [29] analyze quality issues of published RDF dataLater Hogan et al [30] introduce further criteria andmetrics based on Linked Data guidelines for data pub-lishers [26] Whereas Hogan et al crawl and analyzemany KGs we analyze a selected set of KGs in moredetail

Heath et al [26] provide guidelines for Linked Databut do not introduce criteria or metrics for the assess-ment of Linked Data quality Still the guidelines can beeasily translated into relevant criteria and metrics Forinstance Do you refer to additional access methodsleads to the criteria Provisioning of public SPARQLendpoint mSPARQL and Provisioning of an RDF ex-port mExport Also Do you map proprietary vocabu-lary terms to other vocabularies leads to the criterionInteroperability of proprietary vocabulary mpropV ocMetrics that are based on the guidelines of Heath et alcan also be found in other frameworks [3020]

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 49

Table 16Overview of related work regarding data quality criteria for KGs

DQ Metric [40] [45] [29] [26] [20] [22] [30] [48] [2] [34]

msynRDF X X

msynLit X X X X

msemTriple X X X X

mfact X X

mconClass X X X

mconRelat X X X X X X

mcSchema X X

mcCol X X X X

mcPop X X

mChange X X

mDescr X X X X

mLang X

muSer X

muURI X

mReif X X X

miSerial X

mextV oc X X

mpropV oc X

mDeref X X X X

mSPARQL X

mExport X X

mNegot X X X

mHTMLRDF X

mMeta X X X

mmacLicense X X X

mInst X X X

mURIs X X

Flemming [20] introduces a framework for the qual-ity assessment of Linked Data quality This frameworkmeasures the Linked Data quality based on a sample ofa few RDF documents Based on a systematic literaturereview criteria and metrics are introduced Flemmingintroduces the criteria Labels in multiple languagesmLang and Validity of external URIs mURIs the firsttime The framework is evaluated on a sample of RDFdocuments of DBpedia In contrast to Flemming weevaluate the whole KG DBpedia and also four otherwidely used KGs

SWIQA[22] is a quality assessment framework intro-duced by Fuumlrber et al that introduces criteria and met-rics for the dimensions Accuracy Completeness Timeli-ness and Uniqueness In this framework the dimensionAccuracy is divided into Syntactic validity and Sematicvalidity as proposed by Batini et al [6] Furthermorethe dimension Completeness comprises Schema com-pleteness Column completeness and Population com-pleteness following Pipino et al [40] In this articlewe make the same distinction but in addition distin-guish between RDF documents RDF triples and RDF

50 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

literals for evaluating the Accuracy since we considerRDF KGs

TripleCheckMate [35] is a framework for LinkedData quality assessment using a crowdsourcing-approachfor the manual validation of facts Based on this ap-proach Zaveri et al [48] and Acosta et al [23] analyzeboth syntactic and semantic accuracy as well as theconsistency of data in DBpedia

Kontokostas et al [34] present the test-driven evalu-ation framework RDFUnit for assessing Linked Dataquality This framework is inspired by the paradigmof test-driven software development The frameworkintroduces 17 SPARQL templates of tests that can beused for analyzing KGs wrt Accuracy and Consis-tency Note that those tests can also be used for eval-uating external constraints that exist due to the usageof external vocabulary The framework is applied byKontokostas et al on a set of KGs including DBpedia

72 Comparing KGs by Key Statistics

Duan et al [14] Tartir [45] and Hassanzadeh [25]can be mentioned as the most similar related work re-garding the evaluation of KGs using the key statisticspresented in Section 51

Duan et al [14] analyze the structuredness of data inDBpedia YAGO2 UniProt and in several benchmarkdata sets To that end the authors use simple statisticalkey figures that are calculated based on the correspond-ing RDF dumps In contrast to that approach we useSPARQL queries to obtain the figures thus not limitingourselves to the N-Tripel serialization of RDF dumpfiles Duan et al claim that simple statistical figures arenot sufficient to gain fruitful findings when analyzingthe structuredness and differences of RDF datasets Theauthors therefore propose in addition a coherence met-ric Accordingly we analyze not only simple statisti-cal key figures but further analyze the KGs wrt dataquality using 34 DQ metrics

Tartir et al [45] introduce with the system OntoQAmetrics that can be used for analyzing ontologies Moreprecisely it can be measured to which degree theschema level information is actually used on instancelevel An example of such a metric is the class richnessdefined as the number of classes with instances dividedby the number of classes without instances SWETOTAP and GlycO are used as showcase ontologies

Tartir et al [45] and Hassanzadeh et al [25] analyzehow domains are covered by KGs on both schema andinstance level For that Tartir et al introduce the mea-sure importance as the number of instances per class

and their subclasses In our case we cannot use this ap-proach since Freebase has no hierarchy Hassanzadehet al analyze the coverage of domains by listing themost frequent classes with the highest number of in-stances as a table This gives only little overview of thecovered domains since instances can belong to multi-ple classes in the same domain such as dboPlaceand dboPopulatedPlace For determining thedomain coverages of KGs for this article we there-fore adapt the idea of Hassanzadeh et al by manu-ally mapping the most frequent classes to domains anddeleting duplicates within the domains That meansif an instance is instantiated both as dboPlaceand dboPopulatedPlace the instance will becounted only once in the domain geography

8 Conclusion

Freely available knowledge graphs (KGs) have notbeen in the focus of any extensive comparative study sofar In this survey we defined a range of aspects accord-ing to which KGs can be analyzed We analyzed andcompared DBpedia Freebase OpenCyc Wikidata andYAGO along these aspects and proposed a frameworkas well as a process to enable readers to find the mostsuitable KG for their settings

References

[1] M Acosta E Simperl F Floumlck and M Vidal HARE AHybrid SPARQL Engine to Enhance Query Answers viaCrowdsourcing In Proceedings of the 8th InternationalConference on Knowledge Capture K-CAP 2015 pages111ndash118 ACM 2015

[2] M Acosta A Zaveri E Simperl D Kontokostas S Auer andJ Lehmann Crowdsourcing linked data quality assessment InThe Semantic WebndashISWC 2013 pages 260ndash276 Springer 2013

[3] M Acosta A Zaveri E Simperl D Kontokostas F Floumlckand J Lehmann Detecting Linked Data Quality Issues viaCrowdsourcing A DBpedia Study Semantic Web 2016

[4] S Auer C Bizer G Kobilarov J Lehmann R Cyganiak andZ Ives DBpedia A Nucleus for a Web of Open Data InProceedings of the 6th International Semantic Web Conferenceand 2nd Asian Semantic Web Conference ISWC 2007ASWC2007 pages 722ndash735 Springer 2007

[5] S Auer J Lehmann A-C Ngonga Ngomo and A ZaveriIntroduction to Linked Data and Its Lifecycle on the Web InReasoning Web Semantic Technologies for Intelligent DataAccess volume 8067 of Lecture Notes in Computer Sciencepages 1ndash90 Springer Berlin Heidelberg 2013

[6] C Batini C Cappiello C Francalanci and A MaurinoMethodologies for Data Quality Assessment and ImprovementACM Comput Surv 41(3)161ndash1652 July 2009

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 51

[7] S Bechhofer F van Harmelen J Hendler I Horrocks D LMcGuinness and P F Patel-Schneider OWL Web OntologyLanguage Reference httpswwww3orgTR2004REC-owl-ref-200402102004 [Online accessed 06-Apr-2016]

[8] T Berners-Lee Linked Data httpwwww3orgDesignIssuesLinkedDatahtml2006 [Online accessed 28-Feb-2016]

[9] T Berners-Lee Linked Data Is Merely More Datahttpwwww3orgDesignIssuesLinkedDatahtml 2006[Online accessed 28-02-2016]

[10] T Berners-Lee J Hendler and O Lassila The Semantic WebScientific American 284(5)29ndash37 5 2001

[11] C Bizer Quality-Driven Information Filtering in the Contextof Web-Based Information Systems VDM Publishing 2007

[12] C Bizer J Lehmann G Kobilarov S Auer C BeckerR Cyganiak and S Hellmann DBpediandashA crystallizationpoint for the Web of Data Web Semantics science servicesand agents on the world wide web 7(3)154ndash165 2009

[13] X Dong E Gabrilovich G Heitz W Horn N LaoK Murphy T Strohmann S Sun and W Zhang KnowledgeVault A Web-Scale Approach to Probabilistic KnowledgeFusion In Proceedings of the 20th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining KDD rsquo14 pages 601ndash610 New York NY USA 2014ACM

[14] S Duan A Kementsietsidis K Srinivas and O UdreaApples and Oranges A Comparison of RDF Benchmarks andReal RDF Datasets In Proceedings of the ACM SIGMODInternational Conference on Management of Data SIGMOD2011 pages 145ndash156 2011

[15] B Ell D Vrandecic and E Simperl Proceedings of the 10thInternational Semantic Web Conference (ISWC 2011) chapterLabels in the Web of Data pages 162ndash176 Springer BerlinHeidelberg Berlin Heidelberg 2011

[16] F Erxleben M Guumlnther M Kroumltzsch J Mendez andD Vrandecic Introducing Wikidata to the Linked Data WebIn Proceedings of the 13th International Semantic WebConference ISWC 2014 pages 50ndash65 Springer 2014

[17] M Faumlrber F Bartscherer C Menne and A Rettinger LinkedData Quality of DBpedia Freebase OpenCyc Wikidata andYAGO Semantic Web Journal 2017 to be published

[18] M Faumlrber C Menne and A Rettinger A Linked DataWrapper for CrunchBase Semantic Web Journal 2017 to bepublished

[19] C Fellbaum WordNet ndash An Electronic Lexical Database MITPress 1998

[20] A Flemming Qualitaumltsmerkmale von LinkedData-veroumlffentlichenden Datenquellen (Quality characteristicsof linked data publishing datasources) Diploma ThesisHumboldt University of Berlinhttpwwwdbisinformatikhu-berlindefileadminresearchpapersdiploma_seminar_thesisDiplomarbeit_Annika_Flemmingpdf 2011

[21] G Freedman and E G Reynolds Enriching Basal ReaderLessons with Semantic Webbing Reading Teacher33(6)677ndash684 1980

[22] C Fuumlrber and M Hepp SWIQA ndash A Semantic WebInformation Quality Assessment Framework In Proceedings ofthe 19th European Conference on Information Systems

(ECIS2011) volume 15 page 19 2011[23] R Guns Tracing the origins of the Semantic Web Journal of

the American Society for Information Science and Technology64(10)2173ndash2181 2013

[24] H Halpin P J Hayes J P McCusker D L McGuinness andH S Thompson The Semantic Web ndash ISWC 2010 9thInternational Semantic Web Conference ISWC 2010 ShanghaiChina chapter When owlsameAs Isnrsquot the Same An Analysisof Identity in Linked Data pages 305ndash320 Springer BerlinHeidelberg Berlin Heidelberg 2010

[25] O Hassanzadeh M J Ward M Rodriguez-Muro andK Srinivas Understanding a Large Corpus of Web TablesThrough Matching with Knowledge Bases ndash An EmpiricalStudy In Proceedings of the 10th International Workshop onOntology Matching collocated with the 14th InternationalSemantic Web Conference ISWC 2015 2015

[26] T Heath and C Bizer Linked data Evolving the web into aglobal data space Synthesis lectures on the semantic webtheory and technology 1(1)1ndash136 2011

[27] D Hernaacutendez A Hogan and M Kroumltzsch Reifying RDFWhat Works Well With Wikidata In Proceedings of the 11thInternational Workshop on Scalable Semantic Web KnowledgeBase Systems co-located with 14th International Semantic WebConference pages 32ndash47 2015

[28] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 A Spatially and Temporally Enhanced KnowledgeBase from Wikipedia Artificial Intelligence 19428ndash61 2013

[29] A Hogan A Harth A Passant S Decker and A PolleresWeaving the Pedantic Web Proceedings of the WWW2010Workshop on Linked Data on the Web 628 2010

[30] A Hogan J Umbrich A Harth R Cyganiak A Polleres andS Decker An empirical survey of linked data conformanceWeb Semantics Science Services and Agents on the WorldWide Web 1414ndash44 2012

[31] P Jain P Hitzler K Janowicz and C Venkatramani TherersquosNo Money in Linked Data httpcorescholarlibrarieswrighteducse240 2013 accessedJuly 20 2015

[32] J M Juran F M Gryna and R S Bingham editors QualityControl Handbook McGraw-Hill 1974

[33] G Kobilarov T Scott Y Raimond S Oliver C SizemoreM Smethurst C Bizer and R Lee Media Meets SemanticWeb ndash How the BBC Uses DBpedia and Linked Data to MakeConnections In Proceedings of the 6th European SemanticWeb Conference on The Semantic Web Research andApplications ESWC 2009 Heraklion pages 723ndash737 BerlinHeidelberg 2009 Springer

[34] D Kontokostas P Westphal S Auer S HellmannJ Lehmann R Cornelissen and A Zaveri Test-drivenevaluation of linked data quality In Proceedings of the 23rdinternational conference on World Wide Web pages 747ndash758ACM 2014

[35] D Kontokostas A Zaveri S Auer and J LehmannTripleCheckMate A Tool for Crowdsourcing the QualityAssessment of Linked Data In Knowledge Engineering andthe Semantic Web ndash 4th International Conference KESW 2013St Petersburg Russia October 7-9 2013 Proceedings pages265ndash272 Springer 2013

[36] C Matuszek J Cabral M J Witbrock and J DeOliveira AnIntroduction to the Syntax and Content of Cyc In AAAI SpringSymposium Formalizing and Compiling Background

52 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Knowledge and Its Applications to Knowledge Representationand Question Answering pages 44ndash49 AAAI - Association forthe Advancement of Artificial Intelligence 2006

[37] M Mecella M Scannapieco A Virgillito R BaldoniT Catarci and C Batini Managing data quality in cooperativeinformation systems In On the Move to Meaningful InternetSystems 2002 CoopIS DOA and ODBASE pages 486ndash502Springer 2002

[38] O Medelyan and C Legg Integrating Cyc and WikipediaFolksonomy meets rigorously defined common-sense InWikipedia and Artificial Intelligence An Evolving SynergyPapers from the 2008 AAAI Workshop page 65 2008

[39] F Naumann Quality-Driven Query Answering for IntegratedInformation Systems volume 2261 Springer Science ampBusiness Media 2002

[40] L L Pipino Y W Lee and R Y Wang Data QualityAssessment Communications of the ACM 45(4)211ndash2182002

[41] E Sandhaus Semantic Technology at the New York TimesLessons Learned and Future Directions In Proceedings of the9th International Semantic Web Conference on The SemanticWeb - Volume Part II ISWCrsquo10 pages 355ndash355 BerlinHeidelberg 2010 Springer

[42] A Singhal Introducing the Knowledge Graph things notstrings httpsgoogleblogblogspotde201205introducing-knowledge-graph-things-nothtml retrieved on Aug 29 2016 2012

[43] F M Suchanek G Kasneci and G Weikum YAGO A LargeOntology from Wikipedia and WordNet Web SemanticsScience Services and Agents on the World Wide Web6(3)203ndash217 2008

[44] T P Tanon D Vrandecic S Schaffert T Steiner andL Pintscher From Freebase to Wikidata The Great MigrationIn Proceedings of the 25th International Conference on WorldWide Web WWW 2016 pages 1419ndash1428 2016

[45] S Tartir I B Arpinar M Moore A P Sheth andB Aleman-meza OntoQA Metric-Based Ontology QualityAnalysis In IEEE Workshop on Knowledge Acquisition fromDistributed Autonomous Semantically Heterogeneous Dataand Knowledge Sources 2005

[46] R Y Wang M P Reddy and H B Kon Toward quality dataAn attribute-based approach Decision Support Systems13(3)349ndash372 1995

[47] R Y Wang and D M Strong Beyond Accuracy What DataQuality Means to Data Consumers Journal of managementinformation systems 12(4)5ndash33 1996

[48] A Zaveri D Kontokostas M A Sherif L BuumlhmannM Morsey S Auer and J Lehmann User-driven qualityevaluation of dbpedia In Proceedings of the 9th InternationalConference on Semantic Systems pages 97ndash104 ACM 2013

[49] A Zaveri A Rula A Maurino R Pietrobon J Lehmann andS Auer Quality Assessment for Linked Data A SurveySemantic Web 7(1)63ndash93 2015

Page 4: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with

4 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Listing 1 Default prefixes for namespaces used throughout this article

prefix cc lthttpcreativecommonsorgnsgt prefix cyc lthttpswopencycorgconceptgt prefix cych lthttpswopencycorg20120510conceptengt prefix dbo lthttpdbpediaorgontologygt prefix dbp lthttpdbpediaorgpropertygt prefix dbr lthttpdbpediaorgresourcegt prefix dby lthttpdbpediaorgclassyagogt prefix dcterms lthttppurlorgdctermsgt prefix foaf lthttpxmlnscomfoaf01gt prefix freebase lthttprdffreebasecomnsgt prefix owl lthttpwwww3org200207owlgt prefix prov lthttpwwww3orgnsprovgt prefix rdf lthttpwwww3org19990222-rdf-syntax-nsgt prefix rdfs lthttpwwww3org200001rdf-schemagt prefix schema lthttpschemaorggt prefix umbel lthttpumbelorgumbelscgt prefix void lthttpwwww3orgTRvoidgt prefix wdo lthttpwwwwikidataorgontologygt prefix wdt lthttpwwwwikidataorgentitygt prefix xsd lthttpwwww3org2001XMLSchemagt prefix yago lthttpyago-knowledgeorgresourcegt

distinguish between data quality criteria data qualitydimensions and data quality categories13 In the follow-ing we reuse these concepts for our own frameworkwhich has the particular focus on the data quality ofKGs in the context of Linked Open Data

A data quality criterion (Wang et al also call itldquodata quality attributerdquo) is a particular characteristic ofdata wrt its quality and can be either subjective orobjective An example of a subjectively measurabledata quality criterion is Trustworthiness on KG levelAn example of an objective data quality criterion is theSyntactic validity of RDF documents (see Section 32and [46])

In order to measure the degree to which a certaindata quality criterion is fulfilled for a given KG eachcriterion is formalized and expressed in terms of a func-tion with the value range of [0 1] We call this functionthe data quality metric of the respective data qualitycriterion

A data quality dimension ndash in the following justcalled dimension ndash is a main aspect how data qualitycan be viewed A data quality dimension comprises oneor several data quality criteria [47] For instance the

13The quality dimensions are defined in [47] the sub-classificationinto parametersindicators in [46 p 354]

criteria Syntactic validity of RDF documents Syntacticvalidity of literals and Semantic validity of triples formthe Accuracy dimension

Data quality dimensions and their respective dataquality criteria are further grouped into data qualitycategories Based on empirical studies Wang et alspecified four categories

ndash Criteria of the category of the intrinsic data qualityfocus on the fact that data has quality in its ownright

ndash Criteria of the category of the contextual data qual-ity cannot be considered in general but must beassessed depending on the application context ofthe data consumer

ndash Criteria of the category of the representationaldata quality reveal in which form the data is avail-able

ndash Criteria of the category of the accessibility dataquality determine how the data can be accessed

Since its publication the presented framework ofWang et al has been extensively used either in itsoriginal version or in an adapted or extended versionBizer [11] and Zaveri [49] worked on data quality in theLinked Data context They make the following adapta-tions on Wang et alrsquos framework

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 5

ndash Bizer [11] compared the work of Wang et al [47]with other works in the area of data quality Hethereby complements the framework with the di-mensions consistency verifiability and offensive-ness

ndash Zaveri et al [49] follow Wang et al [47] but intro-duce licensing and interlinking as new dimensionsin the linked data context

In this article we use the DQ dimensions as definedby Wang et al [47] and as extended by Bizer [11] andZaveri [49] More precisely we make the followingadaptations on Wang et alrsquos framework

1 Consistency is treated by us as separate DQ dimen-sion

2 Verifiability is incorporated within the DQ dimen-sion Trustworthiness as criterion Trustworthinesson statement level

3 The Offensiveness of KG facts is not consideredby us as it is hard to make an objective evaluationin this regard

4 We extend the category of the accessibility dataquality by the dimension License and Interlinkingas those data quality dimensions get in additionrelevant in the Linked Data context

31 Criteria Weighting

When applying our framework to compare KGs thesingle DQ metrics can be weighted differently so thatthe needs and requirements of the users can be takeninto account In the following we first formalize theidea of weighting the different metrics We then presentthe criteria and the corresponding metrics of our frame-work

Given are a KG g a set of criteria C = c1 cn aset of metrics M = m1 mn and a set of weightsW = w1 wn Each metric mi corresponds to thecriterion ci and mi(g) isin [0 1] where a value of 0 de-fines the minimum fulfillment degree of a KG regardinga quality criterion and a value of 1 the maximum fulfill-ment degree Furthermore each criterion ci is weightedby wi

The fulfillment degree h(g) isin [0 1] of a KG g isthen the weighted normalized sum of the fulfillmentdegrees wrt the criteria c1 cn

h(g) =

sumni=1 wi mi(g)sumn

j=1 wj

Based on the quality dimensions introduced by Wanget al [47] we now present the DQ criteria and met-rics as used in our KG comparison Note that some ofthe criteria have already been introduced by others asoutlined in Section 7

Note also that our metrics are to be understood aspossible ways of how to evaluate the DQ dimensionsOther definitions of the DQ metrics might be possibleand reasonable We defined the metrics along the char-acteristics of the KGs DBpedia Freebase OpenCycWikidata and YAGO but kept the definitions as genericas possible In the evaluations we then used those met-ric definitions and applied them eg on the basis ofown-created gold standards

32 Intrinsic Category

ldquoIntrinsic data quality denotes that data have qualityin their own rightrdquo [47] This kind of data quality cantherefore be assessed independently from the contextThe intrinsic category embraces the three dimensionsAccuracy Trustworthiness and Consistency which aredefined in the following subsections The dimensionsBelievability Objectivity and Reputation which areseparate dimensions in Wang et alrsquos classification sys-tem [47] are subsumed by us under the dimensionTrustworthiness

321 AccuracyDefinition of dimension Accuracy is ldquothe extent to

which data are correct reliable and certified free oferrorrdquo [47]

Discussion Accuracy is intuitively an important di-mension of data quality Previous work on data qualityhas mainly analyzed only this aspect [47] Hence accu-racy has often been used as synonym for data quality[39] Bizer [11] highlights in this context that Accuracyis an objective dimension and can only be applied onverifiable statements

Batini et al [6] distinguish between syntactic andsemantic accuracy Syntactic accuracy describes theformal compliance to syntactic rules without review-ing whether the value reflects the reality The semanticaccuracy determines whether the value is semanticallyvalid ie whether the value is true Based on the clas-sification of Batini et al we can define the metric forAccuracy as follows

Definition of metric The dimension Accuracy isdetermined by the criteria

ndash Syntactic validity of RDF documentsndash Syntactic validity of literals and

6 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

ndash Semantic validity of triples

The fulfillment degree of a KG g wrt the dimen-sion Accuracy is measured by the metrics msynRDF msynLit and msemTriple which are defined as fol-lows

Syntactic validity of RDF documents The syntacticvalidity of RDF documents is an important require-ment for machines to interpret an RDF document com-pletely and correctly Hogan et al [29] suggest usingstandardized tools for creating RDF data The authorsstate that in this way normally only little syntax errorsoccur despite the complex syntactic representation ofRDFXML

RDF data can be validated by an RDF validator suchas the W3C RDF validator14

msynRDF (g) =

1 if all RDF documents are valid0 otherwise

Syntactic validity of literals Assessing the syntacticvalidity of literals means to determine to which degreeliteral values stored in the KG are syntactically validThe syntactic validity of literal values depends on thedata types of the literals and can be automatically as-sessed via rules [2234] Syntactic rules can be writ-ten in the form of regular expressions For instanceit can be verified whether a literal representing a datefollows the ISO 8601 specification Assuming that L isthe infinite set of literals we can state

msynLit(g) =|(s p o) isin g | o isin L and synV alid(o)|

|(s p o) isin g | o isin L|

In case of an empty set in the denominator of thefraction the metric should evaluate to 1

Semantic validity of triples The criterion Semanticvalidity of triples is introduced to evaluate whether thestatements expressed by the triples (with or withoutliterals) hold true Determining whether a statementis true or false is strictly speaking impossible (see thefield of epistemology in philosophy) For evaluating theSemantic validity of statements Bizer et al [11] notethat a triple is semantically correct if it is also availablefrom a trusted source (eg Name Authority File) if it

14See httpwwww3orgRDFValidator requestedon Feb 29 2016

is common sense or if the statement can be measuredor perceived by the user directly Wikidata has similarguidelines implemented to determine whether a factneeds to be sourced15

We measure the Semantic validity of triples based onempirical evidence ie based on a reference data setserving as gold standard We determine the fulfillmentdegree as the precision that the triples which are in theKG g and in the gold standard GS have the same valuesNote that this measurement is heavily depending on thetruthfulness of the reference data set

Formally let nogGS = |(s p o) | (s p o) isin g and(x y z) isin GSandequi(s x)andequi(p y)andequi(o z))|be the number of triples in g to which semanticallycorresponding triples in the gold standard GS exist Letnog = |(s p o) | (s p o) isin g and (x y z) isin GS andequi(s x) and equi(p y)| be the number of triples in gwhere the subject-relation-pairs (s p) are semanticallyequivalent to subject-relation-pairs (x y) in the goldstandard Then we can state

msemTriple(g) =nogGS

nog

In case of an empty set in the denominator of thefraction the metric should evaluate to 1

322 TrustworthinessDefinition of dimension Trustworthiness is defined

as the degree to which the information is accepted to becorrect true real and credible [49] We define it as acollective term for believability reputation objectivityand verifiability These aspects were defined by Wanget al [47] and Naumann [39] as follows

ndash Believability Believability is ldquothe extent to whichdata are accepted or regarded as true real andcrediblerdquo [47]

ndash Reputation Reputation is ldquothe extent to whichdata are trusted or highly regarded in terms of theirsource or contentrdquo [47]

ndash Objectivity Objectivity is ldquothe extent to whichdata are unbiased (unprejudiced) and impartialrdquo[47]

ndash Verifiability Verifiability is ldquothe degree and easewith which the data can be checked for correctnessrdquo[39]

15See httpswwwwikidataorgwikiHelpSources requested on Sep 8 2016

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 7

Discussion In summary believability considers thesubject (data consumer) side reputation takes the gen-eral social view on trustworthiness objectivity consid-ers the object (data provider) side while verifiabilityfocuses on the possibility of verification

Trustworthiness has been discussed as follows

ndash Believability According to Naumann [39] believ-ability is the ldquoexpected accuracyrdquo of a data source

ndash Reputation The essential difference of believ-ability to accuracy is that for believability data istrusted without verification [11] Thus believabil-ity is closely related to the reputation of a dataset

ndash Objectivity According to Naumann [39] the ob-jectivity of a data source is strongly related to theverifiability The more verifiable a data source orstatement is the more objective it is The authorsof this article would not go so far since also biasedstatements could be verifiable

ndash Verifiability Heath et al [26] emphasize that it isessential for trustworthy applications to be able toverify the origin of data

Definition of metric We define the metric for thedata quality dimension Trustworthiness as a combina-tion of trustworthiness metrics on both KG and state-ment level Believability and reputation are thereby cov-ered by the DQ criterion Trustworthiness on KG level(metric mgraph(hg)) while objectivity and verifiabilityare covered by the DQ criteria Trustworthiness on state-ment level (metric mfact(g)) and Indicating unknownand empty values (metric mNoV al(g)) Hence the ful-fillment degree of a KG g wrt the dimension Trust-worthiness is measured by the metrics mgraph mfactand mNoV al which are defined as follows

Trustworthiness on KG level The measure of Trust-worthiness on KG level exposes a basic indication aboutthe trustworthiness of the KG In this assessment themethod of data curation as well as the method of datainsertion is taken into account Regarding the methodof data curation we distinguish between manual andautomated methods Regarding the data insertion wecan differentiate between 1 whether the data is enteredby experts (of a specific domain) 2 whether the knowl-edge comes from volunteers contributing in a commu-nity and 3 whether the knowledge is extracted automat-ically from a data source This data source can itself beeither structured semi-structured or un-structured Weassume that a closed system where experts or other reg-istered users feed knowledge into a system is less vul-nerable to harmful behavior of users than an open sys-

tem where data is curated by a community Thereforewe assign the values of the metric for Trustworthinesson KG level as follows

mgraph(hg) =

1 manual data curation man-ual data insertion in aclosed system

075 manual data curation and in-sertion both by a commu-nity

05 manual data curation datainsertion by community ordata insertion by automatedknowledge extraction

025 automated data curationdata insertion by automatedknowledge extraction fromstructured data sources

0 automated data curationdata insertion by automatedknowledge extraction fromunstructured data sources

Note that all proposed DQ metrics should be seen assuggestions of how to formulate DQ metrics Henceother numerical values and other classification schemes(eg for mgraph(hg)) might be taken for defining theDQ metrics

Trustworthiness on statement level The fulfillment ofTrustworthiness on statement level is determined by anassessment whether a provenance vocabulary is usedBy means of a provenance vocabulary the source ofstatements can be stored Storing source information isan important precondition to assess statements easilywrt semantic validity We distinguish between prove-nance information provided for triples and provenanceinformation provided for resources

The most widely used ontologies for storing prove-nance information are the Dublin Core Metadataterms16 with properties such as dctermsprovenance and dctermssource and the W3C PROVontology17 with properties such as provwasDerivedFrom

16See httppurlorgdcterms requested on Feb 42017

17See httpswwww3orgTRprov-o requested onDec 27 2016

8 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

mfact(g) =

1 provenance on statement

level is used05 provenance on resource

level is used0 otherwise

Indicating unknown and empty values If the datamodel of the considered KG supports the representa-tion of unknown and empty values more complex state-ments can be represented For instance empty valuesallow to represent that a person has no children andunknown values allow to represent that the birth date ofa person in not known This kind of higher explanatorypower of a KG increases the trustworthiness of the KG

mNoV al(g) =

1 unknown and empty values

are used05 either unknown or empty

values are used0 otherwise

323 ConsistencyDefinition of dimension Consistency implies that

ldquotwo or more values [in a dataset] do not conflict eachotherrdquo [37]

Discussion Due to the high variety of data providersin the Web of Data a user must expect data inconsisten-cies Data inconsistencies may be caused by (i) differ-ent information providers (ii) different levels of knowl-edge and (iii) different views of the world [11]

In OWL restrictions can be introduced to ensureconsistent modeling of knowledge to some degree TheOWL schema restrictions can be divided into class re-strictions and relation restrictions [7]

Class restrictions refer to classes For instanceone can specify via owldisjointWith that twoclasses have no common instance

Relation restrictions refer to the usage of relationsThey can be classified into value constraints and cardi-nality constraints

Value constraints determine the range of relationsowlsomeValuesFrom for instance specifies thatat least one value of a relation belongs to a certainclass If the expected data type of a relation is specifiedvia rdfsrange we also consider this as relationrestriction

Cardinality constraints limit the number of times a re-lation may exist per resource Via owlFunctionalproperty and owlInverseFunctionalProp

erty global cardinality constraints can be specifiedFunctional relations permit at most one value per re-source (eg the birth date of a person) Inverse func-tional relations specify that a value should only occuronce per resource This means that the subject is theonly resource linked to the given object via the givenrelation

Definition of metric We can measure the data qual-ity dimension Consistency by means of (i) whetherschema constraints are checked during the insertion ofnew statements into the KG and (ii) whether alreadyexisting statements in the KG are consistent to specifiedclass and relation constraints The fulfillment degree ofa KG g wrt the dimension consistency is measuredby the metrics mcheckRestr mconClass and mconRelatwhich are defined as follows

Check of schema restrictions during insertion of newstatements Checking the schema restrictions duringthe insertion of new statements can help to reject factsthat would render the KG inconsistent Such simplechecks are often done on the client side in the user inter-face For instance the application checks whether datawith the right data type is inserted Due to the depen-dency to the actual inserted data the check needs to becustom-designed Simple rules are applicable howeverinconsistencies can still appear if no suitable rules areavailable Examples of consistency checks are check-ing the expected data types of literals checking whetherthe entity to be inserted has a valid entity type (iechecking the rdftype relation) checking whetherthe assigned classes of the entity are disjoint ie con-tradicting each other (utilizing owldisjointWithrelations)

mcheckRestr(hg) =

1 schema restrictions arechecked

0 otherwise

Consistency of statements wrt class constraints Thismetric is intended to measure the degree to which theinstance data is consistent with the class restrictions(eg owldisjointWith) specified on the schemalevel

In the following we limit ourselves to the classconstraints given by all owldisjointWith state-ments defined on the schema level of the consid-ered KG Ie let CC be the set of all class con-straints defined as CC = (c1 c2) | (c1owldis-

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 9

jointWith c2) isin g18 Furthermore let cg(e) bethe set of all classes of instance e in g defined ascg(e) = c | (erdftype c) isin g Then we definemconClass(g) as follows

mconClass(g) =

|(c1 c2) isin CC | notexiste (c1 isin cg(e) and c2 isin cg(e))||(c1 c2) isin CC|

In case of an empty set of class constraints CC themetric should evaluate to 1

Consistency of statements wrt relation constraintsThe metric for this criterion is intended for measur-ing the degree to which the instance data is consis-tent with the relation restrictions (eg indicated viardfsrange and owlFunctionalProperty)specified on the schema level We evaluate this crite-rion by averaging over the scores obtained from sin-gle metrics mconRelati indicating the consistency ofstatements wrt different relation constraints

mconRelat(g) =1

n

nsumi=1

mconRelati(g)

In case of evaluating the consistency of instance dataconcretely wrt given rdfsrange and owlFunctionalProperty statements19 we can state

mconRelat(g) =mconRelatRg(g) +mconRelatFct(g)

2

Let Rr be the set of all rdfsrange constraints

Rr = (p d) | (prdfsrange d) isin g

and isDatatype(d)

18Implicit restrictions which can be deducted from the class hi-erarchy eg that a restriction for dboAnimal counts also fordboMammal a subclass of dboAnimal are not considered byus here

19We chose those relations (and for instance notowlInverseFunctionalProperty) as only those relationsare used by more than half of the considered KGs

and Rf be the set of all owlFunctionalPro-perty constraints

Rf = (p d) | (prdftypeowlFunc

tionalProperty) isin g and

(prdfsrange d) isin g and isDatatype(d)

Then we can define the metrics mconRelatRg(g) andmconRelatFct(g) as follows

mconRelatRg(g) =

|(s p o) isin g | exist(p d) isin Rr datatype(o) 6= d||(s p o) isin g | exist(p d) isin Rr|

mconRelatFct(g) =

|(s p o) isin g|exist(p d) isin Rf notexist(s p o2) isin g o 6= o2||(s p o) isin g | exist(p d) isin Rf|

In case of an empty set of relation constraints (Rr orRf ) the respective metric should evaluate to 1

33 Contextual Category

Contextual data quality ldquohighlights the requirementthat data quality must be considered within the contextof the task at handrdquo [47] This category contains thethree dimensions (i) Relevancy (ii) Completeness and(iii) Timeliness Wang et alrsquos further dimensions in thiscategory appropriate amount of data and value-addedare considered by us as being part of the dimensionCompleteness

331 RelevancyDefinition of dimension Relevancy is ldquothe extent

to which data are applicable and helpful for the task athandrdquo [47]

Discussion According to Bizer [11] Relevancy isan important quality dimension since the user is con-fronted with a variety of potentially relevant informa-tion on the Web

Definition of metric The dimension Relevancy isdetermined by the criterion Creating a ranking ofstatements20 The fulfillment degree of a KG g wrtthe dimension Relevancy is measured by the metricmRanking which is defined as follows

20We do not consider the relevancy of literals as there is no rankingof literals provided for the considered KGs

10 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Creating a ranking of statements By means of thiscriterion one can determine whether the KG supportsa ranking of statements by which the relative rele-vance of statements among other statements can beexpressed For instance given the Wikidata entityBarack Obama (wdtQ76) and the relation posi-tion held (wdtP39) President of the United Statesof America (wdtQ11696) has a preferred rank(wdoPreferredRank) (until 2017) while olderpositions which he holds no more are ranked as normalrank (wdoNormalRank)

mRanking(g) =

1 ranking of statements supported0 otherwise

Note that this criterion refers to a characteristic ofthe KG and not to a characteristic of the system thathosts the KG

332 CompletenessDefinition of dimension Completeness is ldquothe ex-

tent to which data are of sufficient breadth depth andscope for the task at handrdquo [47]

We include the following two aspects in this dimen-sion which are separate dimensions in Wang et alrsquosframework

ndash Appropriate amount of data Appropriate amountof data is ldquothe extent to which the quantity orvolume of available data is appropriaterdquo [47]

ndash Value-added Value-added is ldquothe extent to whichdata are beneficial and provide advantages fromtheir userdquo [47]

Discussion Pipino et al [40] divide Completenessinto

1 Schema completeness ie the extent to whichclasses and relations are not missing

2 Column completeness ie the extent to whichvalues of relations on instance level ndash ie facts ndashare not missing and

3 Population completeness ie the extent to whichentities are not missing

The Completeness dimension is context-dependent andtherefore belongs to the contextual category becausethe fact that a KG is seen as complete depends on theuse case scenario ie on the given KG and on the infor-mation need of the user As exemplified by Bizer [11]a list of German stocks is complete for an investor whois interested in German stocks but it is not complete for

an investor who is looking for an overview of Europeanstocks The completeness is hence only assessable bymeans of a concrete use case at hand or with the helpof a defined gold standard

Definition of metric We follow the above-mentioneddistinction of Pipino et al [40] and determine Com-pleteness by means of the criteria Schema completenessColumn completeness and Population completeness

The fulfillment degree of a KG g wrt the dimensionCompleteness is measured by the metrics mcSchemamcCol and mcPop which are defined as follows

Schema completeness By means of the criterionSchema completeness one can determine the complete-ness of the schema wrt classes and relations [40] Theschema is assessed by means of a gold standard Thisgold standard consists of classes and relations which arerelevant for the use case For evaluating cross-domainKGs we use as gold standard a typical set of cross-domain classes and relations It comprises (i) basicclasses such as people and locations in different gran-ularities and (ii) basic relations such as birth date andnumber of inhabitants We define the schema complete-ness mcSchema as the ratio of the number of classesand relations of the gold standard existing in g noclatg and the number of classes and relations in the goldstandard noclat

mcSchema(g) =noclatgnoclat

Column completeness In the traditional database area(with fixed schema) by means of the Column complete-ness criterion one can determine the degree by whichthe relations of a class which are defined on the schemalevel (each relation has one column) exist on the in-stance level [40] In the Semantic Web and Linked Datacontext however we cannot presume any fixed rela-tional schema on the schema level The set of possiblerelations for the instances of a class is given at run-time by the set of used relations for the instances ofthis class Therefore we need to modify this criterionas already proposed by Pipino et al [40] In the updatedversion by means of the criterion Column completenessone can determine the degree by which the instances ofa class use the same relations averaged over all classes

Formally we define the Column completeness met-ric mcCol(g) as the ratio of the number of instanceshaving class k and a value for the relation r nokp tothe number of all instances having class k nok Byaveraging over all class-relation-pairs which occur on

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 11

instance level we obtain a fulfillment degree regardingthe whole KG

mcCol(g) =1

|H|sum

(kp)isinH

nokpnok

We thereby let H = (k p) isin (K times P ) | existk isinCg and exist(x p o) | p isin P imp

g and (xrdftype k) bethe set of all combinations of the considered classesK = k1 kn and considered relations P =p1 pm

Note that there are also relations which are dedicatedto the instances of a specific class but which do notneed to exist for all instances of that class For instancenot all people need to have a relation hasChild ordeathDate21 For measuring the Column complete-ness we selected only those relations for an assessmentwhere a value of the relation typically exists for allgiven instances

Population completeness The Population complete-ness metric determines the extent to which the consid-ered KG covers a basic population [40] The assess-ment of the KG completeness wrt a basic populationis performed by means of a gold standard which coversboth well-known entities (called ldquoshort headrdquo eg then largest cities in the world according to the number ofinhabitants) and little-known entities (called ldquolong tailrdquoeg municipalities in Germany) We take all entitiescontained in our gold standard equally into account

Let GS be the set of entities in the gold standardThen we can define

mcPop(g) =|e|e isin GS and e isin Eg|

|e|e isin GS|

333 TimelinessDefinition of dimension Timeliness is ldquothe extent

to which the age of the data is appropriate for the taskat handrdquo [47]

Discussion Timeliness does not describe the creationdate of a statement but instead the time range since thelast update or the last verification of the statement [39]Due to the easy way of publishing data on the Webdata sources can be kept easier up-to-date than tradi-tional isolated data sources This results in advantagesto the consumer of Web data [39] How Timeliness is

21For an evaluation about the prediction which relations are of thisnature see [1]

measured depends on the application context For somesituations years are sufficient while in other situationsone may need days [39]

Definition of metric The dimension timeliness isdetermined by the criteria Timeliness frequency of theKG Specification of the validity period and Specifica-tion of the modification date of statements

The fulfillment degree of a KG g wrt the dimen-sion Timeliness is measured by the metrics mFreqmV alidity and mChange which are defined as follows

Timeliness frequency of the KG The criterion Time-liness frequency of the KG indicates how fast the KGis updated We consider the KG RDF export here anddifferentiate between continuous updates where the up-dates are always performed immediately and discreteKG updates where the updates take place in discretetime intervals In case the KG edits are available onlineimmediately but the RDF export files are available indiscrete varying updating intervals we consider theonline version of the KG since in the context of LinkedData it is sufficient that URIs are dereferenceable

mFreq(g) =

1 continuous updates05 discrete periodic updates025 discrete non-periodic updates0 otherwise

Specification of the validity period of statements Spec-ifying the validity period of statements enables to tem-porally limit the validity of statements By using this cri-terion we measure whether the KG supports the speci-fication of starting and maybe end dates of statementsby means of providing suitable forms of representation

mV alidity(g) =

1 specification of validity pe-riod supported

0 otherwise

Specification of the modification date of statementsThe modification date discloses the point in timeof the last verification of a statement The modifi-cation date is typically represented via the relationsschemadateModified and dctermsmodified

mChange(g) =

1 specification of modifica-

tion dates for statementssupported

0 otherwise

12 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

34 Representational Data Quality

Representational data quality ldquocontains aspects re-lated to the format of the data [] and meaning ofdatardquo [47] This category contains the two dimensions(i) Ease of understanding (ie regarding the human-readability) and (ii) Interoperability (ie regarding themachine-readability) The dimensions InterpretabilityRepresentational consistency and Concise representa-tion as in addition proposed by Wang et al [47] areconsidered by us as being a part of the dimension Inter-operability

341 Ease of UnderstandingDefinition of dimension The ease of understanding

is ldquothe extent to which data are clear without ambiguityand easily comprehendedrdquo [47]

Discussion This dimension focuses on the under-standability of a data source by a human data con-sumer In contrast the dimension Interoperability fo-cuses on technical aspects The understandability of adata source (here KG) can be improved by things suchas descriptive labels and literals in multiple languages

Definition of metric The dimension understand-ability is determined by the criteria Description of re-sources Labels in multiple languages UnderstandableRDF serialization and Self-describing URIs The ful-fillment degree of a KG g wrt the dimension Con-sistency is measured by the metrics mDescr mLangmuSer and muURI which are defined as follows

Description of resources Heath et al [2630] suggestto describe resources in a human-understandable wayeg via rdfslabel or rdfscomment Withinour framework the criterion is measured as followsGiven a sample of resources we divide the numberof resources in the KG for which at least one label orone description is provided (eg via rdfslabelrdfscomment or schemadescription) bythe number of all considered resources in the localnamespace

mDescr(g) = |u|u isin U localg and exist(u p o) isin g

p isin PlDesc||u|u isin U localg |

PlDesc is the set of implicitly used relations in g in-dicating that the value is a label or description (egPlDesc = rdfslabelrdfscomment)

Beschreibung) Daruumlber hinaus ist das Ergebnisder Evaluation auf Basis der Entitaumlten interessant -gt DBpedia weicht deutlich ab da manche Entitaumlten

(Intermediate-Node-Mapping) keine rdfslabel habenFolglich wuumlrde ich die Definition der Metrik allgemeinhalten (beschraumlnkt auf proprietaumlre Ressourcen dh imselben Namespace) die Evaluation jedoch nur anhandder Entitaumlten machen

Labels in multiple languages Resources in the KG aredescribed in a human-readable way via labels eg viardfslabel or skosprefLabel22 The charac-teristic feature of skosprefLabel is that this kindof label should be used per resource at most once incontrast rdfslabel has no cardinality restrictionsie it can be used several times for a given resourceLabels are usually provided in English as the ldquobasiclanguagerdquo The now introduced metric for the criterionLabels in multiple languages determines whether labelsin other languages than English are provided in the KG

mLang(g) =

1 Labels provided in English

and at least one other lan-guage

0 otherwise

Understandable RDF serialization RDFXML is therecommended RDF serialization format of the W3CHowever due to its syntax RDFXML documents arehard to read for humans The understandability of RDFdata by humans can be increased by providing RDFin other more human-understandable serialization for-mats such as N3 N-Triple and Turtle We measurethis criterion by measuring the supported serializationformats during the dereferencing of resources

muSer(hg) =

1 Other RDF serializationsthan RDFXML available

0 otherwise

Note that conversions from one RDF serializationformat into another are easy to perform

Self-describing URIs Descriptive URIs contribute toa better human-readability of KG data Sauermann etal23 recommend to use short memorable URIs in theSemantic Web context which are easier understandableand memorable by humans compared to opaque URIs24

22Using the namespace httpwwww3org200402skoscore

23See httpswwww3orgTRcooluris requested onMar 1 2016

24For an overview of URI patterns see httpswwww3orgcommunitybpmlodwikiBest_practises_-_previous_notes requested on Dec 27 2016

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 13

such as wdtQ1040 The criterion Self-describingURIs is dedicated to evaluate whether self-describingURIs or generic IDs are used for the identification ofresources

muURI(g) =

1 self-describing URIs always used05 self-describing URIs partly used0 otherwise

342 InteroperabilityInteroperability is another dimension of the repre-

sentational data quality category and subsumes Wanget alrsquos aspects interpretability representational consis-tency and concise representation

Definition of dimension We define Interoperabilityalong the subsumed dimensions of Wang et al

ndash Interpretability Interpretability is ldquothe extent towhich data are in appropriate language and unitsand the data definitions are clearrdquo [47]

ndash Representational consistency Representationalconsistency is ldquothe extent to which data are alwayspresented in the same format and are compatiblewith previous datardquo [47]

ndash Concise representation Concise representationis ldquothe extent to which data are compactly repre-sented without being overwhelmingrdquo [47]

Discussion regarding interpretability In contrastto the dimension understandability which focuses onthe understandability of RDF KG data towards the useras data consumer interpretability focuses on the rep-resentation forms of information in the KG from atechnical perspective An example is the considerationwhether blank nodes are used According to Heath etal [26] blank nodes should be avoided in the LinkedData context since they complicate the integration ofmultiple data sources and since they cannot be linkedby resources of other data sources

Discussion regarding representational consistencyIn the context of Linked Data it is best practice to reuseexisting vocabulary for the creation of own RDF dataIn this way less data needs to be prepared for beingpublished as Linked Data [26]

Discussion regarding concise representation Heathet al [26] made the observation that the RDF features(i) RDF reification25 (ii) RDF collections and RDF

25In the literature it is often not differentiated between reificationin the general sense and reification in the sense of the specific

container and (iii) blank nodes are not very widelyused in the Linked Open Data context Those featuresshould be avoided according to Heath et al in orderto simplify the processing of data on the client sideEven the querying of the data via SPARQL may getcomplicated if RDF reification RDF collections andRDF container are used We agree on that but alsopoint out that reification (implemented via RDF stan-dard reification n-ary relations singleton propertiesor named graphs) is inevitably necessary for makingstatements about statements

Definition of metric The dimension Interoperabil-ity is determined via the following criteria

ndash Avoiding blank nodes and RDF reificationndash Provisioning of several serialization formatsndash Using external vocabularyndash Interoperability of proprietary vocabulary

The fulfillment degree of a KG g wrt the dimen-sion Interoperability is measured by the metrics mReif miSerial mexV oc and mpropV oc which are defined asfollows

Avoiding blank nodes and RDF reification Using RDFblank nodes RDF reification RDF container and RDFlists is often considered as ambivalent On the one handthese RDF features are not very common and theycomplicate the processing and querying of RDF data[3026] On the other hand they are necessary in cer-tain situations eg when statements about statementsshould be made We measure the criterion by evaluatingwhether blank nodes and RDF reification are used

mReif (g) =

1 no blank nodes and no RDF

reification05 either blank nodes or RDF

reification0 otherwise

Provisioning of several serialization formats The in-terpretability of RDF data of a KG is increased if be-

proposal described in the RDF standard (Brickley D Guha R (eds)RDF Vocabulary Description Language 10 RDF Schema W3CRecommendation online available at httpwwww3orgTRrdf-schema requested on Sep 2 2016) For more informationabout reification and its implementation possibilities we can refer thereader to [27] In this article we use the term reification by defaultfor the general sense and standard reification or RDF reificationfor referring to the modeling of reification according to the RDFstandard

14 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

sides the serialization standard RDFXML further seri-alization formats are supported for URI dereferencing

miSerial(hg) =

1 RDFXML and further for-

mats are supported05 only RDFXML is supported0 otherwise

Using external vocabulary Using a common vocabu-lary for representing and describing the KG data allowsto represent resources and relations between resourcesin the Web of Data in a unified way This increases theinteroperability of data [3026] and allows a comfort-able data integration We measure the criterion of usingan external vocabulary by setting the number of tripleswith external vocabulary in predicate position to thenumber of all triples in the KG

mextV oc(g) =|(s p o)|(s p o) isin g and p isin P external

g ||(s p o) isin g|

Interoperability of proprietary vocabulary Linkingon schema level means to link the proprietary vo-cabulary to external vocabulary Proprietary vocab-ulary are classes and relations which were definedin the KG itself The interlinking to external vo-cabulary guarantees a high degree of interoperabil-ity [26] We measure the interlinking on schemalevel by calculating the ratio to which classes andrelations have at least one equivalency link (egowlsameAs owlequivalentProperty orowlequivalentClass) to classes and relationsrespectively of other data sources

mpropV oc(g) = |x isin Pg cup Cg|exist(x p o) isin g

(p isin Peq and (o isin U and o isin Uextg ))||Pg cup Cg|

where Peq = owlsameAsowlequivalent-PropertyowlequivalenClass and Uext

g con-sists of all URIs in Ug which are external to the KG gwhich means that hg is not responsible for resolvingthese URIs

35 Accessibility Category

Accessibility data quality refers to aspects on howdata can be accessed This category contains the threedimensions

ndash Accessibilityndash Licensing andndash Interlinking

Wangrsquos dimension access security is considered by usas being not relevant in the Linked Open Data contextas we only take open data sources into account

In the following we go into details of the mentioneddata quality dimensions

351 AccessibilityDefinition of dimension Accessibility is ldquothe ex-

tent to which data are available or easily and quicklyretrievablerdquo [47]

Discussion Wang et alrsquos definition of Accessibilitycontains the aspects availability response time anddata request They are defined as follows

1 Availability ldquoof a data source is the probability thata feasible query is correctly answered in a giventime rangerdquo [39]According to Naumann [39] the availability is animportant quality aspect for data sources on theWeb since in case of integrated systems (with fed-erated queries) usually all data sources need tobe available in order to execute the query Therecan be different influencing factors regarding theavailability of data sources such as the day timethe worldwide distribution of servers the planedmaintenance work and the caching of data LinkedData sources can be available as SPARQL end-points (for performing complex queries on thedata) and via HTTP URI dereferencing We needto consider both possibilities for this DQ dimen-sion

2 Response time characterizes the delay betweenthe point in time when the query was submittedand the point in time when the query response isreceived [11]Note that the response time is dependent on em-pirical factors such as the query the size of the in-dexed data the data structure the used triple storethe hardware and so on We do not consider theresponse time in our evaluations since obtaininga comprehensive result here is hard

3 In the context of Linked Data data requests canbe made (i) on SPARQL endpoints (ii) on RDFdumps (export files) and (iii) on Linked DataAPIs

Definition of metric We define the metric for thedimension Accessibility by means of metrics for thefollowing criteria

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 15

ndash Dereferencing possibility of resourcesndash Availability of the KGndash Provisioning of public SPARQL endpointndash Provisioning of an RDF exportndash Support of content negotiationndash Linking HTML sites to RDF serializationsndash Provisioning of KG metadata

The fulfillment degree of a KG g wrt the dimen-sion Accessibility is measured by the metrics mDeref mAvai mSPARQL mExport mNegot mHTMLRDF and mMeta which are defined as follows

Dereferencing possibility of resources One of theLinked Data principles [9] is the dereferencing possi-bility of resources URIs must be resolvable via HTTPrequests and useful information should be returnedthereby We assess the dereferencing possibility of re-sources in the KG by analyzing for each URI in the sam-ple set (here all URIs Ug) the HTTP response statuscode and by evaluating whether RDF data is returned Asuccessful dereferencing of resources is given if HTTPstatus code 200 and an RDF document is returned

mDeref (hg) =|dereferencable(Ug)|

|Ug|

Availability of the KG The Availability of the KG cri-terion indicates the uptime of the KG It is an essentialcriterion in the context of Linked Data since in case ofan integrated or federated query mostly all data sourcesneed to be available [39] We measure the availabil-ity of a KG by monitoring the ability of dereferencingURIs over a period of time This monitoring processcan be done with the help of a monitoring tool such asPingdom26

mAvai(hg) =Number of successful requests

Number of all requests

Provisioning of public SPARQL endpoint SPARQLendpoints allow the user to perform complex queries(including potentially many instances classes and rela-tions) on the KG This criterion here indicates whetheran official SPARQL endpoint is publicly availableThere might be additional restrictions of this SPARQLendpoint such as a maximum number of requests pertime slice or a maximum runtime of a query However

26See httppingdomcom requested on Mar 1 2016

we do not measure these restrictions here

mSPARQL(hg) =

1 SPARQL endpoint publiclyavailable

0 otherwise

Provisioning of an RDF export If there is no pub-lic SPARQL endpoint available or the restrictions ofthis endpoint are so strict that the user does not useit an RDF export dataset (RDF dump) can often beused This dataset can be used to set up a local pri-vate SPARQL endpoint The criterion here indicateswhether an RDF export dataset is officially available

mExport(hg) =

1 RDF export available0 otherwise

Support of content negotiation Content negotiation(CN) allows that the server returns RDF documentsduring the dereferencing of resources in the desiredRDF serialization format The HTTP protocol allowsthe client to specify the desired content type (eg RDFXML) in the HTTP request and the server to specifythe returned content type in the HTTP response header(eg applicationrdf+xml) In this way the de-sired and the provided content type are matched as faras possible It can happen that the server does not pro-vide the desired content type Moreover it may hap-pen that the server returns an incorrect content typeThis may lead to the fact that serialized RDF data isnot processed further An example is RDF data whichis declared as textplain [26] Hogan et al [29]therefore propose to let KGs return the most specificcontent type as possible We measure the Support ofcontent negotiation by dereferencing resources withdifferent RDF serialization formats as desired contenttype and by comparing the accept header of the HTTPrequest with the content type of the HTTP response

mNegot(hg) =

1 CN supported and correct

content types returned05 CN supported but wrong

content types returned0 otherwise

Linking HTML sites to RDF serializations Heath etal [26] suggest linking any HTML description of aresource to RDF serializations of this resource in or-der to make the discovery of corresponding RDF dataeasier (for Linked Data aware applications) For thatreason in the HTML header the so-called Autodiscov-

16 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

ery pattern can be included This pattern consists ofthe phrase link rel=alternate the indicationabout the provided RDF content type and a link to theRDF document27 We measure the linking of HTMLpages to RDF documents (ie resource representations)by evaluating whether the HTML representations of theresources contain links as described

mHTMLRDF (hg) =

1 Autodiscovery pattern usedat least once

0 otherwise

Provisioning of KG metadata In the light of the Se-mantic Web vision where agents select and make useof appropriate data sources on the Web also the meta-information about KGs needs to be available in amachine-readable format The two important mech-anisms to specify metadata about KGs are (i) usingsemantic sitemaps and (ii) using the VoID vocabu-lary28 [26] For instance the URI of the SPARQL end-point can be assigned via voidsparqlEndpointand the RDF export URL can be specified withvoiddataDump Such metadata can be added as ad-ditional facts to the KG or it can be provided as separateVoID file We measure the Provisioning of KG meta-data by evaluating whether machine-readable metadataabout the KG is available Note that the provisioningof licensing information in a machine-readable format(which is also a meta-information about the KG) isconsidered in the data quality dimension License lateron

mMeta(g) =

1 Machine-readable metadataabout g available

0 otherwise

352 LicenseDefinition of dimension Licensing is defined as

ldquothe granting of permission for a consumer to re-use adataset under defined conditionsrdquo [49]

Discussion The publication of licensing informationabout KGs is important for using KGs without legalconcerns especially in commercial settings CreativeCommons (CC)29 publishes several standard licensing

27An example is ltlinkrel=alternate type =applicationrdf+xml href=companyrdfgt

28See namespace httpwwww3orgTRvoid29See httpcreativecommonsorg requested on Mar

1 2016

contracts which define rights and obligations Thesecontracts are also in the Linked Data context popularThe most frequent licenses for Linked Data are CC-BYCC-BY-SA and CC0 [31] CC-BY30 requires specify-ing the source of the data CC-BY-SA31 requires in ad-dition that if the data is published it is published underthe same legal conditions CC032 defines the respectivedata as public domain and without any restrictions

Noteworthy is that most data sources in the LinkedOpen Data cloud do not provide any licensing infor-mation [31] which makes it difficult to use the datain commercial settings Even if data is published un-der CC-BY or CC-BY-SA the data is often not usedsince companies refer to uncertainties regarding thesecontracts

Definition of metric The dimension License isdetermined by the criterion Provisioning machine-readable licensing information

The fulfillment degree of a KG g wrt the dimensionLicense is measured by the metric mmacLicense whichis defined as follows

Provisioning machine-readable licensing informationLicenses define the legal frameworks under which theKG data may be used Providing machine-readable li-censing information allows users and applications to beaware of the license and to use the data of the KG inaccordance with the legal possibilities [3026]

Licenses can be specified in RDF via relationssuch as cclicence33 dctermslicence ordctermsrights The licensing information can bespecified either in the KG as additional facts or sepa-rately in a VoID file We measure the criterion by eval-uating whether licensing information is available in amachine-readable format

mmacLicense(g) =

1 machine-readable

licensing informationavailable

0 otherwise

353 InterlinkingDefinition of dimension Interlinking is the extent

ldquoto which entities that represent the same concept are

30See httpscreativecommonsorglicensesby40 requestedon Mar 1 2016

31See httpscreativecommonsorglicensesby-sa40 requested on Mar 1 2016

32See httpcreativecommonsorgpublicdomainzero10 requested on Mar 3 2016

33Using the namespace httpcreativecommonsorgns

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 17

linked to each other be it within or between two ormore data sourcesrdquo [49]

Discussion According to Bizer et al [12] DBpediaestablished itself as a hub in the Linked Data clouddue to its intensive interlinking with other KGs Theseinterlinking is on the instance level usually establishedvia owlsameAs links However according to Halpinet al [24] those owlsameAs links do not alwaysinterlink identical entities in reality According to theauthors one reason might be that the KGs provideentries in different granularity For instance the DB-pedia resource for Berlin (dboBerlin) links viaowlsameAs relations to three different resources inthe KG GeoNames34 namely (i) Berlin the capital35

(ii) Berlin the state36 and (iii) Berlin the city37 More-over owlsameAs relations are often created auto-matically by some mapping function Due to mappingerrors the precision is often below 100 [18]

Definition of metric The dimension Interlinking isdetermined by the criteria

ndash Interlinking via owlsameAsndash Validity of external URIs

The fulfillment degree of a KG g wrt the dimen-sion Interlinking is measured by the metrics mInst andmURIs which are defined as follows

Interlinking via owlsameAs The forth LinkedData principle according to Berners-Lee [8] is the inter-linking of data resources so that the user can explorefurther information According to Hogan et al [30] theinterlinking has a side effect It does not only result inotherwise isolated KGs but the number of incominglinks of a KG indicates the importance of the KG in theLinked Open Data cloud We measure the interlinkingon instance level38 by calculating the extent to which in-stances have at least one owlsameAs link to externalKGs

34See httpwwwgeonamesorg requested on Dec 312016

35See httpwwwgeonamesorg2950159berlinhtml requested on Feb 4 2017

36See httpwwwgeonamesorg2950157land-berlinhtml requested on Feb 4 2017

37See httpwwwgeonamesorg6547383berlin-stadthtml requested on Feb 4 2017

38The interlinking on schema level is already measured via thecriterion Interoperability of proprietary vocabulary

mInst(g) = |x isin Ig (Pg cup Cg) |

exist(xowlsameAs y) isin g and y isin Uextg |

|Ig (Pg cup Cg)|

Validity of external URIs The considered KG maycontain outgoing links referring to RDF resourcesor Web documents (non-RDF data) The linking toRDF resources is usually enabled by owlsameAsowlequivalentProperty and owlequivalentClass relations Web documents are linked viarelations such as foafhomepage and foafdepiction Linking to external resources always entailsthe problem that those links might get invalid over timeThis can have different causes For instance the URIsare not available anymore We measure the Validity ofexternal URIs by evaluating the URIs from an URI sam-ple set wrt whether there is a timeout a client error(HTTP response 4xx) or a server error (HTTP response5xx)

mURIs(g) =|x isin A | resolvable(x)|

|A|

where A = y | exist(x p y) isin g (p isin Peqandx isin Ug(CgcupPg)andx isin U local

g andy isin Uextg ) and resolvable(x)

returns true if HTTP status code 200 is returned Peq isthe set of relations used for linking to external sourcesExamples for such relations are owlsameAs andfoafhomepage

In case of an empty set A the metric should evaluateto 1

36 Conclusion

In this section we provided 34 DQ criteria which canbe applied in the form of DQ metrics to KGs in order toassess those KGs wrt data quality The DQ criteria areclassified into 11 DQ dimensions These dimensionsare themselves grouped into 4 DQ categories In totalwe have the following picture

ndash Intrinsic category

lowast Accuracylowast Syntactic validity of RDF documentslowast Syntactic validity of literalslowast Semantic validity of triples

18 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

lowast Trustworthinesslowast Trustworthiness on KG levellowast Trustworthiness on statement levellowast Using unknown and empty values

lowast Consistencylowast Check of schema restrictions during inser-

tion of new statementslowast Consistency of statements wrt class con-

straintslowast Consistency of statements wrt relation con-

straints

ndash Contextual category

lowast Relevancylowast Creating a ranking of statements

lowast Completenesslowast Schema completenesslowast Column completenesslowast Population completeness

lowast Timelinesslowast Timeliness frequency of the KGlowast Specification of the validity period of state-

mentslowast Specification of the modification date of

statements

ndash Representational data quality

lowast Ease of understandinglowast Description of resourceslowast Labels in multiple languageslowast Understandable RDF serializationlowast Self-describing URIs

lowast Interoperabilitylowast Avoiding blank nodes and RDF reificationlowast Provisioning of several serialization formatslowast Using external vocabularylowast Interoperability of proprietary vocabulary

ndash Accessibility category

lowast Accessibilitylowast Dereferencing possibility of resourceslowast Availability of the KGlowast Provisioning of public SPARQL endpointlowast Provisioning of an RDF exportlowast Support of content negotiationlowast Linking HTML sites to RDF serializationslowast Provisioning of KG metadata

lowast Licenselowast Provisioning machine-readable licensing in-

formationlowast Interlinking

lowast Interlinking via owlsameAslowast Validity of external URIs

4 Selection of KGs

We consider the following KGs for our comparativeevaluation

ndash DBpedia DBpedia39 is the most prominent KGin the LOD cloud [4] The project was initiatedby researchers from the Free University of Berlinand the University of Leipzig in collaborationwith OpenLink Software Since the first public re-lease in 2007 DBpedia is updated roughly once ayear40 By means of a dedicated open source ex-traction framework DBpedia is created from infor-mation contained in Wikipedia such as infobox ta-bles categorization information geo-coordinatesand external links Due to its role as the hub ofthe LOD cloud DBpedia contains many links toother datasets in the LOD cloud such as FreebaseOpenCyc UMBEL41 GeoNames Musicbrainz42

CIA World Factbook43 DBLP44 Project Guten-berg45 DBtune Jamendo46 Eurostat47 Uniprot48

and Bio2RDF4950 DBpedia has been used exten-sively in the Semantic Web research communitybut has become also relevant in commercial set-tings for instance companies such as the BBC[33] and the New York Times [41] use DBpediato organize their content The version of DBpediawe analyzed is 2015-04

39See httpdbpediaorg requested on Nov 1 201640There is also DBpedia live which started in 2009 and which

gets updated when Wikipedia is updated See httplivedbpediaorg requested on Nov 1 2016 Note however thatDBpedia live only provides a restricted set of relations compared toDBpedia Also the provisioning of data varies a lot While for sometime ranges DBpedia live provides data for each hour for other timeranges DBpedia live data is only available once a month

41See httpumbelorg requested on Dec 31 201642See httpmusicbrainzorg requested on Dec 31

201643See httpswwwciagovlibrary

publicationsthe-world-factbook requested on Dec31 2016

44See httpwwwdblporg requested on Dec 31 201645See httpswwwgutenbergorg requested on Dec

31 201646See httpdbtuneorgjamendo requested on Dec

31 201647See httpeurostatlinked-statisticsorg

requested on Dec 31 201648See httpwwwuniprotorg requested on Dec 31

201649See httpbio2rdforg requested on Dec 31 201650See a complete list of the links on the websites describing the sin-

gle DBpedia versions such as httpdownloadsdbpediaorg2016-04links (requested on Nov 1 2016)

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 19

ndash Freebase Freebase51 is a KG announced byMetaweb Technologies Inc in 2007 and was ac-quired by Google Inc on July 16 2010 In con-trast to DBpedia Freebase had provided an in-terface that allowed end-users to contribute tothe KG by editing structured data Besides user-contributed data Freebase integrated data fromWikipedia NNDB52 FMD53 and MusicBrainz54

Freebase uses a proprietary graph model for stor-ing also complex statements Freebase shut downits services completely on August 31 2016 Onlythe latest data dump is still available WikimediaDeutschland and Google integrate Freebase datainto Wikidata via the Primary Sources Tool55 Fur-ther information about the migration from Free-base to Wikidata is provided in [44] We analyzedthe latest Freebase version as of March 2015

ndash OpenCyc The Cyc56 project started in 1984 bythe industry research and development consortiumMicroelectronics and Computer Technology Cor-poration The aim of Cyc is to store ndash in a machine-processable way ndash millions of common sense factssuch as ldquoEvery tree is a plantrdquo The main focus ofCyc has been on inferencing and reasoning SinceCyc is proprietary a smaller version of the KGcalled OpenCyc57 was released under the opensource Apache license Version 2 In July 2006 Re-searchCyc58 was published for the research com-munity containing more facts than OpenCyc Wedid not consider Cyc and ResearchCyc since thoseKGs do not meet the chosen requirements namelythat the KGs are freely available and freely us-able in any context The version of OpenCyc weanalyzed is 2012-05-10

ndash Wikidata Wikidata59 is a project of WikimediaDeutschland which started on October 30 2012The aim of the project is to provide data whichcan be used by any Wikimedia project including

51See httpfreebasecom requested on Nov 1 201652See httpwwwnndbcom requested on Dec 31 201653See httpwwwfashionmodeldirectorycom re-

quested on Dec 31 201654See httpmusicbrainzorg requested on Dec 31

201655See httpswwwwikidataorgwikiWikidata

Primary_sources_tool requested on Apr 8 201656See httpwwwcyccom requested on Dec 31 201657See httpwwwopencycorg accessed on Nov 1

201658See httpresearchcyccom requested on Dec 31

201659See httpwikidataorg accessed on Nov 1 2016

Wikipedia Wikidata does not only store facts butalso the corresponding sources so that the valid-ity of facts can be checked Labels aliases anddescriptions of entities in Wikidata are providedin almost 400 languages Wikidata is a commu-nity effort ie users collaboratively add and editinformation Also the schema is maintained andextended based on community agreements Wiki-data is currently growing considerably due to theintegration of Freebase data [44] The version ofWikidata we analyzed is 2015-10

ndash YAGO YAGO60 ndash Yet Another Great Ontol-ogy ndash has been developed at the Max PlanckInstitute for Computer Science in Saarbruumlckensince 2007 YAGO comprises information ex-tracted from Wikipedia (such as information fromthe categories redirects and infoboxes) Word-Net [19] (such as information about synsets andhyponomies) and GeoNames61 The version ofYAGO we analyzed is YAGO3 which was pub-lished in March 2015

5 Comparison of KGs

51 Key Statistics

In the following we present statistical commonal-ities and differences of the KGs DBpedia FreebaseOpenCyc Wikidata and YAGO We thereby use thefollowing key statistics

ndash Number of triplesndash Number of classesndash Number of relationsndash Distribution of classes wrt the number of their

corresponding instancesndash Coverage of classes with at least one instance per

classndash Covered domains wrt entitiesndash Number of entitiesndash Number of instancesndash Number of entities per classndash Number of unique subjectsndash Number of unique predicatesndash Number of unique objects

In Section 72 we provide an overview of relatedwork wrt those key statistics

60See httpwwwmpi-infmpgdedepartmentsdatabases-and-information-systemsresearchyago-nagayagodownloads accessed on Nov 1 2016

61See httpwwwgeonamesorg requested on Dec 312016

20 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

511 TriplesRanking of KGs wrt number of triples The num-

ber of triples (see Table 2) differs considerably betweenthe KGs Freebase is the largest KG with over 31Btriples while OpenCyc resides the smallest KG withonly 24M triples The large size of Freebase can betraced back to the fact that large data sets such as Mu-sicBrainz have been integrated into this KG OpenCycin contrast has been built purely manually by expertsIn general this indicates a correlation between the wayof building up a KG and its size

Size differences between DBpedia and YAGO Asboth DBpedia and YAGO were created automaticallyby extracting semantically-structured information fromWikipedia the significant difference between their sizesndash in terms of triples ndash is in particular noteworthy Wecan mention here the following reasons YAGO inte-grates the statements from different language versionsof Wikipedia in one single KG while for the canon-ical DBpedia dataset (which is used in our evalua-tions) solely the English Wikipedia was used as in-formation source Besides that YAGO contains con-textual information and detailed provenance informa-tion Contextual information is for instance the an-chor texts of all links within Wikipedia For repre-senting the anchor texts the relation yagohasWikipediaAnchorText (330M triples in total) is usedThe provenance information of single statements isstored in a reified form In particular the relationsyagoextractionSource (1612M triples) andyagoextractionTechnique (1762M triples)are applied therefore3nInfluence of reification on the number of triples

DBpedia Freebase Wikidata and YAGO use someform of reification Reification in general describesthe possibility of making statements about statementsWhile reification has an influence on the number oftriples for DBpedia Freebase and Wikidata the num-ber of triples in YAGO is not influenced by reificationsince data is here provided in N-Quads62 This style ofreification is called Named Graph [27] The additionalcolumn (in comparison to triples) contains a unique IDof the statement by which the triple becomes identifiedFor backward compatibility the ID is commented andtherefore not imported into the triple store Note how-ever that transforming N-Quads to N-Triples leads to a

62The idea of N-Quads is based on the assignment of triples todifferent graphs YAGO uses N-Quads to identify statements per ID

high number of unique subjects concerning the set ofall triples

In case of DBpedia Freebase and Wikidata reifica-tion is implemented by means of n-ary relations Ann-ary relation denotes the relation between more thantwo resources and is implemented via additional inter-mediate nodes since in RDF only binary statementscan be modeled [1627] In Freebase and DBpedia datais mostly provided in the form of plain N-Triples andn-ary relations are only used for data from higher ar-ity63 Wikidata in contrast has the peculiarity that notonly every statement is expressed with the help of ann-ary relation but that in addition each statement is in-stantiated with wdoStatement This leads to about74M additional instances which is about one tenth ofall triples in Wikidata

512 ClassesMethods for counting classes The number of

classes can be calculated in different ways Classes canbe identified via rdfsClass and owlClass re-lations or via rdfssubClassOf relations64 SinceFreebase does not provide any class hierarchy withrdfssubClassOf relations and since Wikidatadoes not instantiate classes explicitly as classes butuses instead only ldquosubclass ofrdquo (wdtP279) relationsthe method of calculating the number of classes de-pends on the considered KG

Ranking of KG wrt number of classes Our eval-uations revealed that YAGO contains the highest num-ber of classes of all considered KGs DBpedia in con-trast has the fewest (see Table 2)

Number of classes in YAGO and DBpedia Howdoes it come to this gap between DBpedia and YAGOwith respect to the number of classes although bothKGs were created automatically based on WikipediaFor YAGO the classes are extracted from the categoriesin Wikipedia while the hierarchy of the classes is de-ployed with the help of WordNet synset relations TheDBpedia ontology in contrast is very small since itis created manually based on the mostly used infobox

63In Freebase Compound Value Types are used for reifi-cation [44] In DBpedia it is named Intermedia Node Map-ping see httpmappingsdbpediaorgindexphpTemplateIntermediateNodeMapping (requested on Dec31 2016)

64The number of classes in a KG may also be calculated by takingall entity type relations (rdftype and ldquoinstance ofrdquo (wdtP31)in case of Wikidata) on the instance level into account However thiswould result only in a lower bound estimation as here those classesare not considered which have no instances

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 21

DBpe

dia

Freeb

ase

Ope

nCyc

Wikidat

a

YAGO

20

40

60

80

100

Co

ve

rag

e in

Fig 1 Coverage of classes having at least one instance

templates in Wikipedia Besides those 736 classes theDBpedia KG contains further 444895 classes whichoriginate from the imported YAGO classes and whichare published in the namespace yago Those YAGOclasses are ndash like the DBpedia ontology classes ndash inter-connected via rdfssubClassOf to form a taxon-omy In the evaluation of DBpedia the YAGO classesare ignored as they do not belong to the DBpedia on-tology given as OWL file

Coverage of classes with at least one instanceFig 1 shows for each KG the extent to which classes areinstantiated that is for how many classes at least oneinstance exists YAGO exhibits the highest coveragerate (826) although it contains the highest numberof classes among the KGs This can be traced back tothe fact that YAGO classes are chosen by a heuristicthat considers Wikipedia leaf categories which tend tohave instances [43] OpenCyc (with 65) and Wiki-data (54) come last in the ranking Wikidata has thesecond highest number of classes in total (see Table 2)out of which relatively little are used on instance levelNote however that in some scenarios solely the schemalevel information (including classes) of KGs is neces-sary so that the low coverage of instances by classes isnot necessarily an issue

Correlation between number of classes and num-ber of instances In Fig 2 we can see a histogramof the classes with respect to the number of instancesper class That is for each KG we can spot how manyclasses have a high number of instances and how manyclasses have a low number of instances Note the log-arithmic scale on both axes The curves seem to fol-low power law distributions For DBpedia the line de-

Table 1Percentage of considered entities per KG for covered domains

DB FB OC WD YA

Reach of method 88 92 81 41 82

creases consistently for the first 250 classes before itdecreases more than exponentially beyond class 250

513 DomainsAll considered KGs are cross-domain meaning that a

variety of domains are covered in those KGs Howeverthe KGs often cover the single domains to a differentdegree Tartir [45] proposed to measure the covered do-mains of ontologies by determining the usage degree ofcorresponding classes the number of instances belong-ing to one or more subclasses of the respective domainis compared to the number of all instances In our workhowever we decided to evaluate the coverage of do-mains concerning the classes per KG via manual assign-ments of the mostly used classes to the domains peoplemedia organizations geography and biology65 Thislist of domains was created by aggregating the mostfrequent domains in Freebase

The manual assignment of classes to domainsis necessary in order to obtain a consistent assign-ment of the classes to the domains across all con-sidered KGs Otherwise the same classes in differ-ent KGs may be assigned to different domains More-over in some KGs classes may otherwise appear invarious domains simultaneously For instance theFreebase classes freebasemusicartist andfreebasepeopleperson overlap in terms oftheir instances and multiple domains (such as musicand people) might be assigned to them

As the reader can see in Table 1 our method to de-termine the coverage of domains and hence the reachof our evaluation includes about 80 of all entities ofeach KG except Wikidata It is calculated as the ratio ofthe number of unique entities of all considered domainsof a given KG divided by the number of all entities ofthis KG66 If the ratio was at 100 we were able toassign all entities of a KG to the chosen domains

Fig 3 shows the number of entities per domain in thedifferent KGs with a logarithmic scale Fig 4 presents

65See our website for examples of classes per domain andper KG httpkmaifbkitedusitesknowledge-graph-comparison (requested on Dec 31 2016)

66We used the number of unique entities of all domains and notthe sum of the entities measured per domain since entities may be inseveral domains at the same time

22 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

100 101 102 103

Classes

100

102

104

106

108

Nu

mb

er

of

insta

nce

s

DBpedia

Freebase

OpenCyc

Wikidata

YAGO

Fig 2 Distribution of classes wrt the number of instances per KG

persons media organizations geography biology100

102

104

106

108

1010

Num

ber

of entities

DBpedia

Freebase

OpenCyc

Wikidata

YAGO

Fig 3 Number of entities per domain

the relative coverage of each domain in each KG It iscalculated as the ratio of the number of entities in eachdomain to the total number of entities of the KG Avalue of 100 means that all instances reside in onesingle domain

The case of Freebase is especially outstanding here77 of all entities here are located in the media

domain This fact can be traced back to large-scaledata imports such as from MusicBrainz The classfreebasemusicrelease_track is account-able for 42 of the media entities As shown in Fig 3Freebase provides the most entities in four out of thefive domains when considering all KGs

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 23

persons media organizations geography biology

10

20

30

40

50

60

70

80

Rela

tive n

um

ber

of entities in p

erc

ent

DBpedia

Freebase

OpenCyc

Wikidata

YAGO

Fig 4 Relative number of entities per domain

In DBpedia and YAGO the domain of people is thelargest domain (50 and 34 respectively) Peculiar isthe higher coverage of YAGO regarding the geographydomain compared to DBpedia As one reason for thatwe can point out the data import of GeoNames intoYAGO

Wikidata contains around 150K entities in the do-main organization This is relativly few consideringthe total amount of entities being around 187M andconsidering the number of organizations in other KGsNote that even DBpedia provides more organizationentities than Wikidata The reason why Wikidata hasnot so many organization entities is not fully compre-hensible to us However we can point out that for ouranalysis we only considered Wikidata classes whichappeared more than 6000 times67 and that about 16Kclasses were therefore not considered It is possible thatentities of the domain organization are belonging tothose rather rarely occurring classes

514 Relations and PredicatesEvaluation method In this article we differentiate

between relations and predicates (see also Section 2)

ndash Relations ndash as short term for explicitly defined re-lations ndash refers to (proprietary) vocabulary definedon the schema level of a KG We identify the setof relations of a KG as the set of those links which

67This number is based on heuristics We focused on the 150 mostinstantiated classes and cut the long tail of classes having only fewinstances

are explicitly defined as such via assignments (forinstance with rdfsProperty) to classes InSection 2 we used Pg to denote this set

ndash In contrast we use predicates to denote links usedin the KG independently of their introduction onthe schema level The set of unique predicates perKG denoted as P imp

g is nothing else than the setof unique RDF terms on the predicate position ofall triples in the KG

It is important to distinguish the key statistics for rela-tions from the key statistics for predicates since theycan differ considerably depending on to which degreerelations are only defined on schema level but not usedon instance level

Evaluation resultsRelationsRanking regarding relations As presented in Ta-

ble 2 Freebase exhibits by far the highest number ofunique relations (around 785K) among the KGs YAGOshows only 106 relations which is the lowest value inthis comparison In the following we point out furtherfindings regarding the relations of the single KGs

DBpedia Regarding DBpedia relations we need todistinguish between so-called mapping-based prop-erties and non-mapping-based properties Mapping-based properties are created by extracing the informa-tion from infoboxes in Wikipedia using manually cre-ated mappings These mappings are specified in the DB-

24 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

pedia Mappings Wiki68 Mapping-based properties arecontained in the DBpedia ontology and located in thenamespace httpdbpediaorgontologyWe count 2819 such relations for the considered DB-pedia version 2015-04 Non-mapping-based properties(also called ldquoraw infobox propertiesrdquo) are extractedfrom Wikipedia without the help of manually createdmappings and hence without any manual adjustmentsTherefore they are generally of lower quality We count58776 such unique relations They reside in the names-pace httpdbpediaorgproperty Bothmapping-based and non-mapping-based properties areinstantiated in DBpedia with rdfProperty We ig-nore the non-mapping based properties for the calcu-lation of the number of relations |Pg| (see Table 2)since in contrast to DBpedia in YAGO non-mappingbased properties are not instantiated Note that themapping-based properties and the non-mapping basedproperties in DBpedia are not aligned69 and may over-lap until DBpedia version 2016-0470

Freebase The high number or Freebase relations canbe explained by two facts 1 About a third of all rela-tions in Freebase are duplicates in the sense that they aredeclared by means of the owlinverseOf relationas being inverse of other relations An example is the re-lation freebasemusicartistalbum and itsinverse relation freebasemusicalbumartist2 Freebase allowed users to introduce their own rela-tions without any limits These relations were originallyin each userrsquos namespace So-called commons adminswere able to approve those relations so that they gotincluded into the Freebase commons schema

OpenCyc For OpenCyc we measure 18028 uniquerelations We can assume that most of them are dedi-cated to statements on the schema level

Wikidata In Wikidata a relatively small set of rela-tions is provided Note in this context that despite thefact that Wikidata is curated by a community (just likeFreebase) Wikidata community members cannot insertarbitrarily new relations as it was possible in Freebaseinstead relations first need to be proposed and thenget accepted by the community if and only if certain

68See httpmappingsdbpediaorgindexphpMain_Page accessed on Nov 4 2016

69For instance The DBpedia ontology containsdbobirthName for the name of a person while the non-mappingbased property set contains dbpname dbpfirstname anddbpalternativeNames

70For instance dbpalias and dboalias

criteria are met71 One of those criteria is that each newrelation is presumably used at least 100 times Thisrelation proposal process can be mentioned as likelyreason why in Wikidata in relative terms more relationsare actually used than in Freebase

YAGO For YAGO we measure the small set of 106unique relations Although relations are curated man-ually for YAGO and DBpedia the size of the relationset differs significantly between those KGs Hoffart etal [28] mention the following reasons for that

1 Peculiarity of relations The DBpedia ontologyprovides quite many special relations For in-stance there exists the relation dboaircraftFighter between dboMilitaryUnit anddboMeanOfTransportation

2 Granularity of relations Relations in the DB-pedia ontology are more fine-grained than rela-tions in YAGO For instance DBpedia contains therelations dboauthor and dbodirectorwhereas in YAGO there is only the generic relationyagocreated

3 Date specification The DBpedia ontology intro-duces several relations for dates For instance DB-pedia contains the relations dbobirthDateand dbobirthYear for birth dates while inYAGO only the relation yagobirthOnDateis used Incomplete date specifications ndash for in-stance if only the year is known ndash are specifiedin YAGO by wildcards (ldquordquo) so that no multiplerelations are needed

4 Inverse relations YAGO has no relations ex-plicitly specified as being inverse In DBpediawe can find relations specified as inverse such asdboparent and dbochild

5 Reification YAGO introduces the SPOTL(X) for-mat This format extends the triple format ldquoSPOldquowith a specification of Time Location and conteXtIn this way no contextual relations are necessary(such as dbodistanceToLondon or dbopopulationAsOf) which occur if the relationsare closely aligned to Wikipedia template attributenames

Frequency of the usage of relations Fig 5 showsthe relative proportions of how often relations are usedper KG grouped into three classes Surprisingly DB-pedia and Freebase exhibit a high number of relationswhich are not used at all on the instance level In case of

71See httpswwwwikidataorgwikiWikidataProperty_proposal requested on Dec 31 2016

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 25

0 1-500 gt500

Number of relations

20

40

60

80

100

Re

lative

occu

ren

cie

s in

pe

rce

nt

DBpedia

Freebase

OpenCyc

Wikidata

YAGO

Fig 5 Frequency of the usage of the relations per KG grouped by(i) zero occurrences (ii) 1ndash500 occurrences and (iii) more than 500occurrences in the respective KG

OpenCyc 992 of the defined relations are never usedWe assume that those relations are used only withinCyc the commercial version of OpenCyc In case ofFreebase only 5 of the relations are used more than500 times and about 70 are not used at all Analo-gously to the discussion regarding the number of Free-base relations we can mention again the high numberof defined owlinverseOf relations and the highnumber of usersrsquo relation proposals as reasons for that

PredicatesRanking regarding predicates Freebase is here ndash

like in case of the ranking regarding relations ndash rankedfirst The lowest number of unique predictes is providedby OpenCyc which exhibits only 165 predicates AllKGs except OpenCyc provide more predicates then re-lations Our single observations regarding the predicatesets are as follows

DBpedia DBpedia is ranked third in terms of the ab-solute numbers of predicates about 60K predicates areused in DBpedia The set of relations and the set of pred-icates varies considerably here since also facts are ex-tracted from Wikipedia info-boxes whose predicates areconsidered by us as being only implicitly defined andwhich hence occur only as predicates These are the so-called non-mapping-based properties Note that in the

studied DBpedia version 2015-04 the set of explicitlydefined relations (mapping-based properties) and theset of implicitly defined relations (non-mapping-basedproperties) overlaps An example is dbpalias withdboalias

Freebase We can observe here a similar picture asfor the set of Freebase relations With about 785Kunique predicates Freebase exceeds the other KGs byfar Note however that 95 of the predicates (around743K) are used only once This relativizes the highnumber Most of the predicates are keys in the senseof ids and are used for internal modeling (for instancefreebasekeyuseradrianb)

OpenCyc In contrast to the 18028 unique relationswe measure only 164 unique predicates for OpenCycMore predicates are presumably used in Cyc

Wikidata We measure more Wikidata predicates thanWikidata relations since Wikidata predicates are cre-ated by modifying Wikidata relations An exampleare the following triples which express the statementBarack Obama (wdtQ76) is a human (wdtQ5) byan intermediate node (wdtQ76S123 abbreviated)

wdtQ76 wdtP31s wdtQ76S123wdtQ76S123 wdtP31v wdtQ5

The relation extension ldquosrdquo indicates that the RDF termin the object position is a statement The ldquovrdquo extensionallows to refer to a value (in Wikidata terminology)Besides those extensions there is ldquorrdquo to refer to a ref-erence and the ldquoqrdquo extension to refer to a qualifier Ingeneral these relation extensions are used for realizingreification via n-ary relations For that intermediatenodes are used which represent statements [16]

YAGO YAGO contains more predicates than DBpe-dia since infobox attributes from different languageversions of Wikipedia are aggregated into one KG72

while for DBpedia separate localized KG versions areoffered for non-English languages

515 Instances and EntitiesEvaluation method We distinguish between in-

stances Ig and entities Eg of a KG (cf Section 2)

1 Instances are belonging to classes They are iden-tified by retrieving the subjects of all triples wherethe predicates indicate class affiliations

72The language of each attribute is encoded in theURI for instance yagoinfoboxdeflaumlche andyagoinfoboxenareakm

26 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGODBpe

diaFre

ebas

eOpe

nCyc

Wiki

data

YAGO

10 010 110 210 310 410 510 610 710 810 9

Num

ber

of In

stan

ces

Fig 6 Number of instances per KG

2 Entities are real-world objects This excludesfor instance instantiated statements for beingentities Determining the set of entities is par-tially tricky In DBpedia and YAGO entitiesare determined as being an instance of theclass owlThing In Freebase entities are in-stances of freebasecommontopic and inWikidata instance of wdoItem In OpenCyccychIndividual corresponds to owlThingbut not all entities are classified in this way There-fore we approximately determine the set of en-tities in OpenCyc by manually classifying allclasses having more than 300 instances includingat least one entity73 In this way abstract classessuch as cychExistingObjectType are ne-glected

Ranking wrt the number of instances Table 2and Fig 6 show the number of instances per KG Wecan see that Wikidata comprises the highest numberof instances (142M) in total and OpenCyc the fewest(242K)

Ranking wrt the number of entities Table 2shows the ranking of KGs regarding the number of en-tities Freebase contains by far the highest number ofentities (about 499M) OpenCyc is at the bottom withonly about 41K entities

Differences in number of entities The reason whythe KGs show quite varying numbers of entities are theinformation sources of the KGs We illustrate this withthe music domain as example

1 Freebase had been created mainly from data im-ports such as from MusicBrainz Therefore enti-

73For instance cychIndividual cychMovie_CW andcychCity

ties in the domain of media and especially songrelease tracks are covered very well in Freebase77 of all entities are in the media domain (seeSection 513) out of which 42 are releasetracks74

Due to the large size and the world-wide coverageof entities in MusicBrainz Freebase contains al-bums and release tracks of both English and non-English languages For instance regarding the En-glish language the album ldquoThrillerrdquo from MichaelJackson and its single ldquoBillie Jeanrdquo are there aswell as rather unknown songs from the ldquoThrillerrdquoalbum such as ldquoThe Lady in My Liferdquo Regard-ing non-English languages Freebase contains forinstance songs and albums from Helene Fischersuch as ldquoLassrsquo mich in dein Lebenrdquo and ldquoZauber-mondrdquo also rather unknown songs such as ldquoHabrsquoden Himmel beruumlhrtrdquo can be found

2 In case of DBpedia the English Wikipedia is thesource of information In the English Wikipediamany albums and singles of English artists are cov-ered ndash such as the album ldquoThrillerrdquo and the singleldquoBillie Jeanrdquo Rather unknown songs such as ldquoTheLady in My Liferdquo are not covered in WikipediaFor many non-English artists such as the Germansinger Helene Fischer no music albums and nosingles are contained in the English Wikipedia Inthe corresponding language version of Wikipedia(and localized DBpedia version) this informationis often available (for instance the album ldquoZauber-mondrdquo and the song ldquoLassrsquo mich in dein Lebenrdquo)but not the rather unknown songs such as ldquoHabrsquoden Himmel beruumlhrtrdquo

3 For YAGO the same situation as for DBpediaholds with the difference that YAGO in additionimports entities also from the different languageversions of Wikipedia and imports also data fromsources such as GeoNames However the abovementioned works (ldquoLassrsquo mich in dein LebenrdquoldquoZaubermondrdquo and ldquoHabrsquo den Himmel beruumlhrtrdquo)of Helene Fischer are not in the YAGO althoughthe song ldquoLassrsquo mich in dein Lebenrdquo exists inthe German Wikipedia since May 2014 and al-though the used YAGO version 3 is based on theWikipedia dump of June 201475 Presumably theYAGO extraction system was unable to extract any

74Those release tracks are expressed via freebasemusicrelease_track

75See httpwwwmpi-infmpgdededepartmentsdatabases-and-information-

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 27DBpe

diaFre

ebas

eOpe

nCyc

Wiki

data

YAGO

10 0

10 1

10 2

10 3

10 4

Ave

rage

num

ber

of e

ntiti

es

Fig 7 Average number of entities per class per KG

types for those entities so that those entities werediscarded

4 Wikidata is supported by the community and con-tains music albums of English and non-Englishartists even if they do not exist in Wikipedia Anexample is the song ldquoThe Lady in My Liferdquo Notehowever that Wikidata does not provide all artistrsquosworks such as from Helene Fischer

5 OpenCyc contains only very few entities in themusic domain The reason is that OpenCyc has itsfocus mainly on common-sense knowledge andnot so much on facts about entities

Average number of entities per class Fig 7 showsthe average number of entities per class which can bewritten as |Eg||Cg| Obvious is the difference betweenDBpedia and YAGO (despite the similar number of en-tities) The reason for that is that the number of classesin the DBpedia ontology is small (as created manually)and in YAGO large (as created automatically)

Comparing number of instances with number ofentities Comparing the ratio of the number of instancesto the number of entities for each KG Wikidata ex-poses the highest difference As reason for that we canstate that each statement in Wikidata is modeled as aninstance of wdoStatement leading to 74M addi-tional instances In other KGs such as DBpedia state-ments are modeled without any dedicated statementassignment OpenCyc exposes also a high ratio sinceit contains mainly common sense knowledge and notas many entities as the other KGs Furthermore for ouranalysis we do not regard 100 of the entities but onlya large fraction of it (more precisely the classes with

systemsresearchyago-nagayagoarchive re-quested on Dec 31 2016

DBpedia

Freeb

ase

OpenC

ycW

ikida

ta

YAGO

0

1

2

3

4

5

6

7

8

Rat

io o

f num

ber

of in

stan

ces

to n

umbe

r of

ent

ities

Fig 8 Ratio of the number of instances to the number of entities foreach KG

the most frequently occurring instantiations) since en-tities are not consistently instantiated in OpenCyc (seebeginning of Section 515)

516 Subjects and ObjectsEvaluation method The number of unique subjects

and unique objects can be a meaningful KG charac-teristic regarding the link structure within the KG andin comparison to other KGs Especially interesting aredifferences between the number of unique subjects andthe number of unique objects

We measure the number of unique subjects by count-ing the unique resources (ie URIs and blank nodes) onthe subject position of N-Triples Sg = s | (s p o) ising Furthermore we measure the number of uniqueobjects by counting the unique resources on the ob-ject position of N-Triples excluding literals Og =o | (s p o) isin g and o isin U cup B Complementary thenumber of literals is given as Olit

g = o | (s p o) ising and o isin L

Ranking of KGs regarding number of uniquesubjects The number of unique subjects per KG is pre-sented in Fig 9 YAGO contains the highest number ofdifferent subjects while OpenCyc contains the fewest

Ranking of KGs regarding number of unique ob-jects The number of unique objects is also presented inFig 9 Freebase shows the highest score in this regardOpenCyc again the lowest

Ranking of KGs regarding the ratio of numberof unique subjects to number of unique objects Theratios of the number of unique subjects to the number ofunique objects vary considerably between the KGs (seeFig 8) We can observe that DBpedia has 265 timesmore objects than subjects while YAGO on the otherside has 19 times more unique subjects than objects

28 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Table 2Summary of key statistics

DBpedia Freebase OpenCyc Wikidata YAGO

Number of triples |(s p o) isin g| 411 885 960 3 124 791 156 2 412 520 748 530 833 1 001 461 792

Number of classes |Cg| 736 53 092 116 822 302 280 569 751

Number of relations |Pg| 2819 70 902 18 028 1874 106

No of unique predicates |P impg | 60 231 784 977 165 4839 88 736

Number of entities |Eg| 4 298 433 49 947 799 41 029 18 697 897 5 130 031

Number of instances |Ig| 20 764 283 115 880 761 242 383 142 213 806 12 291 250

Avg number of entities per class |Eg||Cg| 58403 9408 035 619 90

No of unique subjects |Sg| 31 391 413 125 144 313 261 097 142 278 154 331 806 927

No of unique non-literals in obj pos |Og| 83 284 634 189 466 866 423 432 101 745 685 17 438 196

No of unique literals in obj pos |Olitg | 161 398 382 1 782 723 759 1 081 818 308 144 682 682 313 508

DBpedia

Freeb

ase

OpenC

yc

Wiki

data

YAGO10 0

10 2

10 4

10 6

10 8

10 1 0

10 1 2

unique subjectsunique objects

Fig 9 Number of unique subjects and objects per KG Note thelogarithmic scale on the axis of ordinates

The high number of unique subjects in YAGO is sur-prising and can be explained by the reification styleused in YAGO Facts are stored as N-Quads in orderto allow for making statements about statements (forinstance storing the provenance information for state-ments) To that end IDs (instead of blank nodes) whichidentify the triples are used on the first position of N-Triples They lead to 308M unique subjects such asyagoid_6jg5ow_115_lm6jdp In the RDF ex-port of YAGO the IDs which identify the triples arecommented out in order to facilitate the N-Triple for-mat However the statements about statements are alsotransformed to triples In those cases the IDs identi-fying the reified statements are in the subject positionleading to such a high number of unique subjects

DBpedia contains considerably more owlsameAslinks to external resources than KGs like YAGO (290M

vs 38M links) leading to a bias of DBpedia towards ahigh number of unique objects

517 Summary of Key StatisticsBased on the evaluation results presented in the last

subsections we can highlight the following insights

1 Triples All KGs are very large Freebase is thelargest KG in terms of number of triples whileOpenCyc is the smallest KG We notice a corre-lation between the way of building up a KG andthe size of the KG automatically created KGs aretypically larger as the burdens of integrating newknowledge become lower Datasets which havebeen imported into the KGs such as MusicBrainzinto Freebase have a huge impact on the numberof triples and on the number of facts in the KGAlso the way of modeling data has a great impacton the number of triples For instance if n-aryrelations are expressed in N-Triples format (as incase of Wikidata) many intermediate nodes needto be modeled leading to many additional triplescompared to plain statements Last but not leastthe number of supported languages influences thenumber of triples

2 Classes The number of classes is highly varyingamong the KGs ranging from 736 (DBpedia) upto 300K (Wikidata) and 570K (YAGO) Despite itshigh number of classes YAGO contains in relativeterms the most classes which are actually used(ie classes with at least one instance) This canbe traced back to the fact that heuristics are usedfor selecting appropriate Wikipedia categories asclasses for YAGO Wikidata in contrast containsmany classes but out of them only a small fraction

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 29

is actually used on instance level Note howeverthat this is not necessarily a burden

3 Domains Although all considered KGs are speci-fied as crossdomain domains are not equally dis-tributed in the KGs Also the domain coverageamong the KGs differs considerably Which do-mains are well represented heavily depends onwhich datasets have been integrated into the KGsMusicBrainz facts had been imported into Free-base leading to a strong knowledge representation(77) in the domain of media in Freebase In DB-pedia and YAGO the domain people is the largestlikely due to Wikipedia as data source

4 Relations and Predicates Many relations arerarely used in the KGs Only 5 of the Freebaserelations are used more than 500 times and about70 are not used at all In DBpedia half of therelations of the DBpedia ontology are not usedat all and only a quarter of the relations is usedmore than 500 times For OpenCyc 992 of therelations are not used We assume that they areused only within Cyc the commercial version ofOpenCyc

5 Instances and Entities Freebase contains by farthe highest number of entities Wikidata exposesrelatively many instances in comparison to theentities as each statement is instantiated leadingto around 74M instances which are not entities

6 Subjects and Objects YAGO provides the high-est number of unique subjects among the KGsand also the highest ratio of the number of uniquesubjects to the number of unique objects This isdue to the fact that N-Quad representations needto be expressed via intermedium nodes and thatYAGO is concentrated on classes which are linkedby entities and other classes but which do not pro-vide outlinks DBpedia exhibits more unique ob-jects than unique subjects since it contains manyowlsameAs statements to external entities

52 Data Quality Analysis

We now present the results obtained by applyingthe DQ metrics introduced in the Sections 32 ndash 35 tothe KGs DBpedia Freebase OpenCyc Wikidata andYAGO

521 AccuracyThe fulfillment degrees of the KGs regarding the

Accuracy metrics are shown in Table 3

Table 3Evaluation results for the KGs regarding the dimension Accuracy

DB FB OC WD YA

msynRDF 1 1 1 1 1msynLit 099 1 1 1 062msemTriple 099 lt1 1 099 099

Syntactic validity of RDF documents msynRDF

Evaluation method For evaluating the Syntactic va-lidity of RDF documents we dereference the entityldquoHamburgrdquo as resource sample in each KG In caseof DBpedia YAGO Wikidata and OpenCyc thereare RDFXML serializations of the resource availablewhich can be validated by the official W3C RDF valida-tor76 Freebase only provides a Turtle serialization Weevaluate the syntactic validity of this Turtle documentby verifying if the document can be loaded into an RDFmodel of the Apache Jena Framework77

Evaluation result All considered KGs provide syn-tactically valid RDF documents In case of YAGO andWikidata the RDF validator declares the used languagecodes as invalid since the validator evaluates languagecodes in accordance with ISO-639 The criticized lan-guage codes are however contained in the newer stan-dard ISO 639-3 and actually valid

Syntactic validity of literals msynLit

Evaluation method We evaluate the Syntactic va-lidity of literals by means of the relations date ofbirth number of inhabitants and International Stan-dard Book Number (ISBN) as those relations cover dif-ferent domains ndash namely people cities and books ndashand as they can be found in all KGs In general do-main knowledge is needed for selecting representativerelations so that a meaningful coverage is guaranteed

Note that OpenCyc is not taken into account forthis criterion Although OpenCyc comprises around11M literals in total these literals are essentially la-bels and descriptions (given via rdfslabel andrdfscomment) ie not aligned to specific datatypes Hence OpenCyc has no syntactic invalid literalsand is assigned the metric value 1

As long as a literal with data type is given its syntaxis verified with the help of the function RDFDatatypeisValid(String) of the Apache Jena framework

76See httpsw3orgRDFValidator requested onMar 2 2016

77See httpsjenaapacheorg requested Mar 22016

30 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Thereby standard data types such as xsddate canbe validated easily especially if different data types areprovided78 If no data type is provided or if the literalvalue is of type xsdString the literal is evaluatedby a regular expression which is created manually (seebelow depending on the considered relation) For eachof the three relations we created a sample of 1M literalvalues per KG as long as the respective KG containsso many literals

Evaluation results All KGs except YAGO per-formed very well regarding the Syntactic validity ofliterals

Date of Birth For Wikidata DBpedia and Freebaseall verified literal values (1M per KG) were syntacti-cally correct79 For YAGO we detected around 519Ksyntactic errors (given 1M literal values) due to the us-age of wildcards in the date values For instance thebirth date of yagoSocrates is specified as ldquo470--rdquo which does not correspond to the syntax ofxsddate Obviously the syntactic invalidity of lit-erals is accepted by the YAGO publishers in order tokeep the number of relations low80

Number of inhabitants The data types of the literalvalues regarding the number of inhabitants were validin all KGs For DBpedia YAGO and Wikidata weevaluated the syntactic validity of the number of inhab-itants by checking if xsdnonNegativeIntegerxsddecimal and xsdinteger were used asdata types for the typed literals In Freebase no datatype is specified Therefore we evaluated the values bymeans of a regular expression which allows only thedecimals 0-9 periods and commas

ISBN The ISBN is an identifier for books and maga-zines The identifier can occur in various formats withor without preceding ldquoISBNrdquo with or without delim-iters and with 10 or 13 digits Gupta81 provided a regu-lar expression for validating ISBN in its different formswhich we used in our evaluation All in all most ofthe ISBN were assessed as syntactically correct The

78In DBpedia for instance data for the relationdbobirthDate is stored both as xsdgYear and xsddate

79Surprisingly the Jena Framework assessed data values with anegative year (ie BC eg ldquo-600rdquo for xsdgYear) as invaliddespite the correct syntax

80In order to model the dates to the extent they are known furtherrelations would be necessary such as using wasBornOnYearwith range xsdgYear wasBornOnYearMonth with rangexsdgYearMonth

81See httphowtodoinjavacomregexjava-regex-validate-international-standard-book-number-isbns requested on Mar 1 2016

lowest fulfillment degree was obtained for DBpediaWe found the following findings for the single KGs InFreebase around 699K ISBN numbers were availableOut of them 38 were assessed as syntactically incorrectTypical mistakes were too long numbers and wrongprefixes82 In case of Wikidata 18 of around 11K ISBNnumbers were syntactically invalid However some in-valid numbers have meanwhile been corrected This in-dicates that the Wikidata community does not only careabout inserting new data but also about curating givenKG data In case of YAGO we could only find 400triples with the relation yagohasISBN Seven of theliterals on the object position were syntactically incor-rect For DBpedia we evaluated around 24K literals7419 of them were assessed as syntactically incorrectIn many cases comments next to the ISBN numbers inthe info-boxes of Wikipedia led to an inaccurate extrac-tion of data so that the comments are either extractedas additional facts about ISBN numbers83 or togetherwith the actual ISBN numbers as coherent strings84

Semantic validity of triples msemTriple

Evaluation method The semantic validity can be re-liably measured by means of a reference data set which(i) contains at least to some degree the same facts asin the KG and (ii) which is regarded as some kind ofauthority We decided to use the Integrated AuthorityFile (Gemeinsame Normdatei GND)85 which is anauthority file especially concerning persons and corpo-rate bodies and which was created manually by Ger-man libraries Due to the focus on persons (especiallyauthors) we decided to evaluate a random sample ofperson entities wrt the following relations birth placedeath place birth date and death date For each ofthese relations the corresponding relations in the KGswere determined Then a random sample of 100 personentities per KG was chosen For each entity we retrievedthe facts with the mentioned relations and assessedmanually whether a GND entry exists and whether thevalues of the relations match with the values in the KG

Evaluation result We evaluated up to 400 facts perKG and observed only for a few facts some discrep-ancies For instance Wikidata states as death date of

82Eg we found the 16 digit ISBN 9789780307986931 (cffreebasem0pkny27) and the ISBN 2940045143431 with pre-fix 294 instead of 978 (cf freebasem0v3xf7b)

83See dbrPrince_Caspian84An example is ldquoISBN 0755111974 (hardcover edition)rdquo for

dbrMy_Family_and_Other_Animals85See httpwwwdnbdeENStandardisierung

GNDgndhtml requested on Sep 8 2016

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 31

ldquoAnton Erkelenzldquo (wdtQ589196) April 24 whereasGND states April 25 For DBpedia and YAGO we en-countered 3 and for Wikidata 4 errors Hence thoseKGs were evaluated with 099 Note that OpenCyc hasno values for the chosen relations and thus evaluates to1

During evaluation we identified the following issues

1 For finding the right entry in GND more informa-tion besides the name of the person is needed Thisinformation is sometimes not given so that entitydisambiguation is in those cases hard to perform

2 Contrary to assumptions often either no corre-sponding GND entry exists or not many facts ofthe GND entity are given In other words GND isincomplete wrt to entities (cf Population com-pleteness) and relations (cf Column complete-ness)

3 Values of different granularity need to be matchedsuch as an exact date of birth against the indicationof a year only

In conclusion the evaluation of semantic validity ishard even if a random sample set is evaluated manuallyMeaningful differences among the KGs might be re-vealed only when a very large sample is evaluated egby using crowd-sourcing [2348] Another approachfor assessing the semantic validity is presented by Kon-tokostas et al [34] who propose a test-driven evalu-ation where test cases are created to evaluate triplessemi-automatically For instance an interval specifiesthe valid height of a person and all triples which lieoutside of this interval are evaluated manually In thisway outliers can be easily found but possible wrongvalues within the interval are not detected

Our findings appear to be consistent with the evalua-tion results of the YAGO developer team for YAGO2where manually assessing 4412 statements resulted inan accuracy of 98186

522 TrustworthinessThe fulfillment degrees of the KGs regarding the

Trustworthiness criteria are shown in Table 4

Trustworthiness on KG level mgraph

Evaluation method Regarding the trustworthinessof a KG in general we differentiate between the method

86With a weighted averaging of 95 see httpwwwmpi-infmpgdededepartmentsdatabases-and-information-systemsresearchyago-nagayagostatistics requested on Mar 3 2016

Table 4Evaluation results for the KGs regarding the dimensionTrustworthiness

DB FB OC WD YA

mgraph 05 05 1 075 025mfact 05 1 0 1 1mNoV al 0 1 0 1 0

of how new data is inserted into the KG and the methodof how existing data is curated

Evaluation results The KGs differ considerablywrt this metric OpenCyc obtains the highest scorehere followed by Wikidata In the following we pro-vide findings for the single KGs which are listed bydecreasing fulfillment score

Cyc is edited (expanded and modified) exclusively bya dedicated expert group The free version OpenCycis derived from Cyc and only a locally hosted versioncan be modified by the data consumer

Wikidata is also curated and expanded manually butby volunteers of the Wikidata community Wikidataallows importing data from external sources such asFreebase87 However new data is not just inserted butis approved by the community

Freebase was also curated by a community of vol-unteers In contrast to Wikidata the proportion of dataimported automatically is considerably higher and newdata imports were not dependent on community ap-provals

DBpedia and YAGO The knowledge of both KGs isextracted from Wikipedia but DBpedia differs fromYAGO wrt the community involvement Any usercan engage (i) in mapping the Wikipedia infobox tem-plates to the DBpedia ontology in the DBpedia map-pings wiki88 and (ii) in the development of the DBpediaextraction framework

Trustworthiness on statement levelWe determine the Trustworthiness on statement level

by evaluating whether provenance information for state-ments is used in the KGs The picture is mixed

DBpedia uses the relation provwasDerivedFrom to store the sources of the entities and their state-

87Note that imports from Freebase require the approval ofthe community (see httpswwwwikidataorgwikiWikidataPrimary_sources_tool) Besides that there arebots which import automatically (see httpswwwwikidataorgwikiWikidataBotsde)

88See httpmappingsdbpediaorg requested onMar 3 2016

32 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

ments However as the source is always the correspond-ing Wikipedia article89 this provenance informationis trivial and the fulfillment degree is hence of ratherformal nature

YAGO uses its own vocabulary to indicate thesource of information Interestingly YAGO stores perstatement both the source (via yagoextractionSource eg the Wikipedia article) and the used ex-traction technique (via yagoextractionTech-nique eg ldquoInfobox Extractorrdquo or ldquoCategoryMap-perrdquo) The number of statements about sources is 161Mand hence many times over the number of instances inthe KG The reason for that is that in YAGO the sourceis stored for each fact

In Wikidata several relations can be used for refer-ring to sources such as ldquoimported fromrdquo (wdtP143)ldquostated inrdquo (wdtP248) and ldquoreference URLrdquo (wdtP854)90 Note that ldquoimported fromrdquo relations are usedfor automatic imports but that statements with such areference are not accepted (ldquodata is not sourcedrdquo)91 Tosource data the other relations ldquostated inrdquo and ldquoref-erence URLrdquo can be used The number of all storedreferences in Wikidata92 is around 971K Based on thenumber of all statements93 74M this corresponds to acoverage of around 13 Note however that not everystatement in Wikidata requires a reference according tothe Wikidata guidelines In order to be able to state howmany references are actually missing a manual evalua-tion would be necessary However such an evaluationwould be presumably highly subjective

Freebase uses proprietary vocabulary for represent-ing provenance via n-ary relations which are in Free-base called Compound Value Types (CVT) data fromhigher arity can be expressed [44]94

OpenCyc differs from the other KGs in that it usesneither an external vocabulary nor a proprietary vocab-ulary for storing provenance information

89Eg httpenwikipediaorgwikiHamburg fordbrHamburg

90All relations are instances of Wikidata property to indicate asource (wdtQ18608359)

91See httpswwwwikidataorgwikiPropertyP143 requested Mar 3 2016

92This is the number of instances of wdoReference93This is the number of instances of wdoStatement94Eg for a statement with the relation freebaselocation

statistical_regionpopulation the source can bestored via freebasemeasurement_unitdated_integersource

Table 5Evaluation results for the KGs regarding the dimension Consistency

DB FB OC WD YA

mcheckRestr 0 1 0 1 0mconClass 088 1 lt1 1 033mconRelat 099 045 1 050 099

Indicating unknown and empty values mNoV al

This criterion highlights the subtle data model ofWikidata and Freebase in comparison to the data mod-els of the other KGs Wikidata allows for storing un-known values and empty values (eg that ldquoElizabeth Iof Englandrdquo (wdtQ7207) had no children) Howeverin the Wikidata RDF export such statements are onlyindirectly available since they are represented via blanknodes and via the relation owlsomeValuesFrom

YAGO supports the representation of unknown val-ues and empty values by providing explicit relationsfor such cases95 Inexact dates are modeled by meansof wildcards (eg ldquo1940--rdquo if only the year isknown) Note however the invalidity of such stringsas date literals (see Section 521) Unknown dates arenot supported by YAGO

523 ConsistencyThe fulfillment degrees of the KGs regarding the

Consistency criteria are shown in Table 5

Check of schema restrictions during insertion of newstatements mcheckRestr

The values of the metric mcheckRestr indicating re-strictions during the insertion of new statements arevarying among the KGs The web interfaces of Free-base and Wikidata verify during the insertion of newstatements by the user whether the input is compatiblewith the respective data type For instance data of therelation ldquodate of birthrdquo (wdtP569) is expected to bein a syntactically valid form DBpedia OpenCyc andYAGO have no checks for schema restriction during theinsertion of new statements

Consistency of statements wrt class constraintsmconClass

Evaluation method For evaluating the consis-tency of class constraints we considered the relationowldisjointWith since this is the only rela-tion which is used by more than half of the consid-

95Eg freebasefreebasevaluenotationhas_no_value

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 33

ered KGs We only focused on direct instantiationshere if there is for instance the triple (dboPlantowldisjointWith dboAnimal) then theremust not be a resource which is instantiated both asdboPlant and dboAnimal

Evaluation results We obtained mixed results hereOnly Freebase OpenCyc and Wikidata perform verywell96

Freebase and Wikidata do not specify any constraintswith owldisjointWith Hence those two KGshave no inconsistencies wrt class restrictions and wecan assign the metric value 1 to them In case of Open-Cyc 5 out of the 27112 class restrictions are incon-sistent DBpedia contains 24 class constraints Threeout of them are inconsistent For instance over 1200instances exist which are both a dboAgent and adboPlace YAGO contains 42 constraints dedi-cated mainly for WordNet classes which are mostlyinconsistent

Consistency of statements wrt relation constraintsmconRelat

Evaluation method Here we considered the rela-tions rdfsrange and owlFunctionalProperty as those are used in more than every second con-sidered KG rdfsrange specifies the expected typeof an instance on the object position of a triple whileowlFunctionalProperty indicates that a rela-tion should only be used at most once per resource Weonly took datatype properties into account for this eval-uation since consistencies regarding object propertieswould require to distinguish Open World assumptionand Closed World assumption

Evaluation results In the following we considerthe fulfillment degree for the relation constraintsrdfsrange and owlFunctionalPropertyseparately In Table 5 we show the average of the fulfill-ment scores of each KG regarding rdfsrange andowlFunctionalProperty Note that the num-bers of evaluated relation constraints varied from KG toKG depending on how many relation constraints wereavailable per KG

Range Wikidata does not use any rdfsrangerestrictions Within the Wikidata data model there iswdopropertyType but this indicates not the ex-act allowed data type of a relation (eg wdoprop

96Note that the sample size varies among the KGs (depend-ing on how many owldisjointWith statements are availableper KG) Therefore inconsistencies measured on a small set ofowldisjointWith facts become more visible

Table 6Evaluation results for the KGs regarding the dimension Relevancy

DB FB OC WD YA

mRanking 0 1 0 1 0

ertyTypeTime can represent a year or an exact date)On the talk pages of Wikidata relations users can indi-cate the allowed values of relations via One of state-ments97 Since One of statements are only listed onthe property talk pages and since not only entity typesbut also concrete instances are used as One of valueswe do not consider those statements here

DBpedia obtains the highest measured fulfillmentscore wrt consistency of rdfsrange statementsAn example for a range inconsistency is that the relationdbobirthDate requires a data type xsddatein about 20 of those relations the data type xsdgYear is used though

YAGO Freebase and OpenCyc contain range incon-sistencies primarily since they specify designated datatypes via range relations which are not consistentlyused on the instance level For instance YAGO spec-ifies proprietary data types such as yagoyagoURLand yagoyagoISBN On the instance level how-ever either no data type is used or the unspecific datatype xsdstring

FunctionalProperty The restriction indicated byowlFunctionalProperty is used by all KGsexcept Wikidata On the talk pages about the rela-tions in Wikidata users can specify the cardinalityrestriction via setting the relation to single how-ever this is not part of the Wikidata data modelThe other KGs mostly comply with the usage re-strictions of owlFunctionalProperty Note-worthy is that in Freebase 999 of the inconsis-tencies obtained here are caused by the usages ofthe relations freebasetypeobjectname andfreebasecommonnotable_fordisplay_name

524 RelevancyThe fulfillment degrees of the KGs regarding the

Relevancy criteria are shown in Table 6

Creating a ranking of statements mRanking

Only Wikidata supports the modeling of a rankingof statements Each statement is ranked with ldquopre-

97See httpswwwwikidataorgwikiCategoryProperties_with_one-of_constraints for an overviewrequested on Jan 29 2017

34 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Table 7Evaluation results for the KGs regarding the dimensionCompleteness

DB FB OC WD YA

mcSchema 091 076 092 1 095mcColumn 040 043 0 029 033mcPop 093 094 048 099 089mcPop (short) 1 1 082 1 090mcPop (long) 086 088 014 098 088

ferred rankrdquo (wdoPreferredRank) ldquonormal rankrdquo(wdoNormalRank) or ldquodeprecated rankrdquo (wdoDeprecatedRank) The preferred rank corre-sponds to the up-to-date value or the consensus of theWikidata community wrt this relation Freebase doesnot provide any ranking of statements entities or re-lations However the meanwhile shutdown FreebaseSearch API provided a ranking for resources98

525 CompletenessThe fulfillment degrees of the KGs regarding the

Completeness criteria are shown in Table 7

Schema completeness mcSchema

Evaluation method Since a gold standard for eval-uating the Schema completeness of the considered KGshas not been published we built one on our own Thisgold standard is available online99 It is based on thedata set used in Section 513 where we needed as-signments of classes to domains and comprises of 41classes as well as 22 relations It is oriented towards thedomains people media organizations geography andbiology The classes in the gold standard were alignedto corresponding WordNet synsets (using WordNet ver-sion 31) and were grouped into main classes

Evaluation results Generally Wikidata performsoptimal also DBpedia OpenCyc and YAGO exhibitresults which can be judged as acceptable for most usecases Freebase shows considerable room for improve-ment concerning the coverage of typical cross-domainclasses and relations The results in more detail are asfollows

DBpedia DBpedia shows a good score regardingSchema completeness and its schema is mainly limited

98See httpsdevelopersgooglecomfreebasev1search-cookbookscoring-and-ranking re-quested on Mar 4 2016

99See httpkmaifbkitedusitesknowledge-graph-comparison requested on Jan 29 2017

due to the characteristics of how information is storedand extracted from Wikipedia

1 Classes The DBpedia ontology was created man-ually and covers all domains well However it is incom-plete in the details and therefore appears unbalancedFor instance within the domain of plants the DBpe-dia ontology does not use the class tree but the classginko which is a subclass of trees We can mentionas reason for such gaps in the modeling the fact thatthe ontology is created by means of the most frequentlyused infobox templates in Wikipedia

2 Relations Relations are considerably well cov-ered in the DBpedia ontology Some missing relationsor modeling failures are due to the Wikipedia infoboxcharacteristics For example to represent the gender ofa person the existing relation foafgender seemsto fit However it is only modeled in the ontology asbelonging to the class dbolanguage and not usedon instance level Note that the gender of a person is of-ten not explicitly mentioned in the Wikipedia infoboxesbut implicitly mentioned in the category names (forinstance American male singers) While DBpediadoes not exploit this knowledge YAGO does use it andprovides facts with the relation yagohasGender

Freebase Freebase shows a very ambivalent schemacompleteness On the one hand Freebase targets ratherthe representation of facts on instance level than therepresentation of classes and their hierarchy On theother hand Freebase provides a vast amount of rela-tions leading to a very good coverage of the requestedrelations

1 Classes Freebase lacks a class hierarchy and sub-classes of classes are often in different domains (for in-stance the classes freebasemusicartist andsportsmen freebasesportspro_athlete arelogically a subclass of the class people freebasepersonpeople but not explicitly stated as such)which makes it difficult to find suitable sub- and su-perclasses Noteworthy the biology domain containsno classes This is due to the fact that classes are rep-resented as entities such as tree100 and ginko101 Theginko tree is not classified as tree but by the genericclass freebasebiologyoganism_classification

2 Relations Freebase exhibits all relations requestedby our gold standard This is not surprising given thevast amount of available relations in Freebase (see Sec-tion 514 and Table 2)

100Freebase ID freebasem07j7r101Freebase ID freebasem0htd3

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 35

OpenCyc In total OpenCyc exposes a quite highSchema completeness scoring This is due to the factthat OpenCyc has been created manually and has itsfocus on generic and common-sense knowledge

1 Classes The ontology of OpenCyc covers bothgeneric and specific classes such as cychSocialGroup and cychLandTopographicalFeatureWe can state that OpenCyc is complete with respect tothe considered classes

2 Relations OpenCyc lacks some relations of thegold standard such as the number of pages or the ISBNof books

Wikidata According to our evaluation Wikidata iscomplete both with respect to classes and relations

1 Classes Besides frequently used generic classessuch as ldquohumanrdquo (wdtQ5) also very specific classesexist such as ldquolandformrdquo (wdtQ271669) in the senseof a geomorphologial unit with over 3K instances

2 Relations In particular remarkable is that Wiki-data covers all relations of the gold standard eventhough it has extremely less relations than FreebaseThus the Wikidata methodology to let users proposenew relations to discuss about their outreach and fi-nally to approve or disapprove the relations seems tobe appropriate

YAGO Due to its concentration on modeling classesYAGO shows the best overall Schema completenessfulfillment score among the KGs

1 Classes To create the set of classes in YAGOthe Wikipedia categories are extracted and connectedto WordNet synsets Since also our gold standard isalready aligned to WordNet synsets we can measure afull completeness score for YAGO classes

2 Relations The YAGO schema does not containmany unique but rather abstract relations which canbe understood in different senses The abstract rela-tion names make it often difficult to infer the mean-ing The relation yagowasCreatedOnDate forinstance can be used reasonably for both the founda-tion year of a company and for the publication dateof a movie DBpedia in contrast provides the rela-tion dbpfoundationYear Often the meaning ofYAGO relations is only fully understood after consider-ing the associated classes using domain and range ofthe relations Expanding the YAGO schema by furthermore fine-grained relations appears reasonable

Column completeness mcColumn

Evaluation method For evaluating KGs wrt Col-umn completeness for each KG 25 class-relation-

Table 8Metric values of mcCol for single class-relation-pairs

Relation DB FB OC ED YA

Personndashbirthdate 048 048 0 070 077

Personndashsex ndash 057 0 094 064

Bookndashauthor 091 093 0 082 028

BookndashISBN 073 063 ndash 018 001

combinations102 were created based on our gold stan-dard created for measuring the Schema completenessIt was ensured that only those relations were selectedfor a given class for which a value typically exists forthat class For instance we did not include the deathdate as potential relation for living people

Evaluation results In general no KG yields a met-ric score of over 043 As visible in Table 8 KGs oftenhave some specific class-relation-pairs which are wellrepresented on instance level while the rest of the pairsare poorly represented The well-represented pairs pre-sumably originate either from column-complete datasets which were imported (cf MusicBrainz in case ofFreebase) or from user edits focusing primarily on factsabout entities of popular classes such as people Wenotice the following observations with respect to thesingle KGs

DBpedia DBpedia fails regarding the relation sex forinstances of class Person since it does not containsuch a relation in its ontology If we considered the non-mapping-based property dbpgender instead (notdefined in the ontology) we would gain a coverage ofonly 025 (about 5K people) We can note hence thatthe extraction of data out of the Wikipedia categorieswould be a further fruitful data source for DBpedia

Freebase Freebase surprisingly shows a very highcoverage (927) of the authors of books given the ba-sic population of 17M books Note however that thereare not only books modeled under freebasebookbook but also entities of other types such as a descrip-tion of the Lord of Rings (see freebasem07bz5)Also the coverage of ISBN for books is quite high(634)

OpenCyc OpenCyc breaks ranks as mostly no val-ues for the considered relations are stored in this KG It

102The selection of class-relation-pairs was depending on the factwhich class-relation-pairs were available per KG Hence the choiceis varying from KG to KG Also note that less class-relation-pairswere used if no 25 pairs were available in the respective KG

36 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

contains mainly taxonomic knowledge and only thinlyspread instance facts

Wikidata Wikidata achieves a high coverage of birthdates (703) and of gender (941) despite the highnumber of 3M people103

YAGO YAGO obtains a coverage of 635 for gen-der relations as it in contrast to DBpedia extracts thisimplicit information from Wikipedia

Population completeness mcPop

Evaluation method In order to evaluate the Popu-lation completeness we need a gold standard consist-ing of a basic entity population for each consideredKG This gold standard which is available online104

was created on the basis of our gold standard usedfor evaluating the Schema completeness and the Col-umn completeness For its creation we selected fiveclasses from each of the five domains and determinedtwo well-known entities (called short head) and tworather unknown entities (called long tail) for each ofthose classes The exact entity selection criteria are asfollows

1 The well-known entities were chosen without tem-poral and location-based restrictions To take themost popular entities per domain we used quan-titative statements For instance to select well-known athletes we ranked athletes by the numberof won olympic medals to select the most popu-lar mountains we ranked the mountains by theirheights

2 To select the rather unknown entities we consid-ered entities associated to both Germany and aspecific year For instance regarding the athleteswe selected German athletes active in the year2010 such as Maria Houmlfl-Riesch The selectionof rather unknown entities in the domain of biol-ogy is based on the IUCN Red List of ThreatenedSpecies105106

Selecting four entities per class and five classes perdomain resulted in 100 entities to be used for evaluatingthe Population completeness

103These 3M instances form about 185 of all instances in Wiki-data See httpswwwwikidataorgwikiWikidataStatistics requested on Nov 7 2016

104See httpkmaifbkitedusitesknowledge-graph-comparison requested on Jan 29 2017

105See httpwwwiucnredlistorg requested on Apr2 2016

106Note that selecting entities by their importance or popularity ishard in general and that also other popularity measures such as thePageRank scores may be taken into account

Evaluation results All KGs except OpenCyc showgood evaluation results Since also Wikidata exhibitsgood evaluation results the population degree appar-ently does not depend on the age or the insertion methodof the KG Fig 10 additionally depicts the populationcompleteness for the single domains for each KG Inthe following we firstly present our findings for well-known entities before we secondly go into the detailsof rather unknown entities

Well-known entities Here all considered KGsachieve good results DBpedia Freebase and Wikidataare complete wrt the well-known entities in our goldstandard YAGO lacks some well-known entities al-though some of them are represented in Wikipedia Onereason for this fact is that those Wikipedia entities donot get imported into YAGO for which a WordNet classexists For instance there is no ldquoGreat White Sharkrdquoentity only the WordNet class yagowordnet_great_white_shark_101484850

Not-well-known entities First of all not very surpris-ing is the fact that all KGs show a higher degree of com-pleteness regarding well-known entities than regard-ing rather unknown entities as the KGs are orientedtowards general knowledge and not domain-specificknowledge Secondly two things are in particular pe-culiar concerning long-tail entities in the KGs Whilemost of the KGs obtain a score of about 088 Wiki-data deflects upwards and OpenCyc deflects stronglydownwards

Wikidata exhibits a very high Population complete-ness degree for long tail entities This is a result fromthe central storage of interwiki links between differentWikimedia projects (especially between the differentWikipedia language versions) in Wikidata A Wikidataentry is added to Wikidata as soon as a new entity isadded in one of the many Wikipedia language versionsNote however that in this way English-language labelsfor the entities are often missing We measure that onlyabout 546 (102M) of all Wikidata resources have anEnglish label

OpenCyc exhibits a poor population degree score of014 for long-tail entities OpenCycrsquos sister KGs Cycand ResearchCyc are apparently considerably bettercovered with entities [36] leading to higher Populationcompleteness scores

526 TimelinessThe evaluation results concerning the dimension

Timeliness are presented in Table 9

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 37

DBpedia Freebase OpenCyc Wikidata YAGO0

01

02

03

04

05

06

07

08

09

1

PeopleMediaOrganizationsGeographyBiology

Fig 10 Population completeness regarding the different domains per KG

Table 9Evaluation results for the KGs regarding the dimension Timeliness

DB FB OC WD YA

mFreq 05 0 025 1 025mV alidity 0 1 0 1 1mChange 0 1 0 0 0

Timeliness frequency of the KG mFreq

Evaluation results The KGs are very diverse re-garding the frequency in which the KGs are updatedranging from a score of 0 for Freebase (not updated anymore) to 1 for Wikidata (updates immediately visibleand retrievable) Note that the Timeliness frequency ofthe KG can be a crucial point and a criterion for exclu-sion in the process of choosing the right KG for a givensetting [17] In the following we outline some charac-teristics of the KGs with respect to their up-to-dateness

DBpedia is created about once to twice a year andis not modified in the meantime From September2013 until November 2016 six DBpedia versions havebeen published107 Besides the static DBpedia DBpe-dia live108 has been continuously updated by trackingchanges in Wikipedia in real-time However it does notprovide the full range of relations as DBpedia

107These versions are DBpedia 38 DBpedia 39 DBpedia 2014DBpedia 2015-04 DBpedia 2015-10 and DBpedia 2016-04 Alwaysthe latest DBpedia version is published online for dereferencing

108See httplivedbpediaorg requested on Mar 42016

Freebase had been updated continuously until itsclose-down and is not updated anymore

OpenCyc has been updated less than once per yearThe last OpenCyc version dates from May 2012109 Tothe best of our knowledge Cyc and OpenCyc respec-tively are developed further but no exact date of thenext version is known

Wikidata provides the highest fulfillment degree forthis criterion Modifications in Wikidata are via browserand via HTTP URI dereferencing immediately visibleHence Wikidata falls in the category of continuousupdates Besides that an RDF export is provided ona roughly monthly basis (either via the RDF exportwebpage110 or via own processing using the Wikidatatoolkit111)

YAGO has been updated less than once per yearYAGO3 was published in 2015 YAGO2 in 2011 andthe interim version YAGO2s in 2013 A date of the nextrelease has not been published

Specification of the validity period of statementsmV alidity

Evaluation results Although representing the va-lidity period of statements is obviously reasonable formany relations (for instance the presidentrsquos term of

109See httpswopencycorg requested on Nov 82016

110See httptoolswmflabsorgwikidata-exportsrdfexports requested on Nov 23 2016

111See httpsgithubcomWikidataWikidata-Toolkit requested on Nov 8 2016

38 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Table 10Evaluation results for the KGs regarding the dimension Ease ofunderstanding

DB FB OC WD YA

mDescr 070 097 1 lt1 1mLang 1 1 0 1 1muSer 1 1 0 1 1muURI 1 05 1 0 1

office) specifying the validity period of statements isin several KGs either not possible at all or only rudi-mentary performed

DBpedia and OpenCyc do not realize any specifi-cation possibility In YAGO Freebase and Wikidatathe temporal validity period of statements can be spec-ified In YAGO this modeling possibility is madeavailable via the relations yagooccursSinceyagooccursUntil and yagooccursOnDateWikidata provides the relations ldquostart timerdquo (wdtP580)and ldquoend timerdquo (wdtP582) In Freebase CompoundValue Types (CVTs) are used to represent relations withhigher arity [44] As part of this representation validityperiods of statements can be specified An example isldquoVancouverrsquos population in 1997rdquo

Specification of the modification date of statementsmChange

Evaluation results The modification date of state-ments can only be specified in Freebase but not in theother KGs Together with the criteria on Timelinessthis reflects that the considered KGs are mostly notsufficiently equipped with possibilities for modelingtemporal aspects within and about the KG

In Freebase the date of the last review of a fact can berepresented via the relation freebasefreebasevaluenotationis_reviewed In the DBpediaontology the relation dctermsmodified is usedto state the date of the last revision of the DBpediaontology When dereferencing a resource in Wikidatathe latest modification date of the resource is returnedvia schemadateModified This however doesnot hold for statements Thus Wikidata is evaluatedwith 0 too

527 Ease of UnderstandingDescription of resources mDescr

Evaluation method We measured the extent towhich entities are described Regarding the labelswe considered rdfslabel for all KGs Regard-ing the descriptions the corresponding relations dif-

fer from KG to KG DBpedia for instance usesrdfscomment and dcelementsdescriptionwhile Freebase provides freebasecommontopicdescription112

Evaluation result For all KGs the rule applies thatin case there is no label available usually there isalso no description available The current metric couldtherefore (without significant restrictions) be applied tordfslabel occurrences only

YAGO Wikidata and OpenCyc contain a label foralmost every entity In Wikidata the entities withoutany label are of experimental nature and are most likelynot used113

Surprisingly DBpedia shows a relatively low cov-erage wrt labels and descriptions (only 704) Ourmanual investigations suggest that relations with higherarity are modeled by means of intermediate nodeswhich have no labels114

Labels in multiple languages mLang

Evaluation method Here we measure whether theKGs contain labels (rdfslabel) in other languagesthan English This is done by means of the languageannotations of literals such as ldquoderdquo for literals inGerman

Evaluation results DBpedia provides labels in 13languages Further languages are provided in the lo-calized DBpedia versions YAGO integrates statementsof the different language versions of Wikipedia intoone KG Therefore it provides labels in 326 differentlanguages Freebase and Wikidata also provide a lot oflanguages (244 and 395 languages respectively) Con-trary to the other KGs OpenCyc only provides labelsin English

Coverage of languages We also measured the cov-erage of selected languages in the KGs ie the extentto which entities have an rdfslabel with a specificlanguage annotation115 Our evaluation shows that DB-pedia YAGO and Freebase achieve a high coveragewith more than 90 regarding the English language Incontrast to those KGs Wikidata shows a relative low

112Human-readable resource descriptions may also be representedby other relations [15] However we focused on those relations whichare commonly used in the considered KGs

113For instance wdtQ5127809 represents a game fo the Nin-tendo Entertainment System but there is no further information foran identification of the entity available

114Eg dbrNayim links via dboCareerStation to 10entities of his carrier stations

115Note that literals such as rdfslabel do not necessarily havelanguage annotations In those cases we assume that no languageinformation is available

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 39

coverage regarding the English language of only 546but a coverage of over 30 for further languages suchas German and French Wikidata is hence not only themost diverse KG in terms of languages but has also thehighest coverage regarding non-English languages

Understandable RDF serialization muSer

The provisioning of understandable RDF serializa-tions in the context of URI dereferencing leads to a bet-ter understandability for human data consumers DB-pedia YAGO and Wikidata provide N-Triples andN3Turtle serializations Freebase in contrast onlyprovides a Turtle serialization OpenCyc only usesRDFXML which is regarded as not easily understand-able by humans

Self-describing URIs muURI

We can observe two different paradigms of URI us-age On the one hand DBpedia OpenCyc and YAGOrely on descriptive URIs and therefore achieve the fullfulfillment degree In DBpedia and YAGO the URIsof the entities are determined by the corresponding En-glish Wikipedia article The mapping to the EnglishWikipedia is thus trivial In case of OpenCyc two RDFexports are provided one using opaque and one us-ing self-describing URIs The self-describing URIs arethereby derived from the rdfslabel values of theresources

On the other hand Wikidata and Freebase (the latterin part) rely on opaque URIs Wikidata uses Q-IDsfor resources (items in Wikidata terminology) andP-IDs for relations Freebase uses self-describing URIsonly partially namely opaque M-IDs for entities andself-describing URIs for classes and relations116

528 InteroperabilityThe evaluation results of the dimension Interoper-

ability are presented in Table 11

Avoiding blank nodes and RDF reification mReif

Reification allows to represent further informationabout single statements In conclusion we can state thatDBpedia Freebase OpenCyc and YAGO use someform of reification However none of the consideredKGs uses the RDF standard for reification Wikidatamakes extensive use of reification every relation isstored in the form of an n-ary relation In case of DB-pedia and Freebase in contrast facts are predominantlystored as N-Tripels and only relations of higher arity

116Eg freebasemusicalbum for the class music al-bums and freebasepeoplepersondate_of_birthfor the relation day of birth

Table 11Evaluation results for the KGs regarding the dimensionInteroperability

DB FB OC WD YA

mReif 05 05 05 0 05miSerial 1 0 05 1 1mextV oc 061 011 041 068 013mpropV oc 015 0 051 gt0 0

are stored via n-ary relations117 YAGO stores facts asN-Quads in order to be able to store meta informationof facts like provenance information When the quadsare loaded in a triple store the IDs referring to thesingle statements are ignored and quads are convertedinto triples In this way most of the statements are stillusable without the necessity to deal with reification

Blank nodes are non-dereferencable anonymous re-sources They are used by the Wikidata and OpenCycdata model

Provisioning of several serialization formats miSerial

DBpedia YAGO and Wikidata fulfill the criterion ofProvisioning several RDF serialization formats to thefull extent as they provide data in RDFXML and sev-eral other serialization formats during the URI derefer-encing In addition DBpedia and YAGO provide fur-ther RDF serialization formats (eg JSON-LD Micro-data and CSV) via their SPARQL endpoints Freebaseis the only KG providing RDF only in Turtle format

Using external vocabulary mextV oc

Evaluation method This criterion indicates the ex-tent to which external vocabulary is used For that foreach KG we divide the occurrence number of tripleswith external relations by the number of all relations inthis KG

Evaluation results DBpedia uses 37 unique exter-nal relations from 8 different vocabularies while theother KGs mainly restrict themselves to the externalvocabularies RDF RDFS and OWL

Wikidata reveals a high external vocabulary ratiotoo We can mention two obvious reasons for that fact1 Information in Wikidata is provided in a huge varietyof languages leading to 85M rdfslabel and 140Mschemadescription literals 2 Wikidata makesextensive use of reification Out of the 140M triplesused for instantiations via rdftype about 74M (ie

117See Section 511 for more details wrt the influence of reifica-tion on the number of triples

40 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

about the half) are taken for instantiations of statementsie for reification

Interoperability of proprietary vocabulary mpropV oc

Evaluation method This criterion determines the ex-tent to which URIs of proprietary vocabulary are linkedto external vocabulary via equivalence relations Foreach KG we measure which classes and relationsare linked via owlsameAs118 owlequivalentClass (in Wikidata wdtP1709) and owlequivalentProperty (in Wikidata wdtP1628) to ex-ternal vocabulary Note that other relations such asrdfsubPropertyOf could be taken into accounthowever in this work we only consider equivalencyrelations

Evaluation results In general we obtained low ful-fillment scores regarding this criterion OpenCyc showsthe highest value We achieved the following singlefindings

Regarding its classes DBpedia reaches a relativehigh interlinking degree of about 484 Classes arethereby linked to FOAF Wikidata schemaorg andDUL119 Regarding its relations DBpedia links to Wiki-data and schemaorg120 Only 63 of the DBpediarelations are linked to external vocabulary

Freebase only provides owlsameAs links in theform of a separate RDF file but these links are only oninstance level Thus the KG is evaluated with 0

In OpenCyc about half of all classes exhibit at leastone external linking via owlsameAs Internal linksto resources of swcyccom the commercial ver-sion of OpenCyc were ignored in our evaluation Theconsidered classes are mainly linked to FOAF UM-BEL DBpedia and linkedmdborg the relations mainlyto FOAF DBpedia Dublin Core Terms and linked-mdborg The relative high linking degree of OpenCyccan be attributed to dedicated approaches of linkingOpenCyc to other KGs (see eg Medelyan et al [38])

Regarding the classes Wikidata provides linksmainly to DBpedia Considering all Wikidata classesonly 01 of all Wikidata classes are linked to equiva-

118OpenCyc uses owlsameAs both on schema and instancelevel This is appropriate as the OWL primer states The built-inOWL property owlsameAs links an individual to an individualas well as The owlsameAs statements are often used in definingmappings between ontologies see httpswwww3orgTR2004REC-owl-ref-20040210sameAs-def (requestedon Feb 4 2017)

119See httpwwwontologydesignpatternsorgontdulDULowl requested on Jan 11 2017

120Eg dbobirthDate is linked to wdtP569 andschemabirthDate

Table 12Evaluation results for the KGs regarding the dimension Accessibility

DB FB OC WD YA

mDeref 1 1 044 041 1mAvai lt1 073 lt1 lt1 1mSPARQL 1 1 0 1 0mExport 1 1 1 1 1mNegot 05 1 0 1 0mHTMLRDF 1 1 1 1 0mMeta 1 0 0 0 1

lent external classes This may be due to the high num-ber of classes in Wikidata in general Regarding therelations Wikidata provides links in particular to FOAFand schemaorg and achieves here a linking coverageof 21 Although this is low frequently used relationsare linked121

YAGO contains around 553K owlequivalentClass links to classes within the DBpedia namespacedby However as YAGO classes (and their hierarchy)were imported also into DBpedia (using the namespacehttpdbpediaorgclassyago) we donot count those owlequivalentClass links inYAGO as external links for YAGO

529 AccessibilityThe evaluation results of the dimension Accessibility

are presented in Table 12

Dereferencing possibility of resources mDeref

Evaluation method We measured the dereferenc-ing possibilities of resources by trying to dereferenceURIs containing the fully-qualified domain name ofthe KG For that we randomly selected 15K URIs inthe subject predicate and object position of triples ineach KG We submitted HTTP requests with the HTTPaccept header field set to applicationrdf+xmlin order to perform content negotiation

Evaluation results In case of DBpedia OpenCycand YAGO all URIs were dereferenced successfullyand returned appropriate RDF data so that they fulfilledthis criterion completely For DBpedia 45K URIs wereanalyzed for OpenCyc only around 30K due to thesmall number of unique predicates We observed almost

121Frequently used relations with stated equivalence to externalrelations are eg wdtP31 linked to rdftype and wdtP279linked to rdfssubClassOf

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 41

the same picture for YAGO namely no notable errorsduring dereferencing

For Wikidata which contains also not that manyunique predicates we analyzed around 35K URIs Notethat predicates which are derived from relations using asuffix (eg the suffix s as in wdtP1024s is usedfor predicates referring to a statement) could not bedereferenced at all Furthermore the blank nodes usedfor reification cannot be dereferenced

Regarding Freebase mainly all URIs on subjectand object position of triples could be dereferencedSome resources were not resolvable even after multi-ple attempts (HTTP server error 503 eg freebasem0156q) Surprisingly server errors also appearedwhile browsing the website freebasecom so that datawas partially not available Regarding the predicate po-sition many URIs are not dereferencable due to servererrors (HTTP 503) or due to unknown URIs (HTTP404) Note that if a large number of Freebase requestsare performed an API key from Google is necessaryIn our experiments the access was blocked after a fewthousand requests Hence we can point out that withoutan API key the Freebase KG is only usable to a limitedextent

Availability of the KG mAvai

Evaluation method We measured the availabilityof the officially hosted KGs with the monitoring servicePingdom122 For each KG an uptime test was set upwhich checked the availability of the resource Ham-burg as representative resource for successful URI re-solving (ie returning the status code HTTP 200) ev-ery minute over the time range of 60 days (Dec 182015ndashFeb 15 2016)

Evaluation result While the other KGs showed al-most no outages and were again online after some min-utes on average YAGO outages took place frequentlyand lasted on average 35 hours123 In the given timerange four outages took longer than one day Based onthese insights we recommend to use a local version ofYAGO for time-critical queries

Availability of a public SPARQL endpoint mSPARQL

The SPARQL endpoints of DBpedia and YAGO are

122See httpswwwpingdomcom requested Mar 2 2016The HTTP requests of Pingdom are executed by various servers sothat caching is prevented

123See diagrams per KG on our website (httpkmaifbkitedusitesknowledge-graph-comparisonrequested on Jan 31 2017)

provided by a Virtuoso server124 the Wikidata SPARQLendpoint via Blazegraph125 Freebase and OpenCyc donot provide an official SPARQL endpoint However anendpoint for the MQL query language for the FreebaseKG was available

Especially regarding the Wikidata SPARQL endpointwe observed access restrictions The maximum execu-tion time per query is set to 30 seconds but there is nolimitation regarding the returning number of rows How-ever the front-end of the SPARQL endpoint crashed incase of large result sets with more than 15M rows Al-though public SPARQL endpoints need to be preparedfor inefficient queries the time limit of Wikidata mayimpede the execution of reasonable queries

Provisioning of an RDF export mExport

All considered KGs provide RDF exports as down-loadable files The format of the data differs from KGto KG Mostly data is provided in N-Triples and Turtleformat

Support of content negotiation mNegot

We measure the support of content negotiation re-garding the serialization formats RDFXML N3Turtleand N-Triples OpenCyc does not provide any contentnegotiation only RDFXML is supported as contenttype Therefore OpenCyc does not fulfill the criterionof supporting content negotiation

The endpoints for DBpedia Wikidata and YAGOcorrectly returned the appropriate RDF serializationformat and the corresponding HTML representationof the tested resources Freebase does currently notprovide any content negotiation and only the contenttype textplain is returned

Noteworthy is also that regarding the N-Triples seri-alization YAGO and DBpedia require the accept headertextplain and not applicationn-triplesThis is due to the usage of Virtuoso as endpoint For DB-pedia the forwarding to httpdbpediaorgdata[resource]ntriples does not work in-stead the HTML representation is returned Thereforethe KG is evaluated with 05

Linking HTML sites to RDF serializations mHTMLRDF

All KGs except OpenCyc interlink the HTML represen-tations of resources with the corresponding RDF repre-sentations by means of ltlink rel=alternate

124See httpsvirtuosoopenlinkswcom re-quested on Dec 28 2016

125See httpswwwblazegraphcom requested on Dec28 2016

42 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Table 13Evaluation results for the KGs regarding the dimension License

DB FB OC WD YA

mmacLicense 1 0 0 1 0

type=content type href=URLgtin the HTML header

Provisioning of metadata about the KG mmeta

For this criterion we analyzed if KG metadata isavailable such as in the form of a VoID file126 DBpediaintegrates the VoID vocabulary directly in its KG127 andprovides information such as the SPARQL endpointURL and the number of all triples OpenCyc revealsthe current KG version number via owlversionInfo For YAGO Freebase and Wikidata no metainformation could be found

5210 LicenseThe evaluation results of the dimension License are

shown in Table 13

Provisioning machine-readable licensing informationmmacLicense

DBpedia and Wikidata provide licensing informa-tion about their KG data in machine-readable form ForDBpedia this is done in the ontology via the predi-cate cclicense linking to CC-BY-SA128 and GNUFree Documentation License (GNU FDL)129 Wikidataembeds licensing information during the dereferenc-ing of resources in the RDF document by linking withcclicense to the license CC0130 YAGO and Free-base do not provide machine-readable licensing infor-mation However their data is published under the li-cense CC-BY131 OpenCyc embeds licensing informa-tion into the RDF document during dereferencing butnot in machine-readable form132

126See httpswwww3orgTRvoid requested on Apr7 2016

127See httpdbpediaorgvoidpageDataset re-quested on Mar 5 2016

128See httpcreativecomonsorglicensesby-sa30 requested on Feb 4 2017

129See httpwwwgnuorgcopyleftfdlhtml re-quested on Feb 4 2017

130See httpcreativecomonsorgpublicdomainzero10 requested on Feb 4 2017

131See httpcreateivecommonsorglicensesby30 requested on Feb 4 2017

132License information is provided as plain text among furtherinformation with the relation rdfscomment

Table 14Evaluation results for the KGs regarding the dimension Interlinking

DB FB OC WD YA

mInst 025 0 038 0 (09) 031mURIs 093 091 089 096 096

5211 InterlinkingThe evaluation results of the dimension Interlinking

are shown in Table 14

Linking via owlsameAs mInst

Evaluation method Given all owlsameAs triplesin each KG we queried all those subjects thereof whichare instances but neither classes nor relations133 andwhere the resource in the object position of the triple isan external source ie not belonging to the namespaceof the KG

Evaluation result OpenCyc and YAGO achieve thebest results wrt this metric but DBpedia has by farthe most instances with at least one owlsameAs linkWe can therefore confirm the statement by Bizer et al[12] that DBpedia has established itself as a hub in theLinked Data cloud

In DBpedia there are about 52M instances with atleast one owlsameAs link Links to localized DBpe-dia versions (eg dedbpediaorg) were countedas internal links and hence not considered here Intotal one-fourth of all instances have at least oneowlsameAs link

In Wikidata neither owlsameAs links are pro-vided nor a corresponding proprietary relation is avail-able Instead Wikidata uses for each linked data seta proprietary relation (called identifier) to indicateequivalence For example the M-ID of a Freebase in-stance is stored via the relation ldquoFreebase identifierrdquo(wdtP646) as literal value (eg m01x3gpk)So far links to 426 different data sources are maintainedin this way

Although the equivalence statements in Wikidata canbe used to generate corresponding owlsameAs state-ments and although the stored identifiers are providedin the Browser interface as hyperlinks there are no gen-uine owlsameAs links available Hence Wikidata isevaluated with 0 If we view each equivalence relationas owlsameAs relation we would obtain around122M instances with owlsameAs statements Thiscorresponds to 86 of all instances If we consider

133The interlinking on schema level is already covered by thecriterion Interoperability of proprietary vocabulary

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 43

only entities instead of instances (since there are manyinstances due to reification) we obtain a coverage of65 Note however that although the linked resourcesprovide relevant content the resources are not alwaysRDF documents but instead HTML web pages There-fore we cannot easily subsume all identifiers (equiv-alence statements) under owlsameAs

YAGO has around 36M instances with at least oneowlsameAs link However most of them are linksto DBpedia based on common Wikipedia articles Ifthose links are excluded YAGO contains mostly linksto GeoNames and would be evaluated with just 001

In case of OpenCyc links to Cyc134 the commercialversion of OpenCyc were considered as being internalStill OpenCyc has the highest fulfillment degree witharound 40K instances with at least one owlsameAslink As mentioned earlier the relative high linkingdegree of OpenCyc can be attributed to dedicated ap-proaches of linking OpenCyc to other KGs135

Validity of external URIs mURIs

Regarding the dimension Accessibility we alreadyanalyzed the dereferencing possibility of resources inthe KG namespace Now we analyze the links to exter-nal URIs

Evaluation method External links include owlsameAs links as well as links to non-RDF-based Webresources (eg via foafhomepage) We measureerrors such as timouts client errors (HTTP response4xx) and server errors (HTTP response 5xx)

Evaluation result The external links are in most ofthe cases valid for all KGs All KGs obtain a metricvalue between 089 and 096

DBpedia stores provenance information via the re-lation provwasDerivedFrom Since almost alllinks refer to Wikipedia 99 of the resources are avail-able

Freebase achieves high metric values here sinceit contains owlsameAs links mainly to WikipediaAlso Wikipedia URIs are mostly resolvable

OpenCyc contains mainly external links to non-RDF-based Web resources to wikipediaorg and w3org

YAGO also achieves high metric values since it pro-vides owlsameAs links only to DBpedia and Geo-Names whose URIs do not change

For Wikidata the relation reference URL (wdtP854) which states provenance information amongother relations belongs to the links linking to external

134Ie swcyccom135See Interoperability of proprietary vocabulary in sec 528

Web resources Here we were able to resolve around955 without errors

Noticeable is that DBpedia and OpenCyc containmany owlsameAs links to URIs whose domains donot exist anymore136 One solution for such invalid linksmight be to remove them if they have been invalid for acertain time span

5212 Summary of ResultsWe now summarize the results of the evaluations

presented in this section

1 Syntactic validity of RDF documents All KGsprovide syntactically valid RDF documents

2 Syntactic validity of Literals In general the KGsachieve good scores regarding the Syntactic valid-ity of literals Although OpenCyc comprises over1M literals in total these literals are mainly labelsand descriptions which are not formatted in a spe-cial format For YAGO we detected about 519Ksyntactic errors (given 1M literal values) due to theusage of wildcards in the date values Obviouslythe syntactic invalidity of literals is accepted bythe publishers in order to keep the number of rela-tions low In case of Wikidata some invalid literalssuch as the ISBN have been corrected in newerversions of Wikidata This indicates that knowl-edge in Wikidata is curated continuously For DB-pedia comments next to the values to be extracted(such as ISBN) in the infoboxes of Wikipedia ledto inaccurately extracted values

3 Semantic validity of triples All considered KGsscored well regarding this metric This shows thatKGs can be used in general without concerns re-garding the correctness Note however that eval-uating the semantic validity of facts is very chal-lenging since a reliable ground truth is needed

4 Trustworthiness on KG level Based on the way ofhow data is imported and curated OpenCyc andWikidata can be trusted the most

5 Trustworthiness on statement level Here espe-cially good values are achieved for Freebase Wiki-data and YAGO YAGO stores per statement boththe source and the extraction technique which isunique among the KGs Wikidata also supports tostore the source of information but only around13 of the statements have provenance informa-tion attached Note however that not every state-

136Eg httprdfaboutcom httpwww4wiwissfu-berlindefactbook and httpwikicompanyorg (requested on Jan 11 2017)

44 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

ment in Wikidata requires a reference and that itis hard to evaluate which statements lack such areference

6 Using unknown and empty values Wikidata andFreebase support the indication of unknown andempty values

7 Check of schema restrictions during insertion ofnew statements Since Freebase and Wikidata areeditable by community members simple consis-tency checks are made during the insertion of newfacts in the user interface

8 Consistency of statements wrt class constraintsFreebase and Wikidata do not specify any classconstraints via owldisjointWith while theother KGs do

9 Consistency of statements wrt relation con-straints The inconsistencies of all KGs regardingthe range indications of relations are mainly due toinconsistently used data types (eg xsdgYearis used instead of xsdDate)Regarding the constraint of functional proper-ties the relation owlFunctionalPropertyis used by all KGs except Wikidata in most casesthe KGs comply with the usage restrictions of thisrelation

10 Creating a ranking of statements Only Wikidatasupports a ranking of statements This is in partic-ular worthwhile in case of statements which areonly temporally limited valid

11 Schema completeness Wikidata shows the highestdegree of schema completeness Also for DBpe-dia OpenCyc and YAGO we obtain results whichare presumably acceptable in most cross-domainuse cases While DBpedia classes were sometimesmissing in our evaluation the DBpedia relationswere covered considerably well OpenCyc lackssome relations of the gold standard but the classesof the gold standard were existing in OpenCycWhile the YAGO classes are peculiar in the sensethat they are connected to WordNet synsets it isremarkable that YAGO relations are often keptvery abstract so that they can be applied in differ-ent senses Freebase shows considerable room forimprovement concerning the coverage of typicalcross-domain classes and relations Note that Free-base classes are belonging to different domainsHence it is difficult to find related classes if theyare not in the same domain

12 Column completeness DBpedia and Freebaseshow the best column completeness values ie inthose KGs the predicates used by the instances of

each class are on average frequently used by all ofthose class instances We can name data importsas one reason for it

13 Population completeness Not very surprising isthe fact that all KGs show a higher degree of com-pleteness regarding well-known entities than re-garding rather unknown entities Especially Wiki-data shows an excellent performance for both well-known and rather unknown entities

14 Timeliness frequency of the KG Only Wikidataprovides the highest fulfillment degree for thiscriterion as it is continuously updated and as thechanges are immediately visible and queryable byusers

15 Specification of the validity period of statementsIn YAGO Freebase and Wikidata the temporalvalidity period of statements (eg term of office)can be specified

16 Specification of the modification date of state-ments Only Freebase keeps the modification datesof statements Wikidata provides the modificationdate of the queried resource during URI derefer-encing

17 Description of resources YAGO Wikidata andOpenCyc contain a label for almost every entitySurprisingly DBpedia shows a relatively low cov-erage wrt labels and descriptions (only 704)Manual investigations suggest that the interme-diate node mapping template is the main reasonfor that By means of this template intermediatenodes are introduced and instantiated but no la-bels are provided for them137

18 Labels in multiple languages YAGO Freebaseand Wikidata support hundreds of languages re-garding their stored labels Only OpenCyc con-tains labels merely in English While DBpediaYAGO and Freebase show a high coverage re-garding the English language Wikidata does nothave such a high coverage regarding English butinstead covers other languages to a considerableextent It is hence not only the most diverse KGin terms of languages but also the KG which con-tains the most labels for languages other than En-glish

19 Understandable RDF serialization DBpediaWikidata and YAGO provide several understand-

137An example is dbrVolkswagen_Passat_(B1)which has dboengine statements to the intermediate nodesVolkswagen_Passat_(B1)__1 etc representing differentengine variations

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 45

able RDF serialization formats Freebase onlyprovides the understandable format RDFTurtleOpenCyc relies only on RDFXML which is con-sidered as being not easily understandable for hu-mans

20 Self-describing URIs We can find mixed paradigmsregarding the URI generation DBpedia YAGOand OpenCyc rely on descriptive URIs whileWikidata and Freebase (in part classes and rela-tions are identified with self-describing URIs) usegeneric IDs ie opaque URIs

21 Avoiding blank nodes and RDF reification DB-pedia Wikidata YAGO and Freebase are theKGs which use reification ie which formulatestatements about statements There are differentways of implementing reification [27] DBpediaWikidata and Freebase use n-ary relations whileYAGO uses N-Quads creating so-called namedgraphs

22 Provisioning of several serialization formatsMany KGs provide RDF in several serializationformats Freebase is the only KG providing datain the serialization format RDFTurtle only

23 Using external vocabulary DBpedia and Wiki-data show high degrees of external vocabularyusage In DBpedia the RDF RDFS and OWLvocabularies are used Wikidata has a high ex-ternal vocabulary ratio since there exist manylanguage labels and descriptions (modeled viardfslabel and schemadescription)Also due to instantiations of statements withwdoStatement for reification purposes theexternal relation rdftype is used a lot

24 Interoperability of proprietary vocabulary Weobtained low fulfillment scores regarding this cri-terion OpenCyc shows the highest value Wecan mention as reason for that the fact thathalf of all OpenCyc classes exhibit at least oneowlsameAs linkWhile DBpedia has equivalence statements to ex-ternal classes for almost every second class only63 of all relations have equivalence relations torelations outside the DBpedia namespaceWikidata shows a very low interlinking degreeof classes to external classes and of relations toexternal relations

25 Dereferencing possibility of resources Resourcesin DBpedia OpenCyc and YAGO can be derefer-enced without considerable issues Wikidata usespredicates derived from relations that are not deref-erencable at all as well as blank nodes For Free-

base we measured a quite considerable amountof dereferencing failures due to server errors andunknown URIs Note also that Freebase requiredan API key for a large amount of requests

26 Availability of the KG While all other KGsshowed almost no outages YAGO shows a note-worthy instability regarding its online availabilityWe measured around 100 outages for YAGO ina time interval of 8 weeks taking on average 35hours

27 Provisioning of public SPARQL endpoint DBpe-dia Wikidata and YAGO provide a SPARQL end-point while Freebase and OpenCyc do not Note-worthy is that the Wikidata SPARQL endpoint hasa maximum execution time per query of 30 sec-onds This might be a bottleneck for some queries

28 Provisioning of an RDF export RDF exports areavailable for all KGs and are provided mostly inN-Triples and Turtle format

29 Support of content negotiation DBpedia Wiki-data and YAGO correctly return RDF data basedon content negotiation Both OpenCyc and Free-base do not support any content negotiation WhileOpenCyc only provides data in RDFXML Free-base only returns data with textplain as con-tent type

30 Linking HTML sites to RDF serializations AllKGs except OpenCyc interlink the HTML rep-resentations of resources with the correspondingRDF representations

31 Provisioning of KG metadata Only DBpedia andOpenCyc integrate metadata about the KG insome form DBpedia has the VoID vocabulary in-tegrated while OpenCyc reveals the current KGversion as machine-readable metadata

32 Provisioning machine-readable licensing informa-tion Only DBpedia and Wikidata provide licens-ing information about their KG data in machine-readable form

33 Interlinking via owlsameAs OpenCyc andYAGO achieve the best results wrt this met-ric but DBpedia has by far the most instanceswith at least one owlsameAs link Based onthe resource interlinkage DBpedia is justifiablycalled Linked Data hub Wikidata does not provideowlsameAs links but stores identifiers as liter-als that could be used to generate owlsameAslinks

34 Validity of external URIs The links to exter-nal Web resources are for all KGs valid inmost cases DBpedia and OpenCyc contain many

46 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Step 1 Requirements Analysis

- Identifying the preselection criteria P- Assigning a weight wi to each DQ criterion ci isin C

Step 2 Preselection based on the Preselection Criteria

- Manually selecting the KGs GP that fulfill the preselection criteria P

Step 3 Quantitative Assessment of the KGs

- Calculating the DQ metric mi(g) for each DQ criterion ci isin C- Calculating the fulfillment degree h(g) for each KG g isin GP

- Determining the KG g with the highest fulfillment degree h(g)

Step 4 Qualitative Assessment of the Result

- Assessing the selected KG g wrt qualitative aspects- Comparing the selected KG g with other KGs in G P

Fig 11 Proposed process for using our KG recommendation frame-work

owlsameAs links to RDF documents on do-mains which do not exist anymore those linkscould be deleted

6 KG Recommendation Framework

We now propose a framework for selecting themost suitable KG (or a set of suitable KGs) for agiven concrete setting based on a given set of KGsG = g1 gn To use this framework the user needsto go through the steps depicted in Fig 11

In Step 1 the preselection criteria and the weightsfor the criteria are specified The preselection criteriacan be both quality criteria or general criteria and needto be selected dependent on the use case The Timeli-ness frequency of the KG is an example for a qualitycriterion The license under which a KG is provided(eg CC0 license) is an example for a general criterionAfter weighting the criteria in Step 2 those KGs areneglected which do not fulfill the preselection criteriaIn Step 3 the fulfillment degrees of the remaining KGsare calculated and the KG with the highest fulfillmentdegree is selected Finally in Step 4 the result can be as-sessed wrt qualitative aspects (besides the quantitativeassessments using the DQ metrics) and if necessary analternative KG can be selected for being applied for thegiven scenario

Use case application In the following we showhow to use the KG recommendation framework in aparticular scenario The use case is based on the usageof DBpedia and MusicBrainz for the project BBC Musicas described in [33]

Description of the use case The publisher BBCwants to enrich news articles with fact sheets providingrelevant information about musicians mentioned in thearticles In order to obtain more details about the mu-sicians the user can leave the news section and accessthe musicians section where detailed information is pro-vided including a short description a picture the birthdate and the complete discography for each musicianFor being able to integrate the musicians informationinto the articles and to enable such a linking editorsshall tag the article based on a controlled vocabulary

The KG Recommendation Framework can be appliedas follows

1 Requirements analysis

ndash Preselection criteria According to the sce-nario description [33] the KG in questionshould (i) be actively curated and (ii) con-tain an appropriate amount of media enti-ties Given these two criteria a satisfactoryand up-to-date coverage of both old and newmusicians is expected

ndash Weighting of DQ criteria Based on the pre-selection criteria an example weighting ofthe DQ metrics for our use case is given inTable 15 Note that this is only one exam-ple configuration and the assignment of theweights is subjective to some degree Giventhe preselection criteria the criterion Timeli-ness frequency of the KG and the criteria ofthe DQ dimension Completeness are empha-sized Furthermore the criteria Dereferenc-ing possibility of resources and Availabilityof the KG are important as the KG shall beavailable online ready to be queried138

2 Preselection Freebase and OpenCyc are not con-sidered any further since Freebase is not being up-dated anymore and since OpenCyc contains onlyaround 4K entities in the media domain

3 Quantitative Assessment The overall fulfillmentscore for each KG is calculated based on the for-mula presented in Section 31 The result of thequantitative KG evaluation is presented in Ta-ble 15 By weighting the criteria according tothe constraints Wikidata achieves the best rankclosely followed by DBpedia Based on the quan-titative assessment Wikidata is recommended bythe framework

138We assume that in this use case rather the dereferencing ofHTTP URIs than the execution of SPARQL queries is desired

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 47

Table 15Framework with an example weighting which would be reasonablefor a user setting as given in [33]

Dimension Metric DBpedia Freebase OpenCyc Wikidata YAGO Example of UserWeighting wi

Accuracy msynRDF 1 1 1 1 1 1msynLit 0994 1 1 1 0624 1msemTriple 0990 0995 1 0993 0993 1

Trustworthiness mgraph 05 05 1 075 025 0mfact 05 1 0 1 1 1mNoV al 0 1 0 1 0 0

Consistency mcheckRestr 0 1 0 1 0 0mconClass 0875 1 0999 1 0333 0mconRelat 0992 0451 1 0500 0992 0

Relevancy mRanking 0 1 0 1 0 1

Completeness mcSchema 0905 0762 0921 1 0952 1mcCol 0402 0425 0 0285 0332 2mcPop 093 094 048 099 089 3

Timeliness mFreq 05 0 025 1 025 3mV alidity 0 1 0 1 1 0mChange 0 1 0 0 0 0

Ease of understanding mDescr 0704 0972 1 09999 1 1mLang 1 1 0 1 1 0muSer 1 1 0 1 1 0muURI 1 05 1 0 1 1

Interoperability mReif 05 05 05 0 05 0miSerial 1 0 05 1 1 1mextV oc 061 0108 0415 0682 0134 1mpropV oc 0150 0 0513 0001 0 1

Accessibility mDeref 1 0437 1 0414 1 2mAvai 09961 09998 1 09999 07306 2mSPARQL 1 0 0 1 1 1mExport 1 1 1 1 1 0mNegot 05 0 0 1 1 0mHTMLRDF 1 1 0 1 1 0mMeta 1 0 1 0 0 0

Licensing mmacLicense 1 0 0 1 0 0

Interlinking mInst 0251 0 0382 0 0310 3mURIs 0929 0908 0894 0957 0956 1

Unweighted Average 0683 0603 0496 0752 0625Weighted Average 0701 0493 0556 0714 0648

48 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

4 Qualitative Assessment The high population com-pleteness in general and the high coverage of enti-ties in the media domain in particular give Wiki-data advantage over the other KGs FurthermoreWikidata does not require that there is a Wikipediaarticle for each entity Thus missing Wikidata en-tities can be added by the editors directly and arethen available immediatelyThe use case requires to retrieve also detailed infor-mation about the musicians from the KG such as ashort descripion and a discography DBpedia tendsto store more of that data especially wrt discogra-phy A specialized database like MusicBrainz pro-vides even more data about musicians than DBpe-dia as it is not limited to the Wikipedia infoboxesWhile DBpedia does not provide any links to Mu-sicBrainz Wikidata stores around 120K equiva-lence links to MusicBrainz that can be used to pullmore data In conclusion Wikidata especially inthe combination with MusicBrainz seems to bean appropriate choice for the use case In this casethe qualitative assessment confirms the result ofthe quantitative assessment

The use case shows that our KG recommendationframework enables users to find the most suitable KGand is especially useful in giving an overview of themost relevant criteria when choosing a KG Howeverapplying our framework to the use case also showedthat besides the quantitative assessment there is stilla need for a deep understanding of commonalities anddifference of the KGs in order to make an informedchoice

7 Related Work

71 Linked Data Quality Criteria

Zaveri et al [49] provide a conceptual framework forquality assessment of linked data based on quality cri-teria and metrics which are grouped into quality dimen-sions and categories and which are based on the frame-work of Wang et al [47] Our framework is also basedon Wangrsquos dimensions and extended by the dimensionsConsistency [11] Licensing and Interlinking [49] Fur-thermore we reintroduce the dimensions Trustworthi-ness and Interoperability as a collective term for multi-ple dimensions

Many published DQ criteria and metrics are ratherabstract We in contrast selected and developed con-

crete criteria which can be applied to any KG in theLinked Open Data cloud Table 16 shows which ofthe metrics introduced in this article have already beenused to some extent in existing literature In summaryrelated work mainly proposed generic guidelines forpublishing Linked Data [26] introduced DQ criteriawith corresponding metrics (eg [2030]) and criteriawithout metrics (eg [4029]) 27 of the 34 criteria in-troduced in this article have been introduced or sup-ported in one way or another in earlier works The re-maining seven criteria namely Trustworthiness on KGlevel mgraph Indicating unknown and empty valuesmNoV al Check of schema restrictions during insertionof new statements mcheckRestr Creating a rankingof statements mRanking Timeliness frequency of theKG mFreq Specification of the validity period of state-ments mV alidity and Availability of the KG mAvaihave not been proposed so far to the best of our knowl-edge In the following we present more details of singleexisting approaches for Linked Data quality criteria

Pipino et al [40] introduce the criteria Schema com-pleteness Column completeness and Population com-pleteness in the context of databases We introducethose metrics for KGs and apply them to the best ofour knowledge the first time on the KGs DBpediaFreebase OpenCyc Wikidata and YAGO

OntoQA [45] introduces criteria and correspondingmetrics that can be used for the analysis of ontologiesBesides simple statistical figures such as the average ofinstances per class Tartir et al introduce also criteriaand metrics similar to our DQ criteria Description ofresources mDescr and Column completeness mcCol

Based on a large-scale crawl of RDF data Hogan etal [29] analyze quality issues of published RDF dataLater Hogan et al [30] introduce further criteria andmetrics based on Linked Data guidelines for data pub-lishers [26] Whereas Hogan et al crawl and analyzemany KGs we analyze a selected set of KGs in moredetail

Heath et al [26] provide guidelines for Linked Databut do not introduce criteria or metrics for the assess-ment of Linked Data quality Still the guidelines can beeasily translated into relevant criteria and metrics Forinstance Do you refer to additional access methodsleads to the criteria Provisioning of public SPARQLendpoint mSPARQL and Provisioning of an RDF ex-port mExport Also Do you map proprietary vocabu-lary terms to other vocabularies leads to the criterionInteroperability of proprietary vocabulary mpropV ocMetrics that are based on the guidelines of Heath et alcan also be found in other frameworks [3020]

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 49

Table 16Overview of related work regarding data quality criteria for KGs

DQ Metric [40] [45] [29] [26] [20] [22] [30] [48] [2] [34]

msynRDF X X

msynLit X X X X

msemTriple X X X X

mfact X X

mconClass X X X

mconRelat X X X X X X

mcSchema X X

mcCol X X X X

mcPop X X

mChange X X

mDescr X X X X

mLang X

muSer X

muURI X

mReif X X X

miSerial X

mextV oc X X

mpropV oc X

mDeref X X X X

mSPARQL X

mExport X X

mNegot X X X

mHTMLRDF X

mMeta X X X

mmacLicense X X X

mInst X X X

mURIs X X

Flemming [20] introduces a framework for the qual-ity assessment of Linked Data quality This frameworkmeasures the Linked Data quality based on a sample ofa few RDF documents Based on a systematic literaturereview criteria and metrics are introduced Flemmingintroduces the criteria Labels in multiple languagesmLang and Validity of external URIs mURIs the firsttime The framework is evaluated on a sample of RDFdocuments of DBpedia In contrast to Flemming weevaluate the whole KG DBpedia and also four otherwidely used KGs

SWIQA[22] is a quality assessment framework intro-duced by Fuumlrber et al that introduces criteria and met-rics for the dimensions Accuracy Completeness Timeli-ness and Uniqueness In this framework the dimensionAccuracy is divided into Syntactic validity and Sematicvalidity as proposed by Batini et al [6] Furthermorethe dimension Completeness comprises Schema com-pleteness Column completeness and Population com-pleteness following Pipino et al [40] In this articlewe make the same distinction but in addition distin-guish between RDF documents RDF triples and RDF

50 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

literals for evaluating the Accuracy since we considerRDF KGs

TripleCheckMate [35] is a framework for LinkedData quality assessment using a crowdsourcing-approachfor the manual validation of facts Based on this ap-proach Zaveri et al [48] and Acosta et al [23] analyzeboth syntactic and semantic accuracy as well as theconsistency of data in DBpedia

Kontokostas et al [34] present the test-driven evalu-ation framework RDFUnit for assessing Linked Dataquality This framework is inspired by the paradigmof test-driven software development The frameworkintroduces 17 SPARQL templates of tests that can beused for analyzing KGs wrt Accuracy and Consis-tency Note that those tests can also be used for eval-uating external constraints that exist due to the usageof external vocabulary The framework is applied byKontokostas et al on a set of KGs including DBpedia

72 Comparing KGs by Key Statistics

Duan et al [14] Tartir [45] and Hassanzadeh [25]can be mentioned as the most similar related work re-garding the evaluation of KGs using the key statisticspresented in Section 51

Duan et al [14] analyze the structuredness of data inDBpedia YAGO2 UniProt and in several benchmarkdata sets To that end the authors use simple statisticalkey figures that are calculated based on the correspond-ing RDF dumps In contrast to that approach we useSPARQL queries to obtain the figures thus not limitingourselves to the N-Tripel serialization of RDF dumpfiles Duan et al claim that simple statistical figures arenot sufficient to gain fruitful findings when analyzingthe structuredness and differences of RDF datasets Theauthors therefore propose in addition a coherence met-ric Accordingly we analyze not only simple statisti-cal key figures but further analyze the KGs wrt dataquality using 34 DQ metrics

Tartir et al [45] introduce with the system OntoQAmetrics that can be used for analyzing ontologies Moreprecisely it can be measured to which degree theschema level information is actually used on instancelevel An example of such a metric is the class richnessdefined as the number of classes with instances dividedby the number of classes without instances SWETOTAP and GlycO are used as showcase ontologies

Tartir et al [45] and Hassanzadeh et al [25] analyzehow domains are covered by KGs on both schema andinstance level For that Tartir et al introduce the mea-sure importance as the number of instances per class

and their subclasses In our case we cannot use this ap-proach since Freebase has no hierarchy Hassanzadehet al analyze the coverage of domains by listing themost frequent classes with the highest number of in-stances as a table This gives only little overview of thecovered domains since instances can belong to multi-ple classes in the same domain such as dboPlaceand dboPopulatedPlace For determining thedomain coverages of KGs for this article we there-fore adapt the idea of Hassanzadeh et al by manu-ally mapping the most frequent classes to domains anddeleting duplicates within the domains That meansif an instance is instantiated both as dboPlaceand dboPopulatedPlace the instance will becounted only once in the domain geography

8 Conclusion

Freely available knowledge graphs (KGs) have notbeen in the focus of any extensive comparative study sofar In this survey we defined a range of aspects accord-ing to which KGs can be analyzed We analyzed andcompared DBpedia Freebase OpenCyc Wikidata andYAGO along these aspects and proposed a frameworkas well as a process to enable readers to find the mostsuitable KG for their settings

References

[1] M Acosta E Simperl F Floumlck and M Vidal HARE AHybrid SPARQL Engine to Enhance Query Answers viaCrowdsourcing In Proceedings of the 8th InternationalConference on Knowledge Capture K-CAP 2015 pages111ndash118 ACM 2015

[2] M Acosta A Zaveri E Simperl D Kontokostas S Auer andJ Lehmann Crowdsourcing linked data quality assessment InThe Semantic WebndashISWC 2013 pages 260ndash276 Springer 2013

[3] M Acosta A Zaveri E Simperl D Kontokostas F Floumlckand J Lehmann Detecting Linked Data Quality Issues viaCrowdsourcing A DBpedia Study Semantic Web 2016

[4] S Auer C Bizer G Kobilarov J Lehmann R Cyganiak andZ Ives DBpedia A Nucleus for a Web of Open Data InProceedings of the 6th International Semantic Web Conferenceand 2nd Asian Semantic Web Conference ISWC 2007ASWC2007 pages 722ndash735 Springer 2007

[5] S Auer J Lehmann A-C Ngonga Ngomo and A ZaveriIntroduction to Linked Data and Its Lifecycle on the Web InReasoning Web Semantic Technologies for Intelligent DataAccess volume 8067 of Lecture Notes in Computer Sciencepages 1ndash90 Springer Berlin Heidelberg 2013

[6] C Batini C Cappiello C Francalanci and A MaurinoMethodologies for Data Quality Assessment and ImprovementACM Comput Surv 41(3)161ndash1652 July 2009

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 51

[7] S Bechhofer F van Harmelen J Hendler I Horrocks D LMcGuinness and P F Patel-Schneider OWL Web OntologyLanguage Reference httpswwww3orgTR2004REC-owl-ref-200402102004 [Online accessed 06-Apr-2016]

[8] T Berners-Lee Linked Data httpwwww3orgDesignIssuesLinkedDatahtml2006 [Online accessed 28-Feb-2016]

[9] T Berners-Lee Linked Data Is Merely More Datahttpwwww3orgDesignIssuesLinkedDatahtml 2006[Online accessed 28-02-2016]

[10] T Berners-Lee J Hendler and O Lassila The Semantic WebScientific American 284(5)29ndash37 5 2001

[11] C Bizer Quality-Driven Information Filtering in the Contextof Web-Based Information Systems VDM Publishing 2007

[12] C Bizer J Lehmann G Kobilarov S Auer C BeckerR Cyganiak and S Hellmann DBpediandashA crystallizationpoint for the Web of Data Web Semantics science servicesand agents on the world wide web 7(3)154ndash165 2009

[13] X Dong E Gabrilovich G Heitz W Horn N LaoK Murphy T Strohmann S Sun and W Zhang KnowledgeVault A Web-Scale Approach to Probabilistic KnowledgeFusion In Proceedings of the 20th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining KDD rsquo14 pages 601ndash610 New York NY USA 2014ACM

[14] S Duan A Kementsietsidis K Srinivas and O UdreaApples and Oranges A Comparison of RDF Benchmarks andReal RDF Datasets In Proceedings of the ACM SIGMODInternational Conference on Management of Data SIGMOD2011 pages 145ndash156 2011

[15] B Ell D Vrandecic and E Simperl Proceedings of the 10thInternational Semantic Web Conference (ISWC 2011) chapterLabels in the Web of Data pages 162ndash176 Springer BerlinHeidelberg Berlin Heidelberg 2011

[16] F Erxleben M Guumlnther M Kroumltzsch J Mendez andD Vrandecic Introducing Wikidata to the Linked Data WebIn Proceedings of the 13th International Semantic WebConference ISWC 2014 pages 50ndash65 Springer 2014

[17] M Faumlrber F Bartscherer C Menne and A Rettinger LinkedData Quality of DBpedia Freebase OpenCyc Wikidata andYAGO Semantic Web Journal 2017 to be published

[18] M Faumlrber C Menne and A Rettinger A Linked DataWrapper for CrunchBase Semantic Web Journal 2017 to bepublished

[19] C Fellbaum WordNet ndash An Electronic Lexical Database MITPress 1998

[20] A Flemming Qualitaumltsmerkmale von LinkedData-veroumlffentlichenden Datenquellen (Quality characteristicsof linked data publishing datasources) Diploma ThesisHumboldt University of Berlinhttpwwwdbisinformatikhu-berlindefileadminresearchpapersdiploma_seminar_thesisDiplomarbeit_Annika_Flemmingpdf 2011

[21] G Freedman and E G Reynolds Enriching Basal ReaderLessons with Semantic Webbing Reading Teacher33(6)677ndash684 1980

[22] C Fuumlrber and M Hepp SWIQA ndash A Semantic WebInformation Quality Assessment Framework In Proceedings ofthe 19th European Conference on Information Systems

(ECIS2011) volume 15 page 19 2011[23] R Guns Tracing the origins of the Semantic Web Journal of

the American Society for Information Science and Technology64(10)2173ndash2181 2013

[24] H Halpin P J Hayes J P McCusker D L McGuinness andH S Thompson The Semantic Web ndash ISWC 2010 9thInternational Semantic Web Conference ISWC 2010 ShanghaiChina chapter When owlsameAs Isnrsquot the Same An Analysisof Identity in Linked Data pages 305ndash320 Springer BerlinHeidelberg Berlin Heidelberg 2010

[25] O Hassanzadeh M J Ward M Rodriguez-Muro andK Srinivas Understanding a Large Corpus of Web TablesThrough Matching with Knowledge Bases ndash An EmpiricalStudy In Proceedings of the 10th International Workshop onOntology Matching collocated with the 14th InternationalSemantic Web Conference ISWC 2015 2015

[26] T Heath and C Bizer Linked data Evolving the web into aglobal data space Synthesis lectures on the semantic webtheory and technology 1(1)1ndash136 2011

[27] D Hernaacutendez A Hogan and M Kroumltzsch Reifying RDFWhat Works Well With Wikidata In Proceedings of the 11thInternational Workshop on Scalable Semantic Web KnowledgeBase Systems co-located with 14th International Semantic WebConference pages 32ndash47 2015

[28] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 A Spatially and Temporally Enhanced KnowledgeBase from Wikipedia Artificial Intelligence 19428ndash61 2013

[29] A Hogan A Harth A Passant S Decker and A PolleresWeaving the Pedantic Web Proceedings of the WWW2010Workshop on Linked Data on the Web 628 2010

[30] A Hogan J Umbrich A Harth R Cyganiak A Polleres andS Decker An empirical survey of linked data conformanceWeb Semantics Science Services and Agents on the WorldWide Web 1414ndash44 2012

[31] P Jain P Hitzler K Janowicz and C Venkatramani TherersquosNo Money in Linked Data httpcorescholarlibrarieswrighteducse240 2013 accessedJuly 20 2015

[32] J M Juran F M Gryna and R S Bingham editors QualityControl Handbook McGraw-Hill 1974

[33] G Kobilarov T Scott Y Raimond S Oliver C SizemoreM Smethurst C Bizer and R Lee Media Meets SemanticWeb ndash How the BBC Uses DBpedia and Linked Data to MakeConnections In Proceedings of the 6th European SemanticWeb Conference on The Semantic Web Research andApplications ESWC 2009 Heraklion pages 723ndash737 BerlinHeidelberg 2009 Springer

[34] D Kontokostas P Westphal S Auer S HellmannJ Lehmann R Cornelissen and A Zaveri Test-drivenevaluation of linked data quality In Proceedings of the 23rdinternational conference on World Wide Web pages 747ndash758ACM 2014

[35] D Kontokostas A Zaveri S Auer and J LehmannTripleCheckMate A Tool for Crowdsourcing the QualityAssessment of Linked Data In Knowledge Engineering andthe Semantic Web ndash 4th International Conference KESW 2013St Petersburg Russia October 7-9 2013 Proceedings pages265ndash272 Springer 2013

[36] C Matuszek J Cabral M J Witbrock and J DeOliveira AnIntroduction to the Syntax and Content of Cyc In AAAI SpringSymposium Formalizing and Compiling Background

52 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Knowledge and Its Applications to Knowledge Representationand Question Answering pages 44ndash49 AAAI - Association forthe Advancement of Artificial Intelligence 2006

[37] M Mecella M Scannapieco A Virgillito R BaldoniT Catarci and C Batini Managing data quality in cooperativeinformation systems In On the Move to Meaningful InternetSystems 2002 CoopIS DOA and ODBASE pages 486ndash502Springer 2002

[38] O Medelyan and C Legg Integrating Cyc and WikipediaFolksonomy meets rigorously defined common-sense InWikipedia and Artificial Intelligence An Evolving SynergyPapers from the 2008 AAAI Workshop page 65 2008

[39] F Naumann Quality-Driven Query Answering for IntegratedInformation Systems volume 2261 Springer Science ampBusiness Media 2002

[40] L L Pipino Y W Lee and R Y Wang Data QualityAssessment Communications of the ACM 45(4)211ndash2182002

[41] E Sandhaus Semantic Technology at the New York TimesLessons Learned and Future Directions In Proceedings of the9th International Semantic Web Conference on The SemanticWeb - Volume Part II ISWCrsquo10 pages 355ndash355 BerlinHeidelberg 2010 Springer

[42] A Singhal Introducing the Knowledge Graph things notstrings httpsgoogleblogblogspotde201205introducing-knowledge-graph-things-nothtml retrieved on Aug 29 2016 2012

[43] F M Suchanek G Kasneci and G Weikum YAGO A LargeOntology from Wikipedia and WordNet Web SemanticsScience Services and Agents on the World Wide Web6(3)203ndash217 2008

[44] T P Tanon D Vrandecic S Schaffert T Steiner andL Pintscher From Freebase to Wikidata The Great MigrationIn Proceedings of the 25th International Conference on WorldWide Web WWW 2016 pages 1419ndash1428 2016

[45] S Tartir I B Arpinar M Moore A P Sheth andB Aleman-meza OntoQA Metric-Based Ontology QualityAnalysis In IEEE Workshop on Knowledge Acquisition fromDistributed Autonomous Semantically Heterogeneous Dataand Knowledge Sources 2005

[46] R Y Wang M P Reddy and H B Kon Toward quality dataAn attribute-based approach Decision Support Systems13(3)349ndash372 1995

[47] R Y Wang and D M Strong Beyond Accuracy What DataQuality Means to Data Consumers Journal of managementinformation systems 12(4)5ndash33 1996

[48] A Zaveri D Kontokostas M A Sherif L BuumlhmannM Morsey S Auer and J Lehmann User-driven qualityevaluation of dbpedia In Proceedings of the 9th InternationalConference on Semantic Systems pages 97ndash104 ACM 2013

[49] A Zaveri A Rula A Maurino R Pietrobon J Lehmann andS Auer Quality Assessment for Linked Data A SurveySemantic Web 7(1)63ndash93 2015

Page 5: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 5

ndash Bizer [11] compared the work of Wang et al [47]with other works in the area of data quality Hethereby complements the framework with the di-mensions consistency verifiability and offensive-ness

ndash Zaveri et al [49] follow Wang et al [47] but intro-duce licensing and interlinking as new dimensionsin the linked data context

In this article we use the DQ dimensions as definedby Wang et al [47] and as extended by Bizer [11] andZaveri [49] More precisely we make the followingadaptations on Wang et alrsquos framework

1 Consistency is treated by us as separate DQ dimen-sion

2 Verifiability is incorporated within the DQ dimen-sion Trustworthiness as criterion Trustworthinesson statement level

3 The Offensiveness of KG facts is not consideredby us as it is hard to make an objective evaluationin this regard

4 We extend the category of the accessibility dataquality by the dimension License and Interlinkingas those data quality dimensions get in additionrelevant in the Linked Data context

31 Criteria Weighting

When applying our framework to compare KGs thesingle DQ metrics can be weighted differently so thatthe needs and requirements of the users can be takeninto account In the following we first formalize theidea of weighting the different metrics We then presentthe criteria and the corresponding metrics of our frame-work

Given are a KG g a set of criteria C = c1 cn aset of metrics M = m1 mn and a set of weightsW = w1 wn Each metric mi corresponds to thecriterion ci and mi(g) isin [0 1] where a value of 0 de-fines the minimum fulfillment degree of a KG regardinga quality criterion and a value of 1 the maximum fulfill-ment degree Furthermore each criterion ci is weightedby wi

The fulfillment degree h(g) isin [0 1] of a KG g isthen the weighted normalized sum of the fulfillmentdegrees wrt the criteria c1 cn

h(g) =

sumni=1 wi mi(g)sumn

j=1 wj

Based on the quality dimensions introduced by Wanget al [47] we now present the DQ criteria and met-rics as used in our KG comparison Note that some ofthe criteria have already been introduced by others asoutlined in Section 7

Note also that our metrics are to be understood aspossible ways of how to evaluate the DQ dimensionsOther definitions of the DQ metrics might be possibleand reasonable We defined the metrics along the char-acteristics of the KGs DBpedia Freebase OpenCycWikidata and YAGO but kept the definitions as genericas possible In the evaluations we then used those met-ric definitions and applied them eg on the basis ofown-created gold standards

32 Intrinsic Category

ldquoIntrinsic data quality denotes that data have qualityin their own rightrdquo [47] This kind of data quality cantherefore be assessed independently from the contextThe intrinsic category embraces the three dimensionsAccuracy Trustworthiness and Consistency which aredefined in the following subsections The dimensionsBelievability Objectivity and Reputation which areseparate dimensions in Wang et alrsquos classification sys-tem [47] are subsumed by us under the dimensionTrustworthiness

321 AccuracyDefinition of dimension Accuracy is ldquothe extent to

which data are correct reliable and certified free oferrorrdquo [47]

Discussion Accuracy is intuitively an important di-mension of data quality Previous work on data qualityhas mainly analyzed only this aspect [47] Hence accu-racy has often been used as synonym for data quality[39] Bizer [11] highlights in this context that Accuracyis an objective dimension and can only be applied onverifiable statements

Batini et al [6] distinguish between syntactic andsemantic accuracy Syntactic accuracy describes theformal compliance to syntactic rules without review-ing whether the value reflects the reality The semanticaccuracy determines whether the value is semanticallyvalid ie whether the value is true Based on the clas-sification of Batini et al we can define the metric forAccuracy as follows

Definition of metric The dimension Accuracy isdetermined by the criteria

ndash Syntactic validity of RDF documentsndash Syntactic validity of literals and

6 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

ndash Semantic validity of triples

The fulfillment degree of a KG g wrt the dimen-sion Accuracy is measured by the metrics msynRDF msynLit and msemTriple which are defined as fol-lows

Syntactic validity of RDF documents The syntacticvalidity of RDF documents is an important require-ment for machines to interpret an RDF document com-pletely and correctly Hogan et al [29] suggest usingstandardized tools for creating RDF data The authorsstate that in this way normally only little syntax errorsoccur despite the complex syntactic representation ofRDFXML

RDF data can be validated by an RDF validator suchas the W3C RDF validator14

msynRDF (g) =

1 if all RDF documents are valid0 otherwise

Syntactic validity of literals Assessing the syntacticvalidity of literals means to determine to which degreeliteral values stored in the KG are syntactically validThe syntactic validity of literal values depends on thedata types of the literals and can be automatically as-sessed via rules [2234] Syntactic rules can be writ-ten in the form of regular expressions For instanceit can be verified whether a literal representing a datefollows the ISO 8601 specification Assuming that L isthe infinite set of literals we can state

msynLit(g) =|(s p o) isin g | o isin L and synV alid(o)|

|(s p o) isin g | o isin L|

In case of an empty set in the denominator of thefraction the metric should evaluate to 1

Semantic validity of triples The criterion Semanticvalidity of triples is introduced to evaluate whether thestatements expressed by the triples (with or withoutliterals) hold true Determining whether a statementis true or false is strictly speaking impossible (see thefield of epistemology in philosophy) For evaluating theSemantic validity of statements Bizer et al [11] notethat a triple is semantically correct if it is also availablefrom a trusted source (eg Name Authority File) if it

14See httpwwww3orgRDFValidator requestedon Feb 29 2016

is common sense or if the statement can be measuredor perceived by the user directly Wikidata has similarguidelines implemented to determine whether a factneeds to be sourced15

We measure the Semantic validity of triples based onempirical evidence ie based on a reference data setserving as gold standard We determine the fulfillmentdegree as the precision that the triples which are in theKG g and in the gold standard GS have the same valuesNote that this measurement is heavily depending on thetruthfulness of the reference data set

Formally let nogGS = |(s p o) | (s p o) isin g and(x y z) isin GSandequi(s x)andequi(p y)andequi(o z))|be the number of triples in g to which semanticallycorresponding triples in the gold standard GS exist Letnog = |(s p o) | (s p o) isin g and (x y z) isin GS andequi(s x) and equi(p y)| be the number of triples in gwhere the subject-relation-pairs (s p) are semanticallyequivalent to subject-relation-pairs (x y) in the goldstandard Then we can state

msemTriple(g) =nogGS

nog

In case of an empty set in the denominator of thefraction the metric should evaluate to 1

322 TrustworthinessDefinition of dimension Trustworthiness is defined

as the degree to which the information is accepted to becorrect true real and credible [49] We define it as acollective term for believability reputation objectivityand verifiability These aspects were defined by Wanget al [47] and Naumann [39] as follows

ndash Believability Believability is ldquothe extent to whichdata are accepted or regarded as true real andcrediblerdquo [47]

ndash Reputation Reputation is ldquothe extent to whichdata are trusted or highly regarded in terms of theirsource or contentrdquo [47]

ndash Objectivity Objectivity is ldquothe extent to whichdata are unbiased (unprejudiced) and impartialrdquo[47]

ndash Verifiability Verifiability is ldquothe degree and easewith which the data can be checked for correctnessrdquo[39]

15See httpswwwwikidataorgwikiHelpSources requested on Sep 8 2016

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 7

Discussion In summary believability considers thesubject (data consumer) side reputation takes the gen-eral social view on trustworthiness objectivity consid-ers the object (data provider) side while verifiabilityfocuses on the possibility of verification

Trustworthiness has been discussed as follows

ndash Believability According to Naumann [39] believ-ability is the ldquoexpected accuracyrdquo of a data source

ndash Reputation The essential difference of believ-ability to accuracy is that for believability data istrusted without verification [11] Thus believabil-ity is closely related to the reputation of a dataset

ndash Objectivity According to Naumann [39] the ob-jectivity of a data source is strongly related to theverifiability The more verifiable a data source orstatement is the more objective it is The authorsof this article would not go so far since also biasedstatements could be verifiable

ndash Verifiability Heath et al [26] emphasize that it isessential for trustworthy applications to be able toverify the origin of data

Definition of metric We define the metric for thedata quality dimension Trustworthiness as a combina-tion of trustworthiness metrics on both KG and state-ment level Believability and reputation are thereby cov-ered by the DQ criterion Trustworthiness on KG level(metric mgraph(hg)) while objectivity and verifiabilityare covered by the DQ criteria Trustworthiness on state-ment level (metric mfact(g)) and Indicating unknownand empty values (metric mNoV al(g)) Hence the ful-fillment degree of a KG g wrt the dimension Trust-worthiness is measured by the metrics mgraph mfactand mNoV al which are defined as follows

Trustworthiness on KG level The measure of Trust-worthiness on KG level exposes a basic indication aboutthe trustworthiness of the KG In this assessment themethod of data curation as well as the method of datainsertion is taken into account Regarding the methodof data curation we distinguish between manual andautomated methods Regarding the data insertion wecan differentiate between 1 whether the data is enteredby experts (of a specific domain) 2 whether the knowl-edge comes from volunteers contributing in a commu-nity and 3 whether the knowledge is extracted automat-ically from a data source This data source can itself beeither structured semi-structured or un-structured Weassume that a closed system where experts or other reg-istered users feed knowledge into a system is less vul-nerable to harmful behavior of users than an open sys-

tem where data is curated by a community Thereforewe assign the values of the metric for Trustworthinesson KG level as follows

mgraph(hg) =

1 manual data curation man-ual data insertion in aclosed system

075 manual data curation and in-sertion both by a commu-nity

05 manual data curation datainsertion by community ordata insertion by automatedknowledge extraction

025 automated data curationdata insertion by automatedknowledge extraction fromstructured data sources

0 automated data curationdata insertion by automatedknowledge extraction fromunstructured data sources

Note that all proposed DQ metrics should be seen assuggestions of how to formulate DQ metrics Henceother numerical values and other classification schemes(eg for mgraph(hg)) might be taken for defining theDQ metrics

Trustworthiness on statement level The fulfillment ofTrustworthiness on statement level is determined by anassessment whether a provenance vocabulary is usedBy means of a provenance vocabulary the source ofstatements can be stored Storing source information isan important precondition to assess statements easilywrt semantic validity We distinguish between prove-nance information provided for triples and provenanceinformation provided for resources

The most widely used ontologies for storing prove-nance information are the Dublin Core Metadataterms16 with properties such as dctermsprovenance and dctermssource and the W3C PROVontology17 with properties such as provwasDerivedFrom

16See httppurlorgdcterms requested on Feb 42017

17See httpswwww3orgTRprov-o requested onDec 27 2016

8 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

mfact(g) =

1 provenance on statement

level is used05 provenance on resource

level is used0 otherwise

Indicating unknown and empty values If the datamodel of the considered KG supports the representa-tion of unknown and empty values more complex state-ments can be represented For instance empty valuesallow to represent that a person has no children andunknown values allow to represent that the birth date ofa person in not known This kind of higher explanatorypower of a KG increases the trustworthiness of the KG

mNoV al(g) =

1 unknown and empty values

are used05 either unknown or empty

values are used0 otherwise

323 ConsistencyDefinition of dimension Consistency implies that

ldquotwo or more values [in a dataset] do not conflict eachotherrdquo [37]

Discussion Due to the high variety of data providersin the Web of Data a user must expect data inconsisten-cies Data inconsistencies may be caused by (i) differ-ent information providers (ii) different levels of knowl-edge and (iii) different views of the world [11]

In OWL restrictions can be introduced to ensureconsistent modeling of knowledge to some degree TheOWL schema restrictions can be divided into class re-strictions and relation restrictions [7]

Class restrictions refer to classes For instanceone can specify via owldisjointWith that twoclasses have no common instance

Relation restrictions refer to the usage of relationsThey can be classified into value constraints and cardi-nality constraints

Value constraints determine the range of relationsowlsomeValuesFrom for instance specifies thatat least one value of a relation belongs to a certainclass If the expected data type of a relation is specifiedvia rdfsrange we also consider this as relationrestriction

Cardinality constraints limit the number of times a re-lation may exist per resource Via owlFunctionalproperty and owlInverseFunctionalProp

erty global cardinality constraints can be specifiedFunctional relations permit at most one value per re-source (eg the birth date of a person) Inverse func-tional relations specify that a value should only occuronce per resource This means that the subject is theonly resource linked to the given object via the givenrelation

Definition of metric We can measure the data qual-ity dimension Consistency by means of (i) whetherschema constraints are checked during the insertion ofnew statements into the KG and (ii) whether alreadyexisting statements in the KG are consistent to specifiedclass and relation constraints The fulfillment degree ofa KG g wrt the dimension consistency is measuredby the metrics mcheckRestr mconClass and mconRelatwhich are defined as follows

Check of schema restrictions during insertion of newstatements Checking the schema restrictions duringthe insertion of new statements can help to reject factsthat would render the KG inconsistent Such simplechecks are often done on the client side in the user inter-face For instance the application checks whether datawith the right data type is inserted Due to the depen-dency to the actual inserted data the check needs to becustom-designed Simple rules are applicable howeverinconsistencies can still appear if no suitable rules areavailable Examples of consistency checks are check-ing the expected data types of literals checking whetherthe entity to be inserted has a valid entity type (iechecking the rdftype relation) checking whetherthe assigned classes of the entity are disjoint ie con-tradicting each other (utilizing owldisjointWithrelations)

mcheckRestr(hg) =

1 schema restrictions arechecked

0 otherwise

Consistency of statements wrt class constraints Thismetric is intended to measure the degree to which theinstance data is consistent with the class restrictions(eg owldisjointWith) specified on the schemalevel

In the following we limit ourselves to the classconstraints given by all owldisjointWith state-ments defined on the schema level of the consid-ered KG Ie let CC be the set of all class con-straints defined as CC = (c1 c2) | (c1owldis-

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 9

jointWith c2) isin g18 Furthermore let cg(e) bethe set of all classes of instance e in g defined ascg(e) = c | (erdftype c) isin g Then we definemconClass(g) as follows

mconClass(g) =

|(c1 c2) isin CC | notexiste (c1 isin cg(e) and c2 isin cg(e))||(c1 c2) isin CC|

In case of an empty set of class constraints CC themetric should evaluate to 1

Consistency of statements wrt relation constraintsThe metric for this criterion is intended for measur-ing the degree to which the instance data is consis-tent with the relation restrictions (eg indicated viardfsrange and owlFunctionalProperty)specified on the schema level We evaluate this crite-rion by averaging over the scores obtained from sin-gle metrics mconRelati indicating the consistency ofstatements wrt different relation constraints

mconRelat(g) =1

n

nsumi=1

mconRelati(g)

In case of evaluating the consistency of instance dataconcretely wrt given rdfsrange and owlFunctionalProperty statements19 we can state

mconRelat(g) =mconRelatRg(g) +mconRelatFct(g)

2

Let Rr be the set of all rdfsrange constraints

Rr = (p d) | (prdfsrange d) isin g

and isDatatype(d)

18Implicit restrictions which can be deducted from the class hi-erarchy eg that a restriction for dboAnimal counts also fordboMammal a subclass of dboAnimal are not considered byus here

19We chose those relations (and for instance notowlInverseFunctionalProperty) as only those relationsare used by more than half of the considered KGs

and Rf be the set of all owlFunctionalPro-perty constraints

Rf = (p d) | (prdftypeowlFunc

tionalProperty) isin g and

(prdfsrange d) isin g and isDatatype(d)

Then we can define the metrics mconRelatRg(g) andmconRelatFct(g) as follows

mconRelatRg(g) =

|(s p o) isin g | exist(p d) isin Rr datatype(o) 6= d||(s p o) isin g | exist(p d) isin Rr|

mconRelatFct(g) =

|(s p o) isin g|exist(p d) isin Rf notexist(s p o2) isin g o 6= o2||(s p o) isin g | exist(p d) isin Rf|

In case of an empty set of relation constraints (Rr orRf ) the respective metric should evaluate to 1

33 Contextual Category

Contextual data quality ldquohighlights the requirementthat data quality must be considered within the contextof the task at handrdquo [47] This category contains thethree dimensions (i) Relevancy (ii) Completeness and(iii) Timeliness Wang et alrsquos further dimensions in thiscategory appropriate amount of data and value-addedare considered by us as being part of the dimensionCompleteness

331 RelevancyDefinition of dimension Relevancy is ldquothe extent

to which data are applicable and helpful for the task athandrdquo [47]

Discussion According to Bizer [11] Relevancy isan important quality dimension since the user is con-fronted with a variety of potentially relevant informa-tion on the Web

Definition of metric The dimension Relevancy isdetermined by the criterion Creating a ranking ofstatements20 The fulfillment degree of a KG g wrtthe dimension Relevancy is measured by the metricmRanking which is defined as follows

20We do not consider the relevancy of literals as there is no rankingof literals provided for the considered KGs

10 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Creating a ranking of statements By means of thiscriterion one can determine whether the KG supportsa ranking of statements by which the relative rele-vance of statements among other statements can beexpressed For instance given the Wikidata entityBarack Obama (wdtQ76) and the relation posi-tion held (wdtP39) President of the United Statesof America (wdtQ11696) has a preferred rank(wdoPreferredRank) (until 2017) while olderpositions which he holds no more are ranked as normalrank (wdoNormalRank)

mRanking(g) =

1 ranking of statements supported0 otherwise

Note that this criterion refers to a characteristic ofthe KG and not to a characteristic of the system thathosts the KG

332 CompletenessDefinition of dimension Completeness is ldquothe ex-

tent to which data are of sufficient breadth depth andscope for the task at handrdquo [47]

We include the following two aspects in this dimen-sion which are separate dimensions in Wang et alrsquosframework

ndash Appropriate amount of data Appropriate amountof data is ldquothe extent to which the quantity orvolume of available data is appropriaterdquo [47]

ndash Value-added Value-added is ldquothe extent to whichdata are beneficial and provide advantages fromtheir userdquo [47]

Discussion Pipino et al [40] divide Completenessinto

1 Schema completeness ie the extent to whichclasses and relations are not missing

2 Column completeness ie the extent to whichvalues of relations on instance level ndash ie facts ndashare not missing and

3 Population completeness ie the extent to whichentities are not missing

The Completeness dimension is context-dependent andtherefore belongs to the contextual category becausethe fact that a KG is seen as complete depends on theuse case scenario ie on the given KG and on the infor-mation need of the user As exemplified by Bizer [11]a list of German stocks is complete for an investor whois interested in German stocks but it is not complete for

an investor who is looking for an overview of Europeanstocks The completeness is hence only assessable bymeans of a concrete use case at hand or with the helpof a defined gold standard

Definition of metric We follow the above-mentioneddistinction of Pipino et al [40] and determine Com-pleteness by means of the criteria Schema completenessColumn completeness and Population completeness

The fulfillment degree of a KG g wrt the dimensionCompleteness is measured by the metrics mcSchemamcCol and mcPop which are defined as follows

Schema completeness By means of the criterionSchema completeness one can determine the complete-ness of the schema wrt classes and relations [40] Theschema is assessed by means of a gold standard Thisgold standard consists of classes and relations which arerelevant for the use case For evaluating cross-domainKGs we use as gold standard a typical set of cross-domain classes and relations It comprises (i) basicclasses such as people and locations in different gran-ularities and (ii) basic relations such as birth date andnumber of inhabitants We define the schema complete-ness mcSchema as the ratio of the number of classesand relations of the gold standard existing in g noclatg and the number of classes and relations in the goldstandard noclat

mcSchema(g) =noclatgnoclat

Column completeness In the traditional database area(with fixed schema) by means of the Column complete-ness criterion one can determine the degree by whichthe relations of a class which are defined on the schemalevel (each relation has one column) exist on the in-stance level [40] In the Semantic Web and Linked Datacontext however we cannot presume any fixed rela-tional schema on the schema level The set of possiblerelations for the instances of a class is given at run-time by the set of used relations for the instances ofthis class Therefore we need to modify this criterionas already proposed by Pipino et al [40] In the updatedversion by means of the criterion Column completenessone can determine the degree by which the instances ofa class use the same relations averaged over all classes

Formally we define the Column completeness met-ric mcCol(g) as the ratio of the number of instanceshaving class k and a value for the relation r nokp tothe number of all instances having class k nok Byaveraging over all class-relation-pairs which occur on

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 11

instance level we obtain a fulfillment degree regardingthe whole KG

mcCol(g) =1

|H|sum

(kp)isinH

nokpnok

We thereby let H = (k p) isin (K times P ) | existk isinCg and exist(x p o) | p isin P imp

g and (xrdftype k) bethe set of all combinations of the considered classesK = k1 kn and considered relations P =p1 pm

Note that there are also relations which are dedicatedto the instances of a specific class but which do notneed to exist for all instances of that class For instancenot all people need to have a relation hasChild ordeathDate21 For measuring the Column complete-ness we selected only those relations for an assessmentwhere a value of the relation typically exists for allgiven instances

Population completeness The Population complete-ness metric determines the extent to which the consid-ered KG covers a basic population [40] The assess-ment of the KG completeness wrt a basic populationis performed by means of a gold standard which coversboth well-known entities (called ldquoshort headrdquo eg then largest cities in the world according to the number ofinhabitants) and little-known entities (called ldquolong tailrdquoeg municipalities in Germany) We take all entitiescontained in our gold standard equally into account

Let GS be the set of entities in the gold standardThen we can define

mcPop(g) =|e|e isin GS and e isin Eg|

|e|e isin GS|

333 TimelinessDefinition of dimension Timeliness is ldquothe extent

to which the age of the data is appropriate for the taskat handrdquo [47]

Discussion Timeliness does not describe the creationdate of a statement but instead the time range since thelast update or the last verification of the statement [39]Due to the easy way of publishing data on the Webdata sources can be kept easier up-to-date than tradi-tional isolated data sources This results in advantagesto the consumer of Web data [39] How Timeliness is

21For an evaluation about the prediction which relations are of thisnature see [1]

measured depends on the application context For somesituations years are sufficient while in other situationsone may need days [39]

Definition of metric The dimension timeliness isdetermined by the criteria Timeliness frequency of theKG Specification of the validity period and Specifica-tion of the modification date of statements

The fulfillment degree of a KG g wrt the dimen-sion Timeliness is measured by the metrics mFreqmV alidity and mChange which are defined as follows

Timeliness frequency of the KG The criterion Time-liness frequency of the KG indicates how fast the KGis updated We consider the KG RDF export here anddifferentiate between continuous updates where the up-dates are always performed immediately and discreteKG updates where the updates take place in discretetime intervals In case the KG edits are available onlineimmediately but the RDF export files are available indiscrete varying updating intervals we consider theonline version of the KG since in the context of LinkedData it is sufficient that URIs are dereferenceable

mFreq(g) =

1 continuous updates05 discrete periodic updates025 discrete non-periodic updates0 otherwise

Specification of the validity period of statements Spec-ifying the validity period of statements enables to tem-porally limit the validity of statements By using this cri-terion we measure whether the KG supports the speci-fication of starting and maybe end dates of statementsby means of providing suitable forms of representation

mV alidity(g) =

1 specification of validity pe-riod supported

0 otherwise

Specification of the modification date of statementsThe modification date discloses the point in timeof the last verification of a statement The modifi-cation date is typically represented via the relationsschemadateModified and dctermsmodified

mChange(g) =

1 specification of modifica-

tion dates for statementssupported

0 otherwise

12 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

34 Representational Data Quality

Representational data quality ldquocontains aspects re-lated to the format of the data [] and meaning ofdatardquo [47] This category contains the two dimensions(i) Ease of understanding (ie regarding the human-readability) and (ii) Interoperability (ie regarding themachine-readability) The dimensions InterpretabilityRepresentational consistency and Concise representa-tion as in addition proposed by Wang et al [47] areconsidered by us as being a part of the dimension Inter-operability

341 Ease of UnderstandingDefinition of dimension The ease of understanding

is ldquothe extent to which data are clear without ambiguityand easily comprehendedrdquo [47]

Discussion This dimension focuses on the under-standability of a data source by a human data con-sumer In contrast the dimension Interoperability fo-cuses on technical aspects The understandability of adata source (here KG) can be improved by things suchas descriptive labels and literals in multiple languages

Definition of metric The dimension understand-ability is determined by the criteria Description of re-sources Labels in multiple languages UnderstandableRDF serialization and Self-describing URIs The ful-fillment degree of a KG g wrt the dimension Con-sistency is measured by the metrics mDescr mLangmuSer and muURI which are defined as follows

Description of resources Heath et al [2630] suggestto describe resources in a human-understandable wayeg via rdfslabel or rdfscomment Withinour framework the criterion is measured as followsGiven a sample of resources we divide the numberof resources in the KG for which at least one label orone description is provided (eg via rdfslabelrdfscomment or schemadescription) bythe number of all considered resources in the localnamespace

mDescr(g) = |u|u isin U localg and exist(u p o) isin g

p isin PlDesc||u|u isin U localg |

PlDesc is the set of implicitly used relations in g in-dicating that the value is a label or description (egPlDesc = rdfslabelrdfscomment)

Beschreibung) Daruumlber hinaus ist das Ergebnisder Evaluation auf Basis der Entitaumlten interessant -gt DBpedia weicht deutlich ab da manche Entitaumlten

(Intermediate-Node-Mapping) keine rdfslabel habenFolglich wuumlrde ich die Definition der Metrik allgemeinhalten (beschraumlnkt auf proprietaumlre Ressourcen dh imselben Namespace) die Evaluation jedoch nur anhandder Entitaumlten machen

Labels in multiple languages Resources in the KG aredescribed in a human-readable way via labels eg viardfslabel or skosprefLabel22 The charac-teristic feature of skosprefLabel is that this kindof label should be used per resource at most once incontrast rdfslabel has no cardinality restrictionsie it can be used several times for a given resourceLabels are usually provided in English as the ldquobasiclanguagerdquo The now introduced metric for the criterionLabels in multiple languages determines whether labelsin other languages than English are provided in the KG

mLang(g) =

1 Labels provided in English

and at least one other lan-guage

0 otherwise

Understandable RDF serialization RDFXML is therecommended RDF serialization format of the W3CHowever due to its syntax RDFXML documents arehard to read for humans The understandability of RDFdata by humans can be increased by providing RDFin other more human-understandable serialization for-mats such as N3 N-Triple and Turtle We measurethis criterion by measuring the supported serializationformats during the dereferencing of resources

muSer(hg) =

1 Other RDF serializationsthan RDFXML available

0 otherwise

Note that conversions from one RDF serializationformat into another are easy to perform

Self-describing URIs Descriptive URIs contribute toa better human-readability of KG data Sauermann etal23 recommend to use short memorable URIs in theSemantic Web context which are easier understandableand memorable by humans compared to opaque URIs24

22Using the namespace httpwwww3org200402skoscore

23See httpswwww3orgTRcooluris requested onMar 1 2016

24For an overview of URI patterns see httpswwww3orgcommunitybpmlodwikiBest_practises_-_previous_notes requested on Dec 27 2016

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 13

such as wdtQ1040 The criterion Self-describingURIs is dedicated to evaluate whether self-describingURIs or generic IDs are used for the identification ofresources

muURI(g) =

1 self-describing URIs always used05 self-describing URIs partly used0 otherwise

342 InteroperabilityInteroperability is another dimension of the repre-

sentational data quality category and subsumes Wanget alrsquos aspects interpretability representational consis-tency and concise representation

Definition of dimension We define Interoperabilityalong the subsumed dimensions of Wang et al

ndash Interpretability Interpretability is ldquothe extent towhich data are in appropriate language and unitsand the data definitions are clearrdquo [47]

ndash Representational consistency Representationalconsistency is ldquothe extent to which data are alwayspresented in the same format and are compatiblewith previous datardquo [47]

ndash Concise representation Concise representationis ldquothe extent to which data are compactly repre-sented without being overwhelmingrdquo [47]

Discussion regarding interpretability In contrastto the dimension understandability which focuses onthe understandability of RDF KG data towards the useras data consumer interpretability focuses on the rep-resentation forms of information in the KG from atechnical perspective An example is the considerationwhether blank nodes are used According to Heath etal [26] blank nodes should be avoided in the LinkedData context since they complicate the integration ofmultiple data sources and since they cannot be linkedby resources of other data sources

Discussion regarding representational consistencyIn the context of Linked Data it is best practice to reuseexisting vocabulary for the creation of own RDF dataIn this way less data needs to be prepared for beingpublished as Linked Data [26]

Discussion regarding concise representation Heathet al [26] made the observation that the RDF features(i) RDF reification25 (ii) RDF collections and RDF

25In the literature it is often not differentiated between reificationin the general sense and reification in the sense of the specific

container and (iii) blank nodes are not very widelyused in the Linked Open Data context Those featuresshould be avoided according to Heath et al in orderto simplify the processing of data on the client sideEven the querying of the data via SPARQL may getcomplicated if RDF reification RDF collections andRDF container are used We agree on that but alsopoint out that reification (implemented via RDF stan-dard reification n-ary relations singleton propertiesor named graphs) is inevitably necessary for makingstatements about statements

Definition of metric The dimension Interoperabil-ity is determined via the following criteria

ndash Avoiding blank nodes and RDF reificationndash Provisioning of several serialization formatsndash Using external vocabularyndash Interoperability of proprietary vocabulary

The fulfillment degree of a KG g wrt the dimen-sion Interoperability is measured by the metrics mReif miSerial mexV oc and mpropV oc which are defined asfollows

Avoiding blank nodes and RDF reification Using RDFblank nodes RDF reification RDF container and RDFlists is often considered as ambivalent On the one handthese RDF features are not very common and theycomplicate the processing and querying of RDF data[3026] On the other hand they are necessary in cer-tain situations eg when statements about statementsshould be made We measure the criterion by evaluatingwhether blank nodes and RDF reification are used

mReif (g) =

1 no blank nodes and no RDF

reification05 either blank nodes or RDF

reification0 otherwise

Provisioning of several serialization formats The in-terpretability of RDF data of a KG is increased if be-

proposal described in the RDF standard (Brickley D Guha R (eds)RDF Vocabulary Description Language 10 RDF Schema W3CRecommendation online available at httpwwww3orgTRrdf-schema requested on Sep 2 2016) For more informationabout reification and its implementation possibilities we can refer thereader to [27] In this article we use the term reification by defaultfor the general sense and standard reification or RDF reificationfor referring to the modeling of reification according to the RDFstandard

14 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

sides the serialization standard RDFXML further seri-alization formats are supported for URI dereferencing

miSerial(hg) =

1 RDFXML and further for-

mats are supported05 only RDFXML is supported0 otherwise

Using external vocabulary Using a common vocabu-lary for representing and describing the KG data allowsto represent resources and relations between resourcesin the Web of Data in a unified way This increases theinteroperability of data [3026] and allows a comfort-able data integration We measure the criterion of usingan external vocabulary by setting the number of tripleswith external vocabulary in predicate position to thenumber of all triples in the KG

mextV oc(g) =|(s p o)|(s p o) isin g and p isin P external

g ||(s p o) isin g|

Interoperability of proprietary vocabulary Linkingon schema level means to link the proprietary vo-cabulary to external vocabulary Proprietary vocab-ulary are classes and relations which were definedin the KG itself The interlinking to external vo-cabulary guarantees a high degree of interoperabil-ity [26] We measure the interlinking on schemalevel by calculating the ratio to which classes andrelations have at least one equivalency link (egowlsameAs owlequivalentProperty orowlequivalentClass) to classes and relationsrespectively of other data sources

mpropV oc(g) = |x isin Pg cup Cg|exist(x p o) isin g

(p isin Peq and (o isin U and o isin Uextg ))||Pg cup Cg|

where Peq = owlsameAsowlequivalent-PropertyowlequivalenClass and Uext

g con-sists of all URIs in Ug which are external to the KG gwhich means that hg is not responsible for resolvingthese URIs

35 Accessibility Category

Accessibility data quality refers to aspects on howdata can be accessed This category contains the threedimensions

ndash Accessibilityndash Licensing andndash Interlinking

Wangrsquos dimension access security is considered by usas being not relevant in the Linked Open Data contextas we only take open data sources into account

In the following we go into details of the mentioneddata quality dimensions

351 AccessibilityDefinition of dimension Accessibility is ldquothe ex-

tent to which data are available or easily and quicklyretrievablerdquo [47]

Discussion Wang et alrsquos definition of Accessibilitycontains the aspects availability response time anddata request They are defined as follows

1 Availability ldquoof a data source is the probability thata feasible query is correctly answered in a giventime rangerdquo [39]According to Naumann [39] the availability is animportant quality aspect for data sources on theWeb since in case of integrated systems (with fed-erated queries) usually all data sources need tobe available in order to execute the query Therecan be different influencing factors regarding theavailability of data sources such as the day timethe worldwide distribution of servers the planedmaintenance work and the caching of data LinkedData sources can be available as SPARQL end-points (for performing complex queries on thedata) and via HTTP URI dereferencing We needto consider both possibilities for this DQ dimen-sion

2 Response time characterizes the delay betweenthe point in time when the query was submittedand the point in time when the query response isreceived [11]Note that the response time is dependent on em-pirical factors such as the query the size of the in-dexed data the data structure the used triple storethe hardware and so on We do not consider theresponse time in our evaluations since obtaininga comprehensive result here is hard

3 In the context of Linked Data data requests canbe made (i) on SPARQL endpoints (ii) on RDFdumps (export files) and (iii) on Linked DataAPIs

Definition of metric We define the metric for thedimension Accessibility by means of metrics for thefollowing criteria

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 15

ndash Dereferencing possibility of resourcesndash Availability of the KGndash Provisioning of public SPARQL endpointndash Provisioning of an RDF exportndash Support of content negotiationndash Linking HTML sites to RDF serializationsndash Provisioning of KG metadata

The fulfillment degree of a KG g wrt the dimen-sion Accessibility is measured by the metrics mDeref mAvai mSPARQL mExport mNegot mHTMLRDF and mMeta which are defined as follows

Dereferencing possibility of resources One of theLinked Data principles [9] is the dereferencing possi-bility of resources URIs must be resolvable via HTTPrequests and useful information should be returnedthereby We assess the dereferencing possibility of re-sources in the KG by analyzing for each URI in the sam-ple set (here all URIs Ug) the HTTP response statuscode and by evaluating whether RDF data is returned Asuccessful dereferencing of resources is given if HTTPstatus code 200 and an RDF document is returned

mDeref (hg) =|dereferencable(Ug)|

|Ug|

Availability of the KG The Availability of the KG cri-terion indicates the uptime of the KG It is an essentialcriterion in the context of Linked Data since in case ofan integrated or federated query mostly all data sourcesneed to be available [39] We measure the availabil-ity of a KG by monitoring the ability of dereferencingURIs over a period of time This monitoring processcan be done with the help of a monitoring tool such asPingdom26

mAvai(hg) =Number of successful requests

Number of all requests

Provisioning of public SPARQL endpoint SPARQLendpoints allow the user to perform complex queries(including potentially many instances classes and rela-tions) on the KG This criterion here indicates whetheran official SPARQL endpoint is publicly availableThere might be additional restrictions of this SPARQLendpoint such as a maximum number of requests pertime slice or a maximum runtime of a query However

26See httppingdomcom requested on Mar 1 2016

we do not measure these restrictions here

mSPARQL(hg) =

1 SPARQL endpoint publiclyavailable

0 otherwise

Provisioning of an RDF export If there is no pub-lic SPARQL endpoint available or the restrictions ofthis endpoint are so strict that the user does not useit an RDF export dataset (RDF dump) can often beused This dataset can be used to set up a local pri-vate SPARQL endpoint The criterion here indicateswhether an RDF export dataset is officially available

mExport(hg) =

1 RDF export available0 otherwise

Support of content negotiation Content negotiation(CN) allows that the server returns RDF documentsduring the dereferencing of resources in the desiredRDF serialization format The HTTP protocol allowsthe client to specify the desired content type (eg RDFXML) in the HTTP request and the server to specifythe returned content type in the HTTP response header(eg applicationrdf+xml) In this way the de-sired and the provided content type are matched as faras possible It can happen that the server does not pro-vide the desired content type Moreover it may hap-pen that the server returns an incorrect content typeThis may lead to the fact that serialized RDF data isnot processed further An example is RDF data whichis declared as textplain [26] Hogan et al [29]therefore propose to let KGs return the most specificcontent type as possible We measure the Support ofcontent negotiation by dereferencing resources withdifferent RDF serialization formats as desired contenttype and by comparing the accept header of the HTTPrequest with the content type of the HTTP response

mNegot(hg) =

1 CN supported and correct

content types returned05 CN supported but wrong

content types returned0 otherwise

Linking HTML sites to RDF serializations Heath etal [26] suggest linking any HTML description of aresource to RDF serializations of this resource in or-der to make the discovery of corresponding RDF dataeasier (for Linked Data aware applications) For thatreason in the HTML header the so-called Autodiscov-

16 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

ery pattern can be included This pattern consists ofthe phrase link rel=alternate the indicationabout the provided RDF content type and a link to theRDF document27 We measure the linking of HTMLpages to RDF documents (ie resource representations)by evaluating whether the HTML representations of theresources contain links as described

mHTMLRDF (hg) =

1 Autodiscovery pattern usedat least once

0 otherwise

Provisioning of KG metadata In the light of the Se-mantic Web vision where agents select and make useof appropriate data sources on the Web also the meta-information about KGs needs to be available in amachine-readable format The two important mech-anisms to specify metadata about KGs are (i) usingsemantic sitemaps and (ii) using the VoID vocabu-lary28 [26] For instance the URI of the SPARQL end-point can be assigned via voidsparqlEndpointand the RDF export URL can be specified withvoiddataDump Such metadata can be added as ad-ditional facts to the KG or it can be provided as separateVoID file We measure the Provisioning of KG meta-data by evaluating whether machine-readable metadataabout the KG is available Note that the provisioningof licensing information in a machine-readable format(which is also a meta-information about the KG) isconsidered in the data quality dimension License lateron

mMeta(g) =

1 Machine-readable metadataabout g available

0 otherwise

352 LicenseDefinition of dimension Licensing is defined as

ldquothe granting of permission for a consumer to re-use adataset under defined conditionsrdquo [49]

Discussion The publication of licensing informationabout KGs is important for using KGs without legalconcerns especially in commercial settings CreativeCommons (CC)29 publishes several standard licensing

27An example is ltlinkrel=alternate type =applicationrdf+xml href=companyrdfgt

28See namespace httpwwww3orgTRvoid29See httpcreativecommonsorg requested on Mar

1 2016

contracts which define rights and obligations Thesecontracts are also in the Linked Data context popularThe most frequent licenses for Linked Data are CC-BYCC-BY-SA and CC0 [31] CC-BY30 requires specify-ing the source of the data CC-BY-SA31 requires in ad-dition that if the data is published it is published underthe same legal conditions CC032 defines the respectivedata as public domain and without any restrictions

Noteworthy is that most data sources in the LinkedOpen Data cloud do not provide any licensing infor-mation [31] which makes it difficult to use the datain commercial settings Even if data is published un-der CC-BY or CC-BY-SA the data is often not usedsince companies refer to uncertainties regarding thesecontracts

Definition of metric The dimension License isdetermined by the criterion Provisioning machine-readable licensing information

The fulfillment degree of a KG g wrt the dimensionLicense is measured by the metric mmacLicense whichis defined as follows

Provisioning machine-readable licensing informationLicenses define the legal frameworks under which theKG data may be used Providing machine-readable li-censing information allows users and applications to beaware of the license and to use the data of the KG inaccordance with the legal possibilities [3026]

Licenses can be specified in RDF via relationssuch as cclicence33 dctermslicence ordctermsrights The licensing information can bespecified either in the KG as additional facts or sepa-rately in a VoID file We measure the criterion by eval-uating whether licensing information is available in amachine-readable format

mmacLicense(g) =

1 machine-readable

licensing informationavailable

0 otherwise

353 InterlinkingDefinition of dimension Interlinking is the extent

ldquoto which entities that represent the same concept are

30See httpscreativecommonsorglicensesby40 requestedon Mar 1 2016

31See httpscreativecommonsorglicensesby-sa40 requested on Mar 1 2016

32See httpcreativecommonsorgpublicdomainzero10 requested on Mar 3 2016

33Using the namespace httpcreativecommonsorgns

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 17

linked to each other be it within or between two ormore data sourcesrdquo [49]

Discussion According to Bizer et al [12] DBpediaestablished itself as a hub in the Linked Data clouddue to its intensive interlinking with other KGs Theseinterlinking is on the instance level usually establishedvia owlsameAs links However according to Halpinet al [24] those owlsameAs links do not alwaysinterlink identical entities in reality According to theauthors one reason might be that the KGs provideentries in different granularity For instance the DB-pedia resource for Berlin (dboBerlin) links viaowlsameAs relations to three different resources inthe KG GeoNames34 namely (i) Berlin the capital35

(ii) Berlin the state36 and (iii) Berlin the city37 More-over owlsameAs relations are often created auto-matically by some mapping function Due to mappingerrors the precision is often below 100 [18]

Definition of metric The dimension Interlinking isdetermined by the criteria

ndash Interlinking via owlsameAsndash Validity of external URIs

The fulfillment degree of a KG g wrt the dimen-sion Interlinking is measured by the metrics mInst andmURIs which are defined as follows

Interlinking via owlsameAs The forth LinkedData principle according to Berners-Lee [8] is the inter-linking of data resources so that the user can explorefurther information According to Hogan et al [30] theinterlinking has a side effect It does not only result inotherwise isolated KGs but the number of incominglinks of a KG indicates the importance of the KG in theLinked Open Data cloud We measure the interlinkingon instance level38 by calculating the extent to which in-stances have at least one owlsameAs link to externalKGs

34See httpwwwgeonamesorg requested on Dec 312016

35See httpwwwgeonamesorg2950159berlinhtml requested on Feb 4 2017

36See httpwwwgeonamesorg2950157land-berlinhtml requested on Feb 4 2017

37See httpwwwgeonamesorg6547383berlin-stadthtml requested on Feb 4 2017

38The interlinking on schema level is already measured via thecriterion Interoperability of proprietary vocabulary

mInst(g) = |x isin Ig (Pg cup Cg) |

exist(xowlsameAs y) isin g and y isin Uextg |

|Ig (Pg cup Cg)|

Validity of external URIs The considered KG maycontain outgoing links referring to RDF resourcesor Web documents (non-RDF data) The linking toRDF resources is usually enabled by owlsameAsowlequivalentProperty and owlequivalentClass relations Web documents are linked viarelations such as foafhomepage and foafdepiction Linking to external resources always entailsthe problem that those links might get invalid over timeThis can have different causes For instance the URIsare not available anymore We measure the Validity ofexternal URIs by evaluating the URIs from an URI sam-ple set wrt whether there is a timeout a client error(HTTP response 4xx) or a server error (HTTP response5xx)

mURIs(g) =|x isin A | resolvable(x)|

|A|

where A = y | exist(x p y) isin g (p isin Peqandx isin Ug(CgcupPg)andx isin U local

g andy isin Uextg ) and resolvable(x)

returns true if HTTP status code 200 is returned Peq isthe set of relations used for linking to external sourcesExamples for such relations are owlsameAs andfoafhomepage

In case of an empty set A the metric should evaluateto 1

36 Conclusion

In this section we provided 34 DQ criteria which canbe applied in the form of DQ metrics to KGs in order toassess those KGs wrt data quality The DQ criteria areclassified into 11 DQ dimensions These dimensionsare themselves grouped into 4 DQ categories In totalwe have the following picture

ndash Intrinsic category

lowast Accuracylowast Syntactic validity of RDF documentslowast Syntactic validity of literalslowast Semantic validity of triples

18 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

lowast Trustworthinesslowast Trustworthiness on KG levellowast Trustworthiness on statement levellowast Using unknown and empty values

lowast Consistencylowast Check of schema restrictions during inser-

tion of new statementslowast Consistency of statements wrt class con-

straintslowast Consistency of statements wrt relation con-

straints

ndash Contextual category

lowast Relevancylowast Creating a ranking of statements

lowast Completenesslowast Schema completenesslowast Column completenesslowast Population completeness

lowast Timelinesslowast Timeliness frequency of the KGlowast Specification of the validity period of state-

mentslowast Specification of the modification date of

statements

ndash Representational data quality

lowast Ease of understandinglowast Description of resourceslowast Labels in multiple languageslowast Understandable RDF serializationlowast Self-describing URIs

lowast Interoperabilitylowast Avoiding blank nodes and RDF reificationlowast Provisioning of several serialization formatslowast Using external vocabularylowast Interoperability of proprietary vocabulary

ndash Accessibility category

lowast Accessibilitylowast Dereferencing possibility of resourceslowast Availability of the KGlowast Provisioning of public SPARQL endpointlowast Provisioning of an RDF exportlowast Support of content negotiationlowast Linking HTML sites to RDF serializationslowast Provisioning of KG metadata

lowast Licenselowast Provisioning machine-readable licensing in-

formationlowast Interlinking

lowast Interlinking via owlsameAslowast Validity of external URIs

4 Selection of KGs

We consider the following KGs for our comparativeevaluation

ndash DBpedia DBpedia39 is the most prominent KGin the LOD cloud [4] The project was initiatedby researchers from the Free University of Berlinand the University of Leipzig in collaborationwith OpenLink Software Since the first public re-lease in 2007 DBpedia is updated roughly once ayear40 By means of a dedicated open source ex-traction framework DBpedia is created from infor-mation contained in Wikipedia such as infobox ta-bles categorization information geo-coordinatesand external links Due to its role as the hub ofthe LOD cloud DBpedia contains many links toother datasets in the LOD cloud such as FreebaseOpenCyc UMBEL41 GeoNames Musicbrainz42

CIA World Factbook43 DBLP44 Project Guten-berg45 DBtune Jamendo46 Eurostat47 Uniprot48

and Bio2RDF4950 DBpedia has been used exten-sively in the Semantic Web research communitybut has become also relevant in commercial set-tings for instance companies such as the BBC[33] and the New York Times [41] use DBpediato organize their content The version of DBpediawe analyzed is 2015-04

39See httpdbpediaorg requested on Nov 1 201640There is also DBpedia live which started in 2009 and which

gets updated when Wikipedia is updated See httplivedbpediaorg requested on Nov 1 2016 Note however thatDBpedia live only provides a restricted set of relations compared toDBpedia Also the provisioning of data varies a lot While for sometime ranges DBpedia live provides data for each hour for other timeranges DBpedia live data is only available once a month

41See httpumbelorg requested on Dec 31 201642See httpmusicbrainzorg requested on Dec 31

201643See httpswwwciagovlibrary

publicationsthe-world-factbook requested on Dec31 2016

44See httpwwwdblporg requested on Dec 31 201645See httpswwwgutenbergorg requested on Dec

31 201646See httpdbtuneorgjamendo requested on Dec

31 201647See httpeurostatlinked-statisticsorg

requested on Dec 31 201648See httpwwwuniprotorg requested on Dec 31

201649See httpbio2rdforg requested on Dec 31 201650See a complete list of the links on the websites describing the sin-

gle DBpedia versions such as httpdownloadsdbpediaorg2016-04links (requested on Nov 1 2016)

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 19

ndash Freebase Freebase51 is a KG announced byMetaweb Technologies Inc in 2007 and was ac-quired by Google Inc on July 16 2010 In con-trast to DBpedia Freebase had provided an in-terface that allowed end-users to contribute tothe KG by editing structured data Besides user-contributed data Freebase integrated data fromWikipedia NNDB52 FMD53 and MusicBrainz54

Freebase uses a proprietary graph model for stor-ing also complex statements Freebase shut downits services completely on August 31 2016 Onlythe latest data dump is still available WikimediaDeutschland and Google integrate Freebase datainto Wikidata via the Primary Sources Tool55 Fur-ther information about the migration from Free-base to Wikidata is provided in [44] We analyzedthe latest Freebase version as of March 2015

ndash OpenCyc The Cyc56 project started in 1984 bythe industry research and development consortiumMicroelectronics and Computer Technology Cor-poration The aim of Cyc is to store ndash in a machine-processable way ndash millions of common sense factssuch as ldquoEvery tree is a plantrdquo The main focus ofCyc has been on inferencing and reasoning SinceCyc is proprietary a smaller version of the KGcalled OpenCyc57 was released under the opensource Apache license Version 2 In July 2006 Re-searchCyc58 was published for the research com-munity containing more facts than OpenCyc Wedid not consider Cyc and ResearchCyc since thoseKGs do not meet the chosen requirements namelythat the KGs are freely available and freely us-able in any context The version of OpenCyc weanalyzed is 2012-05-10

ndash Wikidata Wikidata59 is a project of WikimediaDeutschland which started on October 30 2012The aim of the project is to provide data whichcan be used by any Wikimedia project including

51See httpfreebasecom requested on Nov 1 201652See httpwwwnndbcom requested on Dec 31 201653See httpwwwfashionmodeldirectorycom re-

quested on Dec 31 201654See httpmusicbrainzorg requested on Dec 31

201655See httpswwwwikidataorgwikiWikidata

Primary_sources_tool requested on Apr 8 201656See httpwwwcyccom requested on Dec 31 201657See httpwwwopencycorg accessed on Nov 1

201658See httpresearchcyccom requested on Dec 31

201659See httpwikidataorg accessed on Nov 1 2016

Wikipedia Wikidata does not only store facts butalso the corresponding sources so that the valid-ity of facts can be checked Labels aliases anddescriptions of entities in Wikidata are providedin almost 400 languages Wikidata is a commu-nity effort ie users collaboratively add and editinformation Also the schema is maintained andextended based on community agreements Wiki-data is currently growing considerably due to theintegration of Freebase data [44] The version ofWikidata we analyzed is 2015-10

ndash YAGO YAGO60 ndash Yet Another Great Ontol-ogy ndash has been developed at the Max PlanckInstitute for Computer Science in Saarbruumlckensince 2007 YAGO comprises information ex-tracted from Wikipedia (such as information fromthe categories redirects and infoboxes) Word-Net [19] (such as information about synsets andhyponomies) and GeoNames61 The version ofYAGO we analyzed is YAGO3 which was pub-lished in March 2015

5 Comparison of KGs

51 Key Statistics

In the following we present statistical commonal-ities and differences of the KGs DBpedia FreebaseOpenCyc Wikidata and YAGO We thereby use thefollowing key statistics

ndash Number of triplesndash Number of classesndash Number of relationsndash Distribution of classes wrt the number of their

corresponding instancesndash Coverage of classes with at least one instance per

classndash Covered domains wrt entitiesndash Number of entitiesndash Number of instancesndash Number of entities per classndash Number of unique subjectsndash Number of unique predicatesndash Number of unique objects

In Section 72 we provide an overview of relatedwork wrt those key statistics

60See httpwwwmpi-infmpgdedepartmentsdatabases-and-information-systemsresearchyago-nagayagodownloads accessed on Nov 1 2016

61See httpwwwgeonamesorg requested on Dec 312016

20 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

511 TriplesRanking of KGs wrt number of triples The num-

ber of triples (see Table 2) differs considerably betweenthe KGs Freebase is the largest KG with over 31Btriples while OpenCyc resides the smallest KG withonly 24M triples The large size of Freebase can betraced back to the fact that large data sets such as Mu-sicBrainz have been integrated into this KG OpenCycin contrast has been built purely manually by expertsIn general this indicates a correlation between the wayof building up a KG and its size

Size differences between DBpedia and YAGO Asboth DBpedia and YAGO were created automaticallyby extracting semantically-structured information fromWikipedia the significant difference between their sizesndash in terms of triples ndash is in particular noteworthy Wecan mention here the following reasons YAGO inte-grates the statements from different language versionsof Wikipedia in one single KG while for the canon-ical DBpedia dataset (which is used in our evalua-tions) solely the English Wikipedia was used as in-formation source Besides that YAGO contains con-textual information and detailed provenance informa-tion Contextual information is for instance the an-chor texts of all links within Wikipedia For repre-senting the anchor texts the relation yagohasWikipediaAnchorText (330M triples in total) is usedThe provenance information of single statements isstored in a reified form In particular the relationsyagoextractionSource (1612M triples) andyagoextractionTechnique (1762M triples)are applied therefore3nInfluence of reification on the number of triples

DBpedia Freebase Wikidata and YAGO use someform of reification Reification in general describesthe possibility of making statements about statementsWhile reification has an influence on the number oftriples for DBpedia Freebase and Wikidata the num-ber of triples in YAGO is not influenced by reificationsince data is here provided in N-Quads62 This style ofreification is called Named Graph [27] The additionalcolumn (in comparison to triples) contains a unique IDof the statement by which the triple becomes identifiedFor backward compatibility the ID is commented andtherefore not imported into the triple store Note how-ever that transforming N-Quads to N-Triples leads to a

62The idea of N-Quads is based on the assignment of triples todifferent graphs YAGO uses N-Quads to identify statements per ID

high number of unique subjects concerning the set ofall triples

In case of DBpedia Freebase and Wikidata reifica-tion is implemented by means of n-ary relations Ann-ary relation denotes the relation between more thantwo resources and is implemented via additional inter-mediate nodes since in RDF only binary statementscan be modeled [1627] In Freebase and DBpedia datais mostly provided in the form of plain N-Triples andn-ary relations are only used for data from higher ar-ity63 Wikidata in contrast has the peculiarity that notonly every statement is expressed with the help of ann-ary relation but that in addition each statement is in-stantiated with wdoStatement This leads to about74M additional instances which is about one tenth ofall triples in Wikidata

512 ClassesMethods for counting classes The number of

classes can be calculated in different ways Classes canbe identified via rdfsClass and owlClass re-lations or via rdfssubClassOf relations64 SinceFreebase does not provide any class hierarchy withrdfssubClassOf relations and since Wikidatadoes not instantiate classes explicitly as classes butuses instead only ldquosubclass ofrdquo (wdtP279) relationsthe method of calculating the number of classes de-pends on the considered KG

Ranking of KG wrt number of classes Our eval-uations revealed that YAGO contains the highest num-ber of classes of all considered KGs DBpedia in con-trast has the fewest (see Table 2)

Number of classes in YAGO and DBpedia Howdoes it come to this gap between DBpedia and YAGOwith respect to the number of classes although bothKGs were created automatically based on WikipediaFor YAGO the classes are extracted from the categoriesin Wikipedia while the hierarchy of the classes is de-ployed with the help of WordNet synset relations TheDBpedia ontology in contrast is very small since itis created manually based on the mostly used infobox

63In Freebase Compound Value Types are used for reifi-cation [44] In DBpedia it is named Intermedia Node Map-ping see httpmappingsdbpediaorgindexphpTemplateIntermediateNodeMapping (requested on Dec31 2016)

64The number of classes in a KG may also be calculated by takingall entity type relations (rdftype and ldquoinstance ofrdquo (wdtP31)in case of Wikidata) on the instance level into account However thiswould result only in a lower bound estimation as here those classesare not considered which have no instances

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 21

DBpe

dia

Freeb

ase

Ope

nCyc

Wikidat

a

YAGO

20

40

60

80

100

Co

ve

rag

e in

Fig 1 Coverage of classes having at least one instance

templates in Wikipedia Besides those 736 classes theDBpedia KG contains further 444895 classes whichoriginate from the imported YAGO classes and whichare published in the namespace yago Those YAGOclasses are ndash like the DBpedia ontology classes ndash inter-connected via rdfssubClassOf to form a taxon-omy In the evaluation of DBpedia the YAGO classesare ignored as they do not belong to the DBpedia on-tology given as OWL file

Coverage of classes with at least one instanceFig 1 shows for each KG the extent to which classes areinstantiated that is for how many classes at least oneinstance exists YAGO exhibits the highest coveragerate (826) although it contains the highest numberof classes among the KGs This can be traced back tothe fact that YAGO classes are chosen by a heuristicthat considers Wikipedia leaf categories which tend tohave instances [43] OpenCyc (with 65) and Wiki-data (54) come last in the ranking Wikidata has thesecond highest number of classes in total (see Table 2)out of which relatively little are used on instance levelNote however that in some scenarios solely the schemalevel information (including classes) of KGs is neces-sary so that the low coverage of instances by classes isnot necessarily an issue

Correlation between number of classes and num-ber of instances In Fig 2 we can see a histogramof the classes with respect to the number of instancesper class That is for each KG we can spot how manyclasses have a high number of instances and how manyclasses have a low number of instances Note the log-arithmic scale on both axes The curves seem to fol-low power law distributions For DBpedia the line de-

Table 1Percentage of considered entities per KG for covered domains

DB FB OC WD YA

Reach of method 88 92 81 41 82

creases consistently for the first 250 classes before itdecreases more than exponentially beyond class 250

513 DomainsAll considered KGs are cross-domain meaning that a

variety of domains are covered in those KGs Howeverthe KGs often cover the single domains to a differentdegree Tartir [45] proposed to measure the covered do-mains of ontologies by determining the usage degree ofcorresponding classes the number of instances belong-ing to one or more subclasses of the respective domainis compared to the number of all instances In our workhowever we decided to evaluate the coverage of do-mains concerning the classes per KG via manual assign-ments of the mostly used classes to the domains peoplemedia organizations geography and biology65 Thislist of domains was created by aggregating the mostfrequent domains in Freebase

The manual assignment of classes to domainsis necessary in order to obtain a consistent assign-ment of the classes to the domains across all con-sidered KGs Otherwise the same classes in differ-ent KGs may be assigned to different domains More-over in some KGs classes may otherwise appear invarious domains simultaneously For instance theFreebase classes freebasemusicartist andfreebasepeopleperson overlap in terms oftheir instances and multiple domains (such as musicand people) might be assigned to them

As the reader can see in Table 1 our method to de-termine the coverage of domains and hence the reachof our evaluation includes about 80 of all entities ofeach KG except Wikidata It is calculated as the ratio ofthe number of unique entities of all considered domainsof a given KG divided by the number of all entities ofthis KG66 If the ratio was at 100 we were able toassign all entities of a KG to the chosen domains

Fig 3 shows the number of entities per domain in thedifferent KGs with a logarithmic scale Fig 4 presents

65See our website for examples of classes per domain andper KG httpkmaifbkitedusitesknowledge-graph-comparison (requested on Dec 31 2016)

66We used the number of unique entities of all domains and notthe sum of the entities measured per domain since entities may be inseveral domains at the same time

22 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

100 101 102 103

Classes

100

102

104

106

108

Nu

mb

er

of

insta

nce

s

DBpedia

Freebase

OpenCyc

Wikidata

YAGO

Fig 2 Distribution of classes wrt the number of instances per KG

persons media organizations geography biology100

102

104

106

108

1010

Num

ber

of entities

DBpedia

Freebase

OpenCyc

Wikidata

YAGO

Fig 3 Number of entities per domain

the relative coverage of each domain in each KG It iscalculated as the ratio of the number of entities in eachdomain to the total number of entities of the KG Avalue of 100 means that all instances reside in onesingle domain

The case of Freebase is especially outstanding here77 of all entities here are located in the media

domain This fact can be traced back to large-scaledata imports such as from MusicBrainz The classfreebasemusicrelease_track is account-able for 42 of the media entities As shown in Fig 3Freebase provides the most entities in four out of thefive domains when considering all KGs

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 23

persons media organizations geography biology

10

20

30

40

50

60

70

80

Rela

tive n

um

ber

of entities in p

erc

ent

DBpedia

Freebase

OpenCyc

Wikidata

YAGO

Fig 4 Relative number of entities per domain

In DBpedia and YAGO the domain of people is thelargest domain (50 and 34 respectively) Peculiar isthe higher coverage of YAGO regarding the geographydomain compared to DBpedia As one reason for thatwe can point out the data import of GeoNames intoYAGO

Wikidata contains around 150K entities in the do-main organization This is relativly few consideringthe total amount of entities being around 187M andconsidering the number of organizations in other KGsNote that even DBpedia provides more organizationentities than Wikidata The reason why Wikidata hasnot so many organization entities is not fully compre-hensible to us However we can point out that for ouranalysis we only considered Wikidata classes whichappeared more than 6000 times67 and that about 16Kclasses were therefore not considered It is possible thatentities of the domain organization are belonging tothose rather rarely occurring classes

514 Relations and PredicatesEvaluation method In this article we differentiate

between relations and predicates (see also Section 2)

ndash Relations ndash as short term for explicitly defined re-lations ndash refers to (proprietary) vocabulary definedon the schema level of a KG We identify the setof relations of a KG as the set of those links which

67This number is based on heuristics We focused on the 150 mostinstantiated classes and cut the long tail of classes having only fewinstances

are explicitly defined as such via assignments (forinstance with rdfsProperty) to classes InSection 2 we used Pg to denote this set

ndash In contrast we use predicates to denote links usedin the KG independently of their introduction onthe schema level The set of unique predicates perKG denoted as P imp

g is nothing else than the setof unique RDF terms on the predicate position ofall triples in the KG

It is important to distinguish the key statistics for rela-tions from the key statistics for predicates since theycan differ considerably depending on to which degreerelations are only defined on schema level but not usedon instance level

Evaluation resultsRelationsRanking regarding relations As presented in Ta-

ble 2 Freebase exhibits by far the highest number ofunique relations (around 785K) among the KGs YAGOshows only 106 relations which is the lowest value inthis comparison In the following we point out furtherfindings regarding the relations of the single KGs

DBpedia Regarding DBpedia relations we need todistinguish between so-called mapping-based prop-erties and non-mapping-based properties Mapping-based properties are created by extracing the informa-tion from infoboxes in Wikipedia using manually cre-ated mappings These mappings are specified in the DB-

24 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

pedia Mappings Wiki68 Mapping-based properties arecontained in the DBpedia ontology and located in thenamespace httpdbpediaorgontologyWe count 2819 such relations for the considered DB-pedia version 2015-04 Non-mapping-based properties(also called ldquoraw infobox propertiesrdquo) are extractedfrom Wikipedia without the help of manually createdmappings and hence without any manual adjustmentsTherefore they are generally of lower quality We count58776 such unique relations They reside in the names-pace httpdbpediaorgproperty Bothmapping-based and non-mapping-based properties areinstantiated in DBpedia with rdfProperty We ig-nore the non-mapping based properties for the calcu-lation of the number of relations |Pg| (see Table 2)since in contrast to DBpedia in YAGO non-mappingbased properties are not instantiated Note that themapping-based properties and the non-mapping basedproperties in DBpedia are not aligned69 and may over-lap until DBpedia version 2016-0470

Freebase The high number or Freebase relations canbe explained by two facts 1 About a third of all rela-tions in Freebase are duplicates in the sense that they aredeclared by means of the owlinverseOf relationas being inverse of other relations An example is the re-lation freebasemusicartistalbum and itsinverse relation freebasemusicalbumartist2 Freebase allowed users to introduce their own rela-tions without any limits These relations were originallyin each userrsquos namespace So-called commons adminswere able to approve those relations so that they gotincluded into the Freebase commons schema

OpenCyc For OpenCyc we measure 18028 uniquerelations We can assume that most of them are dedi-cated to statements on the schema level

Wikidata In Wikidata a relatively small set of rela-tions is provided Note in this context that despite thefact that Wikidata is curated by a community (just likeFreebase) Wikidata community members cannot insertarbitrarily new relations as it was possible in Freebaseinstead relations first need to be proposed and thenget accepted by the community if and only if certain

68See httpmappingsdbpediaorgindexphpMain_Page accessed on Nov 4 2016

69For instance The DBpedia ontology containsdbobirthName for the name of a person while the non-mappingbased property set contains dbpname dbpfirstname anddbpalternativeNames

70For instance dbpalias and dboalias

criteria are met71 One of those criteria is that each newrelation is presumably used at least 100 times Thisrelation proposal process can be mentioned as likelyreason why in Wikidata in relative terms more relationsare actually used than in Freebase

YAGO For YAGO we measure the small set of 106unique relations Although relations are curated man-ually for YAGO and DBpedia the size of the relationset differs significantly between those KGs Hoffart etal [28] mention the following reasons for that

1 Peculiarity of relations The DBpedia ontologyprovides quite many special relations For in-stance there exists the relation dboaircraftFighter between dboMilitaryUnit anddboMeanOfTransportation

2 Granularity of relations Relations in the DB-pedia ontology are more fine-grained than rela-tions in YAGO For instance DBpedia contains therelations dboauthor and dbodirectorwhereas in YAGO there is only the generic relationyagocreated

3 Date specification The DBpedia ontology intro-duces several relations for dates For instance DB-pedia contains the relations dbobirthDateand dbobirthYear for birth dates while inYAGO only the relation yagobirthOnDateis used Incomplete date specifications ndash for in-stance if only the year is known ndash are specifiedin YAGO by wildcards (ldquordquo) so that no multiplerelations are needed

4 Inverse relations YAGO has no relations ex-plicitly specified as being inverse In DBpediawe can find relations specified as inverse such asdboparent and dbochild

5 Reification YAGO introduces the SPOTL(X) for-mat This format extends the triple format ldquoSPOldquowith a specification of Time Location and conteXtIn this way no contextual relations are necessary(such as dbodistanceToLondon or dbopopulationAsOf) which occur if the relationsare closely aligned to Wikipedia template attributenames

Frequency of the usage of relations Fig 5 showsthe relative proportions of how often relations are usedper KG grouped into three classes Surprisingly DB-pedia and Freebase exhibit a high number of relationswhich are not used at all on the instance level In case of

71See httpswwwwikidataorgwikiWikidataProperty_proposal requested on Dec 31 2016

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 25

0 1-500 gt500

Number of relations

20

40

60

80

100

Re

lative

occu

ren

cie

s in

pe

rce

nt

DBpedia

Freebase

OpenCyc

Wikidata

YAGO

Fig 5 Frequency of the usage of the relations per KG grouped by(i) zero occurrences (ii) 1ndash500 occurrences and (iii) more than 500occurrences in the respective KG

OpenCyc 992 of the defined relations are never usedWe assume that those relations are used only withinCyc the commercial version of OpenCyc In case ofFreebase only 5 of the relations are used more than500 times and about 70 are not used at all Analo-gously to the discussion regarding the number of Free-base relations we can mention again the high numberof defined owlinverseOf relations and the highnumber of usersrsquo relation proposals as reasons for that

PredicatesRanking regarding predicates Freebase is here ndash

like in case of the ranking regarding relations ndash rankedfirst The lowest number of unique predictes is providedby OpenCyc which exhibits only 165 predicates AllKGs except OpenCyc provide more predicates then re-lations Our single observations regarding the predicatesets are as follows

DBpedia DBpedia is ranked third in terms of the ab-solute numbers of predicates about 60K predicates areused in DBpedia The set of relations and the set of pred-icates varies considerably here since also facts are ex-tracted from Wikipedia info-boxes whose predicates areconsidered by us as being only implicitly defined andwhich hence occur only as predicates These are the so-called non-mapping-based properties Note that in the

studied DBpedia version 2015-04 the set of explicitlydefined relations (mapping-based properties) and theset of implicitly defined relations (non-mapping-basedproperties) overlaps An example is dbpalias withdboalias

Freebase We can observe here a similar picture asfor the set of Freebase relations With about 785Kunique predicates Freebase exceeds the other KGs byfar Note however that 95 of the predicates (around743K) are used only once This relativizes the highnumber Most of the predicates are keys in the senseof ids and are used for internal modeling (for instancefreebasekeyuseradrianb)

OpenCyc In contrast to the 18028 unique relationswe measure only 164 unique predicates for OpenCycMore predicates are presumably used in Cyc

Wikidata We measure more Wikidata predicates thanWikidata relations since Wikidata predicates are cre-ated by modifying Wikidata relations An exampleare the following triples which express the statementBarack Obama (wdtQ76) is a human (wdtQ5) byan intermediate node (wdtQ76S123 abbreviated)

wdtQ76 wdtP31s wdtQ76S123wdtQ76S123 wdtP31v wdtQ5

The relation extension ldquosrdquo indicates that the RDF termin the object position is a statement The ldquovrdquo extensionallows to refer to a value (in Wikidata terminology)Besides those extensions there is ldquorrdquo to refer to a ref-erence and the ldquoqrdquo extension to refer to a qualifier Ingeneral these relation extensions are used for realizingreification via n-ary relations For that intermediatenodes are used which represent statements [16]

YAGO YAGO contains more predicates than DBpe-dia since infobox attributes from different languageversions of Wikipedia are aggregated into one KG72

while for DBpedia separate localized KG versions areoffered for non-English languages

515 Instances and EntitiesEvaluation method We distinguish between in-

stances Ig and entities Eg of a KG (cf Section 2)

1 Instances are belonging to classes They are iden-tified by retrieving the subjects of all triples wherethe predicates indicate class affiliations

72The language of each attribute is encoded in theURI for instance yagoinfoboxdeflaumlche andyagoinfoboxenareakm

26 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGODBpe

diaFre

ebas

eOpe

nCyc

Wiki

data

YAGO

10 010 110 210 310 410 510 610 710 810 9

Num

ber

of In

stan

ces

Fig 6 Number of instances per KG

2 Entities are real-world objects This excludesfor instance instantiated statements for beingentities Determining the set of entities is par-tially tricky In DBpedia and YAGO entitiesare determined as being an instance of theclass owlThing In Freebase entities are in-stances of freebasecommontopic and inWikidata instance of wdoItem In OpenCyccychIndividual corresponds to owlThingbut not all entities are classified in this way There-fore we approximately determine the set of en-tities in OpenCyc by manually classifying allclasses having more than 300 instances includingat least one entity73 In this way abstract classessuch as cychExistingObjectType are ne-glected

Ranking wrt the number of instances Table 2and Fig 6 show the number of instances per KG Wecan see that Wikidata comprises the highest numberof instances (142M) in total and OpenCyc the fewest(242K)

Ranking wrt the number of entities Table 2shows the ranking of KGs regarding the number of en-tities Freebase contains by far the highest number ofentities (about 499M) OpenCyc is at the bottom withonly about 41K entities

Differences in number of entities The reason whythe KGs show quite varying numbers of entities are theinformation sources of the KGs We illustrate this withthe music domain as example

1 Freebase had been created mainly from data im-ports such as from MusicBrainz Therefore enti-

73For instance cychIndividual cychMovie_CW andcychCity

ties in the domain of media and especially songrelease tracks are covered very well in Freebase77 of all entities are in the media domain (seeSection 513) out of which 42 are releasetracks74

Due to the large size and the world-wide coverageof entities in MusicBrainz Freebase contains al-bums and release tracks of both English and non-English languages For instance regarding the En-glish language the album ldquoThrillerrdquo from MichaelJackson and its single ldquoBillie Jeanrdquo are there aswell as rather unknown songs from the ldquoThrillerrdquoalbum such as ldquoThe Lady in My Liferdquo Regard-ing non-English languages Freebase contains forinstance songs and albums from Helene Fischersuch as ldquoLassrsquo mich in dein Lebenrdquo and ldquoZauber-mondrdquo also rather unknown songs such as ldquoHabrsquoden Himmel beruumlhrtrdquo can be found

2 In case of DBpedia the English Wikipedia is thesource of information In the English Wikipediamany albums and singles of English artists are cov-ered ndash such as the album ldquoThrillerrdquo and the singleldquoBillie Jeanrdquo Rather unknown songs such as ldquoTheLady in My Liferdquo are not covered in WikipediaFor many non-English artists such as the Germansinger Helene Fischer no music albums and nosingles are contained in the English Wikipedia Inthe corresponding language version of Wikipedia(and localized DBpedia version) this informationis often available (for instance the album ldquoZauber-mondrdquo and the song ldquoLassrsquo mich in dein Lebenrdquo)but not the rather unknown songs such as ldquoHabrsquoden Himmel beruumlhrtrdquo

3 For YAGO the same situation as for DBpediaholds with the difference that YAGO in additionimports entities also from the different languageversions of Wikipedia and imports also data fromsources such as GeoNames However the abovementioned works (ldquoLassrsquo mich in dein LebenrdquoldquoZaubermondrdquo and ldquoHabrsquo den Himmel beruumlhrtrdquo)of Helene Fischer are not in the YAGO althoughthe song ldquoLassrsquo mich in dein Lebenrdquo exists inthe German Wikipedia since May 2014 and al-though the used YAGO version 3 is based on theWikipedia dump of June 201475 Presumably theYAGO extraction system was unable to extract any

74Those release tracks are expressed via freebasemusicrelease_track

75See httpwwwmpi-infmpgdededepartmentsdatabases-and-information-

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 27DBpe

diaFre

ebas

eOpe

nCyc

Wiki

data

YAGO

10 0

10 1

10 2

10 3

10 4

Ave

rage

num

ber

of e

ntiti

es

Fig 7 Average number of entities per class per KG

types for those entities so that those entities werediscarded

4 Wikidata is supported by the community and con-tains music albums of English and non-Englishartists even if they do not exist in Wikipedia Anexample is the song ldquoThe Lady in My Liferdquo Notehowever that Wikidata does not provide all artistrsquosworks such as from Helene Fischer

5 OpenCyc contains only very few entities in themusic domain The reason is that OpenCyc has itsfocus mainly on common-sense knowledge andnot so much on facts about entities

Average number of entities per class Fig 7 showsthe average number of entities per class which can bewritten as |Eg||Cg| Obvious is the difference betweenDBpedia and YAGO (despite the similar number of en-tities) The reason for that is that the number of classesin the DBpedia ontology is small (as created manually)and in YAGO large (as created automatically)

Comparing number of instances with number ofentities Comparing the ratio of the number of instancesto the number of entities for each KG Wikidata ex-poses the highest difference As reason for that we canstate that each statement in Wikidata is modeled as aninstance of wdoStatement leading to 74M addi-tional instances In other KGs such as DBpedia state-ments are modeled without any dedicated statementassignment OpenCyc exposes also a high ratio sinceit contains mainly common sense knowledge and notas many entities as the other KGs Furthermore for ouranalysis we do not regard 100 of the entities but onlya large fraction of it (more precisely the classes with

systemsresearchyago-nagayagoarchive re-quested on Dec 31 2016

DBpedia

Freeb

ase

OpenC

ycW

ikida

ta

YAGO

0

1

2

3

4

5

6

7

8

Rat

io o

f num

ber

of in

stan

ces

to n

umbe

r of

ent

ities

Fig 8 Ratio of the number of instances to the number of entities foreach KG

the most frequently occurring instantiations) since en-tities are not consistently instantiated in OpenCyc (seebeginning of Section 515)

516 Subjects and ObjectsEvaluation method The number of unique subjects

and unique objects can be a meaningful KG charac-teristic regarding the link structure within the KG andin comparison to other KGs Especially interesting aredifferences between the number of unique subjects andthe number of unique objects

We measure the number of unique subjects by count-ing the unique resources (ie URIs and blank nodes) onthe subject position of N-Triples Sg = s | (s p o) ising Furthermore we measure the number of uniqueobjects by counting the unique resources on the ob-ject position of N-Triples excluding literals Og =o | (s p o) isin g and o isin U cup B Complementary thenumber of literals is given as Olit

g = o | (s p o) ising and o isin L

Ranking of KGs regarding number of uniquesubjects The number of unique subjects per KG is pre-sented in Fig 9 YAGO contains the highest number ofdifferent subjects while OpenCyc contains the fewest

Ranking of KGs regarding number of unique ob-jects The number of unique objects is also presented inFig 9 Freebase shows the highest score in this regardOpenCyc again the lowest

Ranking of KGs regarding the ratio of numberof unique subjects to number of unique objects Theratios of the number of unique subjects to the number ofunique objects vary considerably between the KGs (seeFig 8) We can observe that DBpedia has 265 timesmore objects than subjects while YAGO on the otherside has 19 times more unique subjects than objects

28 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Table 2Summary of key statistics

DBpedia Freebase OpenCyc Wikidata YAGO

Number of triples |(s p o) isin g| 411 885 960 3 124 791 156 2 412 520 748 530 833 1 001 461 792

Number of classes |Cg| 736 53 092 116 822 302 280 569 751

Number of relations |Pg| 2819 70 902 18 028 1874 106

No of unique predicates |P impg | 60 231 784 977 165 4839 88 736

Number of entities |Eg| 4 298 433 49 947 799 41 029 18 697 897 5 130 031

Number of instances |Ig| 20 764 283 115 880 761 242 383 142 213 806 12 291 250

Avg number of entities per class |Eg||Cg| 58403 9408 035 619 90

No of unique subjects |Sg| 31 391 413 125 144 313 261 097 142 278 154 331 806 927

No of unique non-literals in obj pos |Og| 83 284 634 189 466 866 423 432 101 745 685 17 438 196

No of unique literals in obj pos |Olitg | 161 398 382 1 782 723 759 1 081 818 308 144 682 682 313 508

DBpedia

Freeb

ase

OpenC

yc

Wiki

data

YAGO10 0

10 2

10 4

10 6

10 8

10 1 0

10 1 2

unique subjectsunique objects

Fig 9 Number of unique subjects and objects per KG Note thelogarithmic scale on the axis of ordinates

The high number of unique subjects in YAGO is sur-prising and can be explained by the reification styleused in YAGO Facts are stored as N-Quads in orderto allow for making statements about statements (forinstance storing the provenance information for state-ments) To that end IDs (instead of blank nodes) whichidentify the triples are used on the first position of N-Triples They lead to 308M unique subjects such asyagoid_6jg5ow_115_lm6jdp In the RDF ex-port of YAGO the IDs which identify the triples arecommented out in order to facilitate the N-Triple for-mat However the statements about statements are alsotransformed to triples In those cases the IDs identi-fying the reified statements are in the subject positionleading to such a high number of unique subjects

DBpedia contains considerably more owlsameAslinks to external resources than KGs like YAGO (290M

vs 38M links) leading to a bias of DBpedia towards ahigh number of unique objects

517 Summary of Key StatisticsBased on the evaluation results presented in the last

subsections we can highlight the following insights

1 Triples All KGs are very large Freebase is thelargest KG in terms of number of triples whileOpenCyc is the smallest KG We notice a corre-lation between the way of building up a KG andthe size of the KG automatically created KGs aretypically larger as the burdens of integrating newknowledge become lower Datasets which havebeen imported into the KGs such as MusicBrainzinto Freebase have a huge impact on the numberof triples and on the number of facts in the KGAlso the way of modeling data has a great impacton the number of triples For instance if n-aryrelations are expressed in N-Triples format (as incase of Wikidata) many intermediate nodes needto be modeled leading to many additional triplescompared to plain statements Last but not leastthe number of supported languages influences thenumber of triples

2 Classes The number of classes is highly varyingamong the KGs ranging from 736 (DBpedia) upto 300K (Wikidata) and 570K (YAGO) Despite itshigh number of classes YAGO contains in relativeterms the most classes which are actually used(ie classes with at least one instance) This canbe traced back to the fact that heuristics are usedfor selecting appropriate Wikipedia categories asclasses for YAGO Wikidata in contrast containsmany classes but out of them only a small fraction

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 29

is actually used on instance level Note howeverthat this is not necessarily a burden

3 Domains Although all considered KGs are speci-fied as crossdomain domains are not equally dis-tributed in the KGs Also the domain coverageamong the KGs differs considerably Which do-mains are well represented heavily depends onwhich datasets have been integrated into the KGsMusicBrainz facts had been imported into Free-base leading to a strong knowledge representation(77) in the domain of media in Freebase In DB-pedia and YAGO the domain people is the largestlikely due to Wikipedia as data source

4 Relations and Predicates Many relations arerarely used in the KGs Only 5 of the Freebaserelations are used more than 500 times and about70 are not used at all In DBpedia half of therelations of the DBpedia ontology are not usedat all and only a quarter of the relations is usedmore than 500 times For OpenCyc 992 of therelations are not used We assume that they areused only within Cyc the commercial version ofOpenCyc

5 Instances and Entities Freebase contains by farthe highest number of entities Wikidata exposesrelatively many instances in comparison to theentities as each statement is instantiated leadingto around 74M instances which are not entities

6 Subjects and Objects YAGO provides the high-est number of unique subjects among the KGsand also the highest ratio of the number of uniquesubjects to the number of unique objects This isdue to the fact that N-Quad representations needto be expressed via intermedium nodes and thatYAGO is concentrated on classes which are linkedby entities and other classes but which do not pro-vide outlinks DBpedia exhibits more unique ob-jects than unique subjects since it contains manyowlsameAs statements to external entities

52 Data Quality Analysis

We now present the results obtained by applyingthe DQ metrics introduced in the Sections 32 ndash 35 tothe KGs DBpedia Freebase OpenCyc Wikidata andYAGO

521 AccuracyThe fulfillment degrees of the KGs regarding the

Accuracy metrics are shown in Table 3

Table 3Evaluation results for the KGs regarding the dimension Accuracy

DB FB OC WD YA

msynRDF 1 1 1 1 1msynLit 099 1 1 1 062msemTriple 099 lt1 1 099 099

Syntactic validity of RDF documents msynRDF

Evaluation method For evaluating the Syntactic va-lidity of RDF documents we dereference the entityldquoHamburgrdquo as resource sample in each KG In caseof DBpedia YAGO Wikidata and OpenCyc thereare RDFXML serializations of the resource availablewhich can be validated by the official W3C RDF valida-tor76 Freebase only provides a Turtle serialization Weevaluate the syntactic validity of this Turtle documentby verifying if the document can be loaded into an RDFmodel of the Apache Jena Framework77

Evaluation result All considered KGs provide syn-tactically valid RDF documents In case of YAGO andWikidata the RDF validator declares the used languagecodes as invalid since the validator evaluates languagecodes in accordance with ISO-639 The criticized lan-guage codes are however contained in the newer stan-dard ISO 639-3 and actually valid

Syntactic validity of literals msynLit

Evaluation method We evaluate the Syntactic va-lidity of literals by means of the relations date ofbirth number of inhabitants and International Stan-dard Book Number (ISBN) as those relations cover dif-ferent domains ndash namely people cities and books ndashand as they can be found in all KGs In general do-main knowledge is needed for selecting representativerelations so that a meaningful coverage is guaranteed

Note that OpenCyc is not taken into account forthis criterion Although OpenCyc comprises around11M literals in total these literals are essentially la-bels and descriptions (given via rdfslabel andrdfscomment) ie not aligned to specific datatypes Hence OpenCyc has no syntactic invalid literalsand is assigned the metric value 1

As long as a literal with data type is given its syntaxis verified with the help of the function RDFDatatypeisValid(String) of the Apache Jena framework

76See httpsw3orgRDFValidator requested onMar 2 2016

77See httpsjenaapacheorg requested Mar 22016

30 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Thereby standard data types such as xsddate canbe validated easily especially if different data types areprovided78 If no data type is provided or if the literalvalue is of type xsdString the literal is evaluatedby a regular expression which is created manually (seebelow depending on the considered relation) For eachof the three relations we created a sample of 1M literalvalues per KG as long as the respective KG containsso many literals

Evaluation results All KGs except YAGO per-formed very well regarding the Syntactic validity ofliterals

Date of Birth For Wikidata DBpedia and Freebaseall verified literal values (1M per KG) were syntacti-cally correct79 For YAGO we detected around 519Ksyntactic errors (given 1M literal values) due to the us-age of wildcards in the date values For instance thebirth date of yagoSocrates is specified as ldquo470--rdquo which does not correspond to the syntax ofxsddate Obviously the syntactic invalidity of lit-erals is accepted by the YAGO publishers in order tokeep the number of relations low80

Number of inhabitants The data types of the literalvalues regarding the number of inhabitants were validin all KGs For DBpedia YAGO and Wikidata weevaluated the syntactic validity of the number of inhab-itants by checking if xsdnonNegativeIntegerxsddecimal and xsdinteger were used asdata types for the typed literals In Freebase no datatype is specified Therefore we evaluated the values bymeans of a regular expression which allows only thedecimals 0-9 periods and commas

ISBN The ISBN is an identifier for books and maga-zines The identifier can occur in various formats withor without preceding ldquoISBNrdquo with or without delim-iters and with 10 or 13 digits Gupta81 provided a regu-lar expression for validating ISBN in its different formswhich we used in our evaluation All in all most ofthe ISBN were assessed as syntactically correct The

78In DBpedia for instance data for the relationdbobirthDate is stored both as xsdgYear and xsddate

79Surprisingly the Jena Framework assessed data values with anegative year (ie BC eg ldquo-600rdquo for xsdgYear) as invaliddespite the correct syntax

80In order to model the dates to the extent they are known furtherrelations would be necessary such as using wasBornOnYearwith range xsdgYear wasBornOnYearMonth with rangexsdgYearMonth

81See httphowtodoinjavacomregexjava-regex-validate-international-standard-book-number-isbns requested on Mar 1 2016

lowest fulfillment degree was obtained for DBpediaWe found the following findings for the single KGs InFreebase around 699K ISBN numbers were availableOut of them 38 were assessed as syntactically incorrectTypical mistakes were too long numbers and wrongprefixes82 In case of Wikidata 18 of around 11K ISBNnumbers were syntactically invalid However some in-valid numbers have meanwhile been corrected This in-dicates that the Wikidata community does not only careabout inserting new data but also about curating givenKG data In case of YAGO we could only find 400triples with the relation yagohasISBN Seven of theliterals on the object position were syntactically incor-rect For DBpedia we evaluated around 24K literals7419 of them were assessed as syntactically incorrectIn many cases comments next to the ISBN numbers inthe info-boxes of Wikipedia led to an inaccurate extrac-tion of data so that the comments are either extractedas additional facts about ISBN numbers83 or togetherwith the actual ISBN numbers as coherent strings84

Semantic validity of triples msemTriple

Evaluation method The semantic validity can be re-liably measured by means of a reference data set which(i) contains at least to some degree the same facts asin the KG and (ii) which is regarded as some kind ofauthority We decided to use the Integrated AuthorityFile (Gemeinsame Normdatei GND)85 which is anauthority file especially concerning persons and corpo-rate bodies and which was created manually by Ger-man libraries Due to the focus on persons (especiallyauthors) we decided to evaluate a random sample ofperson entities wrt the following relations birth placedeath place birth date and death date For each ofthese relations the corresponding relations in the KGswere determined Then a random sample of 100 personentities per KG was chosen For each entity we retrievedthe facts with the mentioned relations and assessedmanually whether a GND entry exists and whether thevalues of the relations match with the values in the KG

Evaluation result We evaluated up to 400 facts perKG and observed only for a few facts some discrep-ancies For instance Wikidata states as death date of

82Eg we found the 16 digit ISBN 9789780307986931 (cffreebasem0pkny27) and the ISBN 2940045143431 with pre-fix 294 instead of 978 (cf freebasem0v3xf7b)

83See dbrPrince_Caspian84An example is ldquoISBN 0755111974 (hardcover edition)rdquo for

dbrMy_Family_and_Other_Animals85See httpwwwdnbdeENStandardisierung

GNDgndhtml requested on Sep 8 2016

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 31

ldquoAnton Erkelenzldquo (wdtQ589196) April 24 whereasGND states April 25 For DBpedia and YAGO we en-countered 3 and for Wikidata 4 errors Hence thoseKGs were evaluated with 099 Note that OpenCyc hasno values for the chosen relations and thus evaluates to1

During evaluation we identified the following issues

1 For finding the right entry in GND more informa-tion besides the name of the person is needed Thisinformation is sometimes not given so that entitydisambiguation is in those cases hard to perform

2 Contrary to assumptions often either no corre-sponding GND entry exists or not many facts ofthe GND entity are given In other words GND isincomplete wrt to entities (cf Population com-pleteness) and relations (cf Column complete-ness)

3 Values of different granularity need to be matchedsuch as an exact date of birth against the indicationof a year only

In conclusion the evaluation of semantic validity ishard even if a random sample set is evaluated manuallyMeaningful differences among the KGs might be re-vealed only when a very large sample is evaluated egby using crowd-sourcing [2348] Another approachfor assessing the semantic validity is presented by Kon-tokostas et al [34] who propose a test-driven evalu-ation where test cases are created to evaluate triplessemi-automatically For instance an interval specifiesthe valid height of a person and all triples which lieoutside of this interval are evaluated manually In thisway outliers can be easily found but possible wrongvalues within the interval are not detected

Our findings appear to be consistent with the evalua-tion results of the YAGO developer team for YAGO2where manually assessing 4412 statements resulted inan accuracy of 98186

522 TrustworthinessThe fulfillment degrees of the KGs regarding the

Trustworthiness criteria are shown in Table 4

Trustworthiness on KG level mgraph

Evaluation method Regarding the trustworthinessof a KG in general we differentiate between the method

86With a weighted averaging of 95 see httpwwwmpi-infmpgdededepartmentsdatabases-and-information-systemsresearchyago-nagayagostatistics requested on Mar 3 2016

Table 4Evaluation results for the KGs regarding the dimensionTrustworthiness

DB FB OC WD YA

mgraph 05 05 1 075 025mfact 05 1 0 1 1mNoV al 0 1 0 1 0

of how new data is inserted into the KG and the methodof how existing data is curated

Evaluation results The KGs differ considerablywrt this metric OpenCyc obtains the highest scorehere followed by Wikidata In the following we pro-vide findings for the single KGs which are listed bydecreasing fulfillment score

Cyc is edited (expanded and modified) exclusively bya dedicated expert group The free version OpenCycis derived from Cyc and only a locally hosted versioncan be modified by the data consumer

Wikidata is also curated and expanded manually butby volunteers of the Wikidata community Wikidataallows importing data from external sources such asFreebase87 However new data is not just inserted butis approved by the community

Freebase was also curated by a community of vol-unteers In contrast to Wikidata the proportion of dataimported automatically is considerably higher and newdata imports were not dependent on community ap-provals

DBpedia and YAGO The knowledge of both KGs isextracted from Wikipedia but DBpedia differs fromYAGO wrt the community involvement Any usercan engage (i) in mapping the Wikipedia infobox tem-plates to the DBpedia ontology in the DBpedia map-pings wiki88 and (ii) in the development of the DBpediaextraction framework

Trustworthiness on statement levelWe determine the Trustworthiness on statement level

by evaluating whether provenance information for state-ments is used in the KGs The picture is mixed

DBpedia uses the relation provwasDerivedFrom to store the sources of the entities and their state-

87Note that imports from Freebase require the approval ofthe community (see httpswwwwikidataorgwikiWikidataPrimary_sources_tool) Besides that there arebots which import automatically (see httpswwwwikidataorgwikiWikidataBotsde)

88See httpmappingsdbpediaorg requested onMar 3 2016

32 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

ments However as the source is always the correspond-ing Wikipedia article89 this provenance informationis trivial and the fulfillment degree is hence of ratherformal nature

YAGO uses its own vocabulary to indicate thesource of information Interestingly YAGO stores perstatement both the source (via yagoextractionSource eg the Wikipedia article) and the used ex-traction technique (via yagoextractionTech-nique eg ldquoInfobox Extractorrdquo or ldquoCategoryMap-perrdquo) The number of statements about sources is 161Mand hence many times over the number of instances inthe KG The reason for that is that in YAGO the sourceis stored for each fact

In Wikidata several relations can be used for refer-ring to sources such as ldquoimported fromrdquo (wdtP143)ldquostated inrdquo (wdtP248) and ldquoreference URLrdquo (wdtP854)90 Note that ldquoimported fromrdquo relations are usedfor automatic imports but that statements with such areference are not accepted (ldquodata is not sourcedrdquo)91 Tosource data the other relations ldquostated inrdquo and ldquoref-erence URLrdquo can be used The number of all storedreferences in Wikidata92 is around 971K Based on thenumber of all statements93 74M this corresponds to acoverage of around 13 Note however that not everystatement in Wikidata requires a reference according tothe Wikidata guidelines In order to be able to state howmany references are actually missing a manual evalua-tion would be necessary However such an evaluationwould be presumably highly subjective

Freebase uses proprietary vocabulary for represent-ing provenance via n-ary relations which are in Free-base called Compound Value Types (CVT) data fromhigher arity can be expressed [44]94

OpenCyc differs from the other KGs in that it usesneither an external vocabulary nor a proprietary vocab-ulary for storing provenance information

89Eg httpenwikipediaorgwikiHamburg fordbrHamburg

90All relations are instances of Wikidata property to indicate asource (wdtQ18608359)

91See httpswwwwikidataorgwikiPropertyP143 requested Mar 3 2016

92This is the number of instances of wdoReference93This is the number of instances of wdoStatement94Eg for a statement with the relation freebaselocation

statistical_regionpopulation the source can bestored via freebasemeasurement_unitdated_integersource

Table 5Evaluation results for the KGs regarding the dimension Consistency

DB FB OC WD YA

mcheckRestr 0 1 0 1 0mconClass 088 1 lt1 1 033mconRelat 099 045 1 050 099

Indicating unknown and empty values mNoV al

This criterion highlights the subtle data model ofWikidata and Freebase in comparison to the data mod-els of the other KGs Wikidata allows for storing un-known values and empty values (eg that ldquoElizabeth Iof Englandrdquo (wdtQ7207) had no children) Howeverin the Wikidata RDF export such statements are onlyindirectly available since they are represented via blanknodes and via the relation owlsomeValuesFrom

YAGO supports the representation of unknown val-ues and empty values by providing explicit relationsfor such cases95 Inexact dates are modeled by meansof wildcards (eg ldquo1940--rdquo if only the year isknown) Note however the invalidity of such stringsas date literals (see Section 521) Unknown dates arenot supported by YAGO

523 ConsistencyThe fulfillment degrees of the KGs regarding the

Consistency criteria are shown in Table 5

Check of schema restrictions during insertion of newstatements mcheckRestr

The values of the metric mcheckRestr indicating re-strictions during the insertion of new statements arevarying among the KGs The web interfaces of Free-base and Wikidata verify during the insertion of newstatements by the user whether the input is compatiblewith the respective data type For instance data of therelation ldquodate of birthrdquo (wdtP569) is expected to bein a syntactically valid form DBpedia OpenCyc andYAGO have no checks for schema restriction during theinsertion of new statements

Consistency of statements wrt class constraintsmconClass

Evaluation method For evaluating the consis-tency of class constraints we considered the relationowldisjointWith since this is the only rela-tion which is used by more than half of the consid-

95Eg freebasefreebasevaluenotationhas_no_value

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 33

ered KGs We only focused on direct instantiationshere if there is for instance the triple (dboPlantowldisjointWith dboAnimal) then theremust not be a resource which is instantiated both asdboPlant and dboAnimal

Evaluation results We obtained mixed results hereOnly Freebase OpenCyc and Wikidata perform verywell96

Freebase and Wikidata do not specify any constraintswith owldisjointWith Hence those two KGshave no inconsistencies wrt class restrictions and wecan assign the metric value 1 to them In case of Open-Cyc 5 out of the 27112 class restrictions are incon-sistent DBpedia contains 24 class constraints Threeout of them are inconsistent For instance over 1200instances exist which are both a dboAgent and adboPlace YAGO contains 42 constraints dedi-cated mainly for WordNet classes which are mostlyinconsistent

Consistency of statements wrt relation constraintsmconRelat

Evaluation method Here we considered the rela-tions rdfsrange and owlFunctionalProperty as those are used in more than every second con-sidered KG rdfsrange specifies the expected typeof an instance on the object position of a triple whileowlFunctionalProperty indicates that a rela-tion should only be used at most once per resource Weonly took datatype properties into account for this eval-uation since consistencies regarding object propertieswould require to distinguish Open World assumptionand Closed World assumption

Evaluation results In the following we considerthe fulfillment degree for the relation constraintsrdfsrange and owlFunctionalPropertyseparately In Table 5 we show the average of the fulfill-ment scores of each KG regarding rdfsrange andowlFunctionalProperty Note that the num-bers of evaluated relation constraints varied from KG toKG depending on how many relation constraints wereavailable per KG

Range Wikidata does not use any rdfsrangerestrictions Within the Wikidata data model there iswdopropertyType but this indicates not the ex-act allowed data type of a relation (eg wdoprop

96Note that the sample size varies among the KGs (depend-ing on how many owldisjointWith statements are availableper KG) Therefore inconsistencies measured on a small set ofowldisjointWith facts become more visible

Table 6Evaluation results for the KGs regarding the dimension Relevancy

DB FB OC WD YA

mRanking 0 1 0 1 0

ertyTypeTime can represent a year or an exact date)On the talk pages of Wikidata relations users can indi-cate the allowed values of relations via One of state-ments97 Since One of statements are only listed onthe property talk pages and since not only entity typesbut also concrete instances are used as One of valueswe do not consider those statements here

DBpedia obtains the highest measured fulfillmentscore wrt consistency of rdfsrange statementsAn example for a range inconsistency is that the relationdbobirthDate requires a data type xsddatein about 20 of those relations the data type xsdgYear is used though

YAGO Freebase and OpenCyc contain range incon-sistencies primarily since they specify designated datatypes via range relations which are not consistentlyused on the instance level For instance YAGO spec-ifies proprietary data types such as yagoyagoURLand yagoyagoISBN On the instance level how-ever either no data type is used or the unspecific datatype xsdstring

FunctionalProperty The restriction indicated byowlFunctionalProperty is used by all KGsexcept Wikidata On the talk pages about the rela-tions in Wikidata users can specify the cardinalityrestriction via setting the relation to single how-ever this is not part of the Wikidata data modelThe other KGs mostly comply with the usage re-strictions of owlFunctionalProperty Note-worthy is that in Freebase 999 of the inconsis-tencies obtained here are caused by the usages ofthe relations freebasetypeobjectname andfreebasecommonnotable_fordisplay_name

524 RelevancyThe fulfillment degrees of the KGs regarding the

Relevancy criteria are shown in Table 6

Creating a ranking of statements mRanking

Only Wikidata supports the modeling of a rankingof statements Each statement is ranked with ldquopre-

97See httpswwwwikidataorgwikiCategoryProperties_with_one-of_constraints for an overviewrequested on Jan 29 2017

34 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Table 7Evaluation results for the KGs regarding the dimensionCompleteness

DB FB OC WD YA

mcSchema 091 076 092 1 095mcColumn 040 043 0 029 033mcPop 093 094 048 099 089mcPop (short) 1 1 082 1 090mcPop (long) 086 088 014 098 088

ferred rankrdquo (wdoPreferredRank) ldquonormal rankrdquo(wdoNormalRank) or ldquodeprecated rankrdquo (wdoDeprecatedRank) The preferred rank corre-sponds to the up-to-date value or the consensus of theWikidata community wrt this relation Freebase doesnot provide any ranking of statements entities or re-lations However the meanwhile shutdown FreebaseSearch API provided a ranking for resources98

525 CompletenessThe fulfillment degrees of the KGs regarding the

Completeness criteria are shown in Table 7

Schema completeness mcSchema

Evaluation method Since a gold standard for eval-uating the Schema completeness of the considered KGshas not been published we built one on our own Thisgold standard is available online99 It is based on thedata set used in Section 513 where we needed as-signments of classes to domains and comprises of 41classes as well as 22 relations It is oriented towards thedomains people media organizations geography andbiology The classes in the gold standard were alignedto corresponding WordNet synsets (using WordNet ver-sion 31) and were grouped into main classes

Evaluation results Generally Wikidata performsoptimal also DBpedia OpenCyc and YAGO exhibitresults which can be judged as acceptable for most usecases Freebase shows considerable room for improve-ment concerning the coverage of typical cross-domainclasses and relations The results in more detail are asfollows

DBpedia DBpedia shows a good score regardingSchema completeness and its schema is mainly limited

98See httpsdevelopersgooglecomfreebasev1search-cookbookscoring-and-ranking re-quested on Mar 4 2016

99See httpkmaifbkitedusitesknowledge-graph-comparison requested on Jan 29 2017

due to the characteristics of how information is storedand extracted from Wikipedia

1 Classes The DBpedia ontology was created man-ually and covers all domains well However it is incom-plete in the details and therefore appears unbalancedFor instance within the domain of plants the DBpe-dia ontology does not use the class tree but the classginko which is a subclass of trees We can mentionas reason for such gaps in the modeling the fact thatthe ontology is created by means of the most frequentlyused infobox templates in Wikipedia

2 Relations Relations are considerably well cov-ered in the DBpedia ontology Some missing relationsor modeling failures are due to the Wikipedia infoboxcharacteristics For example to represent the gender ofa person the existing relation foafgender seemsto fit However it is only modeled in the ontology asbelonging to the class dbolanguage and not usedon instance level Note that the gender of a person is of-ten not explicitly mentioned in the Wikipedia infoboxesbut implicitly mentioned in the category names (forinstance American male singers) While DBpediadoes not exploit this knowledge YAGO does use it andprovides facts with the relation yagohasGender

Freebase Freebase shows a very ambivalent schemacompleteness On the one hand Freebase targets ratherthe representation of facts on instance level than therepresentation of classes and their hierarchy On theother hand Freebase provides a vast amount of rela-tions leading to a very good coverage of the requestedrelations

1 Classes Freebase lacks a class hierarchy and sub-classes of classes are often in different domains (for in-stance the classes freebasemusicartist andsportsmen freebasesportspro_athlete arelogically a subclass of the class people freebasepersonpeople but not explicitly stated as such)which makes it difficult to find suitable sub- and su-perclasses Noteworthy the biology domain containsno classes This is due to the fact that classes are rep-resented as entities such as tree100 and ginko101 Theginko tree is not classified as tree but by the genericclass freebasebiologyoganism_classification

2 Relations Freebase exhibits all relations requestedby our gold standard This is not surprising given thevast amount of available relations in Freebase (see Sec-tion 514 and Table 2)

100Freebase ID freebasem07j7r101Freebase ID freebasem0htd3

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 35

OpenCyc In total OpenCyc exposes a quite highSchema completeness scoring This is due to the factthat OpenCyc has been created manually and has itsfocus on generic and common-sense knowledge

1 Classes The ontology of OpenCyc covers bothgeneric and specific classes such as cychSocialGroup and cychLandTopographicalFeatureWe can state that OpenCyc is complete with respect tothe considered classes

2 Relations OpenCyc lacks some relations of thegold standard such as the number of pages or the ISBNof books

Wikidata According to our evaluation Wikidata iscomplete both with respect to classes and relations

1 Classes Besides frequently used generic classessuch as ldquohumanrdquo (wdtQ5) also very specific classesexist such as ldquolandformrdquo (wdtQ271669) in the senseof a geomorphologial unit with over 3K instances

2 Relations In particular remarkable is that Wiki-data covers all relations of the gold standard eventhough it has extremely less relations than FreebaseThus the Wikidata methodology to let users proposenew relations to discuss about their outreach and fi-nally to approve or disapprove the relations seems tobe appropriate

YAGO Due to its concentration on modeling classesYAGO shows the best overall Schema completenessfulfillment score among the KGs

1 Classes To create the set of classes in YAGOthe Wikipedia categories are extracted and connectedto WordNet synsets Since also our gold standard isalready aligned to WordNet synsets we can measure afull completeness score for YAGO classes

2 Relations The YAGO schema does not containmany unique but rather abstract relations which canbe understood in different senses The abstract rela-tion names make it often difficult to infer the mean-ing The relation yagowasCreatedOnDate forinstance can be used reasonably for both the founda-tion year of a company and for the publication dateof a movie DBpedia in contrast provides the rela-tion dbpfoundationYear Often the meaning ofYAGO relations is only fully understood after consider-ing the associated classes using domain and range ofthe relations Expanding the YAGO schema by furthermore fine-grained relations appears reasonable

Column completeness mcColumn

Evaluation method For evaluating KGs wrt Col-umn completeness for each KG 25 class-relation-

Table 8Metric values of mcCol for single class-relation-pairs

Relation DB FB OC ED YA

Personndashbirthdate 048 048 0 070 077

Personndashsex ndash 057 0 094 064

Bookndashauthor 091 093 0 082 028

BookndashISBN 073 063 ndash 018 001

combinations102 were created based on our gold stan-dard created for measuring the Schema completenessIt was ensured that only those relations were selectedfor a given class for which a value typically exists forthat class For instance we did not include the deathdate as potential relation for living people

Evaluation results In general no KG yields a met-ric score of over 043 As visible in Table 8 KGs oftenhave some specific class-relation-pairs which are wellrepresented on instance level while the rest of the pairsare poorly represented The well-represented pairs pre-sumably originate either from column-complete datasets which were imported (cf MusicBrainz in case ofFreebase) or from user edits focusing primarily on factsabout entities of popular classes such as people Wenotice the following observations with respect to thesingle KGs

DBpedia DBpedia fails regarding the relation sex forinstances of class Person since it does not containsuch a relation in its ontology If we considered the non-mapping-based property dbpgender instead (notdefined in the ontology) we would gain a coverage ofonly 025 (about 5K people) We can note hence thatthe extraction of data out of the Wikipedia categorieswould be a further fruitful data source for DBpedia

Freebase Freebase surprisingly shows a very highcoverage (927) of the authors of books given the ba-sic population of 17M books Note however that thereare not only books modeled under freebasebookbook but also entities of other types such as a descrip-tion of the Lord of Rings (see freebasem07bz5)Also the coverage of ISBN for books is quite high(634)

OpenCyc OpenCyc breaks ranks as mostly no val-ues for the considered relations are stored in this KG It

102The selection of class-relation-pairs was depending on the factwhich class-relation-pairs were available per KG Hence the choiceis varying from KG to KG Also note that less class-relation-pairswere used if no 25 pairs were available in the respective KG

36 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

contains mainly taxonomic knowledge and only thinlyspread instance facts

Wikidata Wikidata achieves a high coverage of birthdates (703) and of gender (941) despite the highnumber of 3M people103

YAGO YAGO obtains a coverage of 635 for gen-der relations as it in contrast to DBpedia extracts thisimplicit information from Wikipedia

Population completeness mcPop

Evaluation method In order to evaluate the Popu-lation completeness we need a gold standard consist-ing of a basic entity population for each consideredKG This gold standard which is available online104

was created on the basis of our gold standard usedfor evaluating the Schema completeness and the Col-umn completeness For its creation we selected fiveclasses from each of the five domains and determinedtwo well-known entities (called short head) and tworather unknown entities (called long tail) for each ofthose classes The exact entity selection criteria are asfollows

1 The well-known entities were chosen without tem-poral and location-based restrictions To take themost popular entities per domain we used quan-titative statements For instance to select well-known athletes we ranked athletes by the numberof won olympic medals to select the most popu-lar mountains we ranked the mountains by theirheights

2 To select the rather unknown entities we consid-ered entities associated to both Germany and aspecific year For instance regarding the athleteswe selected German athletes active in the year2010 such as Maria Houmlfl-Riesch The selectionof rather unknown entities in the domain of biol-ogy is based on the IUCN Red List of ThreatenedSpecies105106

Selecting four entities per class and five classes perdomain resulted in 100 entities to be used for evaluatingthe Population completeness

103These 3M instances form about 185 of all instances in Wiki-data See httpswwwwikidataorgwikiWikidataStatistics requested on Nov 7 2016

104See httpkmaifbkitedusitesknowledge-graph-comparison requested on Jan 29 2017

105See httpwwwiucnredlistorg requested on Apr2 2016

106Note that selecting entities by their importance or popularity ishard in general and that also other popularity measures such as thePageRank scores may be taken into account

Evaluation results All KGs except OpenCyc showgood evaluation results Since also Wikidata exhibitsgood evaluation results the population degree appar-ently does not depend on the age or the insertion methodof the KG Fig 10 additionally depicts the populationcompleteness for the single domains for each KG Inthe following we firstly present our findings for well-known entities before we secondly go into the detailsof rather unknown entities

Well-known entities Here all considered KGsachieve good results DBpedia Freebase and Wikidataare complete wrt the well-known entities in our goldstandard YAGO lacks some well-known entities al-though some of them are represented in Wikipedia Onereason for this fact is that those Wikipedia entities donot get imported into YAGO for which a WordNet classexists For instance there is no ldquoGreat White Sharkrdquoentity only the WordNet class yagowordnet_great_white_shark_101484850

Not-well-known entities First of all not very surpris-ing is the fact that all KGs show a higher degree of com-pleteness regarding well-known entities than regard-ing rather unknown entities as the KGs are orientedtowards general knowledge and not domain-specificknowledge Secondly two things are in particular pe-culiar concerning long-tail entities in the KGs Whilemost of the KGs obtain a score of about 088 Wiki-data deflects upwards and OpenCyc deflects stronglydownwards

Wikidata exhibits a very high Population complete-ness degree for long tail entities This is a result fromthe central storage of interwiki links between differentWikimedia projects (especially between the differentWikipedia language versions) in Wikidata A Wikidataentry is added to Wikidata as soon as a new entity isadded in one of the many Wikipedia language versionsNote however that in this way English-language labelsfor the entities are often missing We measure that onlyabout 546 (102M) of all Wikidata resources have anEnglish label

OpenCyc exhibits a poor population degree score of014 for long-tail entities OpenCycrsquos sister KGs Cycand ResearchCyc are apparently considerably bettercovered with entities [36] leading to higher Populationcompleteness scores

526 TimelinessThe evaluation results concerning the dimension

Timeliness are presented in Table 9

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 37

DBpedia Freebase OpenCyc Wikidata YAGO0

01

02

03

04

05

06

07

08

09

1

PeopleMediaOrganizationsGeographyBiology

Fig 10 Population completeness regarding the different domains per KG

Table 9Evaluation results for the KGs regarding the dimension Timeliness

DB FB OC WD YA

mFreq 05 0 025 1 025mV alidity 0 1 0 1 1mChange 0 1 0 0 0

Timeliness frequency of the KG mFreq

Evaluation results The KGs are very diverse re-garding the frequency in which the KGs are updatedranging from a score of 0 for Freebase (not updated anymore) to 1 for Wikidata (updates immediately visibleand retrievable) Note that the Timeliness frequency ofthe KG can be a crucial point and a criterion for exclu-sion in the process of choosing the right KG for a givensetting [17] In the following we outline some charac-teristics of the KGs with respect to their up-to-dateness

DBpedia is created about once to twice a year andis not modified in the meantime From September2013 until November 2016 six DBpedia versions havebeen published107 Besides the static DBpedia DBpe-dia live108 has been continuously updated by trackingchanges in Wikipedia in real-time However it does notprovide the full range of relations as DBpedia

107These versions are DBpedia 38 DBpedia 39 DBpedia 2014DBpedia 2015-04 DBpedia 2015-10 and DBpedia 2016-04 Alwaysthe latest DBpedia version is published online for dereferencing

108See httplivedbpediaorg requested on Mar 42016

Freebase had been updated continuously until itsclose-down and is not updated anymore

OpenCyc has been updated less than once per yearThe last OpenCyc version dates from May 2012109 Tothe best of our knowledge Cyc and OpenCyc respec-tively are developed further but no exact date of thenext version is known

Wikidata provides the highest fulfillment degree forthis criterion Modifications in Wikidata are via browserand via HTTP URI dereferencing immediately visibleHence Wikidata falls in the category of continuousupdates Besides that an RDF export is provided ona roughly monthly basis (either via the RDF exportwebpage110 or via own processing using the Wikidatatoolkit111)

YAGO has been updated less than once per yearYAGO3 was published in 2015 YAGO2 in 2011 andthe interim version YAGO2s in 2013 A date of the nextrelease has not been published

Specification of the validity period of statementsmV alidity

Evaluation results Although representing the va-lidity period of statements is obviously reasonable formany relations (for instance the presidentrsquos term of

109See httpswopencycorg requested on Nov 82016

110See httptoolswmflabsorgwikidata-exportsrdfexports requested on Nov 23 2016

111See httpsgithubcomWikidataWikidata-Toolkit requested on Nov 8 2016

38 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Table 10Evaluation results for the KGs regarding the dimension Ease ofunderstanding

DB FB OC WD YA

mDescr 070 097 1 lt1 1mLang 1 1 0 1 1muSer 1 1 0 1 1muURI 1 05 1 0 1

office) specifying the validity period of statements isin several KGs either not possible at all or only rudi-mentary performed

DBpedia and OpenCyc do not realize any specifi-cation possibility In YAGO Freebase and Wikidatathe temporal validity period of statements can be spec-ified In YAGO this modeling possibility is madeavailable via the relations yagooccursSinceyagooccursUntil and yagooccursOnDateWikidata provides the relations ldquostart timerdquo (wdtP580)and ldquoend timerdquo (wdtP582) In Freebase CompoundValue Types (CVTs) are used to represent relations withhigher arity [44] As part of this representation validityperiods of statements can be specified An example isldquoVancouverrsquos population in 1997rdquo

Specification of the modification date of statementsmChange

Evaluation results The modification date of state-ments can only be specified in Freebase but not in theother KGs Together with the criteria on Timelinessthis reflects that the considered KGs are mostly notsufficiently equipped with possibilities for modelingtemporal aspects within and about the KG

In Freebase the date of the last review of a fact can berepresented via the relation freebasefreebasevaluenotationis_reviewed In the DBpediaontology the relation dctermsmodified is usedto state the date of the last revision of the DBpediaontology When dereferencing a resource in Wikidatathe latest modification date of the resource is returnedvia schemadateModified This however doesnot hold for statements Thus Wikidata is evaluatedwith 0 too

527 Ease of UnderstandingDescription of resources mDescr

Evaluation method We measured the extent towhich entities are described Regarding the labelswe considered rdfslabel for all KGs Regard-ing the descriptions the corresponding relations dif-

fer from KG to KG DBpedia for instance usesrdfscomment and dcelementsdescriptionwhile Freebase provides freebasecommontopicdescription112

Evaluation result For all KGs the rule applies thatin case there is no label available usually there isalso no description available The current metric couldtherefore (without significant restrictions) be applied tordfslabel occurrences only

YAGO Wikidata and OpenCyc contain a label foralmost every entity In Wikidata the entities withoutany label are of experimental nature and are most likelynot used113

Surprisingly DBpedia shows a relatively low cov-erage wrt labels and descriptions (only 704) Ourmanual investigations suggest that relations with higherarity are modeled by means of intermediate nodeswhich have no labels114

Labels in multiple languages mLang

Evaluation method Here we measure whether theKGs contain labels (rdfslabel) in other languagesthan English This is done by means of the languageannotations of literals such as ldquoderdquo for literals inGerman

Evaluation results DBpedia provides labels in 13languages Further languages are provided in the lo-calized DBpedia versions YAGO integrates statementsof the different language versions of Wikipedia intoone KG Therefore it provides labels in 326 differentlanguages Freebase and Wikidata also provide a lot oflanguages (244 and 395 languages respectively) Con-trary to the other KGs OpenCyc only provides labelsin English

Coverage of languages We also measured the cov-erage of selected languages in the KGs ie the extentto which entities have an rdfslabel with a specificlanguage annotation115 Our evaluation shows that DB-pedia YAGO and Freebase achieve a high coveragewith more than 90 regarding the English language Incontrast to those KGs Wikidata shows a relative low

112Human-readable resource descriptions may also be representedby other relations [15] However we focused on those relations whichare commonly used in the considered KGs

113For instance wdtQ5127809 represents a game fo the Nin-tendo Entertainment System but there is no further information foran identification of the entity available

114Eg dbrNayim links via dboCareerStation to 10entities of his carrier stations

115Note that literals such as rdfslabel do not necessarily havelanguage annotations In those cases we assume that no languageinformation is available

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 39

coverage regarding the English language of only 546but a coverage of over 30 for further languages suchas German and French Wikidata is hence not only themost diverse KG in terms of languages but has also thehighest coverage regarding non-English languages

Understandable RDF serialization muSer

The provisioning of understandable RDF serializa-tions in the context of URI dereferencing leads to a bet-ter understandability for human data consumers DB-pedia YAGO and Wikidata provide N-Triples andN3Turtle serializations Freebase in contrast onlyprovides a Turtle serialization OpenCyc only usesRDFXML which is regarded as not easily understand-able by humans

Self-describing URIs muURI

We can observe two different paradigms of URI us-age On the one hand DBpedia OpenCyc and YAGOrely on descriptive URIs and therefore achieve the fullfulfillment degree In DBpedia and YAGO the URIsof the entities are determined by the corresponding En-glish Wikipedia article The mapping to the EnglishWikipedia is thus trivial In case of OpenCyc two RDFexports are provided one using opaque and one us-ing self-describing URIs The self-describing URIs arethereby derived from the rdfslabel values of theresources

On the other hand Wikidata and Freebase (the latterin part) rely on opaque URIs Wikidata uses Q-IDsfor resources (items in Wikidata terminology) andP-IDs for relations Freebase uses self-describing URIsonly partially namely opaque M-IDs for entities andself-describing URIs for classes and relations116

528 InteroperabilityThe evaluation results of the dimension Interoper-

ability are presented in Table 11

Avoiding blank nodes and RDF reification mReif

Reification allows to represent further informationabout single statements In conclusion we can state thatDBpedia Freebase OpenCyc and YAGO use someform of reification However none of the consideredKGs uses the RDF standard for reification Wikidatamakes extensive use of reification every relation isstored in the form of an n-ary relation In case of DB-pedia and Freebase in contrast facts are predominantlystored as N-Tripels and only relations of higher arity

116Eg freebasemusicalbum for the class music al-bums and freebasepeoplepersondate_of_birthfor the relation day of birth

Table 11Evaluation results for the KGs regarding the dimensionInteroperability

DB FB OC WD YA

mReif 05 05 05 0 05miSerial 1 0 05 1 1mextV oc 061 011 041 068 013mpropV oc 015 0 051 gt0 0

are stored via n-ary relations117 YAGO stores facts asN-Quads in order to be able to store meta informationof facts like provenance information When the quadsare loaded in a triple store the IDs referring to thesingle statements are ignored and quads are convertedinto triples In this way most of the statements are stillusable without the necessity to deal with reification

Blank nodes are non-dereferencable anonymous re-sources They are used by the Wikidata and OpenCycdata model

Provisioning of several serialization formats miSerial

DBpedia YAGO and Wikidata fulfill the criterion ofProvisioning several RDF serialization formats to thefull extent as they provide data in RDFXML and sev-eral other serialization formats during the URI derefer-encing In addition DBpedia and YAGO provide fur-ther RDF serialization formats (eg JSON-LD Micro-data and CSV) via their SPARQL endpoints Freebaseis the only KG providing RDF only in Turtle format

Using external vocabulary mextV oc

Evaluation method This criterion indicates the ex-tent to which external vocabulary is used For that foreach KG we divide the occurrence number of tripleswith external relations by the number of all relations inthis KG

Evaluation results DBpedia uses 37 unique exter-nal relations from 8 different vocabularies while theother KGs mainly restrict themselves to the externalvocabularies RDF RDFS and OWL

Wikidata reveals a high external vocabulary ratiotoo We can mention two obvious reasons for that fact1 Information in Wikidata is provided in a huge varietyof languages leading to 85M rdfslabel and 140Mschemadescription literals 2 Wikidata makesextensive use of reification Out of the 140M triplesused for instantiations via rdftype about 74M (ie

117See Section 511 for more details wrt the influence of reifica-tion on the number of triples

40 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

about the half) are taken for instantiations of statementsie for reification

Interoperability of proprietary vocabulary mpropV oc

Evaluation method This criterion determines the ex-tent to which URIs of proprietary vocabulary are linkedto external vocabulary via equivalence relations Foreach KG we measure which classes and relationsare linked via owlsameAs118 owlequivalentClass (in Wikidata wdtP1709) and owlequivalentProperty (in Wikidata wdtP1628) to ex-ternal vocabulary Note that other relations such asrdfsubPropertyOf could be taken into accounthowever in this work we only consider equivalencyrelations

Evaluation results In general we obtained low ful-fillment scores regarding this criterion OpenCyc showsthe highest value We achieved the following singlefindings

Regarding its classes DBpedia reaches a relativehigh interlinking degree of about 484 Classes arethereby linked to FOAF Wikidata schemaorg andDUL119 Regarding its relations DBpedia links to Wiki-data and schemaorg120 Only 63 of the DBpediarelations are linked to external vocabulary

Freebase only provides owlsameAs links in theform of a separate RDF file but these links are only oninstance level Thus the KG is evaluated with 0

In OpenCyc about half of all classes exhibit at leastone external linking via owlsameAs Internal linksto resources of swcyccom the commercial ver-sion of OpenCyc were ignored in our evaluation Theconsidered classes are mainly linked to FOAF UM-BEL DBpedia and linkedmdborg the relations mainlyto FOAF DBpedia Dublin Core Terms and linked-mdborg The relative high linking degree of OpenCyccan be attributed to dedicated approaches of linkingOpenCyc to other KGs (see eg Medelyan et al [38])

Regarding the classes Wikidata provides linksmainly to DBpedia Considering all Wikidata classesonly 01 of all Wikidata classes are linked to equiva-

118OpenCyc uses owlsameAs both on schema and instancelevel This is appropriate as the OWL primer states The built-inOWL property owlsameAs links an individual to an individualas well as The owlsameAs statements are often used in definingmappings between ontologies see httpswwww3orgTR2004REC-owl-ref-20040210sameAs-def (requestedon Feb 4 2017)

119See httpwwwontologydesignpatternsorgontdulDULowl requested on Jan 11 2017

120Eg dbobirthDate is linked to wdtP569 andschemabirthDate

Table 12Evaluation results for the KGs regarding the dimension Accessibility

DB FB OC WD YA

mDeref 1 1 044 041 1mAvai lt1 073 lt1 lt1 1mSPARQL 1 1 0 1 0mExport 1 1 1 1 1mNegot 05 1 0 1 0mHTMLRDF 1 1 1 1 0mMeta 1 0 0 0 1

lent external classes This may be due to the high num-ber of classes in Wikidata in general Regarding therelations Wikidata provides links in particular to FOAFand schemaorg and achieves here a linking coverageof 21 Although this is low frequently used relationsare linked121

YAGO contains around 553K owlequivalentClass links to classes within the DBpedia namespacedby However as YAGO classes (and their hierarchy)were imported also into DBpedia (using the namespacehttpdbpediaorgclassyago) we donot count those owlequivalentClass links inYAGO as external links for YAGO

529 AccessibilityThe evaluation results of the dimension Accessibility

are presented in Table 12

Dereferencing possibility of resources mDeref

Evaluation method We measured the dereferenc-ing possibilities of resources by trying to dereferenceURIs containing the fully-qualified domain name ofthe KG For that we randomly selected 15K URIs inthe subject predicate and object position of triples ineach KG We submitted HTTP requests with the HTTPaccept header field set to applicationrdf+xmlin order to perform content negotiation

Evaluation results In case of DBpedia OpenCycand YAGO all URIs were dereferenced successfullyand returned appropriate RDF data so that they fulfilledthis criterion completely For DBpedia 45K URIs wereanalyzed for OpenCyc only around 30K due to thesmall number of unique predicates We observed almost

121Frequently used relations with stated equivalence to externalrelations are eg wdtP31 linked to rdftype and wdtP279linked to rdfssubClassOf

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 41

the same picture for YAGO namely no notable errorsduring dereferencing

For Wikidata which contains also not that manyunique predicates we analyzed around 35K URIs Notethat predicates which are derived from relations using asuffix (eg the suffix s as in wdtP1024s is usedfor predicates referring to a statement) could not bedereferenced at all Furthermore the blank nodes usedfor reification cannot be dereferenced

Regarding Freebase mainly all URIs on subjectand object position of triples could be dereferencedSome resources were not resolvable even after multi-ple attempts (HTTP server error 503 eg freebasem0156q) Surprisingly server errors also appearedwhile browsing the website freebasecom so that datawas partially not available Regarding the predicate po-sition many URIs are not dereferencable due to servererrors (HTTP 503) or due to unknown URIs (HTTP404) Note that if a large number of Freebase requestsare performed an API key from Google is necessaryIn our experiments the access was blocked after a fewthousand requests Hence we can point out that withoutan API key the Freebase KG is only usable to a limitedextent

Availability of the KG mAvai

Evaluation method We measured the availabilityof the officially hosted KGs with the monitoring servicePingdom122 For each KG an uptime test was set upwhich checked the availability of the resource Ham-burg as representative resource for successful URI re-solving (ie returning the status code HTTP 200) ev-ery minute over the time range of 60 days (Dec 182015ndashFeb 15 2016)

Evaluation result While the other KGs showed al-most no outages and were again online after some min-utes on average YAGO outages took place frequentlyand lasted on average 35 hours123 In the given timerange four outages took longer than one day Based onthese insights we recommend to use a local version ofYAGO for time-critical queries

Availability of a public SPARQL endpoint mSPARQL

The SPARQL endpoints of DBpedia and YAGO are

122See httpswwwpingdomcom requested Mar 2 2016The HTTP requests of Pingdom are executed by various servers sothat caching is prevented

123See diagrams per KG on our website (httpkmaifbkitedusitesknowledge-graph-comparisonrequested on Jan 31 2017)

provided by a Virtuoso server124 the Wikidata SPARQLendpoint via Blazegraph125 Freebase and OpenCyc donot provide an official SPARQL endpoint However anendpoint for the MQL query language for the FreebaseKG was available

Especially regarding the Wikidata SPARQL endpointwe observed access restrictions The maximum execu-tion time per query is set to 30 seconds but there is nolimitation regarding the returning number of rows How-ever the front-end of the SPARQL endpoint crashed incase of large result sets with more than 15M rows Al-though public SPARQL endpoints need to be preparedfor inefficient queries the time limit of Wikidata mayimpede the execution of reasonable queries

Provisioning of an RDF export mExport

All considered KGs provide RDF exports as down-loadable files The format of the data differs from KGto KG Mostly data is provided in N-Triples and Turtleformat

Support of content negotiation mNegot

We measure the support of content negotiation re-garding the serialization formats RDFXML N3Turtleand N-Triples OpenCyc does not provide any contentnegotiation only RDFXML is supported as contenttype Therefore OpenCyc does not fulfill the criterionof supporting content negotiation

The endpoints for DBpedia Wikidata and YAGOcorrectly returned the appropriate RDF serializationformat and the corresponding HTML representationof the tested resources Freebase does currently notprovide any content negotiation and only the contenttype textplain is returned

Noteworthy is also that regarding the N-Triples seri-alization YAGO and DBpedia require the accept headertextplain and not applicationn-triplesThis is due to the usage of Virtuoso as endpoint For DB-pedia the forwarding to httpdbpediaorgdata[resource]ntriples does not work in-stead the HTML representation is returned Thereforethe KG is evaluated with 05

Linking HTML sites to RDF serializations mHTMLRDF

All KGs except OpenCyc interlink the HTML represen-tations of resources with the corresponding RDF repre-sentations by means of ltlink rel=alternate

124See httpsvirtuosoopenlinkswcom re-quested on Dec 28 2016

125See httpswwwblazegraphcom requested on Dec28 2016

42 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Table 13Evaluation results for the KGs regarding the dimension License

DB FB OC WD YA

mmacLicense 1 0 0 1 0

type=content type href=URLgtin the HTML header

Provisioning of metadata about the KG mmeta

For this criterion we analyzed if KG metadata isavailable such as in the form of a VoID file126 DBpediaintegrates the VoID vocabulary directly in its KG127 andprovides information such as the SPARQL endpointURL and the number of all triples OpenCyc revealsthe current KG version number via owlversionInfo For YAGO Freebase and Wikidata no metainformation could be found

5210 LicenseThe evaluation results of the dimension License are

shown in Table 13

Provisioning machine-readable licensing informationmmacLicense

DBpedia and Wikidata provide licensing informa-tion about their KG data in machine-readable form ForDBpedia this is done in the ontology via the predi-cate cclicense linking to CC-BY-SA128 and GNUFree Documentation License (GNU FDL)129 Wikidataembeds licensing information during the dereferenc-ing of resources in the RDF document by linking withcclicense to the license CC0130 YAGO and Free-base do not provide machine-readable licensing infor-mation However their data is published under the li-cense CC-BY131 OpenCyc embeds licensing informa-tion into the RDF document during dereferencing butnot in machine-readable form132

126See httpswwww3orgTRvoid requested on Apr7 2016

127See httpdbpediaorgvoidpageDataset re-quested on Mar 5 2016

128See httpcreativecomonsorglicensesby-sa30 requested on Feb 4 2017

129See httpwwwgnuorgcopyleftfdlhtml re-quested on Feb 4 2017

130See httpcreativecomonsorgpublicdomainzero10 requested on Feb 4 2017

131See httpcreateivecommonsorglicensesby30 requested on Feb 4 2017

132License information is provided as plain text among furtherinformation with the relation rdfscomment

Table 14Evaluation results for the KGs regarding the dimension Interlinking

DB FB OC WD YA

mInst 025 0 038 0 (09) 031mURIs 093 091 089 096 096

5211 InterlinkingThe evaluation results of the dimension Interlinking

are shown in Table 14

Linking via owlsameAs mInst

Evaluation method Given all owlsameAs triplesin each KG we queried all those subjects thereof whichare instances but neither classes nor relations133 andwhere the resource in the object position of the triple isan external source ie not belonging to the namespaceof the KG

Evaluation result OpenCyc and YAGO achieve thebest results wrt this metric but DBpedia has by farthe most instances with at least one owlsameAs linkWe can therefore confirm the statement by Bizer et al[12] that DBpedia has established itself as a hub in theLinked Data cloud

In DBpedia there are about 52M instances with atleast one owlsameAs link Links to localized DBpe-dia versions (eg dedbpediaorg) were countedas internal links and hence not considered here Intotal one-fourth of all instances have at least oneowlsameAs link

In Wikidata neither owlsameAs links are pro-vided nor a corresponding proprietary relation is avail-able Instead Wikidata uses for each linked data seta proprietary relation (called identifier) to indicateequivalence For example the M-ID of a Freebase in-stance is stored via the relation ldquoFreebase identifierrdquo(wdtP646) as literal value (eg m01x3gpk)So far links to 426 different data sources are maintainedin this way

Although the equivalence statements in Wikidata canbe used to generate corresponding owlsameAs state-ments and although the stored identifiers are providedin the Browser interface as hyperlinks there are no gen-uine owlsameAs links available Hence Wikidata isevaluated with 0 If we view each equivalence relationas owlsameAs relation we would obtain around122M instances with owlsameAs statements Thiscorresponds to 86 of all instances If we consider

133The interlinking on schema level is already covered by thecriterion Interoperability of proprietary vocabulary

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 43

only entities instead of instances (since there are manyinstances due to reification) we obtain a coverage of65 Note however that although the linked resourcesprovide relevant content the resources are not alwaysRDF documents but instead HTML web pages There-fore we cannot easily subsume all identifiers (equiv-alence statements) under owlsameAs

YAGO has around 36M instances with at least oneowlsameAs link However most of them are linksto DBpedia based on common Wikipedia articles Ifthose links are excluded YAGO contains mostly linksto GeoNames and would be evaluated with just 001

In case of OpenCyc links to Cyc134 the commercialversion of OpenCyc were considered as being internalStill OpenCyc has the highest fulfillment degree witharound 40K instances with at least one owlsameAslink As mentioned earlier the relative high linkingdegree of OpenCyc can be attributed to dedicated ap-proaches of linking OpenCyc to other KGs135

Validity of external URIs mURIs

Regarding the dimension Accessibility we alreadyanalyzed the dereferencing possibility of resources inthe KG namespace Now we analyze the links to exter-nal URIs

Evaluation method External links include owlsameAs links as well as links to non-RDF-based Webresources (eg via foafhomepage) We measureerrors such as timouts client errors (HTTP response4xx) and server errors (HTTP response 5xx)

Evaluation result The external links are in most ofthe cases valid for all KGs All KGs obtain a metricvalue between 089 and 096

DBpedia stores provenance information via the re-lation provwasDerivedFrom Since almost alllinks refer to Wikipedia 99 of the resources are avail-able

Freebase achieves high metric values here sinceit contains owlsameAs links mainly to WikipediaAlso Wikipedia URIs are mostly resolvable

OpenCyc contains mainly external links to non-RDF-based Web resources to wikipediaorg and w3org

YAGO also achieves high metric values since it pro-vides owlsameAs links only to DBpedia and Geo-Names whose URIs do not change

For Wikidata the relation reference URL (wdtP854) which states provenance information amongother relations belongs to the links linking to external

134Ie swcyccom135See Interoperability of proprietary vocabulary in sec 528

Web resources Here we were able to resolve around955 without errors

Noticeable is that DBpedia and OpenCyc containmany owlsameAs links to URIs whose domains donot exist anymore136 One solution for such invalid linksmight be to remove them if they have been invalid for acertain time span

5212 Summary of ResultsWe now summarize the results of the evaluations

presented in this section

1 Syntactic validity of RDF documents All KGsprovide syntactically valid RDF documents

2 Syntactic validity of Literals In general the KGsachieve good scores regarding the Syntactic valid-ity of literals Although OpenCyc comprises over1M literals in total these literals are mainly labelsand descriptions which are not formatted in a spe-cial format For YAGO we detected about 519Ksyntactic errors (given 1M literal values) due to theusage of wildcards in the date values Obviouslythe syntactic invalidity of literals is accepted bythe publishers in order to keep the number of rela-tions low In case of Wikidata some invalid literalssuch as the ISBN have been corrected in newerversions of Wikidata This indicates that knowl-edge in Wikidata is curated continuously For DB-pedia comments next to the values to be extracted(such as ISBN) in the infoboxes of Wikipedia ledto inaccurately extracted values

3 Semantic validity of triples All considered KGsscored well regarding this metric This shows thatKGs can be used in general without concerns re-garding the correctness Note however that eval-uating the semantic validity of facts is very chal-lenging since a reliable ground truth is needed

4 Trustworthiness on KG level Based on the way ofhow data is imported and curated OpenCyc andWikidata can be trusted the most

5 Trustworthiness on statement level Here espe-cially good values are achieved for Freebase Wiki-data and YAGO YAGO stores per statement boththe source and the extraction technique which isunique among the KGs Wikidata also supports tostore the source of information but only around13 of the statements have provenance informa-tion attached Note however that not every state-

136Eg httprdfaboutcom httpwww4wiwissfu-berlindefactbook and httpwikicompanyorg (requested on Jan 11 2017)

44 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

ment in Wikidata requires a reference and that itis hard to evaluate which statements lack such areference

6 Using unknown and empty values Wikidata andFreebase support the indication of unknown andempty values

7 Check of schema restrictions during insertion ofnew statements Since Freebase and Wikidata areeditable by community members simple consis-tency checks are made during the insertion of newfacts in the user interface

8 Consistency of statements wrt class constraintsFreebase and Wikidata do not specify any classconstraints via owldisjointWith while theother KGs do

9 Consistency of statements wrt relation con-straints The inconsistencies of all KGs regardingthe range indications of relations are mainly due toinconsistently used data types (eg xsdgYearis used instead of xsdDate)Regarding the constraint of functional proper-ties the relation owlFunctionalPropertyis used by all KGs except Wikidata in most casesthe KGs comply with the usage restrictions of thisrelation

10 Creating a ranking of statements Only Wikidatasupports a ranking of statements This is in partic-ular worthwhile in case of statements which areonly temporally limited valid

11 Schema completeness Wikidata shows the highestdegree of schema completeness Also for DBpe-dia OpenCyc and YAGO we obtain results whichare presumably acceptable in most cross-domainuse cases While DBpedia classes were sometimesmissing in our evaluation the DBpedia relationswere covered considerably well OpenCyc lackssome relations of the gold standard but the classesof the gold standard were existing in OpenCycWhile the YAGO classes are peculiar in the sensethat they are connected to WordNet synsets it isremarkable that YAGO relations are often keptvery abstract so that they can be applied in differ-ent senses Freebase shows considerable room forimprovement concerning the coverage of typicalcross-domain classes and relations Note that Free-base classes are belonging to different domainsHence it is difficult to find related classes if theyare not in the same domain

12 Column completeness DBpedia and Freebaseshow the best column completeness values ie inthose KGs the predicates used by the instances of

each class are on average frequently used by all ofthose class instances We can name data importsas one reason for it

13 Population completeness Not very surprising isthe fact that all KGs show a higher degree of com-pleteness regarding well-known entities than re-garding rather unknown entities Especially Wiki-data shows an excellent performance for both well-known and rather unknown entities

14 Timeliness frequency of the KG Only Wikidataprovides the highest fulfillment degree for thiscriterion as it is continuously updated and as thechanges are immediately visible and queryable byusers

15 Specification of the validity period of statementsIn YAGO Freebase and Wikidata the temporalvalidity period of statements (eg term of office)can be specified

16 Specification of the modification date of state-ments Only Freebase keeps the modification datesof statements Wikidata provides the modificationdate of the queried resource during URI derefer-encing

17 Description of resources YAGO Wikidata andOpenCyc contain a label for almost every entitySurprisingly DBpedia shows a relatively low cov-erage wrt labels and descriptions (only 704)Manual investigations suggest that the interme-diate node mapping template is the main reasonfor that By means of this template intermediatenodes are introduced and instantiated but no la-bels are provided for them137

18 Labels in multiple languages YAGO Freebaseand Wikidata support hundreds of languages re-garding their stored labels Only OpenCyc con-tains labels merely in English While DBpediaYAGO and Freebase show a high coverage re-garding the English language Wikidata does nothave such a high coverage regarding English butinstead covers other languages to a considerableextent It is hence not only the most diverse KGin terms of languages but also the KG which con-tains the most labels for languages other than En-glish

19 Understandable RDF serialization DBpediaWikidata and YAGO provide several understand-

137An example is dbrVolkswagen_Passat_(B1)which has dboengine statements to the intermediate nodesVolkswagen_Passat_(B1)__1 etc representing differentengine variations

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 45

able RDF serialization formats Freebase onlyprovides the understandable format RDFTurtleOpenCyc relies only on RDFXML which is con-sidered as being not easily understandable for hu-mans

20 Self-describing URIs We can find mixed paradigmsregarding the URI generation DBpedia YAGOand OpenCyc rely on descriptive URIs whileWikidata and Freebase (in part classes and rela-tions are identified with self-describing URIs) usegeneric IDs ie opaque URIs

21 Avoiding blank nodes and RDF reification DB-pedia Wikidata YAGO and Freebase are theKGs which use reification ie which formulatestatements about statements There are differentways of implementing reification [27] DBpediaWikidata and Freebase use n-ary relations whileYAGO uses N-Quads creating so-called namedgraphs

22 Provisioning of several serialization formatsMany KGs provide RDF in several serializationformats Freebase is the only KG providing datain the serialization format RDFTurtle only

23 Using external vocabulary DBpedia and Wiki-data show high degrees of external vocabularyusage In DBpedia the RDF RDFS and OWLvocabularies are used Wikidata has a high ex-ternal vocabulary ratio since there exist manylanguage labels and descriptions (modeled viardfslabel and schemadescription)Also due to instantiations of statements withwdoStatement for reification purposes theexternal relation rdftype is used a lot

24 Interoperability of proprietary vocabulary Weobtained low fulfillment scores regarding this cri-terion OpenCyc shows the highest value Wecan mention as reason for that the fact thathalf of all OpenCyc classes exhibit at least oneowlsameAs linkWhile DBpedia has equivalence statements to ex-ternal classes for almost every second class only63 of all relations have equivalence relations torelations outside the DBpedia namespaceWikidata shows a very low interlinking degreeof classes to external classes and of relations toexternal relations

25 Dereferencing possibility of resources Resourcesin DBpedia OpenCyc and YAGO can be derefer-enced without considerable issues Wikidata usespredicates derived from relations that are not deref-erencable at all as well as blank nodes For Free-

base we measured a quite considerable amountof dereferencing failures due to server errors andunknown URIs Note also that Freebase requiredan API key for a large amount of requests

26 Availability of the KG While all other KGsshowed almost no outages YAGO shows a note-worthy instability regarding its online availabilityWe measured around 100 outages for YAGO ina time interval of 8 weeks taking on average 35hours

27 Provisioning of public SPARQL endpoint DBpe-dia Wikidata and YAGO provide a SPARQL end-point while Freebase and OpenCyc do not Note-worthy is that the Wikidata SPARQL endpoint hasa maximum execution time per query of 30 sec-onds This might be a bottleneck for some queries

28 Provisioning of an RDF export RDF exports areavailable for all KGs and are provided mostly inN-Triples and Turtle format

29 Support of content negotiation DBpedia Wiki-data and YAGO correctly return RDF data basedon content negotiation Both OpenCyc and Free-base do not support any content negotiation WhileOpenCyc only provides data in RDFXML Free-base only returns data with textplain as con-tent type

30 Linking HTML sites to RDF serializations AllKGs except OpenCyc interlink the HTML rep-resentations of resources with the correspondingRDF representations

31 Provisioning of KG metadata Only DBpedia andOpenCyc integrate metadata about the KG insome form DBpedia has the VoID vocabulary in-tegrated while OpenCyc reveals the current KGversion as machine-readable metadata

32 Provisioning machine-readable licensing informa-tion Only DBpedia and Wikidata provide licens-ing information about their KG data in machine-readable form

33 Interlinking via owlsameAs OpenCyc andYAGO achieve the best results wrt this met-ric but DBpedia has by far the most instanceswith at least one owlsameAs link Based onthe resource interlinkage DBpedia is justifiablycalled Linked Data hub Wikidata does not provideowlsameAs links but stores identifiers as liter-als that could be used to generate owlsameAslinks

34 Validity of external URIs The links to exter-nal Web resources are for all KGs valid inmost cases DBpedia and OpenCyc contain many

46 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Step 1 Requirements Analysis

- Identifying the preselection criteria P- Assigning a weight wi to each DQ criterion ci isin C

Step 2 Preselection based on the Preselection Criteria

- Manually selecting the KGs GP that fulfill the preselection criteria P

Step 3 Quantitative Assessment of the KGs

- Calculating the DQ metric mi(g) for each DQ criterion ci isin C- Calculating the fulfillment degree h(g) for each KG g isin GP

- Determining the KG g with the highest fulfillment degree h(g)

Step 4 Qualitative Assessment of the Result

- Assessing the selected KG g wrt qualitative aspects- Comparing the selected KG g with other KGs in G P

Fig 11 Proposed process for using our KG recommendation frame-work

owlsameAs links to RDF documents on do-mains which do not exist anymore those linkscould be deleted

6 KG Recommendation Framework

We now propose a framework for selecting themost suitable KG (or a set of suitable KGs) for agiven concrete setting based on a given set of KGsG = g1 gn To use this framework the user needsto go through the steps depicted in Fig 11

In Step 1 the preselection criteria and the weightsfor the criteria are specified The preselection criteriacan be both quality criteria or general criteria and needto be selected dependent on the use case The Timeli-ness frequency of the KG is an example for a qualitycriterion The license under which a KG is provided(eg CC0 license) is an example for a general criterionAfter weighting the criteria in Step 2 those KGs areneglected which do not fulfill the preselection criteriaIn Step 3 the fulfillment degrees of the remaining KGsare calculated and the KG with the highest fulfillmentdegree is selected Finally in Step 4 the result can be as-sessed wrt qualitative aspects (besides the quantitativeassessments using the DQ metrics) and if necessary analternative KG can be selected for being applied for thegiven scenario

Use case application In the following we showhow to use the KG recommendation framework in aparticular scenario The use case is based on the usageof DBpedia and MusicBrainz for the project BBC Musicas described in [33]

Description of the use case The publisher BBCwants to enrich news articles with fact sheets providingrelevant information about musicians mentioned in thearticles In order to obtain more details about the mu-sicians the user can leave the news section and accessthe musicians section where detailed information is pro-vided including a short description a picture the birthdate and the complete discography for each musicianFor being able to integrate the musicians informationinto the articles and to enable such a linking editorsshall tag the article based on a controlled vocabulary

The KG Recommendation Framework can be appliedas follows

1 Requirements analysis

ndash Preselection criteria According to the sce-nario description [33] the KG in questionshould (i) be actively curated and (ii) con-tain an appropriate amount of media enti-ties Given these two criteria a satisfactoryand up-to-date coverage of both old and newmusicians is expected

ndash Weighting of DQ criteria Based on the pre-selection criteria an example weighting ofthe DQ metrics for our use case is given inTable 15 Note that this is only one exam-ple configuration and the assignment of theweights is subjective to some degree Giventhe preselection criteria the criterion Timeli-ness frequency of the KG and the criteria ofthe DQ dimension Completeness are empha-sized Furthermore the criteria Dereferenc-ing possibility of resources and Availabilityof the KG are important as the KG shall beavailable online ready to be queried138

2 Preselection Freebase and OpenCyc are not con-sidered any further since Freebase is not being up-dated anymore and since OpenCyc contains onlyaround 4K entities in the media domain

3 Quantitative Assessment The overall fulfillmentscore for each KG is calculated based on the for-mula presented in Section 31 The result of thequantitative KG evaluation is presented in Ta-ble 15 By weighting the criteria according tothe constraints Wikidata achieves the best rankclosely followed by DBpedia Based on the quan-titative assessment Wikidata is recommended bythe framework

138We assume that in this use case rather the dereferencing ofHTTP URIs than the execution of SPARQL queries is desired

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 47

Table 15Framework with an example weighting which would be reasonablefor a user setting as given in [33]

Dimension Metric DBpedia Freebase OpenCyc Wikidata YAGO Example of UserWeighting wi

Accuracy msynRDF 1 1 1 1 1 1msynLit 0994 1 1 1 0624 1msemTriple 0990 0995 1 0993 0993 1

Trustworthiness mgraph 05 05 1 075 025 0mfact 05 1 0 1 1 1mNoV al 0 1 0 1 0 0

Consistency mcheckRestr 0 1 0 1 0 0mconClass 0875 1 0999 1 0333 0mconRelat 0992 0451 1 0500 0992 0

Relevancy mRanking 0 1 0 1 0 1

Completeness mcSchema 0905 0762 0921 1 0952 1mcCol 0402 0425 0 0285 0332 2mcPop 093 094 048 099 089 3

Timeliness mFreq 05 0 025 1 025 3mV alidity 0 1 0 1 1 0mChange 0 1 0 0 0 0

Ease of understanding mDescr 0704 0972 1 09999 1 1mLang 1 1 0 1 1 0muSer 1 1 0 1 1 0muURI 1 05 1 0 1 1

Interoperability mReif 05 05 05 0 05 0miSerial 1 0 05 1 1 1mextV oc 061 0108 0415 0682 0134 1mpropV oc 0150 0 0513 0001 0 1

Accessibility mDeref 1 0437 1 0414 1 2mAvai 09961 09998 1 09999 07306 2mSPARQL 1 0 0 1 1 1mExport 1 1 1 1 1 0mNegot 05 0 0 1 1 0mHTMLRDF 1 1 0 1 1 0mMeta 1 0 1 0 0 0

Licensing mmacLicense 1 0 0 1 0 0

Interlinking mInst 0251 0 0382 0 0310 3mURIs 0929 0908 0894 0957 0956 1

Unweighted Average 0683 0603 0496 0752 0625Weighted Average 0701 0493 0556 0714 0648

48 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

4 Qualitative Assessment The high population com-pleteness in general and the high coverage of enti-ties in the media domain in particular give Wiki-data advantage over the other KGs FurthermoreWikidata does not require that there is a Wikipediaarticle for each entity Thus missing Wikidata en-tities can be added by the editors directly and arethen available immediatelyThe use case requires to retrieve also detailed infor-mation about the musicians from the KG such as ashort descripion and a discography DBpedia tendsto store more of that data especially wrt discogra-phy A specialized database like MusicBrainz pro-vides even more data about musicians than DBpe-dia as it is not limited to the Wikipedia infoboxesWhile DBpedia does not provide any links to Mu-sicBrainz Wikidata stores around 120K equiva-lence links to MusicBrainz that can be used to pullmore data In conclusion Wikidata especially inthe combination with MusicBrainz seems to bean appropriate choice for the use case In this casethe qualitative assessment confirms the result ofthe quantitative assessment

The use case shows that our KG recommendationframework enables users to find the most suitable KGand is especially useful in giving an overview of themost relevant criteria when choosing a KG Howeverapplying our framework to the use case also showedthat besides the quantitative assessment there is stilla need for a deep understanding of commonalities anddifference of the KGs in order to make an informedchoice

7 Related Work

71 Linked Data Quality Criteria

Zaveri et al [49] provide a conceptual framework forquality assessment of linked data based on quality cri-teria and metrics which are grouped into quality dimen-sions and categories and which are based on the frame-work of Wang et al [47] Our framework is also basedon Wangrsquos dimensions and extended by the dimensionsConsistency [11] Licensing and Interlinking [49] Fur-thermore we reintroduce the dimensions Trustworthi-ness and Interoperability as a collective term for multi-ple dimensions

Many published DQ criteria and metrics are ratherabstract We in contrast selected and developed con-

crete criteria which can be applied to any KG in theLinked Open Data cloud Table 16 shows which ofthe metrics introduced in this article have already beenused to some extent in existing literature In summaryrelated work mainly proposed generic guidelines forpublishing Linked Data [26] introduced DQ criteriawith corresponding metrics (eg [2030]) and criteriawithout metrics (eg [4029]) 27 of the 34 criteria in-troduced in this article have been introduced or sup-ported in one way or another in earlier works The re-maining seven criteria namely Trustworthiness on KGlevel mgraph Indicating unknown and empty valuesmNoV al Check of schema restrictions during insertionof new statements mcheckRestr Creating a rankingof statements mRanking Timeliness frequency of theKG mFreq Specification of the validity period of state-ments mV alidity and Availability of the KG mAvaihave not been proposed so far to the best of our knowl-edge In the following we present more details of singleexisting approaches for Linked Data quality criteria

Pipino et al [40] introduce the criteria Schema com-pleteness Column completeness and Population com-pleteness in the context of databases We introducethose metrics for KGs and apply them to the best ofour knowledge the first time on the KGs DBpediaFreebase OpenCyc Wikidata and YAGO

OntoQA [45] introduces criteria and correspondingmetrics that can be used for the analysis of ontologiesBesides simple statistical figures such as the average ofinstances per class Tartir et al introduce also criteriaand metrics similar to our DQ criteria Description ofresources mDescr and Column completeness mcCol

Based on a large-scale crawl of RDF data Hogan etal [29] analyze quality issues of published RDF dataLater Hogan et al [30] introduce further criteria andmetrics based on Linked Data guidelines for data pub-lishers [26] Whereas Hogan et al crawl and analyzemany KGs we analyze a selected set of KGs in moredetail

Heath et al [26] provide guidelines for Linked Databut do not introduce criteria or metrics for the assess-ment of Linked Data quality Still the guidelines can beeasily translated into relevant criteria and metrics Forinstance Do you refer to additional access methodsleads to the criteria Provisioning of public SPARQLendpoint mSPARQL and Provisioning of an RDF ex-port mExport Also Do you map proprietary vocabu-lary terms to other vocabularies leads to the criterionInteroperability of proprietary vocabulary mpropV ocMetrics that are based on the guidelines of Heath et alcan also be found in other frameworks [3020]

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 49

Table 16Overview of related work regarding data quality criteria for KGs

DQ Metric [40] [45] [29] [26] [20] [22] [30] [48] [2] [34]

msynRDF X X

msynLit X X X X

msemTriple X X X X

mfact X X

mconClass X X X

mconRelat X X X X X X

mcSchema X X

mcCol X X X X

mcPop X X

mChange X X

mDescr X X X X

mLang X

muSer X

muURI X

mReif X X X

miSerial X

mextV oc X X

mpropV oc X

mDeref X X X X

mSPARQL X

mExport X X

mNegot X X X

mHTMLRDF X

mMeta X X X

mmacLicense X X X

mInst X X X

mURIs X X

Flemming [20] introduces a framework for the qual-ity assessment of Linked Data quality This frameworkmeasures the Linked Data quality based on a sample ofa few RDF documents Based on a systematic literaturereview criteria and metrics are introduced Flemmingintroduces the criteria Labels in multiple languagesmLang and Validity of external URIs mURIs the firsttime The framework is evaluated on a sample of RDFdocuments of DBpedia In contrast to Flemming weevaluate the whole KG DBpedia and also four otherwidely used KGs

SWIQA[22] is a quality assessment framework intro-duced by Fuumlrber et al that introduces criteria and met-rics for the dimensions Accuracy Completeness Timeli-ness and Uniqueness In this framework the dimensionAccuracy is divided into Syntactic validity and Sematicvalidity as proposed by Batini et al [6] Furthermorethe dimension Completeness comprises Schema com-pleteness Column completeness and Population com-pleteness following Pipino et al [40] In this articlewe make the same distinction but in addition distin-guish between RDF documents RDF triples and RDF

50 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

literals for evaluating the Accuracy since we considerRDF KGs

TripleCheckMate [35] is a framework for LinkedData quality assessment using a crowdsourcing-approachfor the manual validation of facts Based on this ap-proach Zaveri et al [48] and Acosta et al [23] analyzeboth syntactic and semantic accuracy as well as theconsistency of data in DBpedia

Kontokostas et al [34] present the test-driven evalu-ation framework RDFUnit for assessing Linked Dataquality This framework is inspired by the paradigmof test-driven software development The frameworkintroduces 17 SPARQL templates of tests that can beused for analyzing KGs wrt Accuracy and Consis-tency Note that those tests can also be used for eval-uating external constraints that exist due to the usageof external vocabulary The framework is applied byKontokostas et al on a set of KGs including DBpedia

72 Comparing KGs by Key Statistics

Duan et al [14] Tartir [45] and Hassanzadeh [25]can be mentioned as the most similar related work re-garding the evaluation of KGs using the key statisticspresented in Section 51

Duan et al [14] analyze the structuredness of data inDBpedia YAGO2 UniProt and in several benchmarkdata sets To that end the authors use simple statisticalkey figures that are calculated based on the correspond-ing RDF dumps In contrast to that approach we useSPARQL queries to obtain the figures thus not limitingourselves to the N-Tripel serialization of RDF dumpfiles Duan et al claim that simple statistical figures arenot sufficient to gain fruitful findings when analyzingthe structuredness and differences of RDF datasets Theauthors therefore propose in addition a coherence met-ric Accordingly we analyze not only simple statisti-cal key figures but further analyze the KGs wrt dataquality using 34 DQ metrics

Tartir et al [45] introduce with the system OntoQAmetrics that can be used for analyzing ontologies Moreprecisely it can be measured to which degree theschema level information is actually used on instancelevel An example of such a metric is the class richnessdefined as the number of classes with instances dividedby the number of classes without instances SWETOTAP and GlycO are used as showcase ontologies

Tartir et al [45] and Hassanzadeh et al [25] analyzehow domains are covered by KGs on both schema andinstance level For that Tartir et al introduce the mea-sure importance as the number of instances per class

and their subclasses In our case we cannot use this ap-proach since Freebase has no hierarchy Hassanzadehet al analyze the coverage of domains by listing themost frequent classes with the highest number of in-stances as a table This gives only little overview of thecovered domains since instances can belong to multi-ple classes in the same domain such as dboPlaceand dboPopulatedPlace For determining thedomain coverages of KGs for this article we there-fore adapt the idea of Hassanzadeh et al by manu-ally mapping the most frequent classes to domains anddeleting duplicates within the domains That meansif an instance is instantiated both as dboPlaceand dboPopulatedPlace the instance will becounted only once in the domain geography

8 Conclusion

Freely available knowledge graphs (KGs) have notbeen in the focus of any extensive comparative study sofar In this survey we defined a range of aspects accord-ing to which KGs can be analyzed We analyzed andcompared DBpedia Freebase OpenCyc Wikidata andYAGO along these aspects and proposed a frameworkas well as a process to enable readers to find the mostsuitable KG for their settings

References

[1] M Acosta E Simperl F Floumlck and M Vidal HARE AHybrid SPARQL Engine to Enhance Query Answers viaCrowdsourcing In Proceedings of the 8th InternationalConference on Knowledge Capture K-CAP 2015 pages111ndash118 ACM 2015

[2] M Acosta A Zaveri E Simperl D Kontokostas S Auer andJ Lehmann Crowdsourcing linked data quality assessment InThe Semantic WebndashISWC 2013 pages 260ndash276 Springer 2013

[3] M Acosta A Zaveri E Simperl D Kontokostas F Floumlckand J Lehmann Detecting Linked Data Quality Issues viaCrowdsourcing A DBpedia Study Semantic Web 2016

[4] S Auer C Bizer G Kobilarov J Lehmann R Cyganiak andZ Ives DBpedia A Nucleus for a Web of Open Data InProceedings of the 6th International Semantic Web Conferenceand 2nd Asian Semantic Web Conference ISWC 2007ASWC2007 pages 722ndash735 Springer 2007

[5] S Auer J Lehmann A-C Ngonga Ngomo and A ZaveriIntroduction to Linked Data and Its Lifecycle on the Web InReasoning Web Semantic Technologies for Intelligent DataAccess volume 8067 of Lecture Notes in Computer Sciencepages 1ndash90 Springer Berlin Heidelberg 2013

[6] C Batini C Cappiello C Francalanci and A MaurinoMethodologies for Data Quality Assessment and ImprovementACM Comput Surv 41(3)161ndash1652 July 2009

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 51

[7] S Bechhofer F van Harmelen J Hendler I Horrocks D LMcGuinness and P F Patel-Schneider OWL Web OntologyLanguage Reference httpswwww3orgTR2004REC-owl-ref-200402102004 [Online accessed 06-Apr-2016]

[8] T Berners-Lee Linked Data httpwwww3orgDesignIssuesLinkedDatahtml2006 [Online accessed 28-Feb-2016]

[9] T Berners-Lee Linked Data Is Merely More Datahttpwwww3orgDesignIssuesLinkedDatahtml 2006[Online accessed 28-02-2016]

[10] T Berners-Lee J Hendler and O Lassila The Semantic WebScientific American 284(5)29ndash37 5 2001

[11] C Bizer Quality-Driven Information Filtering in the Contextof Web-Based Information Systems VDM Publishing 2007

[12] C Bizer J Lehmann G Kobilarov S Auer C BeckerR Cyganiak and S Hellmann DBpediandashA crystallizationpoint for the Web of Data Web Semantics science servicesand agents on the world wide web 7(3)154ndash165 2009

[13] X Dong E Gabrilovich G Heitz W Horn N LaoK Murphy T Strohmann S Sun and W Zhang KnowledgeVault A Web-Scale Approach to Probabilistic KnowledgeFusion In Proceedings of the 20th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining KDD rsquo14 pages 601ndash610 New York NY USA 2014ACM

[14] S Duan A Kementsietsidis K Srinivas and O UdreaApples and Oranges A Comparison of RDF Benchmarks andReal RDF Datasets In Proceedings of the ACM SIGMODInternational Conference on Management of Data SIGMOD2011 pages 145ndash156 2011

[15] B Ell D Vrandecic and E Simperl Proceedings of the 10thInternational Semantic Web Conference (ISWC 2011) chapterLabels in the Web of Data pages 162ndash176 Springer BerlinHeidelberg Berlin Heidelberg 2011

[16] F Erxleben M Guumlnther M Kroumltzsch J Mendez andD Vrandecic Introducing Wikidata to the Linked Data WebIn Proceedings of the 13th International Semantic WebConference ISWC 2014 pages 50ndash65 Springer 2014

[17] M Faumlrber F Bartscherer C Menne and A Rettinger LinkedData Quality of DBpedia Freebase OpenCyc Wikidata andYAGO Semantic Web Journal 2017 to be published

[18] M Faumlrber C Menne and A Rettinger A Linked DataWrapper for CrunchBase Semantic Web Journal 2017 to bepublished

[19] C Fellbaum WordNet ndash An Electronic Lexical Database MITPress 1998

[20] A Flemming Qualitaumltsmerkmale von LinkedData-veroumlffentlichenden Datenquellen (Quality characteristicsof linked data publishing datasources) Diploma ThesisHumboldt University of Berlinhttpwwwdbisinformatikhu-berlindefileadminresearchpapersdiploma_seminar_thesisDiplomarbeit_Annika_Flemmingpdf 2011

[21] G Freedman and E G Reynolds Enriching Basal ReaderLessons with Semantic Webbing Reading Teacher33(6)677ndash684 1980

[22] C Fuumlrber and M Hepp SWIQA ndash A Semantic WebInformation Quality Assessment Framework In Proceedings ofthe 19th European Conference on Information Systems

(ECIS2011) volume 15 page 19 2011[23] R Guns Tracing the origins of the Semantic Web Journal of

the American Society for Information Science and Technology64(10)2173ndash2181 2013

[24] H Halpin P J Hayes J P McCusker D L McGuinness andH S Thompson The Semantic Web ndash ISWC 2010 9thInternational Semantic Web Conference ISWC 2010 ShanghaiChina chapter When owlsameAs Isnrsquot the Same An Analysisof Identity in Linked Data pages 305ndash320 Springer BerlinHeidelberg Berlin Heidelberg 2010

[25] O Hassanzadeh M J Ward M Rodriguez-Muro andK Srinivas Understanding a Large Corpus of Web TablesThrough Matching with Knowledge Bases ndash An EmpiricalStudy In Proceedings of the 10th International Workshop onOntology Matching collocated with the 14th InternationalSemantic Web Conference ISWC 2015 2015

[26] T Heath and C Bizer Linked data Evolving the web into aglobal data space Synthesis lectures on the semantic webtheory and technology 1(1)1ndash136 2011

[27] D Hernaacutendez A Hogan and M Kroumltzsch Reifying RDFWhat Works Well With Wikidata In Proceedings of the 11thInternational Workshop on Scalable Semantic Web KnowledgeBase Systems co-located with 14th International Semantic WebConference pages 32ndash47 2015

[28] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 A Spatially and Temporally Enhanced KnowledgeBase from Wikipedia Artificial Intelligence 19428ndash61 2013

[29] A Hogan A Harth A Passant S Decker and A PolleresWeaving the Pedantic Web Proceedings of the WWW2010Workshop on Linked Data on the Web 628 2010

[30] A Hogan J Umbrich A Harth R Cyganiak A Polleres andS Decker An empirical survey of linked data conformanceWeb Semantics Science Services and Agents on the WorldWide Web 1414ndash44 2012

[31] P Jain P Hitzler K Janowicz and C Venkatramani TherersquosNo Money in Linked Data httpcorescholarlibrarieswrighteducse240 2013 accessedJuly 20 2015

[32] J M Juran F M Gryna and R S Bingham editors QualityControl Handbook McGraw-Hill 1974

[33] G Kobilarov T Scott Y Raimond S Oliver C SizemoreM Smethurst C Bizer and R Lee Media Meets SemanticWeb ndash How the BBC Uses DBpedia and Linked Data to MakeConnections In Proceedings of the 6th European SemanticWeb Conference on The Semantic Web Research andApplications ESWC 2009 Heraklion pages 723ndash737 BerlinHeidelberg 2009 Springer

[34] D Kontokostas P Westphal S Auer S HellmannJ Lehmann R Cornelissen and A Zaveri Test-drivenevaluation of linked data quality In Proceedings of the 23rdinternational conference on World Wide Web pages 747ndash758ACM 2014

[35] D Kontokostas A Zaveri S Auer and J LehmannTripleCheckMate A Tool for Crowdsourcing the QualityAssessment of Linked Data In Knowledge Engineering andthe Semantic Web ndash 4th International Conference KESW 2013St Petersburg Russia October 7-9 2013 Proceedings pages265ndash272 Springer 2013

[36] C Matuszek J Cabral M J Witbrock and J DeOliveira AnIntroduction to the Syntax and Content of Cyc In AAAI SpringSymposium Formalizing and Compiling Background

52 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Knowledge and Its Applications to Knowledge Representationand Question Answering pages 44ndash49 AAAI - Association forthe Advancement of Artificial Intelligence 2006

[37] M Mecella M Scannapieco A Virgillito R BaldoniT Catarci and C Batini Managing data quality in cooperativeinformation systems In On the Move to Meaningful InternetSystems 2002 CoopIS DOA and ODBASE pages 486ndash502Springer 2002

[38] O Medelyan and C Legg Integrating Cyc and WikipediaFolksonomy meets rigorously defined common-sense InWikipedia and Artificial Intelligence An Evolving SynergyPapers from the 2008 AAAI Workshop page 65 2008

[39] F Naumann Quality-Driven Query Answering for IntegratedInformation Systems volume 2261 Springer Science ampBusiness Media 2002

[40] L L Pipino Y W Lee and R Y Wang Data QualityAssessment Communications of the ACM 45(4)211ndash2182002

[41] E Sandhaus Semantic Technology at the New York TimesLessons Learned and Future Directions In Proceedings of the9th International Semantic Web Conference on The SemanticWeb - Volume Part II ISWCrsquo10 pages 355ndash355 BerlinHeidelberg 2010 Springer

[42] A Singhal Introducing the Knowledge Graph things notstrings httpsgoogleblogblogspotde201205introducing-knowledge-graph-things-nothtml retrieved on Aug 29 2016 2012

[43] F M Suchanek G Kasneci and G Weikum YAGO A LargeOntology from Wikipedia and WordNet Web SemanticsScience Services and Agents on the World Wide Web6(3)203ndash217 2008

[44] T P Tanon D Vrandecic S Schaffert T Steiner andL Pintscher From Freebase to Wikidata The Great MigrationIn Proceedings of the 25th International Conference on WorldWide Web WWW 2016 pages 1419ndash1428 2016

[45] S Tartir I B Arpinar M Moore A P Sheth andB Aleman-meza OntoQA Metric-Based Ontology QualityAnalysis In IEEE Workshop on Knowledge Acquisition fromDistributed Autonomous Semantically Heterogeneous Dataand Knowledge Sources 2005

[46] R Y Wang M P Reddy and H B Kon Toward quality dataAn attribute-based approach Decision Support Systems13(3)349ndash372 1995

[47] R Y Wang and D M Strong Beyond Accuracy What DataQuality Means to Data Consumers Journal of managementinformation systems 12(4)5ndash33 1996

[48] A Zaveri D Kontokostas M A Sherif L BuumlhmannM Morsey S Auer and J Lehmann User-driven qualityevaluation of dbpedia In Proceedings of the 9th InternationalConference on Semantic Systems pages 97ndash104 ACM 2013

[49] A Zaveri A Rula A Maurino R Pietrobon J Lehmann andS Auer Quality Assessment for Linked Data A SurveySemantic Web 7(1)63ndash93 2015

Page 6: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with

6 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

ndash Semantic validity of triples

The fulfillment degree of a KG g wrt the dimen-sion Accuracy is measured by the metrics msynRDF msynLit and msemTriple which are defined as fol-lows

Syntactic validity of RDF documents The syntacticvalidity of RDF documents is an important require-ment for machines to interpret an RDF document com-pletely and correctly Hogan et al [29] suggest usingstandardized tools for creating RDF data The authorsstate that in this way normally only little syntax errorsoccur despite the complex syntactic representation ofRDFXML

RDF data can be validated by an RDF validator suchas the W3C RDF validator14

msynRDF (g) =

1 if all RDF documents are valid0 otherwise

Syntactic validity of literals Assessing the syntacticvalidity of literals means to determine to which degreeliteral values stored in the KG are syntactically validThe syntactic validity of literal values depends on thedata types of the literals and can be automatically as-sessed via rules [2234] Syntactic rules can be writ-ten in the form of regular expressions For instanceit can be verified whether a literal representing a datefollows the ISO 8601 specification Assuming that L isthe infinite set of literals we can state

msynLit(g) =|(s p o) isin g | o isin L and synV alid(o)|

|(s p o) isin g | o isin L|

In case of an empty set in the denominator of thefraction the metric should evaluate to 1

Semantic validity of triples The criterion Semanticvalidity of triples is introduced to evaluate whether thestatements expressed by the triples (with or withoutliterals) hold true Determining whether a statementis true or false is strictly speaking impossible (see thefield of epistemology in philosophy) For evaluating theSemantic validity of statements Bizer et al [11] notethat a triple is semantically correct if it is also availablefrom a trusted source (eg Name Authority File) if it

14See httpwwww3orgRDFValidator requestedon Feb 29 2016

is common sense or if the statement can be measuredor perceived by the user directly Wikidata has similarguidelines implemented to determine whether a factneeds to be sourced15

We measure the Semantic validity of triples based onempirical evidence ie based on a reference data setserving as gold standard We determine the fulfillmentdegree as the precision that the triples which are in theKG g and in the gold standard GS have the same valuesNote that this measurement is heavily depending on thetruthfulness of the reference data set

Formally let nogGS = |(s p o) | (s p o) isin g and(x y z) isin GSandequi(s x)andequi(p y)andequi(o z))|be the number of triples in g to which semanticallycorresponding triples in the gold standard GS exist Letnog = |(s p o) | (s p o) isin g and (x y z) isin GS andequi(s x) and equi(p y)| be the number of triples in gwhere the subject-relation-pairs (s p) are semanticallyequivalent to subject-relation-pairs (x y) in the goldstandard Then we can state

msemTriple(g) =nogGS

nog

In case of an empty set in the denominator of thefraction the metric should evaluate to 1

322 TrustworthinessDefinition of dimension Trustworthiness is defined

as the degree to which the information is accepted to becorrect true real and credible [49] We define it as acollective term for believability reputation objectivityand verifiability These aspects were defined by Wanget al [47] and Naumann [39] as follows

ndash Believability Believability is ldquothe extent to whichdata are accepted or regarded as true real andcrediblerdquo [47]

ndash Reputation Reputation is ldquothe extent to whichdata are trusted or highly regarded in terms of theirsource or contentrdquo [47]

ndash Objectivity Objectivity is ldquothe extent to whichdata are unbiased (unprejudiced) and impartialrdquo[47]

ndash Verifiability Verifiability is ldquothe degree and easewith which the data can be checked for correctnessrdquo[39]

15See httpswwwwikidataorgwikiHelpSources requested on Sep 8 2016

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 7

Discussion In summary believability considers thesubject (data consumer) side reputation takes the gen-eral social view on trustworthiness objectivity consid-ers the object (data provider) side while verifiabilityfocuses on the possibility of verification

Trustworthiness has been discussed as follows

ndash Believability According to Naumann [39] believ-ability is the ldquoexpected accuracyrdquo of a data source

ndash Reputation The essential difference of believ-ability to accuracy is that for believability data istrusted without verification [11] Thus believabil-ity is closely related to the reputation of a dataset

ndash Objectivity According to Naumann [39] the ob-jectivity of a data source is strongly related to theverifiability The more verifiable a data source orstatement is the more objective it is The authorsof this article would not go so far since also biasedstatements could be verifiable

ndash Verifiability Heath et al [26] emphasize that it isessential for trustworthy applications to be able toverify the origin of data

Definition of metric We define the metric for thedata quality dimension Trustworthiness as a combina-tion of trustworthiness metrics on both KG and state-ment level Believability and reputation are thereby cov-ered by the DQ criterion Trustworthiness on KG level(metric mgraph(hg)) while objectivity and verifiabilityare covered by the DQ criteria Trustworthiness on state-ment level (metric mfact(g)) and Indicating unknownand empty values (metric mNoV al(g)) Hence the ful-fillment degree of a KG g wrt the dimension Trust-worthiness is measured by the metrics mgraph mfactand mNoV al which are defined as follows

Trustworthiness on KG level The measure of Trust-worthiness on KG level exposes a basic indication aboutthe trustworthiness of the KG In this assessment themethod of data curation as well as the method of datainsertion is taken into account Regarding the methodof data curation we distinguish between manual andautomated methods Regarding the data insertion wecan differentiate between 1 whether the data is enteredby experts (of a specific domain) 2 whether the knowl-edge comes from volunteers contributing in a commu-nity and 3 whether the knowledge is extracted automat-ically from a data source This data source can itself beeither structured semi-structured or un-structured Weassume that a closed system where experts or other reg-istered users feed knowledge into a system is less vul-nerable to harmful behavior of users than an open sys-

tem where data is curated by a community Thereforewe assign the values of the metric for Trustworthinesson KG level as follows

mgraph(hg) =

1 manual data curation man-ual data insertion in aclosed system

075 manual data curation and in-sertion both by a commu-nity

05 manual data curation datainsertion by community ordata insertion by automatedknowledge extraction

025 automated data curationdata insertion by automatedknowledge extraction fromstructured data sources

0 automated data curationdata insertion by automatedknowledge extraction fromunstructured data sources

Note that all proposed DQ metrics should be seen assuggestions of how to formulate DQ metrics Henceother numerical values and other classification schemes(eg for mgraph(hg)) might be taken for defining theDQ metrics

Trustworthiness on statement level The fulfillment ofTrustworthiness on statement level is determined by anassessment whether a provenance vocabulary is usedBy means of a provenance vocabulary the source ofstatements can be stored Storing source information isan important precondition to assess statements easilywrt semantic validity We distinguish between prove-nance information provided for triples and provenanceinformation provided for resources

The most widely used ontologies for storing prove-nance information are the Dublin Core Metadataterms16 with properties such as dctermsprovenance and dctermssource and the W3C PROVontology17 with properties such as provwasDerivedFrom

16See httppurlorgdcterms requested on Feb 42017

17See httpswwww3orgTRprov-o requested onDec 27 2016

8 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

mfact(g) =

1 provenance on statement

level is used05 provenance on resource

level is used0 otherwise

Indicating unknown and empty values If the datamodel of the considered KG supports the representa-tion of unknown and empty values more complex state-ments can be represented For instance empty valuesallow to represent that a person has no children andunknown values allow to represent that the birth date ofa person in not known This kind of higher explanatorypower of a KG increases the trustworthiness of the KG

mNoV al(g) =

1 unknown and empty values

are used05 either unknown or empty

values are used0 otherwise

323 ConsistencyDefinition of dimension Consistency implies that

ldquotwo or more values [in a dataset] do not conflict eachotherrdquo [37]

Discussion Due to the high variety of data providersin the Web of Data a user must expect data inconsisten-cies Data inconsistencies may be caused by (i) differ-ent information providers (ii) different levels of knowl-edge and (iii) different views of the world [11]

In OWL restrictions can be introduced to ensureconsistent modeling of knowledge to some degree TheOWL schema restrictions can be divided into class re-strictions and relation restrictions [7]

Class restrictions refer to classes For instanceone can specify via owldisjointWith that twoclasses have no common instance

Relation restrictions refer to the usage of relationsThey can be classified into value constraints and cardi-nality constraints

Value constraints determine the range of relationsowlsomeValuesFrom for instance specifies thatat least one value of a relation belongs to a certainclass If the expected data type of a relation is specifiedvia rdfsrange we also consider this as relationrestriction

Cardinality constraints limit the number of times a re-lation may exist per resource Via owlFunctionalproperty and owlInverseFunctionalProp

erty global cardinality constraints can be specifiedFunctional relations permit at most one value per re-source (eg the birth date of a person) Inverse func-tional relations specify that a value should only occuronce per resource This means that the subject is theonly resource linked to the given object via the givenrelation

Definition of metric We can measure the data qual-ity dimension Consistency by means of (i) whetherschema constraints are checked during the insertion ofnew statements into the KG and (ii) whether alreadyexisting statements in the KG are consistent to specifiedclass and relation constraints The fulfillment degree ofa KG g wrt the dimension consistency is measuredby the metrics mcheckRestr mconClass and mconRelatwhich are defined as follows

Check of schema restrictions during insertion of newstatements Checking the schema restrictions duringthe insertion of new statements can help to reject factsthat would render the KG inconsistent Such simplechecks are often done on the client side in the user inter-face For instance the application checks whether datawith the right data type is inserted Due to the depen-dency to the actual inserted data the check needs to becustom-designed Simple rules are applicable howeverinconsistencies can still appear if no suitable rules areavailable Examples of consistency checks are check-ing the expected data types of literals checking whetherthe entity to be inserted has a valid entity type (iechecking the rdftype relation) checking whetherthe assigned classes of the entity are disjoint ie con-tradicting each other (utilizing owldisjointWithrelations)

mcheckRestr(hg) =

1 schema restrictions arechecked

0 otherwise

Consistency of statements wrt class constraints Thismetric is intended to measure the degree to which theinstance data is consistent with the class restrictions(eg owldisjointWith) specified on the schemalevel

In the following we limit ourselves to the classconstraints given by all owldisjointWith state-ments defined on the schema level of the consid-ered KG Ie let CC be the set of all class con-straints defined as CC = (c1 c2) | (c1owldis-

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 9

jointWith c2) isin g18 Furthermore let cg(e) bethe set of all classes of instance e in g defined ascg(e) = c | (erdftype c) isin g Then we definemconClass(g) as follows

mconClass(g) =

|(c1 c2) isin CC | notexiste (c1 isin cg(e) and c2 isin cg(e))||(c1 c2) isin CC|

In case of an empty set of class constraints CC themetric should evaluate to 1

Consistency of statements wrt relation constraintsThe metric for this criterion is intended for measur-ing the degree to which the instance data is consis-tent with the relation restrictions (eg indicated viardfsrange and owlFunctionalProperty)specified on the schema level We evaluate this crite-rion by averaging over the scores obtained from sin-gle metrics mconRelati indicating the consistency ofstatements wrt different relation constraints

mconRelat(g) =1

n

nsumi=1

mconRelati(g)

In case of evaluating the consistency of instance dataconcretely wrt given rdfsrange and owlFunctionalProperty statements19 we can state

mconRelat(g) =mconRelatRg(g) +mconRelatFct(g)

2

Let Rr be the set of all rdfsrange constraints

Rr = (p d) | (prdfsrange d) isin g

and isDatatype(d)

18Implicit restrictions which can be deducted from the class hi-erarchy eg that a restriction for dboAnimal counts also fordboMammal a subclass of dboAnimal are not considered byus here

19We chose those relations (and for instance notowlInverseFunctionalProperty) as only those relationsare used by more than half of the considered KGs

and Rf be the set of all owlFunctionalPro-perty constraints

Rf = (p d) | (prdftypeowlFunc

tionalProperty) isin g and

(prdfsrange d) isin g and isDatatype(d)

Then we can define the metrics mconRelatRg(g) andmconRelatFct(g) as follows

mconRelatRg(g) =

|(s p o) isin g | exist(p d) isin Rr datatype(o) 6= d||(s p o) isin g | exist(p d) isin Rr|

mconRelatFct(g) =

|(s p o) isin g|exist(p d) isin Rf notexist(s p o2) isin g o 6= o2||(s p o) isin g | exist(p d) isin Rf|

In case of an empty set of relation constraints (Rr orRf ) the respective metric should evaluate to 1

33 Contextual Category

Contextual data quality ldquohighlights the requirementthat data quality must be considered within the contextof the task at handrdquo [47] This category contains thethree dimensions (i) Relevancy (ii) Completeness and(iii) Timeliness Wang et alrsquos further dimensions in thiscategory appropriate amount of data and value-addedare considered by us as being part of the dimensionCompleteness

331 RelevancyDefinition of dimension Relevancy is ldquothe extent

to which data are applicable and helpful for the task athandrdquo [47]

Discussion According to Bizer [11] Relevancy isan important quality dimension since the user is con-fronted with a variety of potentially relevant informa-tion on the Web

Definition of metric The dimension Relevancy isdetermined by the criterion Creating a ranking ofstatements20 The fulfillment degree of a KG g wrtthe dimension Relevancy is measured by the metricmRanking which is defined as follows

20We do not consider the relevancy of literals as there is no rankingof literals provided for the considered KGs

10 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Creating a ranking of statements By means of thiscriterion one can determine whether the KG supportsa ranking of statements by which the relative rele-vance of statements among other statements can beexpressed For instance given the Wikidata entityBarack Obama (wdtQ76) and the relation posi-tion held (wdtP39) President of the United Statesof America (wdtQ11696) has a preferred rank(wdoPreferredRank) (until 2017) while olderpositions which he holds no more are ranked as normalrank (wdoNormalRank)

mRanking(g) =

1 ranking of statements supported0 otherwise

Note that this criterion refers to a characteristic ofthe KG and not to a characteristic of the system thathosts the KG

332 CompletenessDefinition of dimension Completeness is ldquothe ex-

tent to which data are of sufficient breadth depth andscope for the task at handrdquo [47]

We include the following two aspects in this dimen-sion which are separate dimensions in Wang et alrsquosframework

ndash Appropriate amount of data Appropriate amountof data is ldquothe extent to which the quantity orvolume of available data is appropriaterdquo [47]

ndash Value-added Value-added is ldquothe extent to whichdata are beneficial and provide advantages fromtheir userdquo [47]

Discussion Pipino et al [40] divide Completenessinto

1 Schema completeness ie the extent to whichclasses and relations are not missing

2 Column completeness ie the extent to whichvalues of relations on instance level ndash ie facts ndashare not missing and

3 Population completeness ie the extent to whichentities are not missing

The Completeness dimension is context-dependent andtherefore belongs to the contextual category becausethe fact that a KG is seen as complete depends on theuse case scenario ie on the given KG and on the infor-mation need of the user As exemplified by Bizer [11]a list of German stocks is complete for an investor whois interested in German stocks but it is not complete for

an investor who is looking for an overview of Europeanstocks The completeness is hence only assessable bymeans of a concrete use case at hand or with the helpof a defined gold standard

Definition of metric We follow the above-mentioneddistinction of Pipino et al [40] and determine Com-pleteness by means of the criteria Schema completenessColumn completeness and Population completeness

The fulfillment degree of a KG g wrt the dimensionCompleteness is measured by the metrics mcSchemamcCol and mcPop which are defined as follows

Schema completeness By means of the criterionSchema completeness one can determine the complete-ness of the schema wrt classes and relations [40] Theschema is assessed by means of a gold standard Thisgold standard consists of classes and relations which arerelevant for the use case For evaluating cross-domainKGs we use as gold standard a typical set of cross-domain classes and relations It comprises (i) basicclasses such as people and locations in different gran-ularities and (ii) basic relations such as birth date andnumber of inhabitants We define the schema complete-ness mcSchema as the ratio of the number of classesand relations of the gold standard existing in g noclatg and the number of classes and relations in the goldstandard noclat

mcSchema(g) =noclatgnoclat

Column completeness In the traditional database area(with fixed schema) by means of the Column complete-ness criterion one can determine the degree by whichthe relations of a class which are defined on the schemalevel (each relation has one column) exist on the in-stance level [40] In the Semantic Web and Linked Datacontext however we cannot presume any fixed rela-tional schema on the schema level The set of possiblerelations for the instances of a class is given at run-time by the set of used relations for the instances ofthis class Therefore we need to modify this criterionas already proposed by Pipino et al [40] In the updatedversion by means of the criterion Column completenessone can determine the degree by which the instances ofa class use the same relations averaged over all classes

Formally we define the Column completeness met-ric mcCol(g) as the ratio of the number of instanceshaving class k and a value for the relation r nokp tothe number of all instances having class k nok Byaveraging over all class-relation-pairs which occur on

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 11

instance level we obtain a fulfillment degree regardingthe whole KG

mcCol(g) =1

|H|sum

(kp)isinH

nokpnok

We thereby let H = (k p) isin (K times P ) | existk isinCg and exist(x p o) | p isin P imp

g and (xrdftype k) bethe set of all combinations of the considered classesK = k1 kn and considered relations P =p1 pm

Note that there are also relations which are dedicatedto the instances of a specific class but which do notneed to exist for all instances of that class For instancenot all people need to have a relation hasChild ordeathDate21 For measuring the Column complete-ness we selected only those relations for an assessmentwhere a value of the relation typically exists for allgiven instances

Population completeness The Population complete-ness metric determines the extent to which the consid-ered KG covers a basic population [40] The assess-ment of the KG completeness wrt a basic populationis performed by means of a gold standard which coversboth well-known entities (called ldquoshort headrdquo eg then largest cities in the world according to the number ofinhabitants) and little-known entities (called ldquolong tailrdquoeg municipalities in Germany) We take all entitiescontained in our gold standard equally into account

Let GS be the set of entities in the gold standardThen we can define

mcPop(g) =|e|e isin GS and e isin Eg|

|e|e isin GS|

333 TimelinessDefinition of dimension Timeliness is ldquothe extent

to which the age of the data is appropriate for the taskat handrdquo [47]

Discussion Timeliness does not describe the creationdate of a statement but instead the time range since thelast update or the last verification of the statement [39]Due to the easy way of publishing data on the Webdata sources can be kept easier up-to-date than tradi-tional isolated data sources This results in advantagesto the consumer of Web data [39] How Timeliness is

21For an evaluation about the prediction which relations are of thisnature see [1]

measured depends on the application context For somesituations years are sufficient while in other situationsone may need days [39]

Definition of metric The dimension timeliness isdetermined by the criteria Timeliness frequency of theKG Specification of the validity period and Specifica-tion of the modification date of statements

The fulfillment degree of a KG g wrt the dimen-sion Timeliness is measured by the metrics mFreqmV alidity and mChange which are defined as follows

Timeliness frequency of the KG The criterion Time-liness frequency of the KG indicates how fast the KGis updated We consider the KG RDF export here anddifferentiate between continuous updates where the up-dates are always performed immediately and discreteKG updates where the updates take place in discretetime intervals In case the KG edits are available onlineimmediately but the RDF export files are available indiscrete varying updating intervals we consider theonline version of the KG since in the context of LinkedData it is sufficient that URIs are dereferenceable

mFreq(g) =

1 continuous updates05 discrete periodic updates025 discrete non-periodic updates0 otherwise

Specification of the validity period of statements Spec-ifying the validity period of statements enables to tem-porally limit the validity of statements By using this cri-terion we measure whether the KG supports the speci-fication of starting and maybe end dates of statementsby means of providing suitable forms of representation

mV alidity(g) =

1 specification of validity pe-riod supported

0 otherwise

Specification of the modification date of statementsThe modification date discloses the point in timeof the last verification of a statement The modifi-cation date is typically represented via the relationsschemadateModified and dctermsmodified

mChange(g) =

1 specification of modifica-

tion dates for statementssupported

0 otherwise

12 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

34 Representational Data Quality

Representational data quality ldquocontains aspects re-lated to the format of the data [] and meaning ofdatardquo [47] This category contains the two dimensions(i) Ease of understanding (ie regarding the human-readability) and (ii) Interoperability (ie regarding themachine-readability) The dimensions InterpretabilityRepresentational consistency and Concise representa-tion as in addition proposed by Wang et al [47] areconsidered by us as being a part of the dimension Inter-operability

341 Ease of UnderstandingDefinition of dimension The ease of understanding

is ldquothe extent to which data are clear without ambiguityand easily comprehendedrdquo [47]

Discussion This dimension focuses on the under-standability of a data source by a human data con-sumer In contrast the dimension Interoperability fo-cuses on technical aspects The understandability of adata source (here KG) can be improved by things suchas descriptive labels and literals in multiple languages

Definition of metric The dimension understand-ability is determined by the criteria Description of re-sources Labels in multiple languages UnderstandableRDF serialization and Self-describing URIs The ful-fillment degree of a KG g wrt the dimension Con-sistency is measured by the metrics mDescr mLangmuSer and muURI which are defined as follows

Description of resources Heath et al [2630] suggestto describe resources in a human-understandable wayeg via rdfslabel or rdfscomment Withinour framework the criterion is measured as followsGiven a sample of resources we divide the numberof resources in the KG for which at least one label orone description is provided (eg via rdfslabelrdfscomment or schemadescription) bythe number of all considered resources in the localnamespace

mDescr(g) = |u|u isin U localg and exist(u p o) isin g

p isin PlDesc||u|u isin U localg |

PlDesc is the set of implicitly used relations in g in-dicating that the value is a label or description (egPlDesc = rdfslabelrdfscomment)

Beschreibung) Daruumlber hinaus ist das Ergebnisder Evaluation auf Basis der Entitaumlten interessant -gt DBpedia weicht deutlich ab da manche Entitaumlten

(Intermediate-Node-Mapping) keine rdfslabel habenFolglich wuumlrde ich die Definition der Metrik allgemeinhalten (beschraumlnkt auf proprietaumlre Ressourcen dh imselben Namespace) die Evaluation jedoch nur anhandder Entitaumlten machen

Labels in multiple languages Resources in the KG aredescribed in a human-readable way via labels eg viardfslabel or skosprefLabel22 The charac-teristic feature of skosprefLabel is that this kindof label should be used per resource at most once incontrast rdfslabel has no cardinality restrictionsie it can be used several times for a given resourceLabels are usually provided in English as the ldquobasiclanguagerdquo The now introduced metric for the criterionLabels in multiple languages determines whether labelsin other languages than English are provided in the KG

mLang(g) =

1 Labels provided in English

and at least one other lan-guage

0 otherwise

Understandable RDF serialization RDFXML is therecommended RDF serialization format of the W3CHowever due to its syntax RDFXML documents arehard to read for humans The understandability of RDFdata by humans can be increased by providing RDFin other more human-understandable serialization for-mats such as N3 N-Triple and Turtle We measurethis criterion by measuring the supported serializationformats during the dereferencing of resources

muSer(hg) =

1 Other RDF serializationsthan RDFXML available

0 otherwise

Note that conversions from one RDF serializationformat into another are easy to perform

Self-describing URIs Descriptive URIs contribute toa better human-readability of KG data Sauermann etal23 recommend to use short memorable URIs in theSemantic Web context which are easier understandableand memorable by humans compared to opaque URIs24

22Using the namespace httpwwww3org200402skoscore

23See httpswwww3orgTRcooluris requested onMar 1 2016

24For an overview of URI patterns see httpswwww3orgcommunitybpmlodwikiBest_practises_-_previous_notes requested on Dec 27 2016

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 13

such as wdtQ1040 The criterion Self-describingURIs is dedicated to evaluate whether self-describingURIs or generic IDs are used for the identification ofresources

muURI(g) =

1 self-describing URIs always used05 self-describing URIs partly used0 otherwise

342 InteroperabilityInteroperability is another dimension of the repre-

sentational data quality category and subsumes Wanget alrsquos aspects interpretability representational consis-tency and concise representation

Definition of dimension We define Interoperabilityalong the subsumed dimensions of Wang et al

ndash Interpretability Interpretability is ldquothe extent towhich data are in appropriate language and unitsand the data definitions are clearrdquo [47]

ndash Representational consistency Representationalconsistency is ldquothe extent to which data are alwayspresented in the same format and are compatiblewith previous datardquo [47]

ndash Concise representation Concise representationis ldquothe extent to which data are compactly repre-sented without being overwhelmingrdquo [47]

Discussion regarding interpretability In contrastto the dimension understandability which focuses onthe understandability of RDF KG data towards the useras data consumer interpretability focuses on the rep-resentation forms of information in the KG from atechnical perspective An example is the considerationwhether blank nodes are used According to Heath etal [26] blank nodes should be avoided in the LinkedData context since they complicate the integration ofmultiple data sources and since they cannot be linkedby resources of other data sources

Discussion regarding representational consistencyIn the context of Linked Data it is best practice to reuseexisting vocabulary for the creation of own RDF dataIn this way less data needs to be prepared for beingpublished as Linked Data [26]

Discussion regarding concise representation Heathet al [26] made the observation that the RDF features(i) RDF reification25 (ii) RDF collections and RDF

25In the literature it is often not differentiated between reificationin the general sense and reification in the sense of the specific

container and (iii) blank nodes are not very widelyused in the Linked Open Data context Those featuresshould be avoided according to Heath et al in orderto simplify the processing of data on the client sideEven the querying of the data via SPARQL may getcomplicated if RDF reification RDF collections andRDF container are used We agree on that but alsopoint out that reification (implemented via RDF stan-dard reification n-ary relations singleton propertiesor named graphs) is inevitably necessary for makingstatements about statements

Definition of metric The dimension Interoperabil-ity is determined via the following criteria

ndash Avoiding blank nodes and RDF reificationndash Provisioning of several serialization formatsndash Using external vocabularyndash Interoperability of proprietary vocabulary

The fulfillment degree of a KG g wrt the dimen-sion Interoperability is measured by the metrics mReif miSerial mexV oc and mpropV oc which are defined asfollows

Avoiding blank nodes and RDF reification Using RDFblank nodes RDF reification RDF container and RDFlists is often considered as ambivalent On the one handthese RDF features are not very common and theycomplicate the processing and querying of RDF data[3026] On the other hand they are necessary in cer-tain situations eg when statements about statementsshould be made We measure the criterion by evaluatingwhether blank nodes and RDF reification are used

mReif (g) =

1 no blank nodes and no RDF

reification05 either blank nodes or RDF

reification0 otherwise

Provisioning of several serialization formats The in-terpretability of RDF data of a KG is increased if be-

proposal described in the RDF standard (Brickley D Guha R (eds)RDF Vocabulary Description Language 10 RDF Schema W3CRecommendation online available at httpwwww3orgTRrdf-schema requested on Sep 2 2016) For more informationabout reification and its implementation possibilities we can refer thereader to [27] In this article we use the term reification by defaultfor the general sense and standard reification or RDF reificationfor referring to the modeling of reification according to the RDFstandard

14 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

sides the serialization standard RDFXML further seri-alization formats are supported for URI dereferencing

miSerial(hg) =

1 RDFXML and further for-

mats are supported05 only RDFXML is supported0 otherwise

Using external vocabulary Using a common vocabu-lary for representing and describing the KG data allowsto represent resources and relations between resourcesin the Web of Data in a unified way This increases theinteroperability of data [3026] and allows a comfort-able data integration We measure the criterion of usingan external vocabulary by setting the number of tripleswith external vocabulary in predicate position to thenumber of all triples in the KG

mextV oc(g) =|(s p o)|(s p o) isin g and p isin P external

g ||(s p o) isin g|

Interoperability of proprietary vocabulary Linkingon schema level means to link the proprietary vo-cabulary to external vocabulary Proprietary vocab-ulary are classes and relations which were definedin the KG itself The interlinking to external vo-cabulary guarantees a high degree of interoperabil-ity [26] We measure the interlinking on schemalevel by calculating the ratio to which classes andrelations have at least one equivalency link (egowlsameAs owlequivalentProperty orowlequivalentClass) to classes and relationsrespectively of other data sources

mpropV oc(g) = |x isin Pg cup Cg|exist(x p o) isin g

(p isin Peq and (o isin U and o isin Uextg ))||Pg cup Cg|

where Peq = owlsameAsowlequivalent-PropertyowlequivalenClass and Uext

g con-sists of all URIs in Ug which are external to the KG gwhich means that hg is not responsible for resolvingthese URIs

35 Accessibility Category

Accessibility data quality refers to aspects on howdata can be accessed This category contains the threedimensions

ndash Accessibilityndash Licensing andndash Interlinking

Wangrsquos dimension access security is considered by usas being not relevant in the Linked Open Data contextas we only take open data sources into account

In the following we go into details of the mentioneddata quality dimensions

351 AccessibilityDefinition of dimension Accessibility is ldquothe ex-

tent to which data are available or easily and quicklyretrievablerdquo [47]

Discussion Wang et alrsquos definition of Accessibilitycontains the aspects availability response time anddata request They are defined as follows

1 Availability ldquoof a data source is the probability thata feasible query is correctly answered in a giventime rangerdquo [39]According to Naumann [39] the availability is animportant quality aspect for data sources on theWeb since in case of integrated systems (with fed-erated queries) usually all data sources need tobe available in order to execute the query Therecan be different influencing factors regarding theavailability of data sources such as the day timethe worldwide distribution of servers the planedmaintenance work and the caching of data LinkedData sources can be available as SPARQL end-points (for performing complex queries on thedata) and via HTTP URI dereferencing We needto consider both possibilities for this DQ dimen-sion

2 Response time characterizes the delay betweenthe point in time when the query was submittedand the point in time when the query response isreceived [11]Note that the response time is dependent on em-pirical factors such as the query the size of the in-dexed data the data structure the used triple storethe hardware and so on We do not consider theresponse time in our evaluations since obtaininga comprehensive result here is hard

3 In the context of Linked Data data requests canbe made (i) on SPARQL endpoints (ii) on RDFdumps (export files) and (iii) on Linked DataAPIs

Definition of metric We define the metric for thedimension Accessibility by means of metrics for thefollowing criteria

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 15

ndash Dereferencing possibility of resourcesndash Availability of the KGndash Provisioning of public SPARQL endpointndash Provisioning of an RDF exportndash Support of content negotiationndash Linking HTML sites to RDF serializationsndash Provisioning of KG metadata

The fulfillment degree of a KG g wrt the dimen-sion Accessibility is measured by the metrics mDeref mAvai mSPARQL mExport mNegot mHTMLRDF and mMeta which are defined as follows

Dereferencing possibility of resources One of theLinked Data principles [9] is the dereferencing possi-bility of resources URIs must be resolvable via HTTPrequests and useful information should be returnedthereby We assess the dereferencing possibility of re-sources in the KG by analyzing for each URI in the sam-ple set (here all URIs Ug) the HTTP response statuscode and by evaluating whether RDF data is returned Asuccessful dereferencing of resources is given if HTTPstatus code 200 and an RDF document is returned

mDeref (hg) =|dereferencable(Ug)|

|Ug|

Availability of the KG The Availability of the KG cri-terion indicates the uptime of the KG It is an essentialcriterion in the context of Linked Data since in case ofan integrated or federated query mostly all data sourcesneed to be available [39] We measure the availabil-ity of a KG by monitoring the ability of dereferencingURIs over a period of time This monitoring processcan be done with the help of a monitoring tool such asPingdom26

mAvai(hg) =Number of successful requests

Number of all requests

Provisioning of public SPARQL endpoint SPARQLendpoints allow the user to perform complex queries(including potentially many instances classes and rela-tions) on the KG This criterion here indicates whetheran official SPARQL endpoint is publicly availableThere might be additional restrictions of this SPARQLendpoint such as a maximum number of requests pertime slice or a maximum runtime of a query However

26See httppingdomcom requested on Mar 1 2016

we do not measure these restrictions here

mSPARQL(hg) =

1 SPARQL endpoint publiclyavailable

0 otherwise

Provisioning of an RDF export If there is no pub-lic SPARQL endpoint available or the restrictions ofthis endpoint are so strict that the user does not useit an RDF export dataset (RDF dump) can often beused This dataset can be used to set up a local pri-vate SPARQL endpoint The criterion here indicateswhether an RDF export dataset is officially available

mExport(hg) =

1 RDF export available0 otherwise

Support of content negotiation Content negotiation(CN) allows that the server returns RDF documentsduring the dereferencing of resources in the desiredRDF serialization format The HTTP protocol allowsthe client to specify the desired content type (eg RDFXML) in the HTTP request and the server to specifythe returned content type in the HTTP response header(eg applicationrdf+xml) In this way the de-sired and the provided content type are matched as faras possible It can happen that the server does not pro-vide the desired content type Moreover it may hap-pen that the server returns an incorrect content typeThis may lead to the fact that serialized RDF data isnot processed further An example is RDF data whichis declared as textplain [26] Hogan et al [29]therefore propose to let KGs return the most specificcontent type as possible We measure the Support ofcontent negotiation by dereferencing resources withdifferent RDF serialization formats as desired contenttype and by comparing the accept header of the HTTPrequest with the content type of the HTTP response

mNegot(hg) =

1 CN supported and correct

content types returned05 CN supported but wrong

content types returned0 otherwise

Linking HTML sites to RDF serializations Heath etal [26] suggest linking any HTML description of aresource to RDF serializations of this resource in or-der to make the discovery of corresponding RDF dataeasier (for Linked Data aware applications) For thatreason in the HTML header the so-called Autodiscov-

16 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

ery pattern can be included This pattern consists ofthe phrase link rel=alternate the indicationabout the provided RDF content type and a link to theRDF document27 We measure the linking of HTMLpages to RDF documents (ie resource representations)by evaluating whether the HTML representations of theresources contain links as described

mHTMLRDF (hg) =

1 Autodiscovery pattern usedat least once

0 otherwise

Provisioning of KG metadata In the light of the Se-mantic Web vision where agents select and make useof appropriate data sources on the Web also the meta-information about KGs needs to be available in amachine-readable format The two important mech-anisms to specify metadata about KGs are (i) usingsemantic sitemaps and (ii) using the VoID vocabu-lary28 [26] For instance the URI of the SPARQL end-point can be assigned via voidsparqlEndpointand the RDF export URL can be specified withvoiddataDump Such metadata can be added as ad-ditional facts to the KG or it can be provided as separateVoID file We measure the Provisioning of KG meta-data by evaluating whether machine-readable metadataabout the KG is available Note that the provisioningof licensing information in a machine-readable format(which is also a meta-information about the KG) isconsidered in the data quality dimension License lateron

mMeta(g) =

1 Machine-readable metadataabout g available

0 otherwise

352 LicenseDefinition of dimension Licensing is defined as

ldquothe granting of permission for a consumer to re-use adataset under defined conditionsrdquo [49]

Discussion The publication of licensing informationabout KGs is important for using KGs without legalconcerns especially in commercial settings CreativeCommons (CC)29 publishes several standard licensing

27An example is ltlinkrel=alternate type =applicationrdf+xml href=companyrdfgt

28See namespace httpwwww3orgTRvoid29See httpcreativecommonsorg requested on Mar

1 2016

contracts which define rights and obligations Thesecontracts are also in the Linked Data context popularThe most frequent licenses for Linked Data are CC-BYCC-BY-SA and CC0 [31] CC-BY30 requires specify-ing the source of the data CC-BY-SA31 requires in ad-dition that if the data is published it is published underthe same legal conditions CC032 defines the respectivedata as public domain and without any restrictions

Noteworthy is that most data sources in the LinkedOpen Data cloud do not provide any licensing infor-mation [31] which makes it difficult to use the datain commercial settings Even if data is published un-der CC-BY or CC-BY-SA the data is often not usedsince companies refer to uncertainties regarding thesecontracts

Definition of metric The dimension License isdetermined by the criterion Provisioning machine-readable licensing information

The fulfillment degree of a KG g wrt the dimensionLicense is measured by the metric mmacLicense whichis defined as follows

Provisioning machine-readable licensing informationLicenses define the legal frameworks under which theKG data may be used Providing machine-readable li-censing information allows users and applications to beaware of the license and to use the data of the KG inaccordance with the legal possibilities [3026]

Licenses can be specified in RDF via relationssuch as cclicence33 dctermslicence ordctermsrights The licensing information can bespecified either in the KG as additional facts or sepa-rately in a VoID file We measure the criterion by eval-uating whether licensing information is available in amachine-readable format

mmacLicense(g) =

1 machine-readable

licensing informationavailable

0 otherwise

353 InterlinkingDefinition of dimension Interlinking is the extent

ldquoto which entities that represent the same concept are

30See httpscreativecommonsorglicensesby40 requestedon Mar 1 2016

31See httpscreativecommonsorglicensesby-sa40 requested on Mar 1 2016

32See httpcreativecommonsorgpublicdomainzero10 requested on Mar 3 2016

33Using the namespace httpcreativecommonsorgns

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 17

linked to each other be it within or between two ormore data sourcesrdquo [49]

Discussion According to Bizer et al [12] DBpediaestablished itself as a hub in the Linked Data clouddue to its intensive interlinking with other KGs Theseinterlinking is on the instance level usually establishedvia owlsameAs links However according to Halpinet al [24] those owlsameAs links do not alwaysinterlink identical entities in reality According to theauthors one reason might be that the KGs provideentries in different granularity For instance the DB-pedia resource for Berlin (dboBerlin) links viaowlsameAs relations to three different resources inthe KG GeoNames34 namely (i) Berlin the capital35

(ii) Berlin the state36 and (iii) Berlin the city37 More-over owlsameAs relations are often created auto-matically by some mapping function Due to mappingerrors the precision is often below 100 [18]

Definition of metric The dimension Interlinking isdetermined by the criteria

ndash Interlinking via owlsameAsndash Validity of external URIs

The fulfillment degree of a KG g wrt the dimen-sion Interlinking is measured by the metrics mInst andmURIs which are defined as follows

Interlinking via owlsameAs The forth LinkedData principle according to Berners-Lee [8] is the inter-linking of data resources so that the user can explorefurther information According to Hogan et al [30] theinterlinking has a side effect It does not only result inotherwise isolated KGs but the number of incominglinks of a KG indicates the importance of the KG in theLinked Open Data cloud We measure the interlinkingon instance level38 by calculating the extent to which in-stances have at least one owlsameAs link to externalKGs

34See httpwwwgeonamesorg requested on Dec 312016

35See httpwwwgeonamesorg2950159berlinhtml requested on Feb 4 2017

36See httpwwwgeonamesorg2950157land-berlinhtml requested on Feb 4 2017

37See httpwwwgeonamesorg6547383berlin-stadthtml requested on Feb 4 2017

38The interlinking on schema level is already measured via thecriterion Interoperability of proprietary vocabulary

mInst(g) = |x isin Ig (Pg cup Cg) |

exist(xowlsameAs y) isin g and y isin Uextg |

|Ig (Pg cup Cg)|

Validity of external URIs The considered KG maycontain outgoing links referring to RDF resourcesor Web documents (non-RDF data) The linking toRDF resources is usually enabled by owlsameAsowlequivalentProperty and owlequivalentClass relations Web documents are linked viarelations such as foafhomepage and foafdepiction Linking to external resources always entailsthe problem that those links might get invalid over timeThis can have different causes For instance the URIsare not available anymore We measure the Validity ofexternal URIs by evaluating the URIs from an URI sam-ple set wrt whether there is a timeout a client error(HTTP response 4xx) or a server error (HTTP response5xx)

mURIs(g) =|x isin A | resolvable(x)|

|A|

where A = y | exist(x p y) isin g (p isin Peqandx isin Ug(CgcupPg)andx isin U local

g andy isin Uextg ) and resolvable(x)

returns true if HTTP status code 200 is returned Peq isthe set of relations used for linking to external sourcesExamples for such relations are owlsameAs andfoafhomepage

In case of an empty set A the metric should evaluateto 1

36 Conclusion

In this section we provided 34 DQ criteria which canbe applied in the form of DQ metrics to KGs in order toassess those KGs wrt data quality The DQ criteria areclassified into 11 DQ dimensions These dimensionsare themselves grouped into 4 DQ categories In totalwe have the following picture

ndash Intrinsic category

lowast Accuracylowast Syntactic validity of RDF documentslowast Syntactic validity of literalslowast Semantic validity of triples

18 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

lowast Trustworthinesslowast Trustworthiness on KG levellowast Trustworthiness on statement levellowast Using unknown and empty values

lowast Consistencylowast Check of schema restrictions during inser-

tion of new statementslowast Consistency of statements wrt class con-

straintslowast Consistency of statements wrt relation con-

straints

ndash Contextual category

lowast Relevancylowast Creating a ranking of statements

lowast Completenesslowast Schema completenesslowast Column completenesslowast Population completeness

lowast Timelinesslowast Timeliness frequency of the KGlowast Specification of the validity period of state-

mentslowast Specification of the modification date of

statements

ndash Representational data quality

lowast Ease of understandinglowast Description of resourceslowast Labels in multiple languageslowast Understandable RDF serializationlowast Self-describing URIs

lowast Interoperabilitylowast Avoiding blank nodes and RDF reificationlowast Provisioning of several serialization formatslowast Using external vocabularylowast Interoperability of proprietary vocabulary

ndash Accessibility category

lowast Accessibilitylowast Dereferencing possibility of resourceslowast Availability of the KGlowast Provisioning of public SPARQL endpointlowast Provisioning of an RDF exportlowast Support of content negotiationlowast Linking HTML sites to RDF serializationslowast Provisioning of KG metadata

lowast Licenselowast Provisioning machine-readable licensing in-

formationlowast Interlinking

lowast Interlinking via owlsameAslowast Validity of external URIs

4 Selection of KGs

We consider the following KGs for our comparativeevaluation

ndash DBpedia DBpedia39 is the most prominent KGin the LOD cloud [4] The project was initiatedby researchers from the Free University of Berlinand the University of Leipzig in collaborationwith OpenLink Software Since the first public re-lease in 2007 DBpedia is updated roughly once ayear40 By means of a dedicated open source ex-traction framework DBpedia is created from infor-mation contained in Wikipedia such as infobox ta-bles categorization information geo-coordinatesand external links Due to its role as the hub ofthe LOD cloud DBpedia contains many links toother datasets in the LOD cloud such as FreebaseOpenCyc UMBEL41 GeoNames Musicbrainz42

CIA World Factbook43 DBLP44 Project Guten-berg45 DBtune Jamendo46 Eurostat47 Uniprot48

and Bio2RDF4950 DBpedia has been used exten-sively in the Semantic Web research communitybut has become also relevant in commercial set-tings for instance companies such as the BBC[33] and the New York Times [41] use DBpediato organize their content The version of DBpediawe analyzed is 2015-04

39See httpdbpediaorg requested on Nov 1 201640There is also DBpedia live which started in 2009 and which

gets updated when Wikipedia is updated See httplivedbpediaorg requested on Nov 1 2016 Note however thatDBpedia live only provides a restricted set of relations compared toDBpedia Also the provisioning of data varies a lot While for sometime ranges DBpedia live provides data for each hour for other timeranges DBpedia live data is only available once a month

41See httpumbelorg requested on Dec 31 201642See httpmusicbrainzorg requested on Dec 31

201643See httpswwwciagovlibrary

publicationsthe-world-factbook requested on Dec31 2016

44See httpwwwdblporg requested on Dec 31 201645See httpswwwgutenbergorg requested on Dec

31 201646See httpdbtuneorgjamendo requested on Dec

31 201647See httpeurostatlinked-statisticsorg

requested on Dec 31 201648See httpwwwuniprotorg requested on Dec 31

201649See httpbio2rdforg requested on Dec 31 201650See a complete list of the links on the websites describing the sin-

gle DBpedia versions such as httpdownloadsdbpediaorg2016-04links (requested on Nov 1 2016)

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 19

ndash Freebase Freebase51 is a KG announced byMetaweb Technologies Inc in 2007 and was ac-quired by Google Inc on July 16 2010 In con-trast to DBpedia Freebase had provided an in-terface that allowed end-users to contribute tothe KG by editing structured data Besides user-contributed data Freebase integrated data fromWikipedia NNDB52 FMD53 and MusicBrainz54

Freebase uses a proprietary graph model for stor-ing also complex statements Freebase shut downits services completely on August 31 2016 Onlythe latest data dump is still available WikimediaDeutschland and Google integrate Freebase datainto Wikidata via the Primary Sources Tool55 Fur-ther information about the migration from Free-base to Wikidata is provided in [44] We analyzedthe latest Freebase version as of March 2015

ndash OpenCyc The Cyc56 project started in 1984 bythe industry research and development consortiumMicroelectronics and Computer Technology Cor-poration The aim of Cyc is to store ndash in a machine-processable way ndash millions of common sense factssuch as ldquoEvery tree is a plantrdquo The main focus ofCyc has been on inferencing and reasoning SinceCyc is proprietary a smaller version of the KGcalled OpenCyc57 was released under the opensource Apache license Version 2 In July 2006 Re-searchCyc58 was published for the research com-munity containing more facts than OpenCyc Wedid not consider Cyc and ResearchCyc since thoseKGs do not meet the chosen requirements namelythat the KGs are freely available and freely us-able in any context The version of OpenCyc weanalyzed is 2012-05-10

ndash Wikidata Wikidata59 is a project of WikimediaDeutschland which started on October 30 2012The aim of the project is to provide data whichcan be used by any Wikimedia project including

51See httpfreebasecom requested on Nov 1 201652See httpwwwnndbcom requested on Dec 31 201653See httpwwwfashionmodeldirectorycom re-

quested on Dec 31 201654See httpmusicbrainzorg requested on Dec 31

201655See httpswwwwikidataorgwikiWikidata

Primary_sources_tool requested on Apr 8 201656See httpwwwcyccom requested on Dec 31 201657See httpwwwopencycorg accessed on Nov 1

201658See httpresearchcyccom requested on Dec 31

201659See httpwikidataorg accessed on Nov 1 2016

Wikipedia Wikidata does not only store facts butalso the corresponding sources so that the valid-ity of facts can be checked Labels aliases anddescriptions of entities in Wikidata are providedin almost 400 languages Wikidata is a commu-nity effort ie users collaboratively add and editinformation Also the schema is maintained andextended based on community agreements Wiki-data is currently growing considerably due to theintegration of Freebase data [44] The version ofWikidata we analyzed is 2015-10

ndash YAGO YAGO60 ndash Yet Another Great Ontol-ogy ndash has been developed at the Max PlanckInstitute for Computer Science in Saarbruumlckensince 2007 YAGO comprises information ex-tracted from Wikipedia (such as information fromthe categories redirects and infoboxes) Word-Net [19] (such as information about synsets andhyponomies) and GeoNames61 The version ofYAGO we analyzed is YAGO3 which was pub-lished in March 2015

5 Comparison of KGs

51 Key Statistics

In the following we present statistical commonal-ities and differences of the KGs DBpedia FreebaseOpenCyc Wikidata and YAGO We thereby use thefollowing key statistics

ndash Number of triplesndash Number of classesndash Number of relationsndash Distribution of classes wrt the number of their

corresponding instancesndash Coverage of classes with at least one instance per

classndash Covered domains wrt entitiesndash Number of entitiesndash Number of instancesndash Number of entities per classndash Number of unique subjectsndash Number of unique predicatesndash Number of unique objects

In Section 72 we provide an overview of relatedwork wrt those key statistics

60See httpwwwmpi-infmpgdedepartmentsdatabases-and-information-systemsresearchyago-nagayagodownloads accessed on Nov 1 2016

61See httpwwwgeonamesorg requested on Dec 312016

20 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

511 TriplesRanking of KGs wrt number of triples The num-

ber of triples (see Table 2) differs considerably betweenthe KGs Freebase is the largest KG with over 31Btriples while OpenCyc resides the smallest KG withonly 24M triples The large size of Freebase can betraced back to the fact that large data sets such as Mu-sicBrainz have been integrated into this KG OpenCycin contrast has been built purely manually by expertsIn general this indicates a correlation between the wayof building up a KG and its size

Size differences between DBpedia and YAGO Asboth DBpedia and YAGO were created automaticallyby extracting semantically-structured information fromWikipedia the significant difference between their sizesndash in terms of triples ndash is in particular noteworthy Wecan mention here the following reasons YAGO inte-grates the statements from different language versionsof Wikipedia in one single KG while for the canon-ical DBpedia dataset (which is used in our evalua-tions) solely the English Wikipedia was used as in-formation source Besides that YAGO contains con-textual information and detailed provenance informa-tion Contextual information is for instance the an-chor texts of all links within Wikipedia For repre-senting the anchor texts the relation yagohasWikipediaAnchorText (330M triples in total) is usedThe provenance information of single statements isstored in a reified form In particular the relationsyagoextractionSource (1612M triples) andyagoextractionTechnique (1762M triples)are applied therefore3nInfluence of reification on the number of triples

DBpedia Freebase Wikidata and YAGO use someform of reification Reification in general describesthe possibility of making statements about statementsWhile reification has an influence on the number oftriples for DBpedia Freebase and Wikidata the num-ber of triples in YAGO is not influenced by reificationsince data is here provided in N-Quads62 This style ofreification is called Named Graph [27] The additionalcolumn (in comparison to triples) contains a unique IDof the statement by which the triple becomes identifiedFor backward compatibility the ID is commented andtherefore not imported into the triple store Note how-ever that transforming N-Quads to N-Triples leads to a

62The idea of N-Quads is based on the assignment of triples todifferent graphs YAGO uses N-Quads to identify statements per ID

high number of unique subjects concerning the set ofall triples

In case of DBpedia Freebase and Wikidata reifica-tion is implemented by means of n-ary relations Ann-ary relation denotes the relation between more thantwo resources and is implemented via additional inter-mediate nodes since in RDF only binary statementscan be modeled [1627] In Freebase and DBpedia datais mostly provided in the form of plain N-Triples andn-ary relations are only used for data from higher ar-ity63 Wikidata in contrast has the peculiarity that notonly every statement is expressed with the help of ann-ary relation but that in addition each statement is in-stantiated with wdoStatement This leads to about74M additional instances which is about one tenth ofall triples in Wikidata

512 ClassesMethods for counting classes The number of

classes can be calculated in different ways Classes canbe identified via rdfsClass and owlClass re-lations or via rdfssubClassOf relations64 SinceFreebase does not provide any class hierarchy withrdfssubClassOf relations and since Wikidatadoes not instantiate classes explicitly as classes butuses instead only ldquosubclass ofrdquo (wdtP279) relationsthe method of calculating the number of classes de-pends on the considered KG

Ranking of KG wrt number of classes Our eval-uations revealed that YAGO contains the highest num-ber of classes of all considered KGs DBpedia in con-trast has the fewest (see Table 2)

Number of classes in YAGO and DBpedia Howdoes it come to this gap between DBpedia and YAGOwith respect to the number of classes although bothKGs were created automatically based on WikipediaFor YAGO the classes are extracted from the categoriesin Wikipedia while the hierarchy of the classes is de-ployed with the help of WordNet synset relations TheDBpedia ontology in contrast is very small since itis created manually based on the mostly used infobox

63In Freebase Compound Value Types are used for reifi-cation [44] In DBpedia it is named Intermedia Node Map-ping see httpmappingsdbpediaorgindexphpTemplateIntermediateNodeMapping (requested on Dec31 2016)

64The number of classes in a KG may also be calculated by takingall entity type relations (rdftype and ldquoinstance ofrdquo (wdtP31)in case of Wikidata) on the instance level into account However thiswould result only in a lower bound estimation as here those classesare not considered which have no instances

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 21

DBpe

dia

Freeb

ase

Ope

nCyc

Wikidat

a

YAGO

20

40

60

80

100

Co

ve

rag

e in

Fig 1 Coverage of classes having at least one instance

templates in Wikipedia Besides those 736 classes theDBpedia KG contains further 444895 classes whichoriginate from the imported YAGO classes and whichare published in the namespace yago Those YAGOclasses are ndash like the DBpedia ontology classes ndash inter-connected via rdfssubClassOf to form a taxon-omy In the evaluation of DBpedia the YAGO classesare ignored as they do not belong to the DBpedia on-tology given as OWL file

Coverage of classes with at least one instanceFig 1 shows for each KG the extent to which classes areinstantiated that is for how many classes at least oneinstance exists YAGO exhibits the highest coveragerate (826) although it contains the highest numberof classes among the KGs This can be traced back tothe fact that YAGO classes are chosen by a heuristicthat considers Wikipedia leaf categories which tend tohave instances [43] OpenCyc (with 65) and Wiki-data (54) come last in the ranking Wikidata has thesecond highest number of classes in total (see Table 2)out of which relatively little are used on instance levelNote however that in some scenarios solely the schemalevel information (including classes) of KGs is neces-sary so that the low coverage of instances by classes isnot necessarily an issue

Correlation between number of classes and num-ber of instances In Fig 2 we can see a histogramof the classes with respect to the number of instancesper class That is for each KG we can spot how manyclasses have a high number of instances and how manyclasses have a low number of instances Note the log-arithmic scale on both axes The curves seem to fol-low power law distributions For DBpedia the line de-

Table 1Percentage of considered entities per KG for covered domains

DB FB OC WD YA

Reach of method 88 92 81 41 82

creases consistently for the first 250 classes before itdecreases more than exponentially beyond class 250

513 DomainsAll considered KGs are cross-domain meaning that a

variety of domains are covered in those KGs Howeverthe KGs often cover the single domains to a differentdegree Tartir [45] proposed to measure the covered do-mains of ontologies by determining the usage degree ofcorresponding classes the number of instances belong-ing to one or more subclasses of the respective domainis compared to the number of all instances In our workhowever we decided to evaluate the coverage of do-mains concerning the classes per KG via manual assign-ments of the mostly used classes to the domains peoplemedia organizations geography and biology65 Thislist of domains was created by aggregating the mostfrequent domains in Freebase

The manual assignment of classes to domainsis necessary in order to obtain a consistent assign-ment of the classes to the domains across all con-sidered KGs Otherwise the same classes in differ-ent KGs may be assigned to different domains More-over in some KGs classes may otherwise appear invarious domains simultaneously For instance theFreebase classes freebasemusicartist andfreebasepeopleperson overlap in terms oftheir instances and multiple domains (such as musicand people) might be assigned to them

As the reader can see in Table 1 our method to de-termine the coverage of domains and hence the reachof our evaluation includes about 80 of all entities ofeach KG except Wikidata It is calculated as the ratio ofthe number of unique entities of all considered domainsof a given KG divided by the number of all entities ofthis KG66 If the ratio was at 100 we were able toassign all entities of a KG to the chosen domains

Fig 3 shows the number of entities per domain in thedifferent KGs with a logarithmic scale Fig 4 presents

65See our website for examples of classes per domain andper KG httpkmaifbkitedusitesknowledge-graph-comparison (requested on Dec 31 2016)

66We used the number of unique entities of all domains and notthe sum of the entities measured per domain since entities may be inseveral domains at the same time

22 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

100 101 102 103

Classes

100

102

104

106

108

Nu

mb

er

of

insta

nce

s

DBpedia

Freebase

OpenCyc

Wikidata

YAGO

Fig 2 Distribution of classes wrt the number of instances per KG

persons media organizations geography biology100

102

104

106

108

1010

Num

ber

of entities

DBpedia

Freebase

OpenCyc

Wikidata

YAGO

Fig 3 Number of entities per domain

the relative coverage of each domain in each KG It iscalculated as the ratio of the number of entities in eachdomain to the total number of entities of the KG Avalue of 100 means that all instances reside in onesingle domain

The case of Freebase is especially outstanding here77 of all entities here are located in the media

domain This fact can be traced back to large-scaledata imports such as from MusicBrainz The classfreebasemusicrelease_track is account-able for 42 of the media entities As shown in Fig 3Freebase provides the most entities in four out of thefive domains when considering all KGs

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 23

persons media organizations geography biology

10

20

30

40

50

60

70

80

Rela

tive n

um

ber

of entities in p

erc

ent

DBpedia

Freebase

OpenCyc

Wikidata

YAGO

Fig 4 Relative number of entities per domain

In DBpedia and YAGO the domain of people is thelargest domain (50 and 34 respectively) Peculiar isthe higher coverage of YAGO regarding the geographydomain compared to DBpedia As one reason for thatwe can point out the data import of GeoNames intoYAGO

Wikidata contains around 150K entities in the do-main organization This is relativly few consideringthe total amount of entities being around 187M andconsidering the number of organizations in other KGsNote that even DBpedia provides more organizationentities than Wikidata The reason why Wikidata hasnot so many organization entities is not fully compre-hensible to us However we can point out that for ouranalysis we only considered Wikidata classes whichappeared more than 6000 times67 and that about 16Kclasses were therefore not considered It is possible thatentities of the domain organization are belonging tothose rather rarely occurring classes

514 Relations and PredicatesEvaluation method In this article we differentiate

between relations and predicates (see also Section 2)

ndash Relations ndash as short term for explicitly defined re-lations ndash refers to (proprietary) vocabulary definedon the schema level of a KG We identify the setof relations of a KG as the set of those links which

67This number is based on heuristics We focused on the 150 mostinstantiated classes and cut the long tail of classes having only fewinstances

are explicitly defined as such via assignments (forinstance with rdfsProperty) to classes InSection 2 we used Pg to denote this set

ndash In contrast we use predicates to denote links usedin the KG independently of their introduction onthe schema level The set of unique predicates perKG denoted as P imp

g is nothing else than the setof unique RDF terms on the predicate position ofall triples in the KG

It is important to distinguish the key statistics for rela-tions from the key statistics for predicates since theycan differ considerably depending on to which degreerelations are only defined on schema level but not usedon instance level

Evaluation resultsRelationsRanking regarding relations As presented in Ta-

ble 2 Freebase exhibits by far the highest number ofunique relations (around 785K) among the KGs YAGOshows only 106 relations which is the lowest value inthis comparison In the following we point out furtherfindings regarding the relations of the single KGs

DBpedia Regarding DBpedia relations we need todistinguish between so-called mapping-based prop-erties and non-mapping-based properties Mapping-based properties are created by extracing the informa-tion from infoboxes in Wikipedia using manually cre-ated mappings These mappings are specified in the DB-

24 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

pedia Mappings Wiki68 Mapping-based properties arecontained in the DBpedia ontology and located in thenamespace httpdbpediaorgontologyWe count 2819 such relations for the considered DB-pedia version 2015-04 Non-mapping-based properties(also called ldquoraw infobox propertiesrdquo) are extractedfrom Wikipedia without the help of manually createdmappings and hence without any manual adjustmentsTherefore they are generally of lower quality We count58776 such unique relations They reside in the names-pace httpdbpediaorgproperty Bothmapping-based and non-mapping-based properties areinstantiated in DBpedia with rdfProperty We ig-nore the non-mapping based properties for the calcu-lation of the number of relations |Pg| (see Table 2)since in contrast to DBpedia in YAGO non-mappingbased properties are not instantiated Note that themapping-based properties and the non-mapping basedproperties in DBpedia are not aligned69 and may over-lap until DBpedia version 2016-0470

Freebase The high number or Freebase relations canbe explained by two facts 1 About a third of all rela-tions in Freebase are duplicates in the sense that they aredeclared by means of the owlinverseOf relationas being inverse of other relations An example is the re-lation freebasemusicartistalbum and itsinverse relation freebasemusicalbumartist2 Freebase allowed users to introduce their own rela-tions without any limits These relations were originallyin each userrsquos namespace So-called commons adminswere able to approve those relations so that they gotincluded into the Freebase commons schema

OpenCyc For OpenCyc we measure 18028 uniquerelations We can assume that most of them are dedi-cated to statements on the schema level

Wikidata In Wikidata a relatively small set of rela-tions is provided Note in this context that despite thefact that Wikidata is curated by a community (just likeFreebase) Wikidata community members cannot insertarbitrarily new relations as it was possible in Freebaseinstead relations first need to be proposed and thenget accepted by the community if and only if certain

68See httpmappingsdbpediaorgindexphpMain_Page accessed on Nov 4 2016

69For instance The DBpedia ontology containsdbobirthName for the name of a person while the non-mappingbased property set contains dbpname dbpfirstname anddbpalternativeNames

70For instance dbpalias and dboalias

criteria are met71 One of those criteria is that each newrelation is presumably used at least 100 times Thisrelation proposal process can be mentioned as likelyreason why in Wikidata in relative terms more relationsare actually used than in Freebase

YAGO For YAGO we measure the small set of 106unique relations Although relations are curated man-ually for YAGO and DBpedia the size of the relationset differs significantly between those KGs Hoffart etal [28] mention the following reasons for that

1 Peculiarity of relations The DBpedia ontologyprovides quite many special relations For in-stance there exists the relation dboaircraftFighter between dboMilitaryUnit anddboMeanOfTransportation

2 Granularity of relations Relations in the DB-pedia ontology are more fine-grained than rela-tions in YAGO For instance DBpedia contains therelations dboauthor and dbodirectorwhereas in YAGO there is only the generic relationyagocreated

3 Date specification The DBpedia ontology intro-duces several relations for dates For instance DB-pedia contains the relations dbobirthDateand dbobirthYear for birth dates while inYAGO only the relation yagobirthOnDateis used Incomplete date specifications ndash for in-stance if only the year is known ndash are specifiedin YAGO by wildcards (ldquordquo) so that no multiplerelations are needed

4 Inverse relations YAGO has no relations ex-plicitly specified as being inverse In DBpediawe can find relations specified as inverse such asdboparent and dbochild

5 Reification YAGO introduces the SPOTL(X) for-mat This format extends the triple format ldquoSPOldquowith a specification of Time Location and conteXtIn this way no contextual relations are necessary(such as dbodistanceToLondon or dbopopulationAsOf) which occur if the relationsare closely aligned to Wikipedia template attributenames

Frequency of the usage of relations Fig 5 showsthe relative proportions of how often relations are usedper KG grouped into three classes Surprisingly DB-pedia and Freebase exhibit a high number of relationswhich are not used at all on the instance level In case of

71See httpswwwwikidataorgwikiWikidataProperty_proposal requested on Dec 31 2016

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 25

0 1-500 gt500

Number of relations

20

40

60

80

100

Re

lative

occu

ren

cie

s in

pe

rce

nt

DBpedia

Freebase

OpenCyc

Wikidata

YAGO

Fig 5 Frequency of the usage of the relations per KG grouped by(i) zero occurrences (ii) 1ndash500 occurrences and (iii) more than 500occurrences in the respective KG

OpenCyc 992 of the defined relations are never usedWe assume that those relations are used only withinCyc the commercial version of OpenCyc In case ofFreebase only 5 of the relations are used more than500 times and about 70 are not used at all Analo-gously to the discussion regarding the number of Free-base relations we can mention again the high numberof defined owlinverseOf relations and the highnumber of usersrsquo relation proposals as reasons for that

PredicatesRanking regarding predicates Freebase is here ndash

like in case of the ranking regarding relations ndash rankedfirst The lowest number of unique predictes is providedby OpenCyc which exhibits only 165 predicates AllKGs except OpenCyc provide more predicates then re-lations Our single observations regarding the predicatesets are as follows

DBpedia DBpedia is ranked third in terms of the ab-solute numbers of predicates about 60K predicates areused in DBpedia The set of relations and the set of pred-icates varies considerably here since also facts are ex-tracted from Wikipedia info-boxes whose predicates areconsidered by us as being only implicitly defined andwhich hence occur only as predicates These are the so-called non-mapping-based properties Note that in the

studied DBpedia version 2015-04 the set of explicitlydefined relations (mapping-based properties) and theset of implicitly defined relations (non-mapping-basedproperties) overlaps An example is dbpalias withdboalias

Freebase We can observe here a similar picture asfor the set of Freebase relations With about 785Kunique predicates Freebase exceeds the other KGs byfar Note however that 95 of the predicates (around743K) are used only once This relativizes the highnumber Most of the predicates are keys in the senseof ids and are used for internal modeling (for instancefreebasekeyuseradrianb)

OpenCyc In contrast to the 18028 unique relationswe measure only 164 unique predicates for OpenCycMore predicates are presumably used in Cyc

Wikidata We measure more Wikidata predicates thanWikidata relations since Wikidata predicates are cre-ated by modifying Wikidata relations An exampleare the following triples which express the statementBarack Obama (wdtQ76) is a human (wdtQ5) byan intermediate node (wdtQ76S123 abbreviated)

wdtQ76 wdtP31s wdtQ76S123wdtQ76S123 wdtP31v wdtQ5

The relation extension ldquosrdquo indicates that the RDF termin the object position is a statement The ldquovrdquo extensionallows to refer to a value (in Wikidata terminology)Besides those extensions there is ldquorrdquo to refer to a ref-erence and the ldquoqrdquo extension to refer to a qualifier Ingeneral these relation extensions are used for realizingreification via n-ary relations For that intermediatenodes are used which represent statements [16]

YAGO YAGO contains more predicates than DBpe-dia since infobox attributes from different languageversions of Wikipedia are aggregated into one KG72

while for DBpedia separate localized KG versions areoffered for non-English languages

515 Instances and EntitiesEvaluation method We distinguish between in-

stances Ig and entities Eg of a KG (cf Section 2)

1 Instances are belonging to classes They are iden-tified by retrieving the subjects of all triples wherethe predicates indicate class affiliations

72The language of each attribute is encoded in theURI for instance yagoinfoboxdeflaumlche andyagoinfoboxenareakm

26 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGODBpe

diaFre

ebas

eOpe

nCyc

Wiki

data

YAGO

10 010 110 210 310 410 510 610 710 810 9

Num

ber

of In

stan

ces

Fig 6 Number of instances per KG

2 Entities are real-world objects This excludesfor instance instantiated statements for beingentities Determining the set of entities is par-tially tricky In DBpedia and YAGO entitiesare determined as being an instance of theclass owlThing In Freebase entities are in-stances of freebasecommontopic and inWikidata instance of wdoItem In OpenCyccychIndividual corresponds to owlThingbut not all entities are classified in this way There-fore we approximately determine the set of en-tities in OpenCyc by manually classifying allclasses having more than 300 instances includingat least one entity73 In this way abstract classessuch as cychExistingObjectType are ne-glected

Ranking wrt the number of instances Table 2and Fig 6 show the number of instances per KG Wecan see that Wikidata comprises the highest numberof instances (142M) in total and OpenCyc the fewest(242K)

Ranking wrt the number of entities Table 2shows the ranking of KGs regarding the number of en-tities Freebase contains by far the highest number ofentities (about 499M) OpenCyc is at the bottom withonly about 41K entities

Differences in number of entities The reason whythe KGs show quite varying numbers of entities are theinformation sources of the KGs We illustrate this withthe music domain as example

1 Freebase had been created mainly from data im-ports such as from MusicBrainz Therefore enti-

73For instance cychIndividual cychMovie_CW andcychCity

ties in the domain of media and especially songrelease tracks are covered very well in Freebase77 of all entities are in the media domain (seeSection 513) out of which 42 are releasetracks74

Due to the large size and the world-wide coverageof entities in MusicBrainz Freebase contains al-bums and release tracks of both English and non-English languages For instance regarding the En-glish language the album ldquoThrillerrdquo from MichaelJackson and its single ldquoBillie Jeanrdquo are there aswell as rather unknown songs from the ldquoThrillerrdquoalbum such as ldquoThe Lady in My Liferdquo Regard-ing non-English languages Freebase contains forinstance songs and albums from Helene Fischersuch as ldquoLassrsquo mich in dein Lebenrdquo and ldquoZauber-mondrdquo also rather unknown songs such as ldquoHabrsquoden Himmel beruumlhrtrdquo can be found

2 In case of DBpedia the English Wikipedia is thesource of information In the English Wikipediamany albums and singles of English artists are cov-ered ndash such as the album ldquoThrillerrdquo and the singleldquoBillie Jeanrdquo Rather unknown songs such as ldquoTheLady in My Liferdquo are not covered in WikipediaFor many non-English artists such as the Germansinger Helene Fischer no music albums and nosingles are contained in the English Wikipedia Inthe corresponding language version of Wikipedia(and localized DBpedia version) this informationis often available (for instance the album ldquoZauber-mondrdquo and the song ldquoLassrsquo mich in dein Lebenrdquo)but not the rather unknown songs such as ldquoHabrsquoden Himmel beruumlhrtrdquo

3 For YAGO the same situation as for DBpediaholds with the difference that YAGO in additionimports entities also from the different languageversions of Wikipedia and imports also data fromsources such as GeoNames However the abovementioned works (ldquoLassrsquo mich in dein LebenrdquoldquoZaubermondrdquo and ldquoHabrsquo den Himmel beruumlhrtrdquo)of Helene Fischer are not in the YAGO althoughthe song ldquoLassrsquo mich in dein Lebenrdquo exists inthe German Wikipedia since May 2014 and al-though the used YAGO version 3 is based on theWikipedia dump of June 201475 Presumably theYAGO extraction system was unable to extract any

74Those release tracks are expressed via freebasemusicrelease_track

75See httpwwwmpi-infmpgdededepartmentsdatabases-and-information-

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 27DBpe

diaFre

ebas

eOpe

nCyc

Wiki

data

YAGO

10 0

10 1

10 2

10 3

10 4

Ave

rage

num

ber

of e

ntiti

es

Fig 7 Average number of entities per class per KG

types for those entities so that those entities werediscarded

4 Wikidata is supported by the community and con-tains music albums of English and non-Englishartists even if they do not exist in Wikipedia Anexample is the song ldquoThe Lady in My Liferdquo Notehowever that Wikidata does not provide all artistrsquosworks such as from Helene Fischer

5 OpenCyc contains only very few entities in themusic domain The reason is that OpenCyc has itsfocus mainly on common-sense knowledge andnot so much on facts about entities

Average number of entities per class Fig 7 showsthe average number of entities per class which can bewritten as |Eg||Cg| Obvious is the difference betweenDBpedia and YAGO (despite the similar number of en-tities) The reason for that is that the number of classesin the DBpedia ontology is small (as created manually)and in YAGO large (as created automatically)

Comparing number of instances with number ofentities Comparing the ratio of the number of instancesto the number of entities for each KG Wikidata ex-poses the highest difference As reason for that we canstate that each statement in Wikidata is modeled as aninstance of wdoStatement leading to 74M addi-tional instances In other KGs such as DBpedia state-ments are modeled without any dedicated statementassignment OpenCyc exposes also a high ratio sinceit contains mainly common sense knowledge and notas many entities as the other KGs Furthermore for ouranalysis we do not regard 100 of the entities but onlya large fraction of it (more precisely the classes with

systemsresearchyago-nagayagoarchive re-quested on Dec 31 2016

DBpedia

Freeb

ase

OpenC

ycW

ikida

ta

YAGO

0

1

2

3

4

5

6

7

8

Rat

io o

f num

ber

of in

stan

ces

to n

umbe

r of

ent

ities

Fig 8 Ratio of the number of instances to the number of entities foreach KG

the most frequently occurring instantiations) since en-tities are not consistently instantiated in OpenCyc (seebeginning of Section 515)

516 Subjects and ObjectsEvaluation method The number of unique subjects

and unique objects can be a meaningful KG charac-teristic regarding the link structure within the KG andin comparison to other KGs Especially interesting aredifferences between the number of unique subjects andthe number of unique objects

We measure the number of unique subjects by count-ing the unique resources (ie URIs and blank nodes) onthe subject position of N-Triples Sg = s | (s p o) ising Furthermore we measure the number of uniqueobjects by counting the unique resources on the ob-ject position of N-Triples excluding literals Og =o | (s p o) isin g and o isin U cup B Complementary thenumber of literals is given as Olit

g = o | (s p o) ising and o isin L

Ranking of KGs regarding number of uniquesubjects The number of unique subjects per KG is pre-sented in Fig 9 YAGO contains the highest number ofdifferent subjects while OpenCyc contains the fewest

Ranking of KGs regarding number of unique ob-jects The number of unique objects is also presented inFig 9 Freebase shows the highest score in this regardOpenCyc again the lowest

Ranking of KGs regarding the ratio of numberof unique subjects to number of unique objects Theratios of the number of unique subjects to the number ofunique objects vary considerably between the KGs (seeFig 8) We can observe that DBpedia has 265 timesmore objects than subjects while YAGO on the otherside has 19 times more unique subjects than objects

28 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Table 2Summary of key statistics

DBpedia Freebase OpenCyc Wikidata YAGO

Number of triples |(s p o) isin g| 411 885 960 3 124 791 156 2 412 520 748 530 833 1 001 461 792

Number of classes |Cg| 736 53 092 116 822 302 280 569 751

Number of relations |Pg| 2819 70 902 18 028 1874 106

No of unique predicates |P impg | 60 231 784 977 165 4839 88 736

Number of entities |Eg| 4 298 433 49 947 799 41 029 18 697 897 5 130 031

Number of instances |Ig| 20 764 283 115 880 761 242 383 142 213 806 12 291 250

Avg number of entities per class |Eg||Cg| 58403 9408 035 619 90

No of unique subjects |Sg| 31 391 413 125 144 313 261 097 142 278 154 331 806 927

No of unique non-literals in obj pos |Og| 83 284 634 189 466 866 423 432 101 745 685 17 438 196

No of unique literals in obj pos |Olitg | 161 398 382 1 782 723 759 1 081 818 308 144 682 682 313 508

DBpedia

Freeb

ase

OpenC

yc

Wiki

data

YAGO10 0

10 2

10 4

10 6

10 8

10 1 0

10 1 2

unique subjectsunique objects

Fig 9 Number of unique subjects and objects per KG Note thelogarithmic scale on the axis of ordinates

The high number of unique subjects in YAGO is sur-prising and can be explained by the reification styleused in YAGO Facts are stored as N-Quads in orderto allow for making statements about statements (forinstance storing the provenance information for state-ments) To that end IDs (instead of blank nodes) whichidentify the triples are used on the first position of N-Triples They lead to 308M unique subjects such asyagoid_6jg5ow_115_lm6jdp In the RDF ex-port of YAGO the IDs which identify the triples arecommented out in order to facilitate the N-Triple for-mat However the statements about statements are alsotransformed to triples In those cases the IDs identi-fying the reified statements are in the subject positionleading to such a high number of unique subjects

DBpedia contains considerably more owlsameAslinks to external resources than KGs like YAGO (290M

vs 38M links) leading to a bias of DBpedia towards ahigh number of unique objects

517 Summary of Key StatisticsBased on the evaluation results presented in the last

subsections we can highlight the following insights

1 Triples All KGs are very large Freebase is thelargest KG in terms of number of triples whileOpenCyc is the smallest KG We notice a corre-lation between the way of building up a KG andthe size of the KG automatically created KGs aretypically larger as the burdens of integrating newknowledge become lower Datasets which havebeen imported into the KGs such as MusicBrainzinto Freebase have a huge impact on the numberof triples and on the number of facts in the KGAlso the way of modeling data has a great impacton the number of triples For instance if n-aryrelations are expressed in N-Triples format (as incase of Wikidata) many intermediate nodes needto be modeled leading to many additional triplescompared to plain statements Last but not leastthe number of supported languages influences thenumber of triples

2 Classes The number of classes is highly varyingamong the KGs ranging from 736 (DBpedia) upto 300K (Wikidata) and 570K (YAGO) Despite itshigh number of classes YAGO contains in relativeterms the most classes which are actually used(ie classes with at least one instance) This canbe traced back to the fact that heuristics are usedfor selecting appropriate Wikipedia categories asclasses for YAGO Wikidata in contrast containsmany classes but out of them only a small fraction

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 29

is actually used on instance level Note howeverthat this is not necessarily a burden

3 Domains Although all considered KGs are speci-fied as crossdomain domains are not equally dis-tributed in the KGs Also the domain coverageamong the KGs differs considerably Which do-mains are well represented heavily depends onwhich datasets have been integrated into the KGsMusicBrainz facts had been imported into Free-base leading to a strong knowledge representation(77) in the domain of media in Freebase In DB-pedia and YAGO the domain people is the largestlikely due to Wikipedia as data source

4 Relations and Predicates Many relations arerarely used in the KGs Only 5 of the Freebaserelations are used more than 500 times and about70 are not used at all In DBpedia half of therelations of the DBpedia ontology are not usedat all and only a quarter of the relations is usedmore than 500 times For OpenCyc 992 of therelations are not used We assume that they areused only within Cyc the commercial version ofOpenCyc

5 Instances and Entities Freebase contains by farthe highest number of entities Wikidata exposesrelatively many instances in comparison to theentities as each statement is instantiated leadingto around 74M instances which are not entities

6 Subjects and Objects YAGO provides the high-est number of unique subjects among the KGsand also the highest ratio of the number of uniquesubjects to the number of unique objects This isdue to the fact that N-Quad representations needto be expressed via intermedium nodes and thatYAGO is concentrated on classes which are linkedby entities and other classes but which do not pro-vide outlinks DBpedia exhibits more unique ob-jects than unique subjects since it contains manyowlsameAs statements to external entities

52 Data Quality Analysis

We now present the results obtained by applyingthe DQ metrics introduced in the Sections 32 ndash 35 tothe KGs DBpedia Freebase OpenCyc Wikidata andYAGO

521 AccuracyThe fulfillment degrees of the KGs regarding the

Accuracy metrics are shown in Table 3

Table 3Evaluation results for the KGs regarding the dimension Accuracy

DB FB OC WD YA

msynRDF 1 1 1 1 1msynLit 099 1 1 1 062msemTriple 099 lt1 1 099 099

Syntactic validity of RDF documents msynRDF

Evaluation method For evaluating the Syntactic va-lidity of RDF documents we dereference the entityldquoHamburgrdquo as resource sample in each KG In caseof DBpedia YAGO Wikidata and OpenCyc thereare RDFXML serializations of the resource availablewhich can be validated by the official W3C RDF valida-tor76 Freebase only provides a Turtle serialization Weevaluate the syntactic validity of this Turtle documentby verifying if the document can be loaded into an RDFmodel of the Apache Jena Framework77

Evaluation result All considered KGs provide syn-tactically valid RDF documents In case of YAGO andWikidata the RDF validator declares the used languagecodes as invalid since the validator evaluates languagecodes in accordance with ISO-639 The criticized lan-guage codes are however contained in the newer stan-dard ISO 639-3 and actually valid

Syntactic validity of literals msynLit

Evaluation method We evaluate the Syntactic va-lidity of literals by means of the relations date ofbirth number of inhabitants and International Stan-dard Book Number (ISBN) as those relations cover dif-ferent domains ndash namely people cities and books ndashand as they can be found in all KGs In general do-main knowledge is needed for selecting representativerelations so that a meaningful coverage is guaranteed

Note that OpenCyc is not taken into account forthis criterion Although OpenCyc comprises around11M literals in total these literals are essentially la-bels and descriptions (given via rdfslabel andrdfscomment) ie not aligned to specific datatypes Hence OpenCyc has no syntactic invalid literalsand is assigned the metric value 1

As long as a literal with data type is given its syntaxis verified with the help of the function RDFDatatypeisValid(String) of the Apache Jena framework

76See httpsw3orgRDFValidator requested onMar 2 2016

77See httpsjenaapacheorg requested Mar 22016

30 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Thereby standard data types such as xsddate canbe validated easily especially if different data types areprovided78 If no data type is provided or if the literalvalue is of type xsdString the literal is evaluatedby a regular expression which is created manually (seebelow depending on the considered relation) For eachof the three relations we created a sample of 1M literalvalues per KG as long as the respective KG containsso many literals

Evaluation results All KGs except YAGO per-formed very well regarding the Syntactic validity ofliterals

Date of Birth For Wikidata DBpedia and Freebaseall verified literal values (1M per KG) were syntacti-cally correct79 For YAGO we detected around 519Ksyntactic errors (given 1M literal values) due to the us-age of wildcards in the date values For instance thebirth date of yagoSocrates is specified as ldquo470--rdquo which does not correspond to the syntax ofxsddate Obviously the syntactic invalidity of lit-erals is accepted by the YAGO publishers in order tokeep the number of relations low80

Number of inhabitants The data types of the literalvalues regarding the number of inhabitants were validin all KGs For DBpedia YAGO and Wikidata weevaluated the syntactic validity of the number of inhab-itants by checking if xsdnonNegativeIntegerxsddecimal and xsdinteger were used asdata types for the typed literals In Freebase no datatype is specified Therefore we evaluated the values bymeans of a regular expression which allows only thedecimals 0-9 periods and commas

ISBN The ISBN is an identifier for books and maga-zines The identifier can occur in various formats withor without preceding ldquoISBNrdquo with or without delim-iters and with 10 or 13 digits Gupta81 provided a regu-lar expression for validating ISBN in its different formswhich we used in our evaluation All in all most ofthe ISBN were assessed as syntactically correct The

78In DBpedia for instance data for the relationdbobirthDate is stored both as xsdgYear and xsddate

79Surprisingly the Jena Framework assessed data values with anegative year (ie BC eg ldquo-600rdquo for xsdgYear) as invaliddespite the correct syntax

80In order to model the dates to the extent they are known furtherrelations would be necessary such as using wasBornOnYearwith range xsdgYear wasBornOnYearMonth with rangexsdgYearMonth

81See httphowtodoinjavacomregexjava-regex-validate-international-standard-book-number-isbns requested on Mar 1 2016

lowest fulfillment degree was obtained for DBpediaWe found the following findings for the single KGs InFreebase around 699K ISBN numbers were availableOut of them 38 were assessed as syntactically incorrectTypical mistakes were too long numbers and wrongprefixes82 In case of Wikidata 18 of around 11K ISBNnumbers were syntactically invalid However some in-valid numbers have meanwhile been corrected This in-dicates that the Wikidata community does not only careabout inserting new data but also about curating givenKG data In case of YAGO we could only find 400triples with the relation yagohasISBN Seven of theliterals on the object position were syntactically incor-rect For DBpedia we evaluated around 24K literals7419 of them were assessed as syntactically incorrectIn many cases comments next to the ISBN numbers inthe info-boxes of Wikipedia led to an inaccurate extrac-tion of data so that the comments are either extractedas additional facts about ISBN numbers83 or togetherwith the actual ISBN numbers as coherent strings84

Semantic validity of triples msemTriple

Evaluation method The semantic validity can be re-liably measured by means of a reference data set which(i) contains at least to some degree the same facts asin the KG and (ii) which is regarded as some kind ofauthority We decided to use the Integrated AuthorityFile (Gemeinsame Normdatei GND)85 which is anauthority file especially concerning persons and corpo-rate bodies and which was created manually by Ger-man libraries Due to the focus on persons (especiallyauthors) we decided to evaluate a random sample ofperson entities wrt the following relations birth placedeath place birth date and death date For each ofthese relations the corresponding relations in the KGswere determined Then a random sample of 100 personentities per KG was chosen For each entity we retrievedthe facts with the mentioned relations and assessedmanually whether a GND entry exists and whether thevalues of the relations match with the values in the KG

Evaluation result We evaluated up to 400 facts perKG and observed only for a few facts some discrep-ancies For instance Wikidata states as death date of

82Eg we found the 16 digit ISBN 9789780307986931 (cffreebasem0pkny27) and the ISBN 2940045143431 with pre-fix 294 instead of 978 (cf freebasem0v3xf7b)

83See dbrPrince_Caspian84An example is ldquoISBN 0755111974 (hardcover edition)rdquo for

dbrMy_Family_and_Other_Animals85See httpwwwdnbdeENStandardisierung

GNDgndhtml requested on Sep 8 2016

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 31

ldquoAnton Erkelenzldquo (wdtQ589196) April 24 whereasGND states April 25 For DBpedia and YAGO we en-countered 3 and for Wikidata 4 errors Hence thoseKGs were evaluated with 099 Note that OpenCyc hasno values for the chosen relations and thus evaluates to1

During evaluation we identified the following issues

1 For finding the right entry in GND more informa-tion besides the name of the person is needed Thisinformation is sometimes not given so that entitydisambiguation is in those cases hard to perform

2 Contrary to assumptions often either no corre-sponding GND entry exists or not many facts ofthe GND entity are given In other words GND isincomplete wrt to entities (cf Population com-pleteness) and relations (cf Column complete-ness)

3 Values of different granularity need to be matchedsuch as an exact date of birth against the indicationof a year only

In conclusion the evaluation of semantic validity ishard even if a random sample set is evaluated manuallyMeaningful differences among the KGs might be re-vealed only when a very large sample is evaluated egby using crowd-sourcing [2348] Another approachfor assessing the semantic validity is presented by Kon-tokostas et al [34] who propose a test-driven evalu-ation where test cases are created to evaluate triplessemi-automatically For instance an interval specifiesthe valid height of a person and all triples which lieoutside of this interval are evaluated manually In thisway outliers can be easily found but possible wrongvalues within the interval are not detected

Our findings appear to be consistent with the evalua-tion results of the YAGO developer team for YAGO2where manually assessing 4412 statements resulted inan accuracy of 98186

522 TrustworthinessThe fulfillment degrees of the KGs regarding the

Trustworthiness criteria are shown in Table 4

Trustworthiness on KG level mgraph

Evaluation method Regarding the trustworthinessof a KG in general we differentiate between the method

86With a weighted averaging of 95 see httpwwwmpi-infmpgdededepartmentsdatabases-and-information-systemsresearchyago-nagayagostatistics requested on Mar 3 2016

Table 4Evaluation results for the KGs regarding the dimensionTrustworthiness

DB FB OC WD YA

mgraph 05 05 1 075 025mfact 05 1 0 1 1mNoV al 0 1 0 1 0

of how new data is inserted into the KG and the methodof how existing data is curated

Evaluation results The KGs differ considerablywrt this metric OpenCyc obtains the highest scorehere followed by Wikidata In the following we pro-vide findings for the single KGs which are listed bydecreasing fulfillment score

Cyc is edited (expanded and modified) exclusively bya dedicated expert group The free version OpenCycis derived from Cyc and only a locally hosted versioncan be modified by the data consumer

Wikidata is also curated and expanded manually butby volunteers of the Wikidata community Wikidataallows importing data from external sources such asFreebase87 However new data is not just inserted butis approved by the community

Freebase was also curated by a community of vol-unteers In contrast to Wikidata the proportion of dataimported automatically is considerably higher and newdata imports were not dependent on community ap-provals

DBpedia and YAGO The knowledge of both KGs isextracted from Wikipedia but DBpedia differs fromYAGO wrt the community involvement Any usercan engage (i) in mapping the Wikipedia infobox tem-plates to the DBpedia ontology in the DBpedia map-pings wiki88 and (ii) in the development of the DBpediaextraction framework

Trustworthiness on statement levelWe determine the Trustworthiness on statement level

by evaluating whether provenance information for state-ments is used in the KGs The picture is mixed

DBpedia uses the relation provwasDerivedFrom to store the sources of the entities and their state-

87Note that imports from Freebase require the approval ofthe community (see httpswwwwikidataorgwikiWikidataPrimary_sources_tool) Besides that there arebots which import automatically (see httpswwwwikidataorgwikiWikidataBotsde)

88See httpmappingsdbpediaorg requested onMar 3 2016

32 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

ments However as the source is always the correspond-ing Wikipedia article89 this provenance informationis trivial and the fulfillment degree is hence of ratherformal nature

YAGO uses its own vocabulary to indicate thesource of information Interestingly YAGO stores perstatement both the source (via yagoextractionSource eg the Wikipedia article) and the used ex-traction technique (via yagoextractionTech-nique eg ldquoInfobox Extractorrdquo or ldquoCategoryMap-perrdquo) The number of statements about sources is 161Mand hence many times over the number of instances inthe KG The reason for that is that in YAGO the sourceis stored for each fact

In Wikidata several relations can be used for refer-ring to sources such as ldquoimported fromrdquo (wdtP143)ldquostated inrdquo (wdtP248) and ldquoreference URLrdquo (wdtP854)90 Note that ldquoimported fromrdquo relations are usedfor automatic imports but that statements with such areference are not accepted (ldquodata is not sourcedrdquo)91 Tosource data the other relations ldquostated inrdquo and ldquoref-erence URLrdquo can be used The number of all storedreferences in Wikidata92 is around 971K Based on thenumber of all statements93 74M this corresponds to acoverage of around 13 Note however that not everystatement in Wikidata requires a reference according tothe Wikidata guidelines In order to be able to state howmany references are actually missing a manual evalua-tion would be necessary However such an evaluationwould be presumably highly subjective

Freebase uses proprietary vocabulary for represent-ing provenance via n-ary relations which are in Free-base called Compound Value Types (CVT) data fromhigher arity can be expressed [44]94

OpenCyc differs from the other KGs in that it usesneither an external vocabulary nor a proprietary vocab-ulary for storing provenance information

89Eg httpenwikipediaorgwikiHamburg fordbrHamburg

90All relations are instances of Wikidata property to indicate asource (wdtQ18608359)

91See httpswwwwikidataorgwikiPropertyP143 requested Mar 3 2016

92This is the number of instances of wdoReference93This is the number of instances of wdoStatement94Eg for a statement with the relation freebaselocation

statistical_regionpopulation the source can bestored via freebasemeasurement_unitdated_integersource

Table 5Evaluation results for the KGs regarding the dimension Consistency

DB FB OC WD YA

mcheckRestr 0 1 0 1 0mconClass 088 1 lt1 1 033mconRelat 099 045 1 050 099

Indicating unknown and empty values mNoV al

This criterion highlights the subtle data model ofWikidata and Freebase in comparison to the data mod-els of the other KGs Wikidata allows for storing un-known values and empty values (eg that ldquoElizabeth Iof Englandrdquo (wdtQ7207) had no children) Howeverin the Wikidata RDF export such statements are onlyindirectly available since they are represented via blanknodes and via the relation owlsomeValuesFrom

YAGO supports the representation of unknown val-ues and empty values by providing explicit relationsfor such cases95 Inexact dates are modeled by meansof wildcards (eg ldquo1940--rdquo if only the year isknown) Note however the invalidity of such stringsas date literals (see Section 521) Unknown dates arenot supported by YAGO

523 ConsistencyThe fulfillment degrees of the KGs regarding the

Consistency criteria are shown in Table 5

Check of schema restrictions during insertion of newstatements mcheckRestr

The values of the metric mcheckRestr indicating re-strictions during the insertion of new statements arevarying among the KGs The web interfaces of Free-base and Wikidata verify during the insertion of newstatements by the user whether the input is compatiblewith the respective data type For instance data of therelation ldquodate of birthrdquo (wdtP569) is expected to bein a syntactically valid form DBpedia OpenCyc andYAGO have no checks for schema restriction during theinsertion of new statements

Consistency of statements wrt class constraintsmconClass

Evaluation method For evaluating the consis-tency of class constraints we considered the relationowldisjointWith since this is the only rela-tion which is used by more than half of the consid-

95Eg freebasefreebasevaluenotationhas_no_value

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 33

ered KGs We only focused on direct instantiationshere if there is for instance the triple (dboPlantowldisjointWith dboAnimal) then theremust not be a resource which is instantiated both asdboPlant and dboAnimal

Evaluation results We obtained mixed results hereOnly Freebase OpenCyc and Wikidata perform verywell96

Freebase and Wikidata do not specify any constraintswith owldisjointWith Hence those two KGshave no inconsistencies wrt class restrictions and wecan assign the metric value 1 to them In case of Open-Cyc 5 out of the 27112 class restrictions are incon-sistent DBpedia contains 24 class constraints Threeout of them are inconsistent For instance over 1200instances exist which are both a dboAgent and adboPlace YAGO contains 42 constraints dedi-cated mainly for WordNet classes which are mostlyinconsistent

Consistency of statements wrt relation constraintsmconRelat

Evaluation method Here we considered the rela-tions rdfsrange and owlFunctionalProperty as those are used in more than every second con-sidered KG rdfsrange specifies the expected typeof an instance on the object position of a triple whileowlFunctionalProperty indicates that a rela-tion should only be used at most once per resource Weonly took datatype properties into account for this eval-uation since consistencies regarding object propertieswould require to distinguish Open World assumptionand Closed World assumption

Evaluation results In the following we considerthe fulfillment degree for the relation constraintsrdfsrange and owlFunctionalPropertyseparately In Table 5 we show the average of the fulfill-ment scores of each KG regarding rdfsrange andowlFunctionalProperty Note that the num-bers of evaluated relation constraints varied from KG toKG depending on how many relation constraints wereavailable per KG

Range Wikidata does not use any rdfsrangerestrictions Within the Wikidata data model there iswdopropertyType but this indicates not the ex-act allowed data type of a relation (eg wdoprop

96Note that the sample size varies among the KGs (depend-ing on how many owldisjointWith statements are availableper KG) Therefore inconsistencies measured on a small set ofowldisjointWith facts become more visible

Table 6Evaluation results for the KGs regarding the dimension Relevancy

DB FB OC WD YA

mRanking 0 1 0 1 0

ertyTypeTime can represent a year or an exact date)On the talk pages of Wikidata relations users can indi-cate the allowed values of relations via One of state-ments97 Since One of statements are only listed onthe property talk pages and since not only entity typesbut also concrete instances are used as One of valueswe do not consider those statements here

DBpedia obtains the highest measured fulfillmentscore wrt consistency of rdfsrange statementsAn example for a range inconsistency is that the relationdbobirthDate requires a data type xsddatein about 20 of those relations the data type xsdgYear is used though

YAGO Freebase and OpenCyc contain range incon-sistencies primarily since they specify designated datatypes via range relations which are not consistentlyused on the instance level For instance YAGO spec-ifies proprietary data types such as yagoyagoURLand yagoyagoISBN On the instance level how-ever either no data type is used or the unspecific datatype xsdstring

FunctionalProperty The restriction indicated byowlFunctionalProperty is used by all KGsexcept Wikidata On the talk pages about the rela-tions in Wikidata users can specify the cardinalityrestriction via setting the relation to single how-ever this is not part of the Wikidata data modelThe other KGs mostly comply with the usage re-strictions of owlFunctionalProperty Note-worthy is that in Freebase 999 of the inconsis-tencies obtained here are caused by the usages ofthe relations freebasetypeobjectname andfreebasecommonnotable_fordisplay_name

524 RelevancyThe fulfillment degrees of the KGs regarding the

Relevancy criteria are shown in Table 6

Creating a ranking of statements mRanking

Only Wikidata supports the modeling of a rankingof statements Each statement is ranked with ldquopre-

97See httpswwwwikidataorgwikiCategoryProperties_with_one-of_constraints for an overviewrequested on Jan 29 2017

34 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Table 7Evaluation results for the KGs regarding the dimensionCompleteness

DB FB OC WD YA

mcSchema 091 076 092 1 095mcColumn 040 043 0 029 033mcPop 093 094 048 099 089mcPop (short) 1 1 082 1 090mcPop (long) 086 088 014 098 088

ferred rankrdquo (wdoPreferredRank) ldquonormal rankrdquo(wdoNormalRank) or ldquodeprecated rankrdquo (wdoDeprecatedRank) The preferred rank corre-sponds to the up-to-date value or the consensus of theWikidata community wrt this relation Freebase doesnot provide any ranking of statements entities or re-lations However the meanwhile shutdown FreebaseSearch API provided a ranking for resources98

525 CompletenessThe fulfillment degrees of the KGs regarding the

Completeness criteria are shown in Table 7

Schema completeness mcSchema

Evaluation method Since a gold standard for eval-uating the Schema completeness of the considered KGshas not been published we built one on our own Thisgold standard is available online99 It is based on thedata set used in Section 513 where we needed as-signments of classes to domains and comprises of 41classes as well as 22 relations It is oriented towards thedomains people media organizations geography andbiology The classes in the gold standard were alignedto corresponding WordNet synsets (using WordNet ver-sion 31) and were grouped into main classes

Evaluation results Generally Wikidata performsoptimal also DBpedia OpenCyc and YAGO exhibitresults which can be judged as acceptable for most usecases Freebase shows considerable room for improve-ment concerning the coverage of typical cross-domainclasses and relations The results in more detail are asfollows

DBpedia DBpedia shows a good score regardingSchema completeness and its schema is mainly limited

98See httpsdevelopersgooglecomfreebasev1search-cookbookscoring-and-ranking re-quested on Mar 4 2016

99See httpkmaifbkitedusitesknowledge-graph-comparison requested on Jan 29 2017

due to the characteristics of how information is storedand extracted from Wikipedia

1 Classes The DBpedia ontology was created man-ually and covers all domains well However it is incom-plete in the details and therefore appears unbalancedFor instance within the domain of plants the DBpe-dia ontology does not use the class tree but the classginko which is a subclass of trees We can mentionas reason for such gaps in the modeling the fact thatthe ontology is created by means of the most frequentlyused infobox templates in Wikipedia

2 Relations Relations are considerably well cov-ered in the DBpedia ontology Some missing relationsor modeling failures are due to the Wikipedia infoboxcharacteristics For example to represent the gender ofa person the existing relation foafgender seemsto fit However it is only modeled in the ontology asbelonging to the class dbolanguage and not usedon instance level Note that the gender of a person is of-ten not explicitly mentioned in the Wikipedia infoboxesbut implicitly mentioned in the category names (forinstance American male singers) While DBpediadoes not exploit this knowledge YAGO does use it andprovides facts with the relation yagohasGender

Freebase Freebase shows a very ambivalent schemacompleteness On the one hand Freebase targets ratherthe representation of facts on instance level than therepresentation of classes and their hierarchy On theother hand Freebase provides a vast amount of rela-tions leading to a very good coverage of the requestedrelations

1 Classes Freebase lacks a class hierarchy and sub-classes of classes are often in different domains (for in-stance the classes freebasemusicartist andsportsmen freebasesportspro_athlete arelogically a subclass of the class people freebasepersonpeople but not explicitly stated as such)which makes it difficult to find suitable sub- and su-perclasses Noteworthy the biology domain containsno classes This is due to the fact that classes are rep-resented as entities such as tree100 and ginko101 Theginko tree is not classified as tree but by the genericclass freebasebiologyoganism_classification

2 Relations Freebase exhibits all relations requestedby our gold standard This is not surprising given thevast amount of available relations in Freebase (see Sec-tion 514 and Table 2)

100Freebase ID freebasem07j7r101Freebase ID freebasem0htd3

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 35

OpenCyc In total OpenCyc exposes a quite highSchema completeness scoring This is due to the factthat OpenCyc has been created manually and has itsfocus on generic and common-sense knowledge

1 Classes The ontology of OpenCyc covers bothgeneric and specific classes such as cychSocialGroup and cychLandTopographicalFeatureWe can state that OpenCyc is complete with respect tothe considered classes

2 Relations OpenCyc lacks some relations of thegold standard such as the number of pages or the ISBNof books

Wikidata According to our evaluation Wikidata iscomplete both with respect to classes and relations

1 Classes Besides frequently used generic classessuch as ldquohumanrdquo (wdtQ5) also very specific classesexist such as ldquolandformrdquo (wdtQ271669) in the senseof a geomorphologial unit with over 3K instances

2 Relations In particular remarkable is that Wiki-data covers all relations of the gold standard eventhough it has extremely less relations than FreebaseThus the Wikidata methodology to let users proposenew relations to discuss about their outreach and fi-nally to approve or disapprove the relations seems tobe appropriate

YAGO Due to its concentration on modeling classesYAGO shows the best overall Schema completenessfulfillment score among the KGs

1 Classes To create the set of classes in YAGOthe Wikipedia categories are extracted and connectedto WordNet synsets Since also our gold standard isalready aligned to WordNet synsets we can measure afull completeness score for YAGO classes

2 Relations The YAGO schema does not containmany unique but rather abstract relations which canbe understood in different senses The abstract rela-tion names make it often difficult to infer the mean-ing The relation yagowasCreatedOnDate forinstance can be used reasonably for both the founda-tion year of a company and for the publication dateof a movie DBpedia in contrast provides the rela-tion dbpfoundationYear Often the meaning ofYAGO relations is only fully understood after consider-ing the associated classes using domain and range ofthe relations Expanding the YAGO schema by furthermore fine-grained relations appears reasonable

Column completeness mcColumn

Evaluation method For evaluating KGs wrt Col-umn completeness for each KG 25 class-relation-

Table 8Metric values of mcCol for single class-relation-pairs

Relation DB FB OC ED YA

Personndashbirthdate 048 048 0 070 077

Personndashsex ndash 057 0 094 064

Bookndashauthor 091 093 0 082 028

BookndashISBN 073 063 ndash 018 001

combinations102 were created based on our gold stan-dard created for measuring the Schema completenessIt was ensured that only those relations were selectedfor a given class for which a value typically exists forthat class For instance we did not include the deathdate as potential relation for living people

Evaluation results In general no KG yields a met-ric score of over 043 As visible in Table 8 KGs oftenhave some specific class-relation-pairs which are wellrepresented on instance level while the rest of the pairsare poorly represented The well-represented pairs pre-sumably originate either from column-complete datasets which were imported (cf MusicBrainz in case ofFreebase) or from user edits focusing primarily on factsabout entities of popular classes such as people Wenotice the following observations with respect to thesingle KGs

DBpedia DBpedia fails regarding the relation sex forinstances of class Person since it does not containsuch a relation in its ontology If we considered the non-mapping-based property dbpgender instead (notdefined in the ontology) we would gain a coverage ofonly 025 (about 5K people) We can note hence thatthe extraction of data out of the Wikipedia categorieswould be a further fruitful data source for DBpedia

Freebase Freebase surprisingly shows a very highcoverage (927) of the authors of books given the ba-sic population of 17M books Note however that thereare not only books modeled under freebasebookbook but also entities of other types such as a descrip-tion of the Lord of Rings (see freebasem07bz5)Also the coverage of ISBN for books is quite high(634)

OpenCyc OpenCyc breaks ranks as mostly no val-ues for the considered relations are stored in this KG It

102The selection of class-relation-pairs was depending on the factwhich class-relation-pairs were available per KG Hence the choiceis varying from KG to KG Also note that less class-relation-pairswere used if no 25 pairs were available in the respective KG

36 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

contains mainly taxonomic knowledge and only thinlyspread instance facts

Wikidata Wikidata achieves a high coverage of birthdates (703) and of gender (941) despite the highnumber of 3M people103

YAGO YAGO obtains a coverage of 635 for gen-der relations as it in contrast to DBpedia extracts thisimplicit information from Wikipedia

Population completeness mcPop

Evaluation method In order to evaluate the Popu-lation completeness we need a gold standard consist-ing of a basic entity population for each consideredKG This gold standard which is available online104

was created on the basis of our gold standard usedfor evaluating the Schema completeness and the Col-umn completeness For its creation we selected fiveclasses from each of the five domains and determinedtwo well-known entities (called short head) and tworather unknown entities (called long tail) for each ofthose classes The exact entity selection criteria are asfollows

1 The well-known entities were chosen without tem-poral and location-based restrictions To take themost popular entities per domain we used quan-titative statements For instance to select well-known athletes we ranked athletes by the numberof won olympic medals to select the most popu-lar mountains we ranked the mountains by theirheights

2 To select the rather unknown entities we consid-ered entities associated to both Germany and aspecific year For instance regarding the athleteswe selected German athletes active in the year2010 such as Maria Houmlfl-Riesch The selectionof rather unknown entities in the domain of biol-ogy is based on the IUCN Red List of ThreatenedSpecies105106

Selecting four entities per class and five classes perdomain resulted in 100 entities to be used for evaluatingthe Population completeness

103These 3M instances form about 185 of all instances in Wiki-data See httpswwwwikidataorgwikiWikidataStatistics requested on Nov 7 2016

104See httpkmaifbkitedusitesknowledge-graph-comparison requested on Jan 29 2017

105See httpwwwiucnredlistorg requested on Apr2 2016

106Note that selecting entities by their importance or popularity ishard in general and that also other popularity measures such as thePageRank scores may be taken into account

Evaluation results All KGs except OpenCyc showgood evaluation results Since also Wikidata exhibitsgood evaluation results the population degree appar-ently does not depend on the age or the insertion methodof the KG Fig 10 additionally depicts the populationcompleteness for the single domains for each KG Inthe following we firstly present our findings for well-known entities before we secondly go into the detailsof rather unknown entities

Well-known entities Here all considered KGsachieve good results DBpedia Freebase and Wikidataare complete wrt the well-known entities in our goldstandard YAGO lacks some well-known entities al-though some of them are represented in Wikipedia Onereason for this fact is that those Wikipedia entities donot get imported into YAGO for which a WordNet classexists For instance there is no ldquoGreat White Sharkrdquoentity only the WordNet class yagowordnet_great_white_shark_101484850

Not-well-known entities First of all not very surpris-ing is the fact that all KGs show a higher degree of com-pleteness regarding well-known entities than regard-ing rather unknown entities as the KGs are orientedtowards general knowledge and not domain-specificknowledge Secondly two things are in particular pe-culiar concerning long-tail entities in the KGs Whilemost of the KGs obtain a score of about 088 Wiki-data deflects upwards and OpenCyc deflects stronglydownwards

Wikidata exhibits a very high Population complete-ness degree for long tail entities This is a result fromthe central storage of interwiki links between differentWikimedia projects (especially between the differentWikipedia language versions) in Wikidata A Wikidataentry is added to Wikidata as soon as a new entity isadded in one of the many Wikipedia language versionsNote however that in this way English-language labelsfor the entities are often missing We measure that onlyabout 546 (102M) of all Wikidata resources have anEnglish label

OpenCyc exhibits a poor population degree score of014 for long-tail entities OpenCycrsquos sister KGs Cycand ResearchCyc are apparently considerably bettercovered with entities [36] leading to higher Populationcompleteness scores

526 TimelinessThe evaluation results concerning the dimension

Timeliness are presented in Table 9

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 37

DBpedia Freebase OpenCyc Wikidata YAGO0

01

02

03

04

05

06

07

08

09

1

PeopleMediaOrganizationsGeographyBiology

Fig 10 Population completeness regarding the different domains per KG

Table 9Evaluation results for the KGs regarding the dimension Timeliness

DB FB OC WD YA

mFreq 05 0 025 1 025mV alidity 0 1 0 1 1mChange 0 1 0 0 0

Timeliness frequency of the KG mFreq

Evaluation results The KGs are very diverse re-garding the frequency in which the KGs are updatedranging from a score of 0 for Freebase (not updated anymore) to 1 for Wikidata (updates immediately visibleand retrievable) Note that the Timeliness frequency ofthe KG can be a crucial point and a criterion for exclu-sion in the process of choosing the right KG for a givensetting [17] In the following we outline some charac-teristics of the KGs with respect to their up-to-dateness

DBpedia is created about once to twice a year andis not modified in the meantime From September2013 until November 2016 six DBpedia versions havebeen published107 Besides the static DBpedia DBpe-dia live108 has been continuously updated by trackingchanges in Wikipedia in real-time However it does notprovide the full range of relations as DBpedia

107These versions are DBpedia 38 DBpedia 39 DBpedia 2014DBpedia 2015-04 DBpedia 2015-10 and DBpedia 2016-04 Alwaysthe latest DBpedia version is published online for dereferencing

108See httplivedbpediaorg requested on Mar 42016

Freebase had been updated continuously until itsclose-down and is not updated anymore

OpenCyc has been updated less than once per yearThe last OpenCyc version dates from May 2012109 Tothe best of our knowledge Cyc and OpenCyc respec-tively are developed further but no exact date of thenext version is known

Wikidata provides the highest fulfillment degree forthis criterion Modifications in Wikidata are via browserand via HTTP URI dereferencing immediately visibleHence Wikidata falls in the category of continuousupdates Besides that an RDF export is provided ona roughly monthly basis (either via the RDF exportwebpage110 or via own processing using the Wikidatatoolkit111)

YAGO has been updated less than once per yearYAGO3 was published in 2015 YAGO2 in 2011 andthe interim version YAGO2s in 2013 A date of the nextrelease has not been published

Specification of the validity period of statementsmV alidity

Evaluation results Although representing the va-lidity period of statements is obviously reasonable formany relations (for instance the presidentrsquos term of

109See httpswopencycorg requested on Nov 82016

110See httptoolswmflabsorgwikidata-exportsrdfexports requested on Nov 23 2016

111See httpsgithubcomWikidataWikidata-Toolkit requested on Nov 8 2016

38 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Table 10Evaluation results for the KGs regarding the dimension Ease ofunderstanding

DB FB OC WD YA

mDescr 070 097 1 lt1 1mLang 1 1 0 1 1muSer 1 1 0 1 1muURI 1 05 1 0 1

office) specifying the validity period of statements isin several KGs either not possible at all or only rudi-mentary performed

DBpedia and OpenCyc do not realize any specifi-cation possibility In YAGO Freebase and Wikidatathe temporal validity period of statements can be spec-ified In YAGO this modeling possibility is madeavailable via the relations yagooccursSinceyagooccursUntil and yagooccursOnDateWikidata provides the relations ldquostart timerdquo (wdtP580)and ldquoend timerdquo (wdtP582) In Freebase CompoundValue Types (CVTs) are used to represent relations withhigher arity [44] As part of this representation validityperiods of statements can be specified An example isldquoVancouverrsquos population in 1997rdquo

Specification of the modification date of statementsmChange

Evaluation results The modification date of state-ments can only be specified in Freebase but not in theother KGs Together with the criteria on Timelinessthis reflects that the considered KGs are mostly notsufficiently equipped with possibilities for modelingtemporal aspects within and about the KG

In Freebase the date of the last review of a fact can berepresented via the relation freebasefreebasevaluenotationis_reviewed In the DBpediaontology the relation dctermsmodified is usedto state the date of the last revision of the DBpediaontology When dereferencing a resource in Wikidatathe latest modification date of the resource is returnedvia schemadateModified This however doesnot hold for statements Thus Wikidata is evaluatedwith 0 too

527 Ease of UnderstandingDescription of resources mDescr

Evaluation method We measured the extent towhich entities are described Regarding the labelswe considered rdfslabel for all KGs Regard-ing the descriptions the corresponding relations dif-

fer from KG to KG DBpedia for instance usesrdfscomment and dcelementsdescriptionwhile Freebase provides freebasecommontopicdescription112

Evaluation result For all KGs the rule applies thatin case there is no label available usually there isalso no description available The current metric couldtherefore (without significant restrictions) be applied tordfslabel occurrences only

YAGO Wikidata and OpenCyc contain a label foralmost every entity In Wikidata the entities withoutany label are of experimental nature and are most likelynot used113

Surprisingly DBpedia shows a relatively low cov-erage wrt labels and descriptions (only 704) Ourmanual investigations suggest that relations with higherarity are modeled by means of intermediate nodeswhich have no labels114

Labels in multiple languages mLang

Evaluation method Here we measure whether theKGs contain labels (rdfslabel) in other languagesthan English This is done by means of the languageannotations of literals such as ldquoderdquo for literals inGerman

Evaluation results DBpedia provides labels in 13languages Further languages are provided in the lo-calized DBpedia versions YAGO integrates statementsof the different language versions of Wikipedia intoone KG Therefore it provides labels in 326 differentlanguages Freebase and Wikidata also provide a lot oflanguages (244 and 395 languages respectively) Con-trary to the other KGs OpenCyc only provides labelsin English

Coverage of languages We also measured the cov-erage of selected languages in the KGs ie the extentto which entities have an rdfslabel with a specificlanguage annotation115 Our evaluation shows that DB-pedia YAGO and Freebase achieve a high coveragewith more than 90 regarding the English language Incontrast to those KGs Wikidata shows a relative low

112Human-readable resource descriptions may also be representedby other relations [15] However we focused on those relations whichare commonly used in the considered KGs

113For instance wdtQ5127809 represents a game fo the Nin-tendo Entertainment System but there is no further information foran identification of the entity available

114Eg dbrNayim links via dboCareerStation to 10entities of his carrier stations

115Note that literals such as rdfslabel do not necessarily havelanguage annotations In those cases we assume that no languageinformation is available

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 39

coverage regarding the English language of only 546but a coverage of over 30 for further languages suchas German and French Wikidata is hence not only themost diverse KG in terms of languages but has also thehighest coverage regarding non-English languages

Understandable RDF serialization muSer

The provisioning of understandable RDF serializa-tions in the context of URI dereferencing leads to a bet-ter understandability for human data consumers DB-pedia YAGO and Wikidata provide N-Triples andN3Turtle serializations Freebase in contrast onlyprovides a Turtle serialization OpenCyc only usesRDFXML which is regarded as not easily understand-able by humans

Self-describing URIs muURI

We can observe two different paradigms of URI us-age On the one hand DBpedia OpenCyc and YAGOrely on descriptive URIs and therefore achieve the fullfulfillment degree In DBpedia and YAGO the URIsof the entities are determined by the corresponding En-glish Wikipedia article The mapping to the EnglishWikipedia is thus trivial In case of OpenCyc two RDFexports are provided one using opaque and one us-ing self-describing URIs The self-describing URIs arethereby derived from the rdfslabel values of theresources

On the other hand Wikidata and Freebase (the latterin part) rely on opaque URIs Wikidata uses Q-IDsfor resources (items in Wikidata terminology) andP-IDs for relations Freebase uses self-describing URIsonly partially namely opaque M-IDs for entities andself-describing URIs for classes and relations116

528 InteroperabilityThe evaluation results of the dimension Interoper-

ability are presented in Table 11

Avoiding blank nodes and RDF reification mReif

Reification allows to represent further informationabout single statements In conclusion we can state thatDBpedia Freebase OpenCyc and YAGO use someform of reification However none of the consideredKGs uses the RDF standard for reification Wikidatamakes extensive use of reification every relation isstored in the form of an n-ary relation In case of DB-pedia and Freebase in contrast facts are predominantlystored as N-Tripels and only relations of higher arity

116Eg freebasemusicalbum for the class music al-bums and freebasepeoplepersondate_of_birthfor the relation day of birth

Table 11Evaluation results for the KGs regarding the dimensionInteroperability

DB FB OC WD YA

mReif 05 05 05 0 05miSerial 1 0 05 1 1mextV oc 061 011 041 068 013mpropV oc 015 0 051 gt0 0

are stored via n-ary relations117 YAGO stores facts asN-Quads in order to be able to store meta informationof facts like provenance information When the quadsare loaded in a triple store the IDs referring to thesingle statements are ignored and quads are convertedinto triples In this way most of the statements are stillusable without the necessity to deal with reification

Blank nodes are non-dereferencable anonymous re-sources They are used by the Wikidata and OpenCycdata model

Provisioning of several serialization formats miSerial

DBpedia YAGO and Wikidata fulfill the criterion ofProvisioning several RDF serialization formats to thefull extent as they provide data in RDFXML and sev-eral other serialization formats during the URI derefer-encing In addition DBpedia and YAGO provide fur-ther RDF serialization formats (eg JSON-LD Micro-data and CSV) via their SPARQL endpoints Freebaseis the only KG providing RDF only in Turtle format

Using external vocabulary mextV oc

Evaluation method This criterion indicates the ex-tent to which external vocabulary is used For that foreach KG we divide the occurrence number of tripleswith external relations by the number of all relations inthis KG

Evaluation results DBpedia uses 37 unique exter-nal relations from 8 different vocabularies while theother KGs mainly restrict themselves to the externalvocabularies RDF RDFS and OWL

Wikidata reveals a high external vocabulary ratiotoo We can mention two obvious reasons for that fact1 Information in Wikidata is provided in a huge varietyof languages leading to 85M rdfslabel and 140Mschemadescription literals 2 Wikidata makesextensive use of reification Out of the 140M triplesused for instantiations via rdftype about 74M (ie

117See Section 511 for more details wrt the influence of reifica-tion on the number of triples

40 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

about the half) are taken for instantiations of statementsie for reification

Interoperability of proprietary vocabulary mpropV oc

Evaluation method This criterion determines the ex-tent to which URIs of proprietary vocabulary are linkedto external vocabulary via equivalence relations Foreach KG we measure which classes and relationsare linked via owlsameAs118 owlequivalentClass (in Wikidata wdtP1709) and owlequivalentProperty (in Wikidata wdtP1628) to ex-ternal vocabulary Note that other relations such asrdfsubPropertyOf could be taken into accounthowever in this work we only consider equivalencyrelations

Evaluation results In general we obtained low ful-fillment scores regarding this criterion OpenCyc showsthe highest value We achieved the following singlefindings

Regarding its classes DBpedia reaches a relativehigh interlinking degree of about 484 Classes arethereby linked to FOAF Wikidata schemaorg andDUL119 Regarding its relations DBpedia links to Wiki-data and schemaorg120 Only 63 of the DBpediarelations are linked to external vocabulary

Freebase only provides owlsameAs links in theform of a separate RDF file but these links are only oninstance level Thus the KG is evaluated with 0

In OpenCyc about half of all classes exhibit at leastone external linking via owlsameAs Internal linksto resources of swcyccom the commercial ver-sion of OpenCyc were ignored in our evaluation Theconsidered classes are mainly linked to FOAF UM-BEL DBpedia and linkedmdborg the relations mainlyto FOAF DBpedia Dublin Core Terms and linked-mdborg The relative high linking degree of OpenCyccan be attributed to dedicated approaches of linkingOpenCyc to other KGs (see eg Medelyan et al [38])

Regarding the classes Wikidata provides linksmainly to DBpedia Considering all Wikidata classesonly 01 of all Wikidata classes are linked to equiva-

118OpenCyc uses owlsameAs both on schema and instancelevel This is appropriate as the OWL primer states The built-inOWL property owlsameAs links an individual to an individualas well as The owlsameAs statements are often used in definingmappings between ontologies see httpswwww3orgTR2004REC-owl-ref-20040210sameAs-def (requestedon Feb 4 2017)

119See httpwwwontologydesignpatternsorgontdulDULowl requested on Jan 11 2017

120Eg dbobirthDate is linked to wdtP569 andschemabirthDate

Table 12Evaluation results for the KGs regarding the dimension Accessibility

DB FB OC WD YA

mDeref 1 1 044 041 1mAvai lt1 073 lt1 lt1 1mSPARQL 1 1 0 1 0mExport 1 1 1 1 1mNegot 05 1 0 1 0mHTMLRDF 1 1 1 1 0mMeta 1 0 0 0 1

lent external classes This may be due to the high num-ber of classes in Wikidata in general Regarding therelations Wikidata provides links in particular to FOAFand schemaorg and achieves here a linking coverageof 21 Although this is low frequently used relationsare linked121

YAGO contains around 553K owlequivalentClass links to classes within the DBpedia namespacedby However as YAGO classes (and their hierarchy)were imported also into DBpedia (using the namespacehttpdbpediaorgclassyago) we donot count those owlequivalentClass links inYAGO as external links for YAGO

529 AccessibilityThe evaluation results of the dimension Accessibility

are presented in Table 12

Dereferencing possibility of resources mDeref

Evaluation method We measured the dereferenc-ing possibilities of resources by trying to dereferenceURIs containing the fully-qualified domain name ofthe KG For that we randomly selected 15K URIs inthe subject predicate and object position of triples ineach KG We submitted HTTP requests with the HTTPaccept header field set to applicationrdf+xmlin order to perform content negotiation

Evaluation results In case of DBpedia OpenCycand YAGO all URIs were dereferenced successfullyand returned appropriate RDF data so that they fulfilledthis criterion completely For DBpedia 45K URIs wereanalyzed for OpenCyc only around 30K due to thesmall number of unique predicates We observed almost

121Frequently used relations with stated equivalence to externalrelations are eg wdtP31 linked to rdftype and wdtP279linked to rdfssubClassOf

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 41

the same picture for YAGO namely no notable errorsduring dereferencing

For Wikidata which contains also not that manyunique predicates we analyzed around 35K URIs Notethat predicates which are derived from relations using asuffix (eg the suffix s as in wdtP1024s is usedfor predicates referring to a statement) could not bedereferenced at all Furthermore the blank nodes usedfor reification cannot be dereferenced

Regarding Freebase mainly all URIs on subjectand object position of triples could be dereferencedSome resources were not resolvable even after multi-ple attempts (HTTP server error 503 eg freebasem0156q) Surprisingly server errors also appearedwhile browsing the website freebasecom so that datawas partially not available Regarding the predicate po-sition many URIs are not dereferencable due to servererrors (HTTP 503) or due to unknown URIs (HTTP404) Note that if a large number of Freebase requestsare performed an API key from Google is necessaryIn our experiments the access was blocked after a fewthousand requests Hence we can point out that withoutan API key the Freebase KG is only usable to a limitedextent

Availability of the KG mAvai

Evaluation method We measured the availabilityof the officially hosted KGs with the monitoring servicePingdom122 For each KG an uptime test was set upwhich checked the availability of the resource Ham-burg as representative resource for successful URI re-solving (ie returning the status code HTTP 200) ev-ery minute over the time range of 60 days (Dec 182015ndashFeb 15 2016)

Evaluation result While the other KGs showed al-most no outages and were again online after some min-utes on average YAGO outages took place frequentlyand lasted on average 35 hours123 In the given timerange four outages took longer than one day Based onthese insights we recommend to use a local version ofYAGO for time-critical queries

Availability of a public SPARQL endpoint mSPARQL

The SPARQL endpoints of DBpedia and YAGO are

122See httpswwwpingdomcom requested Mar 2 2016The HTTP requests of Pingdom are executed by various servers sothat caching is prevented

123See diagrams per KG on our website (httpkmaifbkitedusitesknowledge-graph-comparisonrequested on Jan 31 2017)

provided by a Virtuoso server124 the Wikidata SPARQLendpoint via Blazegraph125 Freebase and OpenCyc donot provide an official SPARQL endpoint However anendpoint for the MQL query language for the FreebaseKG was available

Especially regarding the Wikidata SPARQL endpointwe observed access restrictions The maximum execu-tion time per query is set to 30 seconds but there is nolimitation regarding the returning number of rows How-ever the front-end of the SPARQL endpoint crashed incase of large result sets with more than 15M rows Al-though public SPARQL endpoints need to be preparedfor inefficient queries the time limit of Wikidata mayimpede the execution of reasonable queries

Provisioning of an RDF export mExport

All considered KGs provide RDF exports as down-loadable files The format of the data differs from KGto KG Mostly data is provided in N-Triples and Turtleformat

Support of content negotiation mNegot

We measure the support of content negotiation re-garding the serialization formats RDFXML N3Turtleand N-Triples OpenCyc does not provide any contentnegotiation only RDFXML is supported as contenttype Therefore OpenCyc does not fulfill the criterionof supporting content negotiation

The endpoints for DBpedia Wikidata and YAGOcorrectly returned the appropriate RDF serializationformat and the corresponding HTML representationof the tested resources Freebase does currently notprovide any content negotiation and only the contenttype textplain is returned

Noteworthy is also that regarding the N-Triples seri-alization YAGO and DBpedia require the accept headertextplain and not applicationn-triplesThis is due to the usage of Virtuoso as endpoint For DB-pedia the forwarding to httpdbpediaorgdata[resource]ntriples does not work in-stead the HTML representation is returned Thereforethe KG is evaluated with 05

Linking HTML sites to RDF serializations mHTMLRDF

All KGs except OpenCyc interlink the HTML represen-tations of resources with the corresponding RDF repre-sentations by means of ltlink rel=alternate

124See httpsvirtuosoopenlinkswcom re-quested on Dec 28 2016

125See httpswwwblazegraphcom requested on Dec28 2016

42 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Table 13Evaluation results for the KGs regarding the dimension License

DB FB OC WD YA

mmacLicense 1 0 0 1 0

type=content type href=URLgtin the HTML header

Provisioning of metadata about the KG mmeta

For this criterion we analyzed if KG metadata isavailable such as in the form of a VoID file126 DBpediaintegrates the VoID vocabulary directly in its KG127 andprovides information such as the SPARQL endpointURL and the number of all triples OpenCyc revealsthe current KG version number via owlversionInfo For YAGO Freebase and Wikidata no metainformation could be found

5210 LicenseThe evaluation results of the dimension License are

shown in Table 13

Provisioning machine-readable licensing informationmmacLicense

DBpedia and Wikidata provide licensing informa-tion about their KG data in machine-readable form ForDBpedia this is done in the ontology via the predi-cate cclicense linking to CC-BY-SA128 and GNUFree Documentation License (GNU FDL)129 Wikidataembeds licensing information during the dereferenc-ing of resources in the RDF document by linking withcclicense to the license CC0130 YAGO and Free-base do not provide machine-readable licensing infor-mation However their data is published under the li-cense CC-BY131 OpenCyc embeds licensing informa-tion into the RDF document during dereferencing butnot in machine-readable form132

126See httpswwww3orgTRvoid requested on Apr7 2016

127See httpdbpediaorgvoidpageDataset re-quested on Mar 5 2016

128See httpcreativecomonsorglicensesby-sa30 requested on Feb 4 2017

129See httpwwwgnuorgcopyleftfdlhtml re-quested on Feb 4 2017

130See httpcreativecomonsorgpublicdomainzero10 requested on Feb 4 2017

131See httpcreateivecommonsorglicensesby30 requested on Feb 4 2017

132License information is provided as plain text among furtherinformation with the relation rdfscomment

Table 14Evaluation results for the KGs regarding the dimension Interlinking

DB FB OC WD YA

mInst 025 0 038 0 (09) 031mURIs 093 091 089 096 096

5211 InterlinkingThe evaluation results of the dimension Interlinking

are shown in Table 14

Linking via owlsameAs mInst

Evaluation method Given all owlsameAs triplesin each KG we queried all those subjects thereof whichare instances but neither classes nor relations133 andwhere the resource in the object position of the triple isan external source ie not belonging to the namespaceof the KG

Evaluation result OpenCyc and YAGO achieve thebest results wrt this metric but DBpedia has by farthe most instances with at least one owlsameAs linkWe can therefore confirm the statement by Bizer et al[12] that DBpedia has established itself as a hub in theLinked Data cloud

In DBpedia there are about 52M instances with atleast one owlsameAs link Links to localized DBpe-dia versions (eg dedbpediaorg) were countedas internal links and hence not considered here Intotal one-fourth of all instances have at least oneowlsameAs link

In Wikidata neither owlsameAs links are pro-vided nor a corresponding proprietary relation is avail-able Instead Wikidata uses for each linked data seta proprietary relation (called identifier) to indicateequivalence For example the M-ID of a Freebase in-stance is stored via the relation ldquoFreebase identifierrdquo(wdtP646) as literal value (eg m01x3gpk)So far links to 426 different data sources are maintainedin this way

Although the equivalence statements in Wikidata canbe used to generate corresponding owlsameAs state-ments and although the stored identifiers are providedin the Browser interface as hyperlinks there are no gen-uine owlsameAs links available Hence Wikidata isevaluated with 0 If we view each equivalence relationas owlsameAs relation we would obtain around122M instances with owlsameAs statements Thiscorresponds to 86 of all instances If we consider

133The interlinking on schema level is already covered by thecriterion Interoperability of proprietary vocabulary

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 43

only entities instead of instances (since there are manyinstances due to reification) we obtain a coverage of65 Note however that although the linked resourcesprovide relevant content the resources are not alwaysRDF documents but instead HTML web pages There-fore we cannot easily subsume all identifiers (equiv-alence statements) under owlsameAs

YAGO has around 36M instances with at least oneowlsameAs link However most of them are linksto DBpedia based on common Wikipedia articles Ifthose links are excluded YAGO contains mostly linksto GeoNames and would be evaluated with just 001

In case of OpenCyc links to Cyc134 the commercialversion of OpenCyc were considered as being internalStill OpenCyc has the highest fulfillment degree witharound 40K instances with at least one owlsameAslink As mentioned earlier the relative high linkingdegree of OpenCyc can be attributed to dedicated ap-proaches of linking OpenCyc to other KGs135

Validity of external URIs mURIs

Regarding the dimension Accessibility we alreadyanalyzed the dereferencing possibility of resources inthe KG namespace Now we analyze the links to exter-nal URIs

Evaluation method External links include owlsameAs links as well as links to non-RDF-based Webresources (eg via foafhomepage) We measureerrors such as timouts client errors (HTTP response4xx) and server errors (HTTP response 5xx)

Evaluation result The external links are in most ofthe cases valid for all KGs All KGs obtain a metricvalue between 089 and 096

DBpedia stores provenance information via the re-lation provwasDerivedFrom Since almost alllinks refer to Wikipedia 99 of the resources are avail-able

Freebase achieves high metric values here sinceit contains owlsameAs links mainly to WikipediaAlso Wikipedia URIs are mostly resolvable

OpenCyc contains mainly external links to non-RDF-based Web resources to wikipediaorg and w3org

YAGO also achieves high metric values since it pro-vides owlsameAs links only to DBpedia and Geo-Names whose URIs do not change

For Wikidata the relation reference URL (wdtP854) which states provenance information amongother relations belongs to the links linking to external

134Ie swcyccom135See Interoperability of proprietary vocabulary in sec 528

Web resources Here we were able to resolve around955 without errors

Noticeable is that DBpedia and OpenCyc containmany owlsameAs links to URIs whose domains donot exist anymore136 One solution for such invalid linksmight be to remove them if they have been invalid for acertain time span

5212 Summary of ResultsWe now summarize the results of the evaluations

presented in this section

1 Syntactic validity of RDF documents All KGsprovide syntactically valid RDF documents

2 Syntactic validity of Literals In general the KGsachieve good scores regarding the Syntactic valid-ity of literals Although OpenCyc comprises over1M literals in total these literals are mainly labelsand descriptions which are not formatted in a spe-cial format For YAGO we detected about 519Ksyntactic errors (given 1M literal values) due to theusage of wildcards in the date values Obviouslythe syntactic invalidity of literals is accepted bythe publishers in order to keep the number of rela-tions low In case of Wikidata some invalid literalssuch as the ISBN have been corrected in newerversions of Wikidata This indicates that knowl-edge in Wikidata is curated continuously For DB-pedia comments next to the values to be extracted(such as ISBN) in the infoboxes of Wikipedia ledto inaccurately extracted values

3 Semantic validity of triples All considered KGsscored well regarding this metric This shows thatKGs can be used in general without concerns re-garding the correctness Note however that eval-uating the semantic validity of facts is very chal-lenging since a reliable ground truth is needed

4 Trustworthiness on KG level Based on the way ofhow data is imported and curated OpenCyc andWikidata can be trusted the most

5 Trustworthiness on statement level Here espe-cially good values are achieved for Freebase Wiki-data and YAGO YAGO stores per statement boththe source and the extraction technique which isunique among the KGs Wikidata also supports tostore the source of information but only around13 of the statements have provenance informa-tion attached Note however that not every state-

136Eg httprdfaboutcom httpwww4wiwissfu-berlindefactbook and httpwikicompanyorg (requested on Jan 11 2017)

44 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

ment in Wikidata requires a reference and that itis hard to evaluate which statements lack such areference

6 Using unknown and empty values Wikidata andFreebase support the indication of unknown andempty values

7 Check of schema restrictions during insertion ofnew statements Since Freebase and Wikidata areeditable by community members simple consis-tency checks are made during the insertion of newfacts in the user interface

8 Consistency of statements wrt class constraintsFreebase and Wikidata do not specify any classconstraints via owldisjointWith while theother KGs do

9 Consistency of statements wrt relation con-straints The inconsistencies of all KGs regardingthe range indications of relations are mainly due toinconsistently used data types (eg xsdgYearis used instead of xsdDate)Regarding the constraint of functional proper-ties the relation owlFunctionalPropertyis used by all KGs except Wikidata in most casesthe KGs comply with the usage restrictions of thisrelation

10 Creating a ranking of statements Only Wikidatasupports a ranking of statements This is in partic-ular worthwhile in case of statements which areonly temporally limited valid

11 Schema completeness Wikidata shows the highestdegree of schema completeness Also for DBpe-dia OpenCyc and YAGO we obtain results whichare presumably acceptable in most cross-domainuse cases While DBpedia classes were sometimesmissing in our evaluation the DBpedia relationswere covered considerably well OpenCyc lackssome relations of the gold standard but the classesof the gold standard were existing in OpenCycWhile the YAGO classes are peculiar in the sensethat they are connected to WordNet synsets it isremarkable that YAGO relations are often keptvery abstract so that they can be applied in differ-ent senses Freebase shows considerable room forimprovement concerning the coverage of typicalcross-domain classes and relations Note that Free-base classes are belonging to different domainsHence it is difficult to find related classes if theyare not in the same domain

12 Column completeness DBpedia and Freebaseshow the best column completeness values ie inthose KGs the predicates used by the instances of

each class are on average frequently used by all ofthose class instances We can name data importsas one reason for it

13 Population completeness Not very surprising isthe fact that all KGs show a higher degree of com-pleteness regarding well-known entities than re-garding rather unknown entities Especially Wiki-data shows an excellent performance for both well-known and rather unknown entities

14 Timeliness frequency of the KG Only Wikidataprovides the highest fulfillment degree for thiscriterion as it is continuously updated and as thechanges are immediately visible and queryable byusers

15 Specification of the validity period of statementsIn YAGO Freebase and Wikidata the temporalvalidity period of statements (eg term of office)can be specified

16 Specification of the modification date of state-ments Only Freebase keeps the modification datesof statements Wikidata provides the modificationdate of the queried resource during URI derefer-encing

17 Description of resources YAGO Wikidata andOpenCyc contain a label for almost every entitySurprisingly DBpedia shows a relatively low cov-erage wrt labels and descriptions (only 704)Manual investigations suggest that the interme-diate node mapping template is the main reasonfor that By means of this template intermediatenodes are introduced and instantiated but no la-bels are provided for them137

18 Labels in multiple languages YAGO Freebaseand Wikidata support hundreds of languages re-garding their stored labels Only OpenCyc con-tains labels merely in English While DBpediaYAGO and Freebase show a high coverage re-garding the English language Wikidata does nothave such a high coverage regarding English butinstead covers other languages to a considerableextent It is hence not only the most diverse KGin terms of languages but also the KG which con-tains the most labels for languages other than En-glish

19 Understandable RDF serialization DBpediaWikidata and YAGO provide several understand-

137An example is dbrVolkswagen_Passat_(B1)which has dboengine statements to the intermediate nodesVolkswagen_Passat_(B1)__1 etc representing differentengine variations

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 45

able RDF serialization formats Freebase onlyprovides the understandable format RDFTurtleOpenCyc relies only on RDFXML which is con-sidered as being not easily understandable for hu-mans

20 Self-describing URIs We can find mixed paradigmsregarding the URI generation DBpedia YAGOand OpenCyc rely on descriptive URIs whileWikidata and Freebase (in part classes and rela-tions are identified with self-describing URIs) usegeneric IDs ie opaque URIs

21 Avoiding blank nodes and RDF reification DB-pedia Wikidata YAGO and Freebase are theKGs which use reification ie which formulatestatements about statements There are differentways of implementing reification [27] DBpediaWikidata and Freebase use n-ary relations whileYAGO uses N-Quads creating so-called namedgraphs

22 Provisioning of several serialization formatsMany KGs provide RDF in several serializationformats Freebase is the only KG providing datain the serialization format RDFTurtle only

23 Using external vocabulary DBpedia and Wiki-data show high degrees of external vocabularyusage In DBpedia the RDF RDFS and OWLvocabularies are used Wikidata has a high ex-ternal vocabulary ratio since there exist manylanguage labels and descriptions (modeled viardfslabel and schemadescription)Also due to instantiations of statements withwdoStatement for reification purposes theexternal relation rdftype is used a lot

24 Interoperability of proprietary vocabulary Weobtained low fulfillment scores regarding this cri-terion OpenCyc shows the highest value Wecan mention as reason for that the fact thathalf of all OpenCyc classes exhibit at least oneowlsameAs linkWhile DBpedia has equivalence statements to ex-ternal classes for almost every second class only63 of all relations have equivalence relations torelations outside the DBpedia namespaceWikidata shows a very low interlinking degreeof classes to external classes and of relations toexternal relations

25 Dereferencing possibility of resources Resourcesin DBpedia OpenCyc and YAGO can be derefer-enced without considerable issues Wikidata usespredicates derived from relations that are not deref-erencable at all as well as blank nodes For Free-

base we measured a quite considerable amountof dereferencing failures due to server errors andunknown URIs Note also that Freebase requiredan API key for a large amount of requests

26 Availability of the KG While all other KGsshowed almost no outages YAGO shows a note-worthy instability regarding its online availabilityWe measured around 100 outages for YAGO ina time interval of 8 weeks taking on average 35hours

27 Provisioning of public SPARQL endpoint DBpe-dia Wikidata and YAGO provide a SPARQL end-point while Freebase and OpenCyc do not Note-worthy is that the Wikidata SPARQL endpoint hasa maximum execution time per query of 30 sec-onds This might be a bottleneck for some queries

28 Provisioning of an RDF export RDF exports areavailable for all KGs and are provided mostly inN-Triples and Turtle format

29 Support of content negotiation DBpedia Wiki-data and YAGO correctly return RDF data basedon content negotiation Both OpenCyc and Free-base do not support any content negotiation WhileOpenCyc only provides data in RDFXML Free-base only returns data with textplain as con-tent type

30 Linking HTML sites to RDF serializations AllKGs except OpenCyc interlink the HTML rep-resentations of resources with the correspondingRDF representations

31 Provisioning of KG metadata Only DBpedia andOpenCyc integrate metadata about the KG insome form DBpedia has the VoID vocabulary in-tegrated while OpenCyc reveals the current KGversion as machine-readable metadata

32 Provisioning machine-readable licensing informa-tion Only DBpedia and Wikidata provide licens-ing information about their KG data in machine-readable form

33 Interlinking via owlsameAs OpenCyc andYAGO achieve the best results wrt this met-ric but DBpedia has by far the most instanceswith at least one owlsameAs link Based onthe resource interlinkage DBpedia is justifiablycalled Linked Data hub Wikidata does not provideowlsameAs links but stores identifiers as liter-als that could be used to generate owlsameAslinks

34 Validity of external URIs The links to exter-nal Web resources are for all KGs valid inmost cases DBpedia and OpenCyc contain many

46 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Step 1 Requirements Analysis

- Identifying the preselection criteria P- Assigning a weight wi to each DQ criterion ci isin C

Step 2 Preselection based on the Preselection Criteria

- Manually selecting the KGs GP that fulfill the preselection criteria P

Step 3 Quantitative Assessment of the KGs

- Calculating the DQ metric mi(g) for each DQ criterion ci isin C- Calculating the fulfillment degree h(g) for each KG g isin GP

- Determining the KG g with the highest fulfillment degree h(g)

Step 4 Qualitative Assessment of the Result

- Assessing the selected KG g wrt qualitative aspects- Comparing the selected KG g with other KGs in G P

Fig 11 Proposed process for using our KG recommendation frame-work

owlsameAs links to RDF documents on do-mains which do not exist anymore those linkscould be deleted

6 KG Recommendation Framework

We now propose a framework for selecting themost suitable KG (or a set of suitable KGs) for agiven concrete setting based on a given set of KGsG = g1 gn To use this framework the user needsto go through the steps depicted in Fig 11

In Step 1 the preselection criteria and the weightsfor the criteria are specified The preselection criteriacan be both quality criteria or general criteria and needto be selected dependent on the use case The Timeli-ness frequency of the KG is an example for a qualitycriterion The license under which a KG is provided(eg CC0 license) is an example for a general criterionAfter weighting the criteria in Step 2 those KGs areneglected which do not fulfill the preselection criteriaIn Step 3 the fulfillment degrees of the remaining KGsare calculated and the KG with the highest fulfillmentdegree is selected Finally in Step 4 the result can be as-sessed wrt qualitative aspects (besides the quantitativeassessments using the DQ metrics) and if necessary analternative KG can be selected for being applied for thegiven scenario

Use case application In the following we showhow to use the KG recommendation framework in aparticular scenario The use case is based on the usageof DBpedia and MusicBrainz for the project BBC Musicas described in [33]

Description of the use case The publisher BBCwants to enrich news articles with fact sheets providingrelevant information about musicians mentioned in thearticles In order to obtain more details about the mu-sicians the user can leave the news section and accessthe musicians section where detailed information is pro-vided including a short description a picture the birthdate and the complete discography for each musicianFor being able to integrate the musicians informationinto the articles and to enable such a linking editorsshall tag the article based on a controlled vocabulary

The KG Recommendation Framework can be appliedas follows

1 Requirements analysis

ndash Preselection criteria According to the sce-nario description [33] the KG in questionshould (i) be actively curated and (ii) con-tain an appropriate amount of media enti-ties Given these two criteria a satisfactoryand up-to-date coverage of both old and newmusicians is expected

ndash Weighting of DQ criteria Based on the pre-selection criteria an example weighting ofthe DQ metrics for our use case is given inTable 15 Note that this is only one exam-ple configuration and the assignment of theweights is subjective to some degree Giventhe preselection criteria the criterion Timeli-ness frequency of the KG and the criteria ofthe DQ dimension Completeness are empha-sized Furthermore the criteria Dereferenc-ing possibility of resources and Availabilityof the KG are important as the KG shall beavailable online ready to be queried138

2 Preselection Freebase and OpenCyc are not con-sidered any further since Freebase is not being up-dated anymore and since OpenCyc contains onlyaround 4K entities in the media domain

3 Quantitative Assessment The overall fulfillmentscore for each KG is calculated based on the for-mula presented in Section 31 The result of thequantitative KG evaluation is presented in Ta-ble 15 By weighting the criteria according tothe constraints Wikidata achieves the best rankclosely followed by DBpedia Based on the quan-titative assessment Wikidata is recommended bythe framework

138We assume that in this use case rather the dereferencing ofHTTP URIs than the execution of SPARQL queries is desired

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 47

Table 15Framework with an example weighting which would be reasonablefor a user setting as given in [33]

Dimension Metric DBpedia Freebase OpenCyc Wikidata YAGO Example of UserWeighting wi

Accuracy msynRDF 1 1 1 1 1 1msynLit 0994 1 1 1 0624 1msemTriple 0990 0995 1 0993 0993 1

Trustworthiness mgraph 05 05 1 075 025 0mfact 05 1 0 1 1 1mNoV al 0 1 0 1 0 0

Consistency mcheckRestr 0 1 0 1 0 0mconClass 0875 1 0999 1 0333 0mconRelat 0992 0451 1 0500 0992 0

Relevancy mRanking 0 1 0 1 0 1

Completeness mcSchema 0905 0762 0921 1 0952 1mcCol 0402 0425 0 0285 0332 2mcPop 093 094 048 099 089 3

Timeliness mFreq 05 0 025 1 025 3mV alidity 0 1 0 1 1 0mChange 0 1 0 0 0 0

Ease of understanding mDescr 0704 0972 1 09999 1 1mLang 1 1 0 1 1 0muSer 1 1 0 1 1 0muURI 1 05 1 0 1 1

Interoperability mReif 05 05 05 0 05 0miSerial 1 0 05 1 1 1mextV oc 061 0108 0415 0682 0134 1mpropV oc 0150 0 0513 0001 0 1

Accessibility mDeref 1 0437 1 0414 1 2mAvai 09961 09998 1 09999 07306 2mSPARQL 1 0 0 1 1 1mExport 1 1 1 1 1 0mNegot 05 0 0 1 1 0mHTMLRDF 1 1 0 1 1 0mMeta 1 0 1 0 0 0

Licensing mmacLicense 1 0 0 1 0 0

Interlinking mInst 0251 0 0382 0 0310 3mURIs 0929 0908 0894 0957 0956 1

Unweighted Average 0683 0603 0496 0752 0625Weighted Average 0701 0493 0556 0714 0648

48 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

4 Qualitative Assessment The high population com-pleteness in general and the high coverage of enti-ties in the media domain in particular give Wiki-data advantage over the other KGs FurthermoreWikidata does not require that there is a Wikipediaarticle for each entity Thus missing Wikidata en-tities can be added by the editors directly and arethen available immediatelyThe use case requires to retrieve also detailed infor-mation about the musicians from the KG such as ashort descripion and a discography DBpedia tendsto store more of that data especially wrt discogra-phy A specialized database like MusicBrainz pro-vides even more data about musicians than DBpe-dia as it is not limited to the Wikipedia infoboxesWhile DBpedia does not provide any links to Mu-sicBrainz Wikidata stores around 120K equiva-lence links to MusicBrainz that can be used to pullmore data In conclusion Wikidata especially inthe combination with MusicBrainz seems to bean appropriate choice for the use case In this casethe qualitative assessment confirms the result ofthe quantitative assessment

The use case shows that our KG recommendationframework enables users to find the most suitable KGand is especially useful in giving an overview of themost relevant criteria when choosing a KG Howeverapplying our framework to the use case also showedthat besides the quantitative assessment there is stilla need for a deep understanding of commonalities anddifference of the KGs in order to make an informedchoice

7 Related Work

71 Linked Data Quality Criteria

Zaveri et al [49] provide a conceptual framework forquality assessment of linked data based on quality cri-teria and metrics which are grouped into quality dimen-sions and categories and which are based on the frame-work of Wang et al [47] Our framework is also basedon Wangrsquos dimensions and extended by the dimensionsConsistency [11] Licensing and Interlinking [49] Fur-thermore we reintroduce the dimensions Trustworthi-ness and Interoperability as a collective term for multi-ple dimensions

Many published DQ criteria and metrics are ratherabstract We in contrast selected and developed con-

crete criteria which can be applied to any KG in theLinked Open Data cloud Table 16 shows which ofthe metrics introduced in this article have already beenused to some extent in existing literature In summaryrelated work mainly proposed generic guidelines forpublishing Linked Data [26] introduced DQ criteriawith corresponding metrics (eg [2030]) and criteriawithout metrics (eg [4029]) 27 of the 34 criteria in-troduced in this article have been introduced or sup-ported in one way or another in earlier works The re-maining seven criteria namely Trustworthiness on KGlevel mgraph Indicating unknown and empty valuesmNoV al Check of schema restrictions during insertionof new statements mcheckRestr Creating a rankingof statements mRanking Timeliness frequency of theKG mFreq Specification of the validity period of state-ments mV alidity and Availability of the KG mAvaihave not been proposed so far to the best of our knowl-edge In the following we present more details of singleexisting approaches for Linked Data quality criteria

Pipino et al [40] introduce the criteria Schema com-pleteness Column completeness and Population com-pleteness in the context of databases We introducethose metrics for KGs and apply them to the best ofour knowledge the first time on the KGs DBpediaFreebase OpenCyc Wikidata and YAGO

OntoQA [45] introduces criteria and correspondingmetrics that can be used for the analysis of ontologiesBesides simple statistical figures such as the average ofinstances per class Tartir et al introduce also criteriaand metrics similar to our DQ criteria Description ofresources mDescr and Column completeness mcCol

Based on a large-scale crawl of RDF data Hogan etal [29] analyze quality issues of published RDF dataLater Hogan et al [30] introduce further criteria andmetrics based on Linked Data guidelines for data pub-lishers [26] Whereas Hogan et al crawl and analyzemany KGs we analyze a selected set of KGs in moredetail

Heath et al [26] provide guidelines for Linked Databut do not introduce criteria or metrics for the assess-ment of Linked Data quality Still the guidelines can beeasily translated into relevant criteria and metrics Forinstance Do you refer to additional access methodsleads to the criteria Provisioning of public SPARQLendpoint mSPARQL and Provisioning of an RDF ex-port mExport Also Do you map proprietary vocabu-lary terms to other vocabularies leads to the criterionInteroperability of proprietary vocabulary mpropV ocMetrics that are based on the guidelines of Heath et alcan also be found in other frameworks [3020]

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 49

Table 16Overview of related work regarding data quality criteria for KGs

DQ Metric [40] [45] [29] [26] [20] [22] [30] [48] [2] [34]

msynRDF X X

msynLit X X X X

msemTriple X X X X

mfact X X

mconClass X X X

mconRelat X X X X X X

mcSchema X X

mcCol X X X X

mcPop X X

mChange X X

mDescr X X X X

mLang X

muSer X

muURI X

mReif X X X

miSerial X

mextV oc X X

mpropV oc X

mDeref X X X X

mSPARQL X

mExport X X

mNegot X X X

mHTMLRDF X

mMeta X X X

mmacLicense X X X

mInst X X X

mURIs X X

Flemming [20] introduces a framework for the qual-ity assessment of Linked Data quality This frameworkmeasures the Linked Data quality based on a sample ofa few RDF documents Based on a systematic literaturereview criteria and metrics are introduced Flemmingintroduces the criteria Labels in multiple languagesmLang and Validity of external URIs mURIs the firsttime The framework is evaluated on a sample of RDFdocuments of DBpedia In contrast to Flemming weevaluate the whole KG DBpedia and also four otherwidely used KGs

SWIQA[22] is a quality assessment framework intro-duced by Fuumlrber et al that introduces criteria and met-rics for the dimensions Accuracy Completeness Timeli-ness and Uniqueness In this framework the dimensionAccuracy is divided into Syntactic validity and Sematicvalidity as proposed by Batini et al [6] Furthermorethe dimension Completeness comprises Schema com-pleteness Column completeness and Population com-pleteness following Pipino et al [40] In this articlewe make the same distinction but in addition distin-guish between RDF documents RDF triples and RDF

50 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

literals for evaluating the Accuracy since we considerRDF KGs

TripleCheckMate [35] is a framework for LinkedData quality assessment using a crowdsourcing-approachfor the manual validation of facts Based on this ap-proach Zaveri et al [48] and Acosta et al [23] analyzeboth syntactic and semantic accuracy as well as theconsistency of data in DBpedia

Kontokostas et al [34] present the test-driven evalu-ation framework RDFUnit for assessing Linked Dataquality This framework is inspired by the paradigmof test-driven software development The frameworkintroduces 17 SPARQL templates of tests that can beused for analyzing KGs wrt Accuracy and Consis-tency Note that those tests can also be used for eval-uating external constraints that exist due to the usageof external vocabulary The framework is applied byKontokostas et al on a set of KGs including DBpedia

72 Comparing KGs by Key Statistics

Duan et al [14] Tartir [45] and Hassanzadeh [25]can be mentioned as the most similar related work re-garding the evaluation of KGs using the key statisticspresented in Section 51

Duan et al [14] analyze the structuredness of data inDBpedia YAGO2 UniProt and in several benchmarkdata sets To that end the authors use simple statisticalkey figures that are calculated based on the correspond-ing RDF dumps In contrast to that approach we useSPARQL queries to obtain the figures thus not limitingourselves to the N-Tripel serialization of RDF dumpfiles Duan et al claim that simple statistical figures arenot sufficient to gain fruitful findings when analyzingthe structuredness and differences of RDF datasets Theauthors therefore propose in addition a coherence met-ric Accordingly we analyze not only simple statisti-cal key figures but further analyze the KGs wrt dataquality using 34 DQ metrics

Tartir et al [45] introduce with the system OntoQAmetrics that can be used for analyzing ontologies Moreprecisely it can be measured to which degree theschema level information is actually used on instancelevel An example of such a metric is the class richnessdefined as the number of classes with instances dividedby the number of classes without instances SWETOTAP and GlycO are used as showcase ontologies

Tartir et al [45] and Hassanzadeh et al [25] analyzehow domains are covered by KGs on both schema andinstance level For that Tartir et al introduce the mea-sure importance as the number of instances per class

and their subclasses In our case we cannot use this ap-proach since Freebase has no hierarchy Hassanzadehet al analyze the coverage of domains by listing themost frequent classes with the highest number of in-stances as a table This gives only little overview of thecovered domains since instances can belong to multi-ple classes in the same domain such as dboPlaceand dboPopulatedPlace For determining thedomain coverages of KGs for this article we there-fore adapt the idea of Hassanzadeh et al by manu-ally mapping the most frequent classes to domains anddeleting duplicates within the domains That meansif an instance is instantiated both as dboPlaceand dboPopulatedPlace the instance will becounted only once in the domain geography

8 Conclusion

Freely available knowledge graphs (KGs) have notbeen in the focus of any extensive comparative study sofar In this survey we defined a range of aspects accord-ing to which KGs can be analyzed We analyzed andcompared DBpedia Freebase OpenCyc Wikidata andYAGO along these aspects and proposed a frameworkas well as a process to enable readers to find the mostsuitable KG for their settings

References

[1] M Acosta E Simperl F Floumlck and M Vidal HARE AHybrid SPARQL Engine to Enhance Query Answers viaCrowdsourcing In Proceedings of the 8th InternationalConference on Knowledge Capture K-CAP 2015 pages111ndash118 ACM 2015

[2] M Acosta A Zaveri E Simperl D Kontokostas S Auer andJ Lehmann Crowdsourcing linked data quality assessment InThe Semantic WebndashISWC 2013 pages 260ndash276 Springer 2013

[3] M Acosta A Zaveri E Simperl D Kontokostas F Floumlckand J Lehmann Detecting Linked Data Quality Issues viaCrowdsourcing A DBpedia Study Semantic Web 2016

[4] S Auer C Bizer G Kobilarov J Lehmann R Cyganiak andZ Ives DBpedia A Nucleus for a Web of Open Data InProceedings of the 6th International Semantic Web Conferenceand 2nd Asian Semantic Web Conference ISWC 2007ASWC2007 pages 722ndash735 Springer 2007

[5] S Auer J Lehmann A-C Ngonga Ngomo and A ZaveriIntroduction to Linked Data and Its Lifecycle on the Web InReasoning Web Semantic Technologies for Intelligent DataAccess volume 8067 of Lecture Notes in Computer Sciencepages 1ndash90 Springer Berlin Heidelberg 2013

[6] C Batini C Cappiello C Francalanci and A MaurinoMethodologies for Data Quality Assessment and ImprovementACM Comput Surv 41(3)161ndash1652 July 2009

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 51

[7] S Bechhofer F van Harmelen J Hendler I Horrocks D LMcGuinness and P F Patel-Schneider OWL Web OntologyLanguage Reference httpswwww3orgTR2004REC-owl-ref-200402102004 [Online accessed 06-Apr-2016]

[8] T Berners-Lee Linked Data httpwwww3orgDesignIssuesLinkedDatahtml2006 [Online accessed 28-Feb-2016]

[9] T Berners-Lee Linked Data Is Merely More Datahttpwwww3orgDesignIssuesLinkedDatahtml 2006[Online accessed 28-02-2016]

[10] T Berners-Lee J Hendler and O Lassila The Semantic WebScientific American 284(5)29ndash37 5 2001

[11] C Bizer Quality-Driven Information Filtering in the Contextof Web-Based Information Systems VDM Publishing 2007

[12] C Bizer J Lehmann G Kobilarov S Auer C BeckerR Cyganiak and S Hellmann DBpediandashA crystallizationpoint for the Web of Data Web Semantics science servicesand agents on the world wide web 7(3)154ndash165 2009

[13] X Dong E Gabrilovich G Heitz W Horn N LaoK Murphy T Strohmann S Sun and W Zhang KnowledgeVault A Web-Scale Approach to Probabilistic KnowledgeFusion In Proceedings of the 20th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining KDD rsquo14 pages 601ndash610 New York NY USA 2014ACM

[14] S Duan A Kementsietsidis K Srinivas and O UdreaApples and Oranges A Comparison of RDF Benchmarks andReal RDF Datasets In Proceedings of the ACM SIGMODInternational Conference on Management of Data SIGMOD2011 pages 145ndash156 2011

[15] B Ell D Vrandecic and E Simperl Proceedings of the 10thInternational Semantic Web Conference (ISWC 2011) chapterLabels in the Web of Data pages 162ndash176 Springer BerlinHeidelberg Berlin Heidelberg 2011

[16] F Erxleben M Guumlnther M Kroumltzsch J Mendez andD Vrandecic Introducing Wikidata to the Linked Data WebIn Proceedings of the 13th International Semantic WebConference ISWC 2014 pages 50ndash65 Springer 2014

[17] M Faumlrber F Bartscherer C Menne and A Rettinger LinkedData Quality of DBpedia Freebase OpenCyc Wikidata andYAGO Semantic Web Journal 2017 to be published

[18] M Faumlrber C Menne and A Rettinger A Linked DataWrapper for CrunchBase Semantic Web Journal 2017 to bepublished

[19] C Fellbaum WordNet ndash An Electronic Lexical Database MITPress 1998

[20] A Flemming Qualitaumltsmerkmale von LinkedData-veroumlffentlichenden Datenquellen (Quality characteristicsof linked data publishing datasources) Diploma ThesisHumboldt University of Berlinhttpwwwdbisinformatikhu-berlindefileadminresearchpapersdiploma_seminar_thesisDiplomarbeit_Annika_Flemmingpdf 2011

[21] G Freedman and E G Reynolds Enriching Basal ReaderLessons with Semantic Webbing Reading Teacher33(6)677ndash684 1980

[22] C Fuumlrber and M Hepp SWIQA ndash A Semantic WebInformation Quality Assessment Framework In Proceedings ofthe 19th European Conference on Information Systems

(ECIS2011) volume 15 page 19 2011[23] R Guns Tracing the origins of the Semantic Web Journal of

the American Society for Information Science and Technology64(10)2173ndash2181 2013

[24] H Halpin P J Hayes J P McCusker D L McGuinness andH S Thompson The Semantic Web ndash ISWC 2010 9thInternational Semantic Web Conference ISWC 2010 ShanghaiChina chapter When owlsameAs Isnrsquot the Same An Analysisof Identity in Linked Data pages 305ndash320 Springer BerlinHeidelberg Berlin Heidelberg 2010

[25] O Hassanzadeh M J Ward M Rodriguez-Muro andK Srinivas Understanding a Large Corpus of Web TablesThrough Matching with Knowledge Bases ndash An EmpiricalStudy In Proceedings of the 10th International Workshop onOntology Matching collocated with the 14th InternationalSemantic Web Conference ISWC 2015 2015

[26] T Heath and C Bizer Linked data Evolving the web into aglobal data space Synthesis lectures on the semantic webtheory and technology 1(1)1ndash136 2011

[27] D Hernaacutendez A Hogan and M Kroumltzsch Reifying RDFWhat Works Well With Wikidata In Proceedings of the 11thInternational Workshop on Scalable Semantic Web KnowledgeBase Systems co-located with 14th International Semantic WebConference pages 32ndash47 2015

[28] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 A Spatially and Temporally Enhanced KnowledgeBase from Wikipedia Artificial Intelligence 19428ndash61 2013

[29] A Hogan A Harth A Passant S Decker and A PolleresWeaving the Pedantic Web Proceedings of the WWW2010Workshop on Linked Data on the Web 628 2010

[30] A Hogan J Umbrich A Harth R Cyganiak A Polleres andS Decker An empirical survey of linked data conformanceWeb Semantics Science Services and Agents on the WorldWide Web 1414ndash44 2012

[31] P Jain P Hitzler K Janowicz and C Venkatramani TherersquosNo Money in Linked Data httpcorescholarlibrarieswrighteducse240 2013 accessedJuly 20 2015

[32] J M Juran F M Gryna and R S Bingham editors QualityControl Handbook McGraw-Hill 1974

[33] G Kobilarov T Scott Y Raimond S Oliver C SizemoreM Smethurst C Bizer and R Lee Media Meets SemanticWeb ndash How the BBC Uses DBpedia and Linked Data to MakeConnections In Proceedings of the 6th European SemanticWeb Conference on The Semantic Web Research andApplications ESWC 2009 Heraklion pages 723ndash737 BerlinHeidelberg 2009 Springer

[34] D Kontokostas P Westphal S Auer S HellmannJ Lehmann R Cornelissen and A Zaveri Test-drivenevaluation of linked data quality In Proceedings of the 23rdinternational conference on World Wide Web pages 747ndash758ACM 2014

[35] D Kontokostas A Zaveri S Auer and J LehmannTripleCheckMate A Tool for Crowdsourcing the QualityAssessment of Linked Data In Knowledge Engineering andthe Semantic Web ndash 4th International Conference KESW 2013St Petersburg Russia October 7-9 2013 Proceedings pages265ndash272 Springer 2013

[36] C Matuszek J Cabral M J Witbrock and J DeOliveira AnIntroduction to the Syntax and Content of Cyc In AAAI SpringSymposium Formalizing and Compiling Background

52 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Knowledge and Its Applications to Knowledge Representationand Question Answering pages 44ndash49 AAAI - Association forthe Advancement of Artificial Intelligence 2006

[37] M Mecella M Scannapieco A Virgillito R BaldoniT Catarci and C Batini Managing data quality in cooperativeinformation systems In On the Move to Meaningful InternetSystems 2002 CoopIS DOA and ODBASE pages 486ndash502Springer 2002

[38] O Medelyan and C Legg Integrating Cyc and WikipediaFolksonomy meets rigorously defined common-sense InWikipedia and Artificial Intelligence An Evolving SynergyPapers from the 2008 AAAI Workshop page 65 2008

[39] F Naumann Quality-Driven Query Answering for IntegratedInformation Systems volume 2261 Springer Science ampBusiness Media 2002

[40] L L Pipino Y W Lee and R Y Wang Data QualityAssessment Communications of the ACM 45(4)211ndash2182002

[41] E Sandhaus Semantic Technology at the New York TimesLessons Learned and Future Directions In Proceedings of the9th International Semantic Web Conference on The SemanticWeb - Volume Part II ISWCrsquo10 pages 355ndash355 BerlinHeidelberg 2010 Springer

[42] A Singhal Introducing the Knowledge Graph things notstrings httpsgoogleblogblogspotde201205introducing-knowledge-graph-things-nothtml retrieved on Aug 29 2016 2012

[43] F M Suchanek G Kasneci and G Weikum YAGO A LargeOntology from Wikipedia and WordNet Web SemanticsScience Services and Agents on the World Wide Web6(3)203ndash217 2008

[44] T P Tanon D Vrandecic S Schaffert T Steiner andL Pintscher From Freebase to Wikidata The Great MigrationIn Proceedings of the 25th International Conference on WorldWide Web WWW 2016 pages 1419ndash1428 2016

[45] S Tartir I B Arpinar M Moore A P Sheth andB Aleman-meza OntoQA Metric-Based Ontology QualityAnalysis In IEEE Workshop on Knowledge Acquisition fromDistributed Autonomous Semantically Heterogeneous Dataand Knowledge Sources 2005

[46] R Y Wang M P Reddy and H B Kon Toward quality dataAn attribute-based approach Decision Support Systems13(3)349ndash372 1995

[47] R Y Wang and D M Strong Beyond Accuracy What DataQuality Means to Data Consumers Journal of managementinformation systems 12(4)5ndash33 1996

[48] A Zaveri D Kontokostas M A Sherif L BuumlhmannM Morsey S Auer and J Lehmann User-driven qualityevaluation of dbpedia In Proceedings of the 9th InternationalConference on Semantic Systems pages 97ndash104 ACM 2013

[49] A Zaveri A Rula A Maurino R Pietrobon J Lehmann andS Auer Quality Assessment for Linked Data A SurveySemantic Web 7(1)63ndash93 2015

Page 7: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 7

Discussion In summary believability considers thesubject (data consumer) side reputation takes the gen-eral social view on trustworthiness objectivity consid-ers the object (data provider) side while verifiabilityfocuses on the possibility of verification

Trustworthiness has been discussed as follows

ndash Believability According to Naumann [39] believ-ability is the ldquoexpected accuracyrdquo of a data source

ndash Reputation The essential difference of believ-ability to accuracy is that for believability data istrusted without verification [11] Thus believabil-ity is closely related to the reputation of a dataset

ndash Objectivity According to Naumann [39] the ob-jectivity of a data source is strongly related to theverifiability The more verifiable a data source orstatement is the more objective it is The authorsof this article would not go so far since also biasedstatements could be verifiable

ndash Verifiability Heath et al [26] emphasize that it isessential for trustworthy applications to be able toverify the origin of data

Definition of metric We define the metric for thedata quality dimension Trustworthiness as a combina-tion of trustworthiness metrics on both KG and state-ment level Believability and reputation are thereby cov-ered by the DQ criterion Trustworthiness on KG level(metric mgraph(hg)) while objectivity and verifiabilityare covered by the DQ criteria Trustworthiness on state-ment level (metric mfact(g)) and Indicating unknownand empty values (metric mNoV al(g)) Hence the ful-fillment degree of a KG g wrt the dimension Trust-worthiness is measured by the metrics mgraph mfactand mNoV al which are defined as follows

Trustworthiness on KG level The measure of Trust-worthiness on KG level exposes a basic indication aboutthe trustworthiness of the KG In this assessment themethod of data curation as well as the method of datainsertion is taken into account Regarding the methodof data curation we distinguish between manual andautomated methods Regarding the data insertion wecan differentiate between 1 whether the data is enteredby experts (of a specific domain) 2 whether the knowl-edge comes from volunteers contributing in a commu-nity and 3 whether the knowledge is extracted automat-ically from a data source This data source can itself beeither structured semi-structured or un-structured Weassume that a closed system where experts or other reg-istered users feed knowledge into a system is less vul-nerable to harmful behavior of users than an open sys-

tem where data is curated by a community Thereforewe assign the values of the metric for Trustworthinesson KG level as follows

mgraph(hg) =

1 manual data curation man-ual data insertion in aclosed system

075 manual data curation and in-sertion both by a commu-nity

05 manual data curation datainsertion by community ordata insertion by automatedknowledge extraction

025 automated data curationdata insertion by automatedknowledge extraction fromstructured data sources

0 automated data curationdata insertion by automatedknowledge extraction fromunstructured data sources

Note that all proposed DQ metrics should be seen assuggestions of how to formulate DQ metrics Henceother numerical values and other classification schemes(eg for mgraph(hg)) might be taken for defining theDQ metrics

Trustworthiness on statement level The fulfillment ofTrustworthiness on statement level is determined by anassessment whether a provenance vocabulary is usedBy means of a provenance vocabulary the source ofstatements can be stored Storing source information isan important precondition to assess statements easilywrt semantic validity We distinguish between prove-nance information provided for triples and provenanceinformation provided for resources

The most widely used ontologies for storing prove-nance information are the Dublin Core Metadataterms16 with properties such as dctermsprovenance and dctermssource and the W3C PROVontology17 with properties such as provwasDerivedFrom

16See httppurlorgdcterms requested on Feb 42017

17See httpswwww3orgTRprov-o requested onDec 27 2016

8 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

mfact(g) =

1 provenance on statement

level is used05 provenance on resource

level is used0 otherwise

Indicating unknown and empty values If the datamodel of the considered KG supports the representa-tion of unknown and empty values more complex state-ments can be represented For instance empty valuesallow to represent that a person has no children andunknown values allow to represent that the birth date ofa person in not known This kind of higher explanatorypower of a KG increases the trustworthiness of the KG

mNoV al(g) =

1 unknown and empty values

are used05 either unknown or empty

values are used0 otherwise

323 ConsistencyDefinition of dimension Consistency implies that

ldquotwo or more values [in a dataset] do not conflict eachotherrdquo [37]

Discussion Due to the high variety of data providersin the Web of Data a user must expect data inconsisten-cies Data inconsistencies may be caused by (i) differ-ent information providers (ii) different levels of knowl-edge and (iii) different views of the world [11]

In OWL restrictions can be introduced to ensureconsistent modeling of knowledge to some degree TheOWL schema restrictions can be divided into class re-strictions and relation restrictions [7]

Class restrictions refer to classes For instanceone can specify via owldisjointWith that twoclasses have no common instance

Relation restrictions refer to the usage of relationsThey can be classified into value constraints and cardi-nality constraints

Value constraints determine the range of relationsowlsomeValuesFrom for instance specifies thatat least one value of a relation belongs to a certainclass If the expected data type of a relation is specifiedvia rdfsrange we also consider this as relationrestriction

Cardinality constraints limit the number of times a re-lation may exist per resource Via owlFunctionalproperty and owlInverseFunctionalProp

erty global cardinality constraints can be specifiedFunctional relations permit at most one value per re-source (eg the birth date of a person) Inverse func-tional relations specify that a value should only occuronce per resource This means that the subject is theonly resource linked to the given object via the givenrelation

Definition of metric We can measure the data qual-ity dimension Consistency by means of (i) whetherschema constraints are checked during the insertion ofnew statements into the KG and (ii) whether alreadyexisting statements in the KG are consistent to specifiedclass and relation constraints The fulfillment degree ofa KG g wrt the dimension consistency is measuredby the metrics mcheckRestr mconClass and mconRelatwhich are defined as follows

Check of schema restrictions during insertion of newstatements Checking the schema restrictions duringthe insertion of new statements can help to reject factsthat would render the KG inconsistent Such simplechecks are often done on the client side in the user inter-face For instance the application checks whether datawith the right data type is inserted Due to the depen-dency to the actual inserted data the check needs to becustom-designed Simple rules are applicable howeverinconsistencies can still appear if no suitable rules areavailable Examples of consistency checks are check-ing the expected data types of literals checking whetherthe entity to be inserted has a valid entity type (iechecking the rdftype relation) checking whetherthe assigned classes of the entity are disjoint ie con-tradicting each other (utilizing owldisjointWithrelations)

mcheckRestr(hg) =

1 schema restrictions arechecked

0 otherwise

Consistency of statements wrt class constraints Thismetric is intended to measure the degree to which theinstance data is consistent with the class restrictions(eg owldisjointWith) specified on the schemalevel

In the following we limit ourselves to the classconstraints given by all owldisjointWith state-ments defined on the schema level of the consid-ered KG Ie let CC be the set of all class con-straints defined as CC = (c1 c2) | (c1owldis-

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 9

jointWith c2) isin g18 Furthermore let cg(e) bethe set of all classes of instance e in g defined ascg(e) = c | (erdftype c) isin g Then we definemconClass(g) as follows

mconClass(g) =

|(c1 c2) isin CC | notexiste (c1 isin cg(e) and c2 isin cg(e))||(c1 c2) isin CC|

In case of an empty set of class constraints CC themetric should evaluate to 1

Consistency of statements wrt relation constraintsThe metric for this criterion is intended for measur-ing the degree to which the instance data is consis-tent with the relation restrictions (eg indicated viardfsrange and owlFunctionalProperty)specified on the schema level We evaluate this crite-rion by averaging over the scores obtained from sin-gle metrics mconRelati indicating the consistency ofstatements wrt different relation constraints

mconRelat(g) =1

n

nsumi=1

mconRelati(g)

In case of evaluating the consistency of instance dataconcretely wrt given rdfsrange and owlFunctionalProperty statements19 we can state

mconRelat(g) =mconRelatRg(g) +mconRelatFct(g)

2

Let Rr be the set of all rdfsrange constraints

Rr = (p d) | (prdfsrange d) isin g

and isDatatype(d)

18Implicit restrictions which can be deducted from the class hi-erarchy eg that a restriction for dboAnimal counts also fordboMammal a subclass of dboAnimal are not considered byus here

19We chose those relations (and for instance notowlInverseFunctionalProperty) as only those relationsare used by more than half of the considered KGs

and Rf be the set of all owlFunctionalPro-perty constraints

Rf = (p d) | (prdftypeowlFunc

tionalProperty) isin g and

(prdfsrange d) isin g and isDatatype(d)

Then we can define the metrics mconRelatRg(g) andmconRelatFct(g) as follows

mconRelatRg(g) =

|(s p o) isin g | exist(p d) isin Rr datatype(o) 6= d||(s p o) isin g | exist(p d) isin Rr|

mconRelatFct(g) =

|(s p o) isin g|exist(p d) isin Rf notexist(s p o2) isin g o 6= o2||(s p o) isin g | exist(p d) isin Rf|

In case of an empty set of relation constraints (Rr orRf ) the respective metric should evaluate to 1

33 Contextual Category

Contextual data quality ldquohighlights the requirementthat data quality must be considered within the contextof the task at handrdquo [47] This category contains thethree dimensions (i) Relevancy (ii) Completeness and(iii) Timeliness Wang et alrsquos further dimensions in thiscategory appropriate amount of data and value-addedare considered by us as being part of the dimensionCompleteness

331 RelevancyDefinition of dimension Relevancy is ldquothe extent

to which data are applicable and helpful for the task athandrdquo [47]

Discussion According to Bizer [11] Relevancy isan important quality dimension since the user is con-fronted with a variety of potentially relevant informa-tion on the Web

Definition of metric The dimension Relevancy isdetermined by the criterion Creating a ranking ofstatements20 The fulfillment degree of a KG g wrtthe dimension Relevancy is measured by the metricmRanking which is defined as follows

20We do not consider the relevancy of literals as there is no rankingof literals provided for the considered KGs

10 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Creating a ranking of statements By means of thiscriterion one can determine whether the KG supportsa ranking of statements by which the relative rele-vance of statements among other statements can beexpressed For instance given the Wikidata entityBarack Obama (wdtQ76) and the relation posi-tion held (wdtP39) President of the United Statesof America (wdtQ11696) has a preferred rank(wdoPreferredRank) (until 2017) while olderpositions which he holds no more are ranked as normalrank (wdoNormalRank)

mRanking(g) =

1 ranking of statements supported0 otherwise

Note that this criterion refers to a characteristic ofthe KG and not to a characteristic of the system thathosts the KG

332 CompletenessDefinition of dimension Completeness is ldquothe ex-

tent to which data are of sufficient breadth depth andscope for the task at handrdquo [47]

We include the following two aspects in this dimen-sion which are separate dimensions in Wang et alrsquosframework

ndash Appropriate amount of data Appropriate amountof data is ldquothe extent to which the quantity orvolume of available data is appropriaterdquo [47]

ndash Value-added Value-added is ldquothe extent to whichdata are beneficial and provide advantages fromtheir userdquo [47]

Discussion Pipino et al [40] divide Completenessinto

1 Schema completeness ie the extent to whichclasses and relations are not missing

2 Column completeness ie the extent to whichvalues of relations on instance level ndash ie facts ndashare not missing and

3 Population completeness ie the extent to whichentities are not missing

The Completeness dimension is context-dependent andtherefore belongs to the contextual category becausethe fact that a KG is seen as complete depends on theuse case scenario ie on the given KG and on the infor-mation need of the user As exemplified by Bizer [11]a list of German stocks is complete for an investor whois interested in German stocks but it is not complete for

an investor who is looking for an overview of Europeanstocks The completeness is hence only assessable bymeans of a concrete use case at hand or with the helpof a defined gold standard

Definition of metric We follow the above-mentioneddistinction of Pipino et al [40] and determine Com-pleteness by means of the criteria Schema completenessColumn completeness and Population completeness

The fulfillment degree of a KG g wrt the dimensionCompleteness is measured by the metrics mcSchemamcCol and mcPop which are defined as follows

Schema completeness By means of the criterionSchema completeness one can determine the complete-ness of the schema wrt classes and relations [40] Theschema is assessed by means of a gold standard Thisgold standard consists of classes and relations which arerelevant for the use case For evaluating cross-domainKGs we use as gold standard a typical set of cross-domain classes and relations It comprises (i) basicclasses such as people and locations in different gran-ularities and (ii) basic relations such as birth date andnumber of inhabitants We define the schema complete-ness mcSchema as the ratio of the number of classesand relations of the gold standard existing in g noclatg and the number of classes and relations in the goldstandard noclat

mcSchema(g) =noclatgnoclat

Column completeness In the traditional database area(with fixed schema) by means of the Column complete-ness criterion one can determine the degree by whichthe relations of a class which are defined on the schemalevel (each relation has one column) exist on the in-stance level [40] In the Semantic Web and Linked Datacontext however we cannot presume any fixed rela-tional schema on the schema level The set of possiblerelations for the instances of a class is given at run-time by the set of used relations for the instances ofthis class Therefore we need to modify this criterionas already proposed by Pipino et al [40] In the updatedversion by means of the criterion Column completenessone can determine the degree by which the instances ofa class use the same relations averaged over all classes

Formally we define the Column completeness met-ric mcCol(g) as the ratio of the number of instanceshaving class k and a value for the relation r nokp tothe number of all instances having class k nok Byaveraging over all class-relation-pairs which occur on

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 11

instance level we obtain a fulfillment degree regardingthe whole KG

mcCol(g) =1

|H|sum

(kp)isinH

nokpnok

We thereby let H = (k p) isin (K times P ) | existk isinCg and exist(x p o) | p isin P imp

g and (xrdftype k) bethe set of all combinations of the considered classesK = k1 kn and considered relations P =p1 pm

Note that there are also relations which are dedicatedto the instances of a specific class but which do notneed to exist for all instances of that class For instancenot all people need to have a relation hasChild ordeathDate21 For measuring the Column complete-ness we selected only those relations for an assessmentwhere a value of the relation typically exists for allgiven instances

Population completeness The Population complete-ness metric determines the extent to which the consid-ered KG covers a basic population [40] The assess-ment of the KG completeness wrt a basic populationis performed by means of a gold standard which coversboth well-known entities (called ldquoshort headrdquo eg then largest cities in the world according to the number ofinhabitants) and little-known entities (called ldquolong tailrdquoeg municipalities in Germany) We take all entitiescontained in our gold standard equally into account

Let GS be the set of entities in the gold standardThen we can define

mcPop(g) =|e|e isin GS and e isin Eg|

|e|e isin GS|

333 TimelinessDefinition of dimension Timeliness is ldquothe extent

to which the age of the data is appropriate for the taskat handrdquo [47]

Discussion Timeliness does not describe the creationdate of a statement but instead the time range since thelast update or the last verification of the statement [39]Due to the easy way of publishing data on the Webdata sources can be kept easier up-to-date than tradi-tional isolated data sources This results in advantagesto the consumer of Web data [39] How Timeliness is

21For an evaluation about the prediction which relations are of thisnature see [1]

measured depends on the application context For somesituations years are sufficient while in other situationsone may need days [39]

Definition of metric The dimension timeliness isdetermined by the criteria Timeliness frequency of theKG Specification of the validity period and Specifica-tion of the modification date of statements

The fulfillment degree of a KG g wrt the dimen-sion Timeliness is measured by the metrics mFreqmV alidity and mChange which are defined as follows

Timeliness frequency of the KG The criterion Time-liness frequency of the KG indicates how fast the KGis updated We consider the KG RDF export here anddifferentiate between continuous updates where the up-dates are always performed immediately and discreteKG updates where the updates take place in discretetime intervals In case the KG edits are available onlineimmediately but the RDF export files are available indiscrete varying updating intervals we consider theonline version of the KG since in the context of LinkedData it is sufficient that URIs are dereferenceable

mFreq(g) =

1 continuous updates05 discrete periodic updates025 discrete non-periodic updates0 otherwise

Specification of the validity period of statements Spec-ifying the validity period of statements enables to tem-porally limit the validity of statements By using this cri-terion we measure whether the KG supports the speci-fication of starting and maybe end dates of statementsby means of providing suitable forms of representation

mV alidity(g) =

1 specification of validity pe-riod supported

0 otherwise

Specification of the modification date of statementsThe modification date discloses the point in timeof the last verification of a statement The modifi-cation date is typically represented via the relationsschemadateModified and dctermsmodified

mChange(g) =

1 specification of modifica-

tion dates for statementssupported

0 otherwise

12 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

34 Representational Data Quality

Representational data quality ldquocontains aspects re-lated to the format of the data [] and meaning ofdatardquo [47] This category contains the two dimensions(i) Ease of understanding (ie regarding the human-readability) and (ii) Interoperability (ie regarding themachine-readability) The dimensions InterpretabilityRepresentational consistency and Concise representa-tion as in addition proposed by Wang et al [47] areconsidered by us as being a part of the dimension Inter-operability

341 Ease of UnderstandingDefinition of dimension The ease of understanding

is ldquothe extent to which data are clear without ambiguityand easily comprehendedrdquo [47]

Discussion This dimension focuses on the under-standability of a data source by a human data con-sumer In contrast the dimension Interoperability fo-cuses on technical aspects The understandability of adata source (here KG) can be improved by things suchas descriptive labels and literals in multiple languages

Definition of metric The dimension understand-ability is determined by the criteria Description of re-sources Labels in multiple languages UnderstandableRDF serialization and Self-describing URIs The ful-fillment degree of a KG g wrt the dimension Con-sistency is measured by the metrics mDescr mLangmuSer and muURI which are defined as follows

Description of resources Heath et al [2630] suggestto describe resources in a human-understandable wayeg via rdfslabel or rdfscomment Withinour framework the criterion is measured as followsGiven a sample of resources we divide the numberof resources in the KG for which at least one label orone description is provided (eg via rdfslabelrdfscomment or schemadescription) bythe number of all considered resources in the localnamespace

mDescr(g) = |u|u isin U localg and exist(u p o) isin g

p isin PlDesc||u|u isin U localg |

PlDesc is the set of implicitly used relations in g in-dicating that the value is a label or description (egPlDesc = rdfslabelrdfscomment)

Beschreibung) Daruumlber hinaus ist das Ergebnisder Evaluation auf Basis der Entitaumlten interessant -gt DBpedia weicht deutlich ab da manche Entitaumlten

(Intermediate-Node-Mapping) keine rdfslabel habenFolglich wuumlrde ich die Definition der Metrik allgemeinhalten (beschraumlnkt auf proprietaumlre Ressourcen dh imselben Namespace) die Evaluation jedoch nur anhandder Entitaumlten machen

Labels in multiple languages Resources in the KG aredescribed in a human-readable way via labels eg viardfslabel or skosprefLabel22 The charac-teristic feature of skosprefLabel is that this kindof label should be used per resource at most once incontrast rdfslabel has no cardinality restrictionsie it can be used several times for a given resourceLabels are usually provided in English as the ldquobasiclanguagerdquo The now introduced metric for the criterionLabels in multiple languages determines whether labelsin other languages than English are provided in the KG

mLang(g) =

1 Labels provided in English

and at least one other lan-guage

0 otherwise

Understandable RDF serialization RDFXML is therecommended RDF serialization format of the W3CHowever due to its syntax RDFXML documents arehard to read for humans The understandability of RDFdata by humans can be increased by providing RDFin other more human-understandable serialization for-mats such as N3 N-Triple and Turtle We measurethis criterion by measuring the supported serializationformats during the dereferencing of resources

muSer(hg) =

1 Other RDF serializationsthan RDFXML available

0 otherwise

Note that conversions from one RDF serializationformat into another are easy to perform

Self-describing URIs Descriptive URIs contribute toa better human-readability of KG data Sauermann etal23 recommend to use short memorable URIs in theSemantic Web context which are easier understandableand memorable by humans compared to opaque URIs24

22Using the namespace httpwwww3org200402skoscore

23See httpswwww3orgTRcooluris requested onMar 1 2016

24For an overview of URI patterns see httpswwww3orgcommunitybpmlodwikiBest_practises_-_previous_notes requested on Dec 27 2016

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 13

such as wdtQ1040 The criterion Self-describingURIs is dedicated to evaluate whether self-describingURIs or generic IDs are used for the identification ofresources

muURI(g) =

1 self-describing URIs always used05 self-describing URIs partly used0 otherwise

342 InteroperabilityInteroperability is another dimension of the repre-

sentational data quality category and subsumes Wanget alrsquos aspects interpretability representational consis-tency and concise representation

Definition of dimension We define Interoperabilityalong the subsumed dimensions of Wang et al

ndash Interpretability Interpretability is ldquothe extent towhich data are in appropriate language and unitsand the data definitions are clearrdquo [47]

ndash Representational consistency Representationalconsistency is ldquothe extent to which data are alwayspresented in the same format and are compatiblewith previous datardquo [47]

ndash Concise representation Concise representationis ldquothe extent to which data are compactly repre-sented without being overwhelmingrdquo [47]

Discussion regarding interpretability In contrastto the dimension understandability which focuses onthe understandability of RDF KG data towards the useras data consumer interpretability focuses on the rep-resentation forms of information in the KG from atechnical perspective An example is the considerationwhether blank nodes are used According to Heath etal [26] blank nodes should be avoided in the LinkedData context since they complicate the integration ofmultiple data sources and since they cannot be linkedby resources of other data sources

Discussion regarding representational consistencyIn the context of Linked Data it is best practice to reuseexisting vocabulary for the creation of own RDF dataIn this way less data needs to be prepared for beingpublished as Linked Data [26]

Discussion regarding concise representation Heathet al [26] made the observation that the RDF features(i) RDF reification25 (ii) RDF collections and RDF

25In the literature it is often not differentiated between reificationin the general sense and reification in the sense of the specific

container and (iii) blank nodes are not very widelyused in the Linked Open Data context Those featuresshould be avoided according to Heath et al in orderto simplify the processing of data on the client sideEven the querying of the data via SPARQL may getcomplicated if RDF reification RDF collections andRDF container are used We agree on that but alsopoint out that reification (implemented via RDF stan-dard reification n-ary relations singleton propertiesor named graphs) is inevitably necessary for makingstatements about statements

Definition of metric The dimension Interoperabil-ity is determined via the following criteria

ndash Avoiding blank nodes and RDF reificationndash Provisioning of several serialization formatsndash Using external vocabularyndash Interoperability of proprietary vocabulary

The fulfillment degree of a KG g wrt the dimen-sion Interoperability is measured by the metrics mReif miSerial mexV oc and mpropV oc which are defined asfollows

Avoiding blank nodes and RDF reification Using RDFblank nodes RDF reification RDF container and RDFlists is often considered as ambivalent On the one handthese RDF features are not very common and theycomplicate the processing and querying of RDF data[3026] On the other hand they are necessary in cer-tain situations eg when statements about statementsshould be made We measure the criterion by evaluatingwhether blank nodes and RDF reification are used

mReif (g) =

1 no blank nodes and no RDF

reification05 either blank nodes or RDF

reification0 otherwise

Provisioning of several serialization formats The in-terpretability of RDF data of a KG is increased if be-

proposal described in the RDF standard (Brickley D Guha R (eds)RDF Vocabulary Description Language 10 RDF Schema W3CRecommendation online available at httpwwww3orgTRrdf-schema requested on Sep 2 2016) For more informationabout reification and its implementation possibilities we can refer thereader to [27] In this article we use the term reification by defaultfor the general sense and standard reification or RDF reificationfor referring to the modeling of reification according to the RDFstandard

14 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

sides the serialization standard RDFXML further seri-alization formats are supported for URI dereferencing

miSerial(hg) =

1 RDFXML and further for-

mats are supported05 only RDFXML is supported0 otherwise

Using external vocabulary Using a common vocabu-lary for representing and describing the KG data allowsto represent resources and relations between resourcesin the Web of Data in a unified way This increases theinteroperability of data [3026] and allows a comfort-able data integration We measure the criterion of usingan external vocabulary by setting the number of tripleswith external vocabulary in predicate position to thenumber of all triples in the KG

mextV oc(g) =|(s p o)|(s p o) isin g and p isin P external

g ||(s p o) isin g|

Interoperability of proprietary vocabulary Linkingon schema level means to link the proprietary vo-cabulary to external vocabulary Proprietary vocab-ulary are classes and relations which were definedin the KG itself The interlinking to external vo-cabulary guarantees a high degree of interoperabil-ity [26] We measure the interlinking on schemalevel by calculating the ratio to which classes andrelations have at least one equivalency link (egowlsameAs owlequivalentProperty orowlequivalentClass) to classes and relationsrespectively of other data sources

mpropV oc(g) = |x isin Pg cup Cg|exist(x p o) isin g

(p isin Peq and (o isin U and o isin Uextg ))||Pg cup Cg|

where Peq = owlsameAsowlequivalent-PropertyowlequivalenClass and Uext

g con-sists of all URIs in Ug which are external to the KG gwhich means that hg is not responsible for resolvingthese URIs

35 Accessibility Category

Accessibility data quality refers to aspects on howdata can be accessed This category contains the threedimensions

ndash Accessibilityndash Licensing andndash Interlinking

Wangrsquos dimension access security is considered by usas being not relevant in the Linked Open Data contextas we only take open data sources into account

In the following we go into details of the mentioneddata quality dimensions

351 AccessibilityDefinition of dimension Accessibility is ldquothe ex-

tent to which data are available or easily and quicklyretrievablerdquo [47]

Discussion Wang et alrsquos definition of Accessibilitycontains the aspects availability response time anddata request They are defined as follows

1 Availability ldquoof a data source is the probability thata feasible query is correctly answered in a giventime rangerdquo [39]According to Naumann [39] the availability is animportant quality aspect for data sources on theWeb since in case of integrated systems (with fed-erated queries) usually all data sources need tobe available in order to execute the query Therecan be different influencing factors regarding theavailability of data sources such as the day timethe worldwide distribution of servers the planedmaintenance work and the caching of data LinkedData sources can be available as SPARQL end-points (for performing complex queries on thedata) and via HTTP URI dereferencing We needto consider both possibilities for this DQ dimen-sion

2 Response time characterizes the delay betweenthe point in time when the query was submittedand the point in time when the query response isreceived [11]Note that the response time is dependent on em-pirical factors such as the query the size of the in-dexed data the data structure the used triple storethe hardware and so on We do not consider theresponse time in our evaluations since obtaininga comprehensive result here is hard

3 In the context of Linked Data data requests canbe made (i) on SPARQL endpoints (ii) on RDFdumps (export files) and (iii) on Linked DataAPIs

Definition of metric We define the metric for thedimension Accessibility by means of metrics for thefollowing criteria

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 15

ndash Dereferencing possibility of resourcesndash Availability of the KGndash Provisioning of public SPARQL endpointndash Provisioning of an RDF exportndash Support of content negotiationndash Linking HTML sites to RDF serializationsndash Provisioning of KG metadata

The fulfillment degree of a KG g wrt the dimen-sion Accessibility is measured by the metrics mDeref mAvai mSPARQL mExport mNegot mHTMLRDF and mMeta which are defined as follows

Dereferencing possibility of resources One of theLinked Data principles [9] is the dereferencing possi-bility of resources URIs must be resolvable via HTTPrequests and useful information should be returnedthereby We assess the dereferencing possibility of re-sources in the KG by analyzing for each URI in the sam-ple set (here all URIs Ug) the HTTP response statuscode and by evaluating whether RDF data is returned Asuccessful dereferencing of resources is given if HTTPstatus code 200 and an RDF document is returned

mDeref (hg) =|dereferencable(Ug)|

|Ug|

Availability of the KG The Availability of the KG cri-terion indicates the uptime of the KG It is an essentialcriterion in the context of Linked Data since in case ofan integrated or federated query mostly all data sourcesneed to be available [39] We measure the availabil-ity of a KG by monitoring the ability of dereferencingURIs over a period of time This monitoring processcan be done with the help of a monitoring tool such asPingdom26

mAvai(hg) =Number of successful requests

Number of all requests

Provisioning of public SPARQL endpoint SPARQLendpoints allow the user to perform complex queries(including potentially many instances classes and rela-tions) on the KG This criterion here indicates whetheran official SPARQL endpoint is publicly availableThere might be additional restrictions of this SPARQLendpoint such as a maximum number of requests pertime slice or a maximum runtime of a query However

26See httppingdomcom requested on Mar 1 2016

we do not measure these restrictions here

mSPARQL(hg) =

1 SPARQL endpoint publiclyavailable

0 otherwise

Provisioning of an RDF export If there is no pub-lic SPARQL endpoint available or the restrictions ofthis endpoint are so strict that the user does not useit an RDF export dataset (RDF dump) can often beused This dataset can be used to set up a local pri-vate SPARQL endpoint The criterion here indicateswhether an RDF export dataset is officially available

mExport(hg) =

1 RDF export available0 otherwise

Support of content negotiation Content negotiation(CN) allows that the server returns RDF documentsduring the dereferencing of resources in the desiredRDF serialization format The HTTP protocol allowsthe client to specify the desired content type (eg RDFXML) in the HTTP request and the server to specifythe returned content type in the HTTP response header(eg applicationrdf+xml) In this way the de-sired and the provided content type are matched as faras possible It can happen that the server does not pro-vide the desired content type Moreover it may hap-pen that the server returns an incorrect content typeThis may lead to the fact that serialized RDF data isnot processed further An example is RDF data whichis declared as textplain [26] Hogan et al [29]therefore propose to let KGs return the most specificcontent type as possible We measure the Support ofcontent negotiation by dereferencing resources withdifferent RDF serialization formats as desired contenttype and by comparing the accept header of the HTTPrequest with the content type of the HTTP response

mNegot(hg) =

1 CN supported and correct

content types returned05 CN supported but wrong

content types returned0 otherwise

Linking HTML sites to RDF serializations Heath etal [26] suggest linking any HTML description of aresource to RDF serializations of this resource in or-der to make the discovery of corresponding RDF dataeasier (for Linked Data aware applications) For thatreason in the HTML header the so-called Autodiscov-

16 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

ery pattern can be included This pattern consists ofthe phrase link rel=alternate the indicationabout the provided RDF content type and a link to theRDF document27 We measure the linking of HTMLpages to RDF documents (ie resource representations)by evaluating whether the HTML representations of theresources contain links as described

mHTMLRDF (hg) =

1 Autodiscovery pattern usedat least once

0 otherwise

Provisioning of KG metadata In the light of the Se-mantic Web vision where agents select and make useof appropriate data sources on the Web also the meta-information about KGs needs to be available in amachine-readable format The two important mech-anisms to specify metadata about KGs are (i) usingsemantic sitemaps and (ii) using the VoID vocabu-lary28 [26] For instance the URI of the SPARQL end-point can be assigned via voidsparqlEndpointand the RDF export URL can be specified withvoiddataDump Such metadata can be added as ad-ditional facts to the KG or it can be provided as separateVoID file We measure the Provisioning of KG meta-data by evaluating whether machine-readable metadataabout the KG is available Note that the provisioningof licensing information in a machine-readable format(which is also a meta-information about the KG) isconsidered in the data quality dimension License lateron

mMeta(g) =

1 Machine-readable metadataabout g available

0 otherwise

352 LicenseDefinition of dimension Licensing is defined as

ldquothe granting of permission for a consumer to re-use adataset under defined conditionsrdquo [49]

Discussion The publication of licensing informationabout KGs is important for using KGs without legalconcerns especially in commercial settings CreativeCommons (CC)29 publishes several standard licensing

27An example is ltlinkrel=alternate type =applicationrdf+xml href=companyrdfgt

28See namespace httpwwww3orgTRvoid29See httpcreativecommonsorg requested on Mar

1 2016

contracts which define rights and obligations Thesecontracts are also in the Linked Data context popularThe most frequent licenses for Linked Data are CC-BYCC-BY-SA and CC0 [31] CC-BY30 requires specify-ing the source of the data CC-BY-SA31 requires in ad-dition that if the data is published it is published underthe same legal conditions CC032 defines the respectivedata as public domain and without any restrictions

Noteworthy is that most data sources in the LinkedOpen Data cloud do not provide any licensing infor-mation [31] which makes it difficult to use the datain commercial settings Even if data is published un-der CC-BY or CC-BY-SA the data is often not usedsince companies refer to uncertainties regarding thesecontracts

Definition of metric The dimension License isdetermined by the criterion Provisioning machine-readable licensing information

The fulfillment degree of a KG g wrt the dimensionLicense is measured by the metric mmacLicense whichis defined as follows

Provisioning machine-readable licensing informationLicenses define the legal frameworks under which theKG data may be used Providing machine-readable li-censing information allows users and applications to beaware of the license and to use the data of the KG inaccordance with the legal possibilities [3026]

Licenses can be specified in RDF via relationssuch as cclicence33 dctermslicence ordctermsrights The licensing information can bespecified either in the KG as additional facts or sepa-rately in a VoID file We measure the criterion by eval-uating whether licensing information is available in amachine-readable format

mmacLicense(g) =

1 machine-readable

licensing informationavailable

0 otherwise

353 InterlinkingDefinition of dimension Interlinking is the extent

ldquoto which entities that represent the same concept are

30See httpscreativecommonsorglicensesby40 requestedon Mar 1 2016

31See httpscreativecommonsorglicensesby-sa40 requested on Mar 1 2016

32See httpcreativecommonsorgpublicdomainzero10 requested on Mar 3 2016

33Using the namespace httpcreativecommonsorgns

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 17

linked to each other be it within or between two ormore data sourcesrdquo [49]

Discussion According to Bizer et al [12] DBpediaestablished itself as a hub in the Linked Data clouddue to its intensive interlinking with other KGs Theseinterlinking is on the instance level usually establishedvia owlsameAs links However according to Halpinet al [24] those owlsameAs links do not alwaysinterlink identical entities in reality According to theauthors one reason might be that the KGs provideentries in different granularity For instance the DB-pedia resource for Berlin (dboBerlin) links viaowlsameAs relations to three different resources inthe KG GeoNames34 namely (i) Berlin the capital35

(ii) Berlin the state36 and (iii) Berlin the city37 More-over owlsameAs relations are often created auto-matically by some mapping function Due to mappingerrors the precision is often below 100 [18]

Definition of metric The dimension Interlinking isdetermined by the criteria

ndash Interlinking via owlsameAsndash Validity of external URIs

The fulfillment degree of a KG g wrt the dimen-sion Interlinking is measured by the metrics mInst andmURIs which are defined as follows

Interlinking via owlsameAs The forth LinkedData principle according to Berners-Lee [8] is the inter-linking of data resources so that the user can explorefurther information According to Hogan et al [30] theinterlinking has a side effect It does not only result inotherwise isolated KGs but the number of incominglinks of a KG indicates the importance of the KG in theLinked Open Data cloud We measure the interlinkingon instance level38 by calculating the extent to which in-stances have at least one owlsameAs link to externalKGs

34See httpwwwgeonamesorg requested on Dec 312016

35See httpwwwgeonamesorg2950159berlinhtml requested on Feb 4 2017

36See httpwwwgeonamesorg2950157land-berlinhtml requested on Feb 4 2017

37See httpwwwgeonamesorg6547383berlin-stadthtml requested on Feb 4 2017

38The interlinking on schema level is already measured via thecriterion Interoperability of proprietary vocabulary

mInst(g) = |x isin Ig (Pg cup Cg) |

exist(xowlsameAs y) isin g and y isin Uextg |

|Ig (Pg cup Cg)|

Validity of external URIs The considered KG maycontain outgoing links referring to RDF resourcesor Web documents (non-RDF data) The linking toRDF resources is usually enabled by owlsameAsowlequivalentProperty and owlequivalentClass relations Web documents are linked viarelations such as foafhomepage and foafdepiction Linking to external resources always entailsthe problem that those links might get invalid over timeThis can have different causes For instance the URIsare not available anymore We measure the Validity ofexternal URIs by evaluating the URIs from an URI sam-ple set wrt whether there is a timeout a client error(HTTP response 4xx) or a server error (HTTP response5xx)

mURIs(g) =|x isin A | resolvable(x)|

|A|

where A = y | exist(x p y) isin g (p isin Peqandx isin Ug(CgcupPg)andx isin U local

g andy isin Uextg ) and resolvable(x)

returns true if HTTP status code 200 is returned Peq isthe set of relations used for linking to external sourcesExamples for such relations are owlsameAs andfoafhomepage

In case of an empty set A the metric should evaluateto 1

36 Conclusion

In this section we provided 34 DQ criteria which canbe applied in the form of DQ metrics to KGs in order toassess those KGs wrt data quality The DQ criteria areclassified into 11 DQ dimensions These dimensionsare themselves grouped into 4 DQ categories In totalwe have the following picture

ndash Intrinsic category

lowast Accuracylowast Syntactic validity of RDF documentslowast Syntactic validity of literalslowast Semantic validity of triples

18 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

lowast Trustworthinesslowast Trustworthiness on KG levellowast Trustworthiness on statement levellowast Using unknown and empty values

lowast Consistencylowast Check of schema restrictions during inser-

tion of new statementslowast Consistency of statements wrt class con-

straintslowast Consistency of statements wrt relation con-

straints

ndash Contextual category

lowast Relevancylowast Creating a ranking of statements

lowast Completenesslowast Schema completenesslowast Column completenesslowast Population completeness

lowast Timelinesslowast Timeliness frequency of the KGlowast Specification of the validity period of state-

mentslowast Specification of the modification date of

statements

ndash Representational data quality

lowast Ease of understandinglowast Description of resourceslowast Labels in multiple languageslowast Understandable RDF serializationlowast Self-describing URIs

lowast Interoperabilitylowast Avoiding blank nodes and RDF reificationlowast Provisioning of several serialization formatslowast Using external vocabularylowast Interoperability of proprietary vocabulary

ndash Accessibility category

lowast Accessibilitylowast Dereferencing possibility of resourceslowast Availability of the KGlowast Provisioning of public SPARQL endpointlowast Provisioning of an RDF exportlowast Support of content negotiationlowast Linking HTML sites to RDF serializationslowast Provisioning of KG metadata

lowast Licenselowast Provisioning machine-readable licensing in-

formationlowast Interlinking

lowast Interlinking via owlsameAslowast Validity of external URIs

4 Selection of KGs

We consider the following KGs for our comparativeevaluation

ndash DBpedia DBpedia39 is the most prominent KGin the LOD cloud [4] The project was initiatedby researchers from the Free University of Berlinand the University of Leipzig in collaborationwith OpenLink Software Since the first public re-lease in 2007 DBpedia is updated roughly once ayear40 By means of a dedicated open source ex-traction framework DBpedia is created from infor-mation contained in Wikipedia such as infobox ta-bles categorization information geo-coordinatesand external links Due to its role as the hub ofthe LOD cloud DBpedia contains many links toother datasets in the LOD cloud such as FreebaseOpenCyc UMBEL41 GeoNames Musicbrainz42

CIA World Factbook43 DBLP44 Project Guten-berg45 DBtune Jamendo46 Eurostat47 Uniprot48

and Bio2RDF4950 DBpedia has been used exten-sively in the Semantic Web research communitybut has become also relevant in commercial set-tings for instance companies such as the BBC[33] and the New York Times [41] use DBpediato organize their content The version of DBpediawe analyzed is 2015-04

39See httpdbpediaorg requested on Nov 1 201640There is also DBpedia live which started in 2009 and which

gets updated when Wikipedia is updated See httplivedbpediaorg requested on Nov 1 2016 Note however thatDBpedia live only provides a restricted set of relations compared toDBpedia Also the provisioning of data varies a lot While for sometime ranges DBpedia live provides data for each hour for other timeranges DBpedia live data is only available once a month

41See httpumbelorg requested on Dec 31 201642See httpmusicbrainzorg requested on Dec 31

201643See httpswwwciagovlibrary

publicationsthe-world-factbook requested on Dec31 2016

44See httpwwwdblporg requested on Dec 31 201645See httpswwwgutenbergorg requested on Dec

31 201646See httpdbtuneorgjamendo requested on Dec

31 201647See httpeurostatlinked-statisticsorg

requested on Dec 31 201648See httpwwwuniprotorg requested on Dec 31

201649See httpbio2rdforg requested on Dec 31 201650See a complete list of the links on the websites describing the sin-

gle DBpedia versions such as httpdownloadsdbpediaorg2016-04links (requested on Nov 1 2016)

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 19

ndash Freebase Freebase51 is a KG announced byMetaweb Technologies Inc in 2007 and was ac-quired by Google Inc on July 16 2010 In con-trast to DBpedia Freebase had provided an in-terface that allowed end-users to contribute tothe KG by editing structured data Besides user-contributed data Freebase integrated data fromWikipedia NNDB52 FMD53 and MusicBrainz54

Freebase uses a proprietary graph model for stor-ing also complex statements Freebase shut downits services completely on August 31 2016 Onlythe latest data dump is still available WikimediaDeutschland and Google integrate Freebase datainto Wikidata via the Primary Sources Tool55 Fur-ther information about the migration from Free-base to Wikidata is provided in [44] We analyzedthe latest Freebase version as of March 2015

ndash OpenCyc The Cyc56 project started in 1984 bythe industry research and development consortiumMicroelectronics and Computer Technology Cor-poration The aim of Cyc is to store ndash in a machine-processable way ndash millions of common sense factssuch as ldquoEvery tree is a plantrdquo The main focus ofCyc has been on inferencing and reasoning SinceCyc is proprietary a smaller version of the KGcalled OpenCyc57 was released under the opensource Apache license Version 2 In July 2006 Re-searchCyc58 was published for the research com-munity containing more facts than OpenCyc Wedid not consider Cyc and ResearchCyc since thoseKGs do not meet the chosen requirements namelythat the KGs are freely available and freely us-able in any context The version of OpenCyc weanalyzed is 2012-05-10

ndash Wikidata Wikidata59 is a project of WikimediaDeutschland which started on October 30 2012The aim of the project is to provide data whichcan be used by any Wikimedia project including

51See httpfreebasecom requested on Nov 1 201652See httpwwwnndbcom requested on Dec 31 201653See httpwwwfashionmodeldirectorycom re-

quested on Dec 31 201654See httpmusicbrainzorg requested on Dec 31

201655See httpswwwwikidataorgwikiWikidata

Primary_sources_tool requested on Apr 8 201656See httpwwwcyccom requested on Dec 31 201657See httpwwwopencycorg accessed on Nov 1

201658See httpresearchcyccom requested on Dec 31

201659See httpwikidataorg accessed on Nov 1 2016

Wikipedia Wikidata does not only store facts butalso the corresponding sources so that the valid-ity of facts can be checked Labels aliases anddescriptions of entities in Wikidata are providedin almost 400 languages Wikidata is a commu-nity effort ie users collaboratively add and editinformation Also the schema is maintained andextended based on community agreements Wiki-data is currently growing considerably due to theintegration of Freebase data [44] The version ofWikidata we analyzed is 2015-10

ndash YAGO YAGO60 ndash Yet Another Great Ontol-ogy ndash has been developed at the Max PlanckInstitute for Computer Science in Saarbruumlckensince 2007 YAGO comprises information ex-tracted from Wikipedia (such as information fromthe categories redirects and infoboxes) Word-Net [19] (such as information about synsets andhyponomies) and GeoNames61 The version ofYAGO we analyzed is YAGO3 which was pub-lished in March 2015

5 Comparison of KGs

51 Key Statistics

In the following we present statistical commonal-ities and differences of the KGs DBpedia FreebaseOpenCyc Wikidata and YAGO We thereby use thefollowing key statistics

ndash Number of triplesndash Number of classesndash Number of relationsndash Distribution of classes wrt the number of their

corresponding instancesndash Coverage of classes with at least one instance per

classndash Covered domains wrt entitiesndash Number of entitiesndash Number of instancesndash Number of entities per classndash Number of unique subjectsndash Number of unique predicatesndash Number of unique objects

In Section 72 we provide an overview of relatedwork wrt those key statistics

60See httpwwwmpi-infmpgdedepartmentsdatabases-and-information-systemsresearchyago-nagayagodownloads accessed on Nov 1 2016

61See httpwwwgeonamesorg requested on Dec 312016

20 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

511 TriplesRanking of KGs wrt number of triples The num-

ber of triples (see Table 2) differs considerably betweenthe KGs Freebase is the largest KG with over 31Btriples while OpenCyc resides the smallest KG withonly 24M triples The large size of Freebase can betraced back to the fact that large data sets such as Mu-sicBrainz have been integrated into this KG OpenCycin contrast has been built purely manually by expertsIn general this indicates a correlation between the wayof building up a KG and its size

Size differences between DBpedia and YAGO Asboth DBpedia and YAGO were created automaticallyby extracting semantically-structured information fromWikipedia the significant difference between their sizesndash in terms of triples ndash is in particular noteworthy Wecan mention here the following reasons YAGO inte-grates the statements from different language versionsof Wikipedia in one single KG while for the canon-ical DBpedia dataset (which is used in our evalua-tions) solely the English Wikipedia was used as in-formation source Besides that YAGO contains con-textual information and detailed provenance informa-tion Contextual information is for instance the an-chor texts of all links within Wikipedia For repre-senting the anchor texts the relation yagohasWikipediaAnchorText (330M triples in total) is usedThe provenance information of single statements isstored in a reified form In particular the relationsyagoextractionSource (1612M triples) andyagoextractionTechnique (1762M triples)are applied therefore3nInfluence of reification on the number of triples

DBpedia Freebase Wikidata and YAGO use someform of reification Reification in general describesthe possibility of making statements about statementsWhile reification has an influence on the number oftriples for DBpedia Freebase and Wikidata the num-ber of triples in YAGO is not influenced by reificationsince data is here provided in N-Quads62 This style ofreification is called Named Graph [27] The additionalcolumn (in comparison to triples) contains a unique IDof the statement by which the triple becomes identifiedFor backward compatibility the ID is commented andtherefore not imported into the triple store Note how-ever that transforming N-Quads to N-Triples leads to a

62The idea of N-Quads is based on the assignment of triples todifferent graphs YAGO uses N-Quads to identify statements per ID

high number of unique subjects concerning the set ofall triples

In case of DBpedia Freebase and Wikidata reifica-tion is implemented by means of n-ary relations Ann-ary relation denotes the relation between more thantwo resources and is implemented via additional inter-mediate nodes since in RDF only binary statementscan be modeled [1627] In Freebase and DBpedia datais mostly provided in the form of plain N-Triples andn-ary relations are only used for data from higher ar-ity63 Wikidata in contrast has the peculiarity that notonly every statement is expressed with the help of ann-ary relation but that in addition each statement is in-stantiated with wdoStatement This leads to about74M additional instances which is about one tenth ofall triples in Wikidata

512 ClassesMethods for counting classes The number of

classes can be calculated in different ways Classes canbe identified via rdfsClass and owlClass re-lations or via rdfssubClassOf relations64 SinceFreebase does not provide any class hierarchy withrdfssubClassOf relations and since Wikidatadoes not instantiate classes explicitly as classes butuses instead only ldquosubclass ofrdquo (wdtP279) relationsthe method of calculating the number of classes de-pends on the considered KG

Ranking of KG wrt number of classes Our eval-uations revealed that YAGO contains the highest num-ber of classes of all considered KGs DBpedia in con-trast has the fewest (see Table 2)

Number of classes in YAGO and DBpedia Howdoes it come to this gap between DBpedia and YAGOwith respect to the number of classes although bothKGs were created automatically based on WikipediaFor YAGO the classes are extracted from the categoriesin Wikipedia while the hierarchy of the classes is de-ployed with the help of WordNet synset relations TheDBpedia ontology in contrast is very small since itis created manually based on the mostly used infobox

63In Freebase Compound Value Types are used for reifi-cation [44] In DBpedia it is named Intermedia Node Map-ping see httpmappingsdbpediaorgindexphpTemplateIntermediateNodeMapping (requested on Dec31 2016)

64The number of classes in a KG may also be calculated by takingall entity type relations (rdftype and ldquoinstance ofrdquo (wdtP31)in case of Wikidata) on the instance level into account However thiswould result only in a lower bound estimation as here those classesare not considered which have no instances

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 21

DBpe

dia

Freeb

ase

Ope

nCyc

Wikidat

a

YAGO

20

40

60

80

100

Co

ve

rag

e in

Fig 1 Coverage of classes having at least one instance

templates in Wikipedia Besides those 736 classes theDBpedia KG contains further 444895 classes whichoriginate from the imported YAGO classes and whichare published in the namespace yago Those YAGOclasses are ndash like the DBpedia ontology classes ndash inter-connected via rdfssubClassOf to form a taxon-omy In the evaluation of DBpedia the YAGO classesare ignored as they do not belong to the DBpedia on-tology given as OWL file

Coverage of classes with at least one instanceFig 1 shows for each KG the extent to which classes areinstantiated that is for how many classes at least oneinstance exists YAGO exhibits the highest coveragerate (826) although it contains the highest numberof classes among the KGs This can be traced back tothe fact that YAGO classes are chosen by a heuristicthat considers Wikipedia leaf categories which tend tohave instances [43] OpenCyc (with 65) and Wiki-data (54) come last in the ranking Wikidata has thesecond highest number of classes in total (see Table 2)out of which relatively little are used on instance levelNote however that in some scenarios solely the schemalevel information (including classes) of KGs is neces-sary so that the low coverage of instances by classes isnot necessarily an issue

Correlation between number of classes and num-ber of instances In Fig 2 we can see a histogramof the classes with respect to the number of instancesper class That is for each KG we can spot how manyclasses have a high number of instances and how manyclasses have a low number of instances Note the log-arithmic scale on both axes The curves seem to fol-low power law distributions For DBpedia the line de-

Table 1Percentage of considered entities per KG for covered domains

DB FB OC WD YA

Reach of method 88 92 81 41 82

creases consistently for the first 250 classes before itdecreases more than exponentially beyond class 250

513 DomainsAll considered KGs are cross-domain meaning that a

variety of domains are covered in those KGs Howeverthe KGs often cover the single domains to a differentdegree Tartir [45] proposed to measure the covered do-mains of ontologies by determining the usage degree ofcorresponding classes the number of instances belong-ing to one or more subclasses of the respective domainis compared to the number of all instances In our workhowever we decided to evaluate the coverage of do-mains concerning the classes per KG via manual assign-ments of the mostly used classes to the domains peoplemedia organizations geography and biology65 Thislist of domains was created by aggregating the mostfrequent domains in Freebase

The manual assignment of classes to domainsis necessary in order to obtain a consistent assign-ment of the classes to the domains across all con-sidered KGs Otherwise the same classes in differ-ent KGs may be assigned to different domains More-over in some KGs classes may otherwise appear invarious domains simultaneously For instance theFreebase classes freebasemusicartist andfreebasepeopleperson overlap in terms oftheir instances and multiple domains (such as musicand people) might be assigned to them

As the reader can see in Table 1 our method to de-termine the coverage of domains and hence the reachof our evaluation includes about 80 of all entities ofeach KG except Wikidata It is calculated as the ratio ofthe number of unique entities of all considered domainsof a given KG divided by the number of all entities ofthis KG66 If the ratio was at 100 we were able toassign all entities of a KG to the chosen domains

Fig 3 shows the number of entities per domain in thedifferent KGs with a logarithmic scale Fig 4 presents

65See our website for examples of classes per domain andper KG httpkmaifbkitedusitesknowledge-graph-comparison (requested on Dec 31 2016)

66We used the number of unique entities of all domains and notthe sum of the entities measured per domain since entities may be inseveral domains at the same time

22 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

100 101 102 103

Classes

100

102

104

106

108

Nu

mb

er

of

insta

nce

s

DBpedia

Freebase

OpenCyc

Wikidata

YAGO

Fig 2 Distribution of classes wrt the number of instances per KG

persons media organizations geography biology100

102

104

106

108

1010

Num

ber

of entities

DBpedia

Freebase

OpenCyc

Wikidata

YAGO

Fig 3 Number of entities per domain

the relative coverage of each domain in each KG It iscalculated as the ratio of the number of entities in eachdomain to the total number of entities of the KG Avalue of 100 means that all instances reside in onesingle domain

The case of Freebase is especially outstanding here77 of all entities here are located in the media

domain This fact can be traced back to large-scaledata imports such as from MusicBrainz The classfreebasemusicrelease_track is account-able for 42 of the media entities As shown in Fig 3Freebase provides the most entities in four out of thefive domains when considering all KGs

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 23

persons media organizations geography biology

10

20

30

40

50

60

70

80

Rela

tive n

um

ber

of entities in p

erc

ent

DBpedia

Freebase

OpenCyc

Wikidata

YAGO

Fig 4 Relative number of entities per domain

In DBpedia and YAGO the domain of people is thelargest domain (50 and 34 respectively) Peculiar isthe higher coverage of YAGO regarding the geographydomain compared to DBpedia As one reason for thatwe can point out the data import of GeoNames intoYAGO

Wikidata contains around 150K entities in the do-main organization This is relativly few consideringthe total amount of entities being around 187M andconsidering the number of organizations in other KGsNote that even DBpedia provides more organizationentities than Wikidata The reason why Wikidata hasnot so many organization entities is not fully compre-hensible to us However we can point out that for ouranalysis we only considered Wikidata classes whichappeared more than 6000 times67 and that about 16Kclasses were therefore not considered It is possible thatentities of the domain organization are belonging tothose rather rarely occurring classes

514 Relations and PredicatesEvaluation method In this article we differentiate

between relations and predicates (see also Section 2)

ndash Relations ndash as short term for explicitly defined re-lations ndash refers to (proprietary) vocabulary definedon the schema level of a KG We identify the setof relations of a KG as the set of those links which

67This number is based on heuristics We focused on the 150 mostinstantiated classes and cut the long tail of classes having only fewinstances

are explicitly defined as such via assignments (forinstance with rdfsProperty) to classes InSection 2 we used Pg to denote this set

ndash In contrast we use predicates to denote links usedin the KG independently of their introduction onthe schema level The set of unique predicates perKG denoted as P imp

g is nothing else than the setof unique RDF terms on the predicate position ofall triples in the KG

It is important to distinguish the key statistics for rela-tions from the key statistics for predicates since theycan differ considerably depending on to which degreerelations are only defined on schema level but not usedon instance level

Evaluation resultsRelationsRanking regarding relations As presented in Ta-

ble 2 Freebase exhibits by far the highest number ofunique relations (around 785K) among the KGs YAGOshows only 106 relations which is the lowest value inthis comparison In the following we point out furtherfindings regarding the relations of the single KGs

DBpedia Regarding DBpedia relations we need todistinguish between so-called mapping-based prop-erties and non-mapping-based properties Mapping-based properties are created by extracing the informa-tion from infoboxes in Wikipedia using manually cre-ated mappings These mappings are specified in the DB-

24 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

pedia Mappings Wiki68 Mapping-based properties arecontained in the DBpedia ontology and located in thenamespace httpdbpediaorgontologyWe count 2819 such relations for the considered DB-pedia version 2015-04 Non-mapping-based properties(also called ldquoraw infobox propertiesrdquo) are extractedfrom Wikipedia without the help of manually createdmappings and hence without any manual adjustmentsTherefore they are generally of lower quality We count58776 such unique relations They reside in the names-pace httpdbpediaorgproperty Bothmapping-based and non-mapping-based properties areinstantiated in DBpedia with rdfProperty We ig-nore the non-mapping based properties for the calcu-lation of the number of relations |Pg| (see Table 2)since in contrast to DBpedia in YAGO non-mappingbased properties are not instantiated Note that themapping-based properties and the non-mapping basedproperties in DBpedia are not aligned69 and may over-lap until DBpedia version 2016-0470

Freebase The high number or Freebase relations canbe explained by two facts 1 About a third of all rela-tions in Freebase are duplicates in the sense that they aredeclared by means of the owlinverseOf relationas being inverse of other relations An example is the re-lation freebasemusicartistalbum and itsinverse relation freebasemusicalbumartist2 Freebase allowed users to introduce their own rela-tions without any limits These relations were originallyin each userrsquos namespace So-called commons adminswere able to approve those relations so that they gotincluded into the Freebase commons schema

OpenCyc For OpenCyc we measure 18028 uniquerelations We can assume that most of them are dedi-cated to statements on the schema level

Wikidata In Wikidata a relatively small set of rela-tions is provided Note in this context that despite thefact that Wikidata is curated by a community (just likeFreebase) Wikidata community members cannot insertarbitrarily new relations as it was possible in Freebaseinstead relations first need to be proposed and thenget accepted by the community if and only if certain

68See httpmappingsdbpediaorgindexphpMain_Page accessed on Nov 4 2016

69For instance The DBpedia ontology containsdbobirthName for the name of a person while the non-mappingbased property set contains dbpname dbpfirstname anddbpalternativeNames

70For instance dbpalias and dboalias

criteria are met71 One of those criteria is that each newrelation is presumably used at least 100 times Thisrelation proposal process can be mentioned as likelyreason why in Wikidata in relative terms more relationsare actually used than in Freebase

YAGO For YAGO we measure the small set of 106unique relations Although relations are curated man-ually for YAGO and DBpedia the size of the relationset differs significantly between those KGs Hoffart etal [28] mention the following reasons for that

1 Peculiarity of relations The DBpedia ontologyprovides quite many special relations For in-stance there exists the relation dboaircraftFighter between dboMilitaryUnit anddboMeanOfTransportation

2 Granularity of relations Relations in the DB-pedia ontology are more fine-grained than rela-tions in YAGO For instance DBpedia contains therelations dboauthor and dbodirectorwhereas in YAGO there is only the generic relationyagocreated

3 Date specification The DBpedia ontology intro-duces several relations for dates For instance DB-pedia contains the relations dbobirthDateand dbobirthYear for birth dates while inYAGO only the relation yagobirthOnDateis used Incomplete date specifications ndash for in-stance if only the year is known ndash are specifiedin YAGO by wildcards (ldquordquo) so that no multiplerelations are needed

4 Inverse relations YAGO has no relations ex-plicitly specified as being inverse In DBpediawe can find relations specified as inverse such asdboparent and dbochild

5 Reification YAGO introduces the SPOTL(X) for-mat This format extends the triple format ldquoSPOldquowith a specification of Time Location and conteXtIn this way no contextual relations are necessary(such as dbodistanceToLondon or dbopopulationAsOf) which occur if the relationsare closely aligned to Wikipedia template attributenames

Frequency of the usage of relations Fig 5 showsthe relative proportions of how often relations are usedper KG grouped into three classes Surprisingly DB-pedia and Freebase exhibit a high number of relationswhich are not used at all on the instance level In case of

71See httpswwwwikidataorgwikiWikidataProperty_proposal requested on Dec 31 2016

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 25

0 1-500 gt500

Number of relations

20

40

60

80

100

Re

lative

occu

ren

cie

s in

pe

rce

nt

DBpedia

Freebase

OpenCyc

Wikidata

YAGO

Fig 5 Frequency of the usage of the relations per KG grouped by(i) zero occurrences (ii) 1ndash500 occurrences and (iii) more than 500occurrences in the respective KG

OpenCyc 992 of the defined relations are never usedWe assume that those relations are used only withinCyc the commercial version of OpenCyc In case ofFreebase only 5 of the relations are used more than500 times and about 70 are not used at all Analo-gously to the discussion regarding the number of Free-base relations we can mention again the high numberof defined owlinverseOf relations and the highnumber of usersrsquo relation proposals as reasons for that

PredicatesRanking regarding predicates Freebase is here ndash

like in case of the ranking regarding relations ndash rankedfirst The lowest number of unique predictes is providedby OpenCyc which exhibits only 165 predicates AllKGs except OpenCyc provide more predicates then re-lations Our single observations regarding the predicatesets are as follows

DBpedia DBpedia is ranked third in terms of the ab-solute numbers of predicates about 60K predicates areused in DBpedia The set of relations and the set of pred-icates varies considerably here since also facts are ex-tracted from Wikipedia info-boxes whose predicates areconsidered by us as being only implicitly defined andwhich hence occur only as predicates These are the so-called non-mapping-based properties Note that in the

studied DBpedia version 2015-04 the set of explicitlydefined relations (mapping-based properties) and theset of implicitly defined relations (non-mapping-basedproperties) overlaps An example is dbpalias withdboalias

Freebase We can observe here a similar picture asfor the set of Freebase relations With about 785Kunique predicates Freebase exceeds the other KGs byfar Note however that 95 of the predicates (around743K) are used only once This relativizes the highnumber Most of the predicates are keys in the senseof ids and are used for internal modeling (for instancefreebasekeyuseradrianb)

OpenCyc In contrast to the 18028 unique relationswe measure only 164 unique predicates for OpenCycMore predicates are presumably used in Cyc

Wikidata We measure more Wikidata predicates thanWikidata relations since Wikidata predicates are cre-ated by modifying Wikidata relations An exampleare the following triples which express the statementBarack Obama (wdtQ76) is a human (wdtQ5) byan intermediate node (wdtQ76S123 abbreviated)

wdtQ76 wdtP31s wdtQ76S123wdtQ76S123 wdtP31v wdtQ5

The relation extension ldquosrdquo indicates that the RDF termin the object position is a statement The ldquovrdquo extensionallows to refer to a value (in Wikidata terminology)Besides those extensions there is ldquorrdquo to refer to a ref-erence and the ldquoqrdquo extension to refer to a qualifier Ingeneral these relation extensions are used for realizingreification via n-ary relations For that intermediatenodes are used which represent statements [16]

YAGO YAGO contains more predicates than DBpe-dia since infobox attributes from different languageversions of Wikipedia are aggregated into one KG72

while for DBpedia separate localized KG versions areoffered for non-English languages

515 Instances and EntitiesEvaluation method We distinguish between in-

stances Ig and entities Eg of a KG (cf Section 2)

1 Instances are belonging to classes They are iden-tified by retrieving the subjects of all triples wherethe predicates indicate class affiliations

72The language of each attribute is encoded in theURI for instance yagoinfoboxdeflaumlche andyagoinfoboxenareakm

26 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGODBpe

diaFre

ebas

eOpe

nCyc

Wiki

data

YAGO

10 010 110 210 310 410 510 610 710 810 9

Num

ber

of In

stan

ces

Fig 6 Number of instances per KG

2 Entities are real-world objects This excludesfor instance instantiated statements for beingentities Determining the set of entities is par-tially tricky In DBpedia and YAGO entitiesare determined as being an instance of theclass owlThing In Freebase entities are in-stances of freebasecommontopic and inWikidata instance of wdoItem In OpenCyccychIndividual corresponds to owlThingbut not all entities are classified in this way There-fore we approximately determine the set of en-tities in OpenCyc by manually classifying allclasses having more than 300 instances includingat least one entity73 In this way abstract classessuch as cychExistingObjectType are ne-glected

Ranking wrt the number of instances Table 2and Fig 6 show the number of instances per KG Wecan see that Wikidata comprises the highest numberof instances (142M) in total and OpenCyc the fewest(242K)

Ranking wrt the number of entities Table 2shows the ranking of KGs regarding the number of en-tities Freebase contains by far the highest number ofentities (about 499M) OpenCyc is at the bottom withonly about 41K entities

Differences in number of entities The reason whythe KGs show quite varying numbers of entities are theinformation sources of the KGs We illustrate this withthe music domain as example

1 Freebase had been created mainly from data im-ports such as from MusicBrainz Therefore enti-

73For instance cychIndividual cychMovie_CW andcychCity

ties in the domain of media and especially songrelease tracks are covered very well in Freebase77 of all entities are in the media domain (seeSection 513) out of which 42 are releasetracks74

Due to the large size and the world-wide coverageof entities in MusicBrainz Freebase contains al-bums and release tracks of both English and non-English languages For instance regarding the En-glish language the album ldquoThrillerrdquo from MichaelJackson and its single ldquoBillie Jeanrdquo are there aswell as rather unknown songs from the ldquoThrillerrdquoalbum such as ldquoThe Lady in My Liferdquo Regard-ing non-English languages Freebase contains forinstance songs and albums from Helene Fischersuch as ldquoLassrsquo mich in dein Lebenrdquo and ldquoZauber-mondrdquo also rather unknown songs such as ldquoHabrsquoden Himmel beruumlhrtrdquo can be found

2 In case of DBpedia the English Wikipedia is thesource of information In the English Wikipediamany albums and singles of English artists are cov-ered ndash such as the album ldquoThrillerrdquo and the singleldquoBillie Jeanrdquo Rather unknown songs such as ldquoTheLady in My Liferdquo are not covered in WikipediaFor many non-English artists such as the Germansinger Helene Fischer no music albums and nosingles are contained in the English Wikipedia Inthe corresponding language version of Wikipedia(and localized DBpedia version) this informationis often available (for instance the album ldquoZauber-mondrdquo and the song ldquoLassrsquo mich in dein Lebenrdquo)but not the rather unknown songs such as ldquoHabrsquoden Himmel beruumlhrtrdquo

3 For YAGO the same situation as for DBpediaholds with the difference that YAGO in additionimports entities also from the different languageversions of Wikipedia and imports also data fromsources such as GeoNames However the abovementioned works (ldquoLassrsquo mich in dein LebenrdquoldquoZaubermondrdquo and ldquoHabrsquo den Himmel beruumlhrtrdquo)of Helene Fischer are not in the YAGO althoughthe song ldquoLassrsquo mich in dein Lebenrdquo exists inthe German Wikipedia since May 2014 and al-though the used YAGO version 3 is based on theWikipedia dump of June 201475 Presumably theYAGO extraction system was unable to extract any

74Those release tracks are expressed via freebasemusicrelease_track

75See httpwwwmpi-infmpgdededepartmentsdatabases-and-information-

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 27DBpe

diaFre

ebas

eOpe

nCyc

Wiki

data

YAGO

10 0

10 1

10 2

10 3

10 4

Ave

rage

num

ber

of e

ntiti

es

Fig 7 Average number of entities per class per KG

types for those entities so that those entities werediscarded

4 Wikidata is supported by the community and con-tains music albums of English and non-Englishartists even if they do not exist in Wikipedia Anexample is the song ldquoThe Lady in My Liferdquo Notehowever that Wikidata does not provide all artistrsquosworks such as from Helene Fischer

5 OpenCyc contains only very few entities in themusic domain The reason is that OpenCyc has itsfocus mainly on common-sense knowledge andnot so much on facts about entities

Average number of entities per class Fig 7 showsthe average number of entities per class which can bewritten as |Eg||Cg| Obvious is the difference betweenDBpedia and YAGO (despite the similar number of en-tities) The reason for that is that the number of classesin the DBpedia ontology is small (as created manually)and in YAGO large (as created automatically)

Comparing number of instances with number ofentities Comparing the ratio of the number of instancesto the number of entities for each KG Wikidata ex-poses the highest difference As reason for that we canstate that each statement in Wikidata is modeled as aninstance of wdoStatement leading to 74M addi-tional instances In other KGs such as DBpedia state-ments are modeled without any dedicated statementassignment OpenCyc exposes also a high ratio sinceit contains mainly common sense knowledge and notas many entities as the other KGs Furthermore for ouranalysis we do not regard 100 of the entities but onlya large fraction of it (more precisely the classes with

systemsresearchyago-nagayagoarchive re-quested on Dec 31 2016

DBpedia

Freeb

ase

OpenC

ycW

ikida

ta

YAGO

0

1

2

3

4

5

6

7

8

Rat

io o

f num

ber

of in

stan

ces

to n

umbe

r of

ent

ities

Fig 8 Ratio of the number of instances to the number of entities foreach KG

the most frequently occurring instantiations) since en-tities are not consistently instantiated in OpenCyc (seebeginning of Section 515)

516 Subjects and ObjectsEvaluation method The number of unique subjects

and unique objects can be a meaningful KG charac-teristic regarding the link structure within the KG andin comparison to other KGs Especially interesting aredifferences between the number of unique subjects andthe number of unique objects

We measure the number of unique subjects by count-ing the unique resources (ie URIs and blank nodes) onthe subject position of N-Triples Sg = s | (s p o) ising Furthermore we measure the number of uniqueobjects by counting the unique resources on the ob-ject position of N-Triples excluding literals Og =o | (s p o) isin g and o isin U cup B Complementary thenumber of literals is given as Olit

g = o | (s p o) ising and o isin L

Ranking of KGs regarding number of uniquesubjects The number of unique subjects per KG is pre-sented in Fig 9 YAGO contains the highest number ofdifferent subjects while OpenCyc contains the fewest

Ranking of KGs regarding number of unique ob-jects The number of unique objects is also presented inFig 9 Freebase shows the highest score in this regardOpenCyc again the lowest

Ranking of KGs regarding the ratio of numberof unique subjects to number of unique objects Theratios of the number of unique subjects to the number ofunique objects vary considerably between the KGs (seeFig 8) We can observe that DBpedia has 265 timesmore objects than subjects while YAGO on the otherside has 19 times more unique subjects than objects

28 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Table 2Summary of key statistics

DBpedia Freebase OpenCyc Wikidata YAGO

Number of triples |(s p o) isin g| 411 885 960 3 124 791 156 2 412 520 748 530 833 1 001 461 792

Number of classes |Cg| 736 53 092 116 822 302 280 569 751

Number of relations |Pg| 2819 70 902 18 028 1874 106

No of unique predicates |P impg | 60 231 784 977 165 4839 88 736

Number of entities |Eg| 4 298 433 49 947 799 41 029 18 697 897 5 130 031

Number of instances |Ig| 20 764 283 115 880 761 242 383 142 213 806 12 291 250

Avg number of entities per class |Eg||Cg| 58403 9408 035 619 90

No of unique subjects |Sg| 31 391 413 125 144 313 261 097 142 278 154 331 806 927

No of unique non-literals in obj pos |Og| 83 284 634 189 466 866 423 432 101 745 685 17 438 196

No of unique literals in obj pos |Olitg | 161 398 382 1 782 723 759 1 081 818 308 144 682 682 313 508

DBpedia

Freeb

ase

OpenC

yc

Wiki

data

YAGO10 0

10 2

10 4

10 6

10 8

10 1 0

10 1 2

unique subjectsunique objects

Fig 9 Number of unique subjects and objects per KG Note thelogarithmic scale on the axis of ordinates

The high number of unique subjects in YAGO is sur-prising and can be explained by the reification styleused in YAGO Facts are stored as N-Quads in orderto allow for making statements about statements (forinstance storing the provenance information for state-ments) To that end IDs (instead of blank nodes) whichidentify the triples are used on the first position of N-Triples They lead to 308M unique subjects such asyagoid_6jg5ow_115_lm6jdp In the RDF ex-port of YAGO the IDs which identify the triples arecommented out in order to facilitate the N-Triple for-mat However the statements about statements are alsotransformed to triples In those cases the IDs identi-fying the reified statements are in the subject positionleading to such a high number of unique subjects

DBpedia contains considerably more owlsameAslinks to external resources than KGs like YAGO (290M

vs 38M links) leading to a bias of DBpedia towards ahigh number of unique objects

517 Summary of Key StatisticsBased on the evaluation results presented in the last

subsections we can highlight the following insights

1 Triples All KGs are very large Freebase is thelargest KG in terms of number of triples whileOpenCyc is the smallest KG We notice a corre-lation between the way of building up a KG andthe size of the KG automatically created KGs aretypically larger as the burdens of integrating newknowledge become lower Datasets which havebeen imported into the KGs such as MusicBrainzinto Freebase have a huge impact on the numberof triples and on the number of facts in the KGAlso the way of modeling data has a great impacton the number of triples For instance if n-aryrelations are expressed in N-Triples format (as incase of Wikidata) many intermediate nodes needto be modeled leading to many additional triplescompared to plain statements Last but not leastthe number of supported languages influences thenumber of triples

2 Classes The number of classes is highly varyingamong the KGs ranging from 736 (DBpedia) upto 300K (Wikidata) and 570K (YAGO) Despite itshigh number of classes YAGO contains in relativeterms the most classes which are actually used(ie classes with at least one instance) This canbe traced back to the fact that heuristics are usedfor selecting appropriate Wikipedia categories asclasses for YAGO Wikidata in contrast containsmany classes but out of them only a small fraction

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 29

is actually used on instance level Note howeverthat this is not necessarily a burden

3 Domains Although all considered KGs are speci-fied as crossdomain domains are not equally dis-tributed in the KGs Also the domain coverageamong the KGs differs considerably Which do-mains are well represented heavily depends onwhich datasets have been integrated into the KGsMusicBrainz facts had been imported into Free-base leading to a strong knowledge representation(77) in the domain of media in Freebase In DB-pedia and YAGO the domain people is the largestlikely due to Wikipedia as data source

4 Relations and Predicates Many relations arerarely used in the KGs Only 5 of the Freebaserelations are used more than 500 times and about70 are not used at all In DBpedia half of therelations of the DBpedia ontology are not usedat all and only a quarter of the relations is usedmore than 500 times For OpenCyc 992 of therelations are not used We assume that they areused only within Cyc the commercial version ofOpenCyc

5 Instances and Entities Freebase contains by farthe highest number of entities Wikidata exposesrelatively many instances in comparison to theentities as each statement is instantiated leadingto around 74M instances which are not entities

6 Subjects and Objects YAGO provides the high-est number of unique subjects among the KGsand also the highest ratio of the number of uniquesubjects to the number of unique objects This isdue to the fact that N-Quad representations needto be expressed via intermedium nodes and thatYAGO is concentrated on classes which are linkedby entities and other classes but which do not pro-vide outlinks DBpedia exhibits more unique ob-jects than unique subjects since it contains manyowlsameAs statements to external entities

52 Data Quality Analysis

We now present the results obtained by applyingthe DQ metrics introduced in the Sections 32 ndash 35 tothe KGs DBpedia Freebase OpenCyc Wikidata andYAGO

521 AccuracyThe fulfillment degrees of the KGs regarding the

Accuracy metrics are shown in Table 3

Table 3Evaluation results for the KGs regarding the dimension Accuracy

DB FB OC WD YA

msynRDF 1 1 1 1 1msynLit 099 1 1 1 062msemTriple 099 lt1 1 099 099

Syntactic validity of RDF documents msynRDF

Evaluation method For evaluating the Syntactic va-lidity of RDF documents we dereference the entityldquoHamburgrdquo as resource sample in each KG In caseof DBpedia YAGO Wikidata and OpenCyc thereare RDFXML serializations of the resource availablewhich can be validated by the official W3C RDF valida-tor76 Freebase only provides a Turtle serialization Weevaluate the syntactic validity of this Turtle documentby verifying if the document can be loaded into an RDFmodel of the Apache Jena Framework77

Evaluation result All considered KGs provide syn-tactically valid RDF documents In case of YAGO andWikidata the RDF validator declares the used languagecodes as invalid since the validator evaluates languagecodes in accordance with ISO-639 The criticized lan-guage codes are however contained in the newer stan-dard ISO 639-3 and actually valid

Syntactic validity of literals msynLit

Evaluation method We evaluate the Syntactic va-lidity of literals by means of the relations date ofbirth number of inhabitants and International Stan-dard Book Number (ISBN) as those relations cover dif-ferent domains ndash namely people cities and books ndashand as they can be found in all KGs In general do-main knowledge is needed for selecting representativerelations so that a meaningful coverage is guaranteed

Note that OpenCyc is not taken into account forthis criterion Although OpenCyc comprises around11M literals in total these literals are essentially la-bels and descriptions (given via rdfslabel andrdfscomment) ie not aligned to specific datatypes Hence OpenCyc has no syntactic invalid literalsand is assigned the metric value 1

As long as a literal with data type is given its syntaxis verified with the help of the function RDFDatatypeisValid(String) of the Apache Jena framework

76See httpsw3orgRDFValidator requested onMar 2 2016

77See httpsjenaapacheorg requested Mar 22016

30 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Thereby standard data types such as xsddate canbe validated easily especially if different data types areprovided78 If no data type is provided or if the literalvalue is of type xsdString the literal is evaluatedby a regular expression which is created manually (seebelow depending on the considered relation) For eachof the three relations we created a sample of 1M literalvalues per KG as long as the respective KG containsso many literals

Evaluation results All KGs except YAGO per-formed very well regarding the Syntactic validity ofliterals

Date of Birth For Wikidata DBpedia and Freebaseall verified literal values (1M per KG) were syntacti-cally correct79 For YAGO we detected around 519Ksyntactic errors (given 1M literal values) due to the us-age of wildcards in the date values For instance thebirth date of yagoSocrates is specified as ldquo470--rdquo which does not correspond to the syntax ofxsddate Obviously the syntactic invalidity of lit-erals is accepted by the YAGO publishers in order tokeep the number of relations low80

Number of inhabitants The data types of the literalvalues regarding the number of inhabitants were validin all KGs For DBpedia YAGO and Wikidata weevaluated the syntactic validity of the number of inhab-itants by checking if xsdnonNegativeIntegerxsddecimal and xsdinteger were used asdata types for the typed literals In Freebase no datatype is specified Therefore we evaluated the values bymeans of a regular expression which allows only thedecimals 0-9 periods and commas

ISBN The ISBN is an identifier for books and maga-zines The identifier can occur in various formats withor without preceding ldquoISBNrdquo with or without delim-iters and with 10 or 13 digits Gupta81 provided a regu-lar expression for validating ISBN in its different formswhich we used in our evaluation All in all most ofthe ISBN were assessed as syntactically correct The

78In DBpedia for instance data for the relationdbobirthDate is stored both as xsdgYear and xsddate

79Surprisingly the Jena Framework assessed data values with anegative year (ie BC eg ldquo-600rdquo for xsdgYear) as invaliddespite the correct syntax

80In order to model the dates to the extent they are known furtherrelations would be necessary such as using wasBornOnYearwith range xsdgYear wasBornOnYearMonth with rangexsdgYearMonth

81See httphowtodoinjavacomregexjava-regex-validate-international-standard-book-number-isbns requested on Mar 1 2016

lowest fulfillment degree was obtained for DBpediaWe found the following findings for the single KGs InFreebase around 699K ISBN numbers were availableOut of them 38 were assessed as syntactically incorrectTypical mistakes were too long numbers and wrongprefixes82 In case of Wikidata 18 of around 11K ISBNnumbers were syntactically invalid However some in-valid numbers have meanwhile been corrected This in-dicates that the Wikidata community does not only careabout inserting new data but also about curating givenKG data In case of YAGO we could only find 400triples with the relation yagohasISBN Seven of theliterals on the object position were syntactically incor-rect For DBpedia we evaluated around 24K literals7419 of them were assessed as syntactically incorrectIn many cases comments next to the ISBN numbers inthe info-boxes of Wikipedia led to an inaccurate extrac-tion of data so that the comments are either extractedas additional facts about ISBN numbers83 or togetherwith the actual ISBN numbers as coherent strings84

Semantic validity of triples msemTriple

Evaluation method The semantic validity can be re-liably measured by means of a reference data set which(i) contains at least to some degree the same facts asin the KG and (ii) which is regarded as some kind ofauthority We decided to use the Integrated AuthorityFile (Gemeinsame Normdatei GND)85 which is anauthority file especially concerning persons and corpo-rate bodies and which was created manually by Ger-man libraries Due to the focus on persons (especiallyauthors) we decided to evaluate a random sample ofperson entities wrt the following relations birth placedeath place birth date and death date For each ofthese relations the corresponding relations in the KGswere determined Then a random sample of 100 personentities per KG was chosen For each entity we retrievedthe facts with the mentioned relations and assessedmanually whether a GND entry exists and whether thevalues of the relations match with the values in the KG

Evaluation result We evaluated up to 400 facts perKG and observed only for a few facts some discrep-ancies For instance Wikidata states as death date of

82Eg we found the 16 digit ISBN 9789780307986931 (cffreebasem0pkny27) and the ISBN 2940045143431 with pre-fix 294 instead of 978 (cf freebasem0v3xf7b)

83See dbrPrince_Caspian84An example is ldquoISBN 0755111974 (hardcover edition)rdquo for

dbrMy_Family_and_Other_Animals85See httpwwwdnbdeENStandardisierung

GNDgndhtml requested on Sep 8 2016

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 31

ldquoAnton Erkelenzldquo (wdtQ589196) April 24 whereasGND states April 25 For DBpedia and YAGO we en-countered 3 and for Wikidata 4 errors Hence thoseKGs were evaluated with 099 Note that OpenCyc hasno values for the chosen relations and thus evaluates to1

During evaluation we identified the following issues

1 For finding the right entry in GND more informa-tion besides the name of the person is needed Thisinformation is sometimes not given so that entitydisambiguation is in those cases hard to perform

2 Contrary to assumptions often either no corre-sponding GND entry exists or not many facts ofthe GND entity are given In other words GND isincomplete wrt to entities (cf Population com-pleteness) and relations (cf Column complete-ness)

3 Values of different granularity need to be matchedsuch as an exact date of birth against the indicationof a year only

In conclusion the evaluation of semantic validity ishard even if a random sample set is evaluated manuallyMeaningful differences among the KGs might be re-vealed only when a very large sample is evaluated egby using crowd-sourcing [2348] Another approachfor assessing the semantic validity is presented by Kon-tokostas et al [34] who propose a test-driven evalu-ation where test cases are created to evaluate triplessemi-automatically For instance an interval specifiesthe valid height of a person and all triples which lieoutside of this interval are evaluated manually In thisway outliers can be easily found but possible wrongvalues within the interval are not detected

Our findings appear to be consistent with the evalua-tion results of the YAGO developer team for YAGO2where manually assessing 4412 statements resulted inan accuracy of 98186

522 TrustworthinessThe fulfillment degrees of the KGs regarding the

Trustworthiness criteria are shown in Table 4

Trustworthiness on KG level mgraph

Evaluation method Regarding the trustworthinessof a KG in general we differentiate between the method

86With a weighted averaging of 95 see httpwwwmpi-infmpgdededepartmentsdatabases-and-information-systemsresearchyago-nagayagostatistics requested on Mar 3 2016

Table 4Evaluation results for the KGs regarding the dimensionTrustworthiness

DB FB OC WD YA

mgraph 05 05 1 075 025mfact 05 1 0 1 1mNoV al 0 1 0 1 0

of how new data is inserted into the KG and the methodof how existing data is curated

Evaluation results The KGs differ considerablywrt this metric OpenCyc obtains the highest scorehere followed by Wikidata In the following we pro-vide findings for the single KGs which are listed bydecreasing fulfillment score

Cyc is edited (expanded and modified) exclusively bya dedicated expert group The free version OpenCycis derived from Cyc and only a locally hosted versioncan be modified by the data consumer

Wikidata is also curated and expanded manually butby volunteers of the Wikidata community Wikidataallows importing data from external sources such asFreebase87 However new data is not just inserted butis approved by the community

Freebase was also curated by a community of vol-unteers In contrast to Wikidata the proportion of dataimported automatically is considerably higher and newdata imports were not dependent on community ap-provals

DBpedia and YAGO The knowledge of both KGs isextracted from Wikipedia but DBpedia differs fromYAGO wrt the community involvement Any usercan engage (i) in mapping the Wikipedia infobox tem-plates to the DBpedia ontology in the DBpedia map-pings wiki88 and (ii) in the development of the DBpediaextraction framework

Trustworthiness on statement levelWe determine the Trustworthiness on statement level

by evaluating whether provenance information for state-ments is used in the KGs The picture is mixed

DBpedia uses the relation provwasDerivedFrom to store the sources of the entities and their state-

87Note that imports from Freebase require the approval ofthe community (see httpswwwwikidataorgwikiWikidataPrimary_sources_tool) Besides that there arebots which import automatically (see httpswwwwikidataorgwikiWikidataBotsde)

88See httpmappingsdbpediaorg requested onMar 3 2016

32 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

ments However as the source is always the correspond-ing Wikipedia article89 this provenance informationis trivial and the fulfillment degree is hence of ratherformal nature

YAGO uses its own vocabulary to indicate thesource of information Interestingly YAGO stores perstatement both the source (via yagoextractionSource eg the Wikipedia article) and the used ex-traction technique (via yagoextractionTech-nique eg ldquoInfobox Extractorrdquo or ldquoCategoryMap-perrdquo) The number of statements about sources is 161Mand hence many times over the number of instances inthe KG The reason for that is that in YAGO the sourceis stored for each fact

In Wikidata several relations can be used for refer-ring to sources such as ldquoimported fromrdquo (wdtP143)ldquostated inrdquo (wdtP248) and ldquoreference URLrdquo (wdtP854)90 Note that ldquoimported fromrdquo relations are usedfor automatic imports but that statements with such areference are not accepted (ldquodata is not sourcedrdquo)91 Tosource data the other relations ldquostated inrdquo and ldquoref-erence URLrdquo can be used The number of all storedreferences in Wikidata92 is around 971K Based on thenumber of all statements93 74M this corresponds to acoverage of around 13 Note however that not everystatement in Wikidata requires a reference according tothe Wikidata guidelines In order to be able to state howmany references are actually missing a manual evalua-tion would be necessary However such an evaluationwould be presumably highly subjective

Freebase uses proprietary vocabulary for represent-ing provenance via n-ary relations which are in Free-base called Compound Value Types (CVT) data fromhigher arity can be expressed [44]94

OpenCyc differs from the other KGs in that it usesneither an external vocabulary nor a proprietary vocab-ulary for storing provenance information

89Eg httpenwikipediaorgwikiHamburg fordbrHamburg

90All relations are instances of Wikidata property to indicate asource (wdtQ18608359)

91See httpswwwwikidataorgwikiPropertyP143 requested Mar 3 2016

92This is the number of instances of wdoReference93This is the number of instances of wdoStatement94Eg for a statement with the relation freebaselocation

statistical_regionpopulation the source can bestored via freebasemeasurement_unitdated_integersource

Table 5Evaluation results for the KGs regarding the dimension Consistency

DB FB OC WD YA

mcheckRestr 0 1 0 1 0mconClass 088 1 lt1 1 033mconRelat 099 045 1 050 099

Indicating unknown and empty values mNoV al

This criterion highlights the subtle data model ofWikidata and Freebase in comparison to the data mod-els of the other KGs Wikidata allows for storing un-known values and empty values (eg that ldquoElizabeth Iof Englandrdquo (wdtQ7207) had no children) Howeverin the Wikidata RDF export such statements are onlyindirectly available since they are represented via blanknodes and via the relation owlsomeValuesFrom

YAGO supports the representation of unknown val-ues and empty values by providing explicit relationsfor such cases95 Inexact dates are modeled by meansof wildcards (eg ldquo1940--rdquo if only the year isknown) Note however the invalidity of such stringsas date literals (see Section 521) Unknown dates arenot supported by YAGO

523 ConsistencyThe fulfillment degrees of the KGs regarding the

Consistency criteria are shown in Table 5

Check of schema restrictions during insertion of newstatements mcheckRestr

The values of the metric mcheckRestr indicating re-strictions during the insertion of new statements arevarying among the KGs The web interfaces of Free-base and Wikidata verify during the insertion of newstatements by the user whether the input is compatiblewith the respective data type For instance data of therelation ldquodate of birthrdquo (wdtP569) is expected to bein a syntactically valid form DBpedia OpenCyc andYAGO have no checks for schema restriction during theinsertion of new statements

Consistency of statements wrt class constraintsmconClass

Evaluation method For evaluating the consis-tency of class constraints we considered the relationowldisjointWith since this is the only rela-tion which is used by more than half of the consid-

95Eg freebasefreebasevaluenotationhas_no_value

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 33

ered KGs We only focused on direct instantiationshere if there is for instance the triple (dboPlantowldisjointWith dboAnimal) then theremust not be a resource which is instantiated both asdboPlant and dboAnimal

Evaluation results We obtained mixed results hereOnly Freebase OpenCyc and Wikidata perform verywell96

Freebase and Wikidata do not specify any constraintswith owldisjointWith Hence those two KGshave no inconsistencies wrt class restrictions and wecan assign the metric value 1 to them In case of Open-Cyc 5 out of the 27112 class restrictions are incon-sistent DBpedia contains 24 class constraints Threeout of them are inconsistent For instance over 1200instances exist which are both a dboAgent and adboPlace YAGO contains 42 constraints dedi-cated mainly for WordNet classes which are mostlyinconsistent

Consistency of statements wrt relation constraintsmconRelat

Evaluation method Here we considered the rela-tions rdfsrange and owlFunctionalProperty as those are used in more than every second con-sidered KG rdfsrange specifies the expected typeof an instance on the object position of a triple whileowlFunctionalProperty indicates that a rela-tion should only be used at most once per resource Weonly took datatype properties into account for this eval-uation since consistencies regarding object propertieswould require to distinguish Open World assumptionand Closed World assumption

Evaluation results In the following we considerthe fulfillment degree for the relation constraintsrdfsrange and owlFunctionalPropertyseparately In Table 5 we show the average of the fulfill-ment scores of each KG regarding rdfsrange andowlFunctionalProperty Note that the num-bers of evaluated relation constraints varied from KG toKG depending on how many relation constraints wereavailable per KG

Range Wikidata does not use any rdfsrangerestrictions Within the Wikidata data model there iswdopropertyType but this indicates not the ex-act allowed data type of a relation (eg wdoprop

96Note that the sample size varies among the KGs (depend-ing on how many owldisjointWith statements are availableper KG) Therefore inconsistencies measured on a small set ofowldisjointWith facts become more visible

Table 6Evaluation results for the KGs regarding the dimension Relevancy

DB FB OC WD YA

mRanking 0 1 0 1 0

ertyTypeTime can represent a year or an exact date)On the talk pages of Wikidata relations users can indi-cate the allowed values of relations via One of state-ments97 Since One of statements are only listed onthe property talk pages and since not only entity typesbut also concrete instances are used as One of valueswe do not consider those statements here

DBpedia obtains the highest measured fulfillmentscore wrt consistency of rdfsrange statementsAn example for a range inconsistency is that the relationdbobirthDate requires a data type xsddatein about 20 of those relations the data type xsdgYear is used though

YAGO Freebase and OpenCyc contain range incon-sistencies primarily since they specify designated datatypes via range relations which are not consistentlyused on the instance level For instance YAGO spec-ifies proprietary data types such as yagoyagoURLand yagoyagoISBN On the instance level how-ever either no data type is used or the unspecific datatype xsdstring

FunctionalProperty The restriction indicated byowlFunctionalProperty is used by all KGsexcept Wikidata On the talk pages about the rela-tions in Wikidata users can specify the cardinalityrestriction via setting the relation to single how-ever this is not part of the Wikidata data modelThe other KGs mostly comply with the usage re-strictions of owlFunctionalProperty Note-worthy is that in Freebase 999 of the inconsis-tencies obtained here are caused by the usages ofthe relations freebasetypeobjectname andfreebasecommonnotable_fordisplay_name

524 RelevancyThe fulfillment degrees of the KGs regarding the

Relevancy criteria are shown in Table 6

Creating a ranking of statements mRanking

Only Wikidata supports the modeling of a rankingof statements Each statement is ranked with ldquopre-

97See httpswwwwikidataorgwikiCategoryProperties_with_one-of_constraints for an overviewrequested on Jan 29 2017

34 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Table 7Evaluation results for the KGs regarding the dimensionCompleteness

DB FB OC WD YA

mcSchema 091 076 092 1 095mcColumn 040 043 0 029 033mcPop 093 094 048 099 089mcPop (short) 1 1 082 1 090mcPop (long) 086 088 014 098 088

ferred rankrdquo (wdoPreferredRank) ldquonormal rankrdquo(wdoNormalRank) or ldquodeprecated rankrdquo (wdoDeprecatedRank) The preferred rank corre-sponds to the up-to-date value or the consensus of theWikidata community wrt this relation Freebase doesnot provide any ranking of statements entities or re-lations However the meanwhile shutdown FreebaseSearch API provided a ranking for resources98

525 CompletenessThe fulfillment degrees of the KGs regarding the

Completeness criteria are shown in Table 7

Schema completeness mcSchema

Evaluation method Since a gold standard for eval-uating the Schema completeness of the considered KGshas not been published we built one on our own Thisgold standard is available online99 It is based on thedata set used in Section 513 where we needed as-signments of classes to domains and comprises of 41classes as well as 22 relations It is oriented towards thedomains people media organizations geography andbiology The classes in the gold standard were alignedto corresponding WordNet synsets (using WordNet ver-sion 31) and were grouped into main classes

Evaluation results Generally Wikidata performsoptimal also DBpedia OpenCyc and YAGO exhibitresults which can be judged as acceptable for most usecases Freebase shows considerable room for improve-ment concerning the coverage of typical cross-domainclasses and relations The results in more detail are asfollows

DBpedia DBpedia shows a good score regardingSchema completeness and its schema is mainly limited

98See httpsdevelopersgooglecomfreebasev1search-cookbookscoring-and-ranking re-quested on Mar 4 2016

99See httpkmaifbkitedusitesknowledge-graph-comparison requested on Jan 29 2017

due to the characteristics of how information is storedand extracted from Wikipedia

1 Classes The DBpedia ontology was created man-ually and covers all domains well However it is incom-plete in the details and therefore appears unbalancedFor instance within the domain of plants the DBpe-dia ontology does not use the class tree but the classginko which is a subclass of trees We can mentionas reason for such gaps in the modeling the fact thatthe ontology is created by means of the most frequentlyused infobox templates in Wikipedia

2 Relations Relations are considerably well cov-ered in the DBpedia ontology Some missing relationsor modeling failures are due to the Wikipedia infoboxcharacteristics For example to represent the gender ofa person the existing relation foafgender seemsto fit However it is only modeled in the ontology asbelonging to the class dbolanguage and not usedon instance level Note that the gender of a person is of-ten not explicitly mentioned in the Wikipedia infoboxesbut implicitly mentioned in the category names (forinstance American male singers) While DBpediadoes not exploit this knowledge YAGO does use it andprovides facts with the relation yagohasGender

Freebase Freebase shows a very ambivalent schemacompleteness On the one hand Freebase targets ratherthe representation of facts on instance level than therepresentation of classes and their hierarchy On theother hand Freebase provides a vast amount of rela-tions leading to a very good coverage of the requestedrelations

1 Classes Freebase lacks a class hierarchy and sub-classes of classes are often in different domains (for in-stance the classes freebasemusicartist andsportsmen freebasesportspro_athlete arelogically a subclass of the class people freebasepersonpeople but not explicitly stated as such)which makes it difficult to find suitable sub- and su-perclasses Noteworthy the biology domain containsno classes This is due to the fact that classes are rep-resented as entities such as tree100 and ginko101 Theginko tree is not classified as tree but by the genericclass freebasebiologyoganism_classification

2 Relations Freebase exhibits all relations requestedby our gold standard This is not surprising given thevast amount of available relations in Freebase (see Sec-tion 514 and Table 2)

100Freebase ID freebasem07j7r101Freebase ID freebasem0htd3

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 35

OpenCyc In total OpenCyc exposes a quite highSchema completeness scoring This is due to the factthat OpenCyc has been created manually and has itsfocus on generic and common-sense knowledge

1 Classes The ontology of OpenCyc covers bothgeneric and specific classes such as cychSocialGroup and cychLandTopographicalFeatureWe can state that OpenCyc is complete with respect tothe considered classes

2 Relations OpenCyc lacks some relations of thegold standard such as the number of pages or the ISBNof books

Wikidata According to our evaluation Wikidata iscomplete both with respect to classes and relations

1 Classes Besides frequently used generic classessuch as ldquohumanrdquo (wdtQ5) also very specific classesexist such as ldquolandformrdquo (wdtQ271669) in the senseof a geomorphologial unit with over 3K instances

2 Relations In particular remarkable is that Wiki-data covers all relations of the gold standard eventhough it has extremely less relations than FreebaseThus the Wikidata methodology to let users proposenew relations to discuss about their outreach and fi-nally to approve or disapprove the relations seems tobe appropriate

YAGO Due to its concentration on modeling classesYAGO shows the best overall Schema completenessfulfillment score among the KGs

1 Classes To create the set of classes in YAGOthe Wikipedia categories are extracted and connectedto WordNet synsets Since also our gold standard isalready aligned to WordNet synsets we can measure afull completeness score for YAGO classes

2 Relations The YAGO schema does not containmany unique but rather abstract relations which canbe understood in different senses The abstract rela-tion names make it often difficult to infer the mean-ing The relation yagowasCreatedOnDate forinstance can be used reasonably for both the founda-tion year of a company and for the publication dateof a movie DBpedia in contrast provides the rela-tion dbpfoundationYear Often the meaning ofYAGO relations is only fully understood after consider-ing the associated classes using domain and range ofthe relations Expanding the YAGO schema by furthermore fine-grained relations appears reasonable

Column completeness mcColumn

Evaluation method For evaluating KGs wrt Col-umn completeness for each KG 25 class-relation-

Table 8Metric values of mcCol for single class-relation-pairs

Relation DB FB OC ED YA

Personndashbirthdate 048 048 0 070 077

Personndashsex ndash 057 0 094 064

Bookndashauthor 091 093 0 082 028

BookndashISBN 073 063 ndash 018 001

combinations102 were created based on our gold stan-dard created for measuring the Schema completenessIt was ensured that only those relations were selectedfor a given class for which a value typically exists forthat class For instance we did not include the deathdate as potential relation for living people

Evaluation results In general no KG yields a met-ric score of over 043 As visible in Table 8 KGs oftenhave some specific class-relation-pairs which are wellrepresented on instance level while the rest of the pairsare poorly represented The well-represented pairs pre-sumably originate either from column-complete datasets which were imported (cf MusicBrainz in case ofFreebase) or from user edits focusing primarily on factsabout entities of popular classes such as people Wenotice the following observations with respect to thesingle KGs

DBpedia DBpedia fails regarding the relation sex forinstances of class Person since it does not containsuch a relation in its ontology If we considered the non-mapping-based property dbpgender instead (notdefined in the ontology) we would gain a coverage ofonly 025 (about 5K people) We can note hence thatthe extraction of data out of the Wikipedia categorieswould be a further fruitful data source for DBpedia

Freebase Freebase surprisingly shows a very highcoverage (927) of the authors of books given the ba-sic population of 17M books Note however that thereare not only books modeled under freebasebookbook but also entities of other types such as a descrip-tion of the Lord of Rings (see freebasem07bz5)Also the coverage of ISBN for books is quite high(634)

OpenCyc OpenCyc breaks ranks as mostly no val-ues for the considered relations are stored in this KG It

102The selection of class-relation-pairs was depending on the factwhich class-relation-pairs were available per KG Hence the choiceis varying from KG to KG Also note that less class-relation-pairswere used if no 25 pairs were available in the respective KG

36 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

contains mainly taxonomic knowledge and only thinlyspread instance facts

Wikidata Wikidata achieves a high coverage of birthdates (703) and of gender (941) despite the highnumber of 3M people103

YAGO YAGO obtains a coverage of 635 for gen-der relations as it in contrast to DBpedia extracts thisimplicit information from Wikipedia

Population completeness mcPop

Evaluation method In order to evaluate the Popu-lation completeness we need a gold standard consist-ing of a basic entity population for each consideredKG This gold standard which is available online104

was created on the basis of our gold standard usedfor evaluating the Schema completeness and the Col-umn completeness For its creation we selected fiveclasses from each of the five domains and determinedtwo well-known entities (called short head) and tworather unknown entities (called long tail) for each ofthose classes The exact entity selection criteria are asfollows

1 The well-known entities were chosen without tem-poral and location-based restrictions To take themost popular entities per domain we used quan-titative statements For instance to select well-known athletes we ranked athletes by the numberof won olympic medals to select the most popu-lar mountains we ranked the mountains by theirheights

2 To select the rather unknown entities we consid-ered entities associated to both Germany and aspecific year For instance regarding the athleteswe selected German athletes active in the year2010 such as Maria Houmlfl-Riesch The selectionof rather unknown entities in the domain of biol-ogy is based on the IUCN Red List of ThreatenedSpecies105106

Selecting four entities per class and five classes perdomain resulted in 100 entities to be used for evaluatingthe Population completeness

103These 3M instances form about 185 of all instances in Wiki-data See httpswwwwikidataorgwikiWikidataStatistics requested on Nov 7 2016

104See httpkmaifbkitedusitesknowledge-graph-comparison requested on Jan 29 2017

105See httpwwwiucnredlistorg requested on Apr2 2016

106Note that selecting entities by their importance or popularity ishard in general and that also other popularity measures such as thePageRank scores may be taken into account

Evaluation results All KGs except OpenCyc showgood evaluation results Since also Wikidata exhibitsgood evaluation results the population degree appar-ently does not depend on the age or the insertion methodof the KG Fig 10 additionally depicts the populationcompleteness for the single domains for each KG Inthe following we firstly present our findings for well-known entities before we secondly go into the detailsof rather unknown entities

Well-known entities Here all considered KGsachieve good results DBpedia Freebase and Wikidataare complete wrt the well-known entities in our goldstandard YAGO lacks some well-known entities al-though some of them are represented in Wikipedia Onereason for this fact is that those Wikipedia entities donot get imported into YAGO for which a WordNet classexists For instance there is no ldquoGreat White Sharkrdquoentity only the WordNet class yagowordnet_great_white_shark_101484850

Not-well-known entities First of all not very surpris-ing is the fact that all KGs show a higher degree of com-pleteness regarding well-known entities than regard-ing rather unknown entities as the KGs are orientedtowards general knowledge and not domain-specificknowledge Secondly two things are in particular pe-culiar concerning long-tail entities in the KGs Whilemost of the KGs obtain a score of about 088 Wiki-data deflects upwards and OpenCyc deflects stronglydownwards

Wikidata exhibits a very high Population complete-ness degree for long tail entities This is a result fromthe central storage of interwiki links between differentWikimedia projects (especially between the differentWikipedia language versions) in Wikidata A Wikidataentry is added to Wikidata as soon as a new entity isadded in one of the many Wikipedia language versionsNote however that in this way English-language labelsfor the entities are often missing We measure that onlyabout 546 (102M) of all Wikidata resources have anEnglish label

OpenCyc exhibits a poor population degree score of014 for long-tail entities OpenCycrsquos sister KGs Cycand ResearchCyc are apparently considerably bettercovered with entities [36] leading to higher Populationcompleteness scores

526 TimelinessThe evaluation results concerning the dimension

Timeliness are presented in Table 9

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 37

DBpedia Freebase OpenCyc Wikidata YAGO0

01

02

03

04

05

06

07

08

09

1

PeopleMediaOrganizationsGeographyBiology

Fig 10 Population completeness regarding the different domains per KG

Table 9Evaluation results for the KGs regarding the dimension Timeliness

DB FB OC WD YA

mFreq 05 0 025 1 025mV alidity 0 1 0 1 1mChange 0 1 0 0 0

Timeliness frequency of the KG mFreq

Evaluation results The KGs are very diverse re-garding the frequency in which the KGs are updatedranging from a score of 0 for Freebase (not updated anymore) to 1 for Wikidata (updates immediately visibleand retrievable) Note that the Timeliness frequency ofthe KG can be a crucial point and a criterion for exclu-sion in the process of choosing the right KG for a givensetting [17] In the following we outline some charac-teristics of the KGs with respect to their up-to-dateness

DBpedia is created about once to twice a year andis not modified in the meantime From September2013 until November 2016 six DBpedia versions havebeen published107 Besides the static DBpedia DBpe-dia live108 has been continuously updated by trackingchanges in Wikipedia in real-time However it does notprovide the full range of relations as DBpedia

107These versions are DBpedia 38 DBpedia 39 DBpedia 2014DBpedia 2015-04 DBpedia 2015-10 and DBpedia 2016-04 Alwaysthe latest DBpedia version is published online for dereferencing

108See httplivedbpediaorg requested on Mar 42016

Freebase had been updated continuously until itsclose-down and is not updated anymore

OpenCyc has been updated less than once per yearThe last OpenCyc version dates from May 2012109 Tothe best of our knowledge Cyc and OpenCyc respec-tively are developed further but no exact date of thenext version is known

Wikidata provides the highest fulfillment degree forthis criterion Modifications in Wikidata are via browserand via HTTP URI dereferencing immediately visibleHence Wikidata falls in the category of continuousupdates Besides that an RDF export is provided ona roughly monthly basis (either via the RDF exportwebpage110 or via own processing using the Wikidatatoolkit111)

YAGO has been updated less than once per yearYAGO3 was published in 2015 YAGO2 in 2011 andthe interim version YAGO2s in 2013 A date of the nextrelease has not been published

Specification of the validity period of statementsmV alidity

Evaluation results Although representing the va-lidity period of statements is obviously reasonable formany relations (for instance the presidentrsquos term of

109See httpswopencycorg requested on Nov 82016

110See httptoolswmflabsorgwikidata-exportsrdfexports requested on Nov 23 2016

111See httpsgithubcomWikidataWikidata-Toolkit requested on Nov 8 2016

38 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Table 10Evaluation results for the KGs regarding the dimension Ease ofunderstanding

DB FB OC WD YA

mDescr 070 097 1 lt1 1mLang 1 1 0 1 1muSer 1 1 0 1 1muURI 1 05 1 0 1

office) specifying the validity period of statements isin several KGs either not possible at all or only rudi-mentary performed

DBpedia and OpenCyc do not realize any specifi-cation possibility In YAGO Freebase and Wikidatathe temporal validity period of statements can be spec-ified In YAGO this modeling possibility is madeavailable via the relations yagooccursSinceyagooccursUntil and yagooccursOnDateWikidata provides the relations ldquostart timerdquo (wdtP580)and ldquoend timerdquo (wdtP582) In Freebase CompoundValue Types (CVTs) are used to represent relations withhigher arity [44] As part of this representation validityperiods of statements can be specified An example isldquoVancouverrsquos population in 1997rdquo

Specification of the modification date of statementsmChange

Evaluation results The modification date of state-ments can only be specified in Freebase but not in theother KGs Together with the criteria on Timelinessthis reflects that the considered KGs are mostly notsufficiently equipped with possibilities for modelingtemporal aspects within and about the KG

In Freebase the date of the last review of a fact can berepresented via the relation freebasefreebasevaluenotationis_reviewed In the DBpediaontology the relation dctermsmodified is usedto state the date of the last revision of the DBpediaontology When dereferencing a resource in Wikidatathe latest modification date of the resource is returnedvia schemadateModified This however doesnot hold for statements Thus Wikidata is evaluatedwith 0 too

527 Ease of UnderstandingDescription of resources mDescr

Evaluation method We measured the extent towhich entities are described Regarding the labelswe considered rdfslabel for all KGs Regard-ing the descriptions the corresponding relations dif-

fer from KG to KG DBpedia for instance usesrdfscomment and dcelementsdescriptionwhile Freebase provides freebasecommontopicdescription112

Evaluation result For all KGs the rule applies thatin case there is no label available usually there isalso no description available The current metric couldtherefore (without significant restrictions) be applied tordfslabel occurrences only

YAGO Wikidata and OpenCyc contain a label foralmost every entity In Wikidata the entities withoutany label are of experimental nature and are most likelynot used113

Surprisingly DBpedia shows a relatively low cov-erage wrt labels and descriptions (only 704) Ourmanual investigations suggest that relations with higherarity are modeled by means of intermediate nodeswhich have no labels114

Labels in multiple languages mLang

Evaluation method Here we measure whether theKGs contain labels (rdfslabel) in other languagesthan English This is done by means of the languageannotations of literals such as ldquoderdquo for literals inGerman

Evaluation results DBpedia provides labels in 13languages Further languages are provided in the lo-calized DBpedia versions YAGO integrates statementsof the different language versions of Wikipedia intoone KG Therefore it provides labels in 326 differentlanguages Freebase and Wikidata also provide a lot oflanguages (244 and 395 languages respectively) Con-trary to the other KGs OpenCyc only provides labelsin English

Coverage of languages We also measured the cov-erage of selected languages in the KGs ie the extentto which entities have an rdfslabel with a specificlanguage annotation115 Our evaluation shows that DB-pedia YAGO and Freebase achieve a high coveragewith more than 90 regarding the English language Incontrast to those KGs Wikidata shows a relative low

112Human-readable resource descriptions may also be representedby other relations [15] However we focused on those relations whichare commonly used in the considered KGs

113For instance wdtQ5127809 represents a game fo the Nin-tendo Entertainment System but there is no further information foran identification of the entity available

114Eg dbrNayim links via dboCareerStation to 10entities of his carrier stations

115Note that literals such as rdfslabel do not necessarily havelanguage annotations In those cases we assume that no languageinformation is available

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 39

coverage regarding the English language of only 546but a coverage of over 30 for further languages suchas German and French Wikidata is hence not only themost diverse KG in terms of languages but has also thehighest coverage regarding non-English languages

Understandable RDF serialization muSer

The provisioning of understandable RDF serializa-tions in the context of URI dereferencing leads to a bet-ter understandability for human data consumers DB-pedia YAGO and Wikidata provide N-Triples andN3Turtle serializations Freebase in contrast onlyprovides a Turtle serialization OpenCyc only usesRDFXML which is regarded as not easily understand-able by humans

Self-describing URIs muURI

We can observe two different paradigms of URI us-age On the one hand DBpedia OpenCyc and YAGOrely on descriptive URIs and therefore achieve the fullfulfillment degree In DBpedia and YAGO the URIsof the entities are determined by the corresponding En-glish Wikipedia article The mapping to the EnglishWikipedia is thus trivial In case of OpenCyc two RDFexports are provided one using opaque and one us-ing self-describing URIs The self-describing URIs arethereby derived from the rdfslabel values of theresources

On the other hand Wikidata and Freebase (the latterin part) rely on opaque URIs Wikidata uses Q-IDsfor resources (items in Wikidata terminology) andP-IDs for relations Freebase uses self-describing URIsonly partially namely opaque M-IDs for entities andself-describing URIs for classes and relations116

528 InteroperabilityThe evaluation results of the dimension Interoper-

ability are presented in Table 11

Avoiding blank nodes and RDF reification mReif

Reification allows to represent further informationabout single statements In conclusion we can state thatDBpedia Freebase OpenCyc and YAGO use someform of reification However none of the consideredKGs uses the RDF standard for reification Wikidatamakes extensive use of reification every relation isstored in the form of an n-ary relation In case of DB-pedia and Freebase in contrast facts are predominantlystored as N-Tripels and only relations of higher arity

116Eg freebasemusicalbum for the class music al-bums and freebasepeoplepersondate_of_birthfor the relation day of birth

Table 11Evaluation results for the KGs regarding the dimensionInteroperability

DB FB OC WD YA

mReif 05 05 05 0 05miSerial 1 0 05 1 1mextV oc 061 011 041 068 013mpropV oc 015 0 051 gt0 0

are stored via n-ary relations117 YAGO stores facts asN-Quads in order to be able to store meta informationof facts like provenance information When the quadsare loaded in a triple store the IDs referring to thesingle statements are ignored and quads are convertedinto triples In this way most of the statements are stillusable without the necessity to deal with reification

Blank nodes are non-dereferencable anonymous re-sources They are used by the Wikidata and OpenCycdata model

Provisioning of several serialization formats miSerial

DBpedia YAGO and Wikidata fulfill the criterion ofProvisioning several RDF serialization formats to thefull extent as they provide data in RDFXML and sev-eral other serialization formats during the URI derefer-encing In addition DBpedia and YAGO provide fur-ther RDF serialization formats (eg JSON-LD Micro-data and CSV) via their SPARQL endpoints Freebaseis the only KG providing RDF only in Turtle format

Using external vocabulary mextV oc

Evaluation method This criterion indicates the ex-tent to which external vocabulary is used For that foreach KG we divide the occurrence number of tripleswith external relations by the number of all relations inthis KG

Evaluation results DBpedia uses 37 unique exter-nal relations from 8 different vocabularies while theother KGs mainly restrict themselves to the externalvocabularies RDF RDFS and OWL

Wikidata reveals a high external vocabulary ratiotoo We can mention two obvious reasons for that fact1 Information in Wikidata is provided in a huge varietyof languages leading to 85M rdfslabel and 140Mschemadescription literals 2 Wikidata makesextensive use of reification Out of the 140M triplesused for instantiations via rdftype about 74M (ie

117See Section 511 for more details wrt the influence of reifica-tion on the number of triples

40 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

about the half) are taken for instantiations of statementsie for reification

Interoperability of proprietary vocabulary mpropV oc

Evaluation method This criterion determines the ex-tent to which URIs of proprietary vocabulary are linkedto external vocabulary via equivalence relations Foreach KG we measure which classes and relationsare linked via owlsameAs118 owlequivalentClass (in Wikidata wdtP1709) and owlequivalentProperty (in Wikidata wdtP1628) to ex-ternal vocabulary Note that other relations such asrdfsubPropertyOf could be taken into accounthowever in this work we only consider equivalencyrelations

Evaluation results In general we obtained low ful-fillment scores regarding this criterion OpenCyc showsthe highest value We achieved the following singlefindings

Regarding its classes DBpedia reaches a relativehigh interlinking degree of about 484 Classes arethereby linked to FOAF Wikidata schemaorg andDUL119 Regarding its relations DBpedia links to Wiki-data and schemaorg120 Only 63 of the DBpediarelations are linked to external vocabulary

Freebase only provides owlsameAs links in theform of a separate RDF file but these links are only oninstance level Thus the KG is evaluated with 0

In OpenCyc about half of all classes exhibit at leastone external linking via owlsameAs Internal linksto resources of swcyccom the commercial ver-sion of OpenCyc were ignored in our evaluation Theconsidered classes are mainly linked to FOAF UM-BEL DBpedia and linkedmdborg the relations mainlyto FOAF DBpedia Dublin Core Terms and linked-mdborg The relative high linking degree of OpenCyccan be attributed to dedicated approaches of linkingOpenCyc to other KGs (see eg Medelyan et al [38])

Regarding the classes Wikidata provides linksmainly to DBpedia Considering all Wikidata classesonly 01 of all Wikidata classes are linked to equiva-

118OpenCyc uses owlsameAs both on schema and instancelevel This is appropriate as the OWL primer states The built-inOWL property owlsameAs links an individual to an individualas well as The owlsameAs statements are often used in definingmappings between ontologies see httpswwww3orgTR2004REC-owl-ref-20040210sameAs-def (requestedon Feb 4 2017)

119See httpwwwontologydesignpatternsorgontdulDULowl requested on Jan 11 2017

120Eg dbobirthDate is linked to wdtP569 andschemabirthDate

Table 12Evaluation results for the KGs regarding the dimension Accessibility

DB FB OC WD YA

mDeref 1 1 044 041 1mAvai lt1 073 lt1 lt1 1mSPARQL 1 1 0 1 0mExport 1 1 1 1 1mNegot 05 1 0 1 0mHTMLRDF 1 1 1 1 0mMeta 1 0 0 0 1

lent external classes This may be due to the high num-ber of classes in Wikidata in general Regarding therelations Wikidata provides links in particular to FOAFand schemaorg and achieves here a linking coverageof 21 Although this is low frequently used relationsare linked121

YAGO contains around 553K owlequivalentClass links to classes within the DBpedia namespacedby However as YAGO classes (and their hierarchy)were imported also into DBpedia (using the namespacehttpdbpediaorgclassyago) we donot count those owlequivalentClass links inYAGO as external links for YAGO

529 AccessibilityThe evaluation results of the dimension Accessibility

are presented in Table 12

Dereferencing possibility of resources mDeref

Evaluation method We measured the dereferenc-ing possibilities of resources by trying to dereferenceURIs containing the fully-qualified domain name ofthe KG For that we randomly selected 15K URIs inthe subject predicate and object position of triples ineach KG We submitted HTTP requests with the HTTPaccept header field set to applicationrdf+xmlin order to perform content negotiation

Evaluation results In case of DBpedia OpenCycand YAGO all URIs were dereferenced successfullyand returned appropriate RDF data so that they fulfilledthis criterion completely For DBpedia 45K URIs wereanalyzed for OpenCyc only around 30K due to thesmall number of unique predicates We observed almost

121Frequently used relations with stated equivalence to externalrelations are eg wdtP31 linked to rdftype and wdtP279linked to rdfssubClassOf

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 41

the same picture for YAGO namely no notable errorsduring dereferencing

For Wikidata which contains also not that manyunique predicates we analyzed around 35K URIs Notethat predicates which are derived from relations using asuffix (eg the suffix s as in wdtP1024s is usedfor predicates referring to a statement) could not bedereferenced at all Furthermore the blank nodes usedfor reification cannot be dereferenced

Regarding Freebase mainly all URIs on subjectand object position of triples could be dereferencedSome resources were not resolvable even after multi-ple attempts (HTTP server error 503 eg freebasem0156q) Surprisingly server errors also appearedwhile browsing the website freebasecom so that datawas partially not available Regarding the predicate po-sition many URIs are not dereferencable due to servererrors (HTTP 503) or due to unknown URIs (HTTP404) Note that if a large number of Freebase requestsare performed an API key from Google is necessaryIn our experiments the access was blocked after a fewthousand requests Hence we can point out that withoutan API key the Freebase KG is only usable to a limitedextent

Availability of the KG mAvai

Evaluation method We measured the availabilityof the officially hosted KGs with the monitoring servicePingdom122 For each KG an uptime test was set upwhich checked the availability of the resource Ham-burg as representative resource for successful URI re-solving (ie returning the status code HTTP 200) ev-ery minute over the time range of 60 days (Dec 182015ndashFeb 15 2016)

Evaluation result While the other KGs showed al-most no outages and were again online after some min-utes on average YAGO outages took place frequentlyand lasted on average 35 hours123 In the given timerange four outages took longer than one day Based onthese insights we recommend to use a local version ofYAGO for time-critical queries

Availability of a public SPARQL endpoint mSPARQL

The SPARQL endpoints of DBpedia and YAGO are

122See httpswwwpingdomcom requested Mar 2 2016The HTTP requests of Pingdom are executed by various servers sothat caching is prevented

123See diagrams per KG on our website (httpkmaifbkitedusitesknowledge-graph-comparisonrequested on Jan 31 2017)

provided by a Virtuoso server124 the Wikidata SPARQLendpoint via Blazegraph125 Freebase and OpenCyc donot provide an official SPARQL endpoint However anendpoint for the MQL query language for the FreebaseKG was available

Especially regarding the Wikidata SPARQL endpointwe observed access restrictions The maximum execu-tion time per query is set to 30 seconds but there is nolimitation regarding the returning number of rows How-ever the front-end of the SPARQL endpoint crashed incase of large result sets with more than 15M rows Al-though public SPARQL endpoints need to be preparedfor inefficient queries the time limit of Wikidata mayimpede the execution of reasonable queries

Provisioning of an RDF export mExport

All considered KGs provide RDF exports as down-loadable files The format of the data differs from KGto KG Mostly data is provided in N-Triples and Turtleformat

Support of content negotiation mNegot

We measure the support of content negotiation re-garding the serialization formats RDFXML N3Turtleand N-Triples OpenCyc does not provide any contentnegotiation only RDFXML is supported as contenttype Therefore OpenCyc does not fulfill the criterionof supporting content negotiation

The endpoints for DBpedia Wikidata and YAGOcorrectly returned the appropriate RDF serializationformat and the corresponding HTML representationof the tested resources Freebase does currently notprovide any content negotiation and only the contenttype textplain is returned

Noteworthy is also that regarding the N-Triples seri-alization YAGO and DBpedia require the accept headertextplain and not applicationn-triplesThis is due to the usage of Virtuoso as endpoint For DB-pedia the forwarding to httpdbpediaorgdata[resource]ntriples does not work in-stead the HTML representation is returned Thereforethe KG is evaluated with 05

Linking HTML sites to RDF serializations mHTMLRDF

All KGs except OpenCyc interlink the HTML represen-tations of resources with the corresponding RDF repre-sentations by means of ltlink rel=alternate

124See httpsvirtuosoopenlinkswcom re-quested on Dec 28 2016

125See httpswwwblazegraphcom requested on Dec28 2016

42 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Table 13Evaluation results for the KGs regarding the dimension License

DB FB OC WD YA

mmacLicense 1 0 0 1 0

type=content type href=URLgtin the HTML header

Provisioning of metadata about the KG mmeta

For this criterion we analyzed if KG metadata isavailable such as in the form of a VoID file126 DBpediaintegrates the VoID vocabulary directly in its KG127 andprovides information such as the SPARQL endpointURL and the number of all triples OpenCyc revealsthe current KG version number via owlversionInfo For YAGO Freebase and Wikidata no metainformation could be found

5210 LicenseThe evaluation results of the dimension License are

shown in Table 13

Provisioning machine-readable licensing informationmmacLicense

DBpedia and Wikidata provide licensing informa-tion about their KG data in machine-readable form ForDBpedia this is done in the ontology via the predi-cate cclicense linking to CC-BY-SA128 and GNUFree Documentation License (GNU FDL)129 Wikidataembeds licensing information during the dereferenc-ing of resources in the RDF document by linking withcclicense to the license CC0130 YAGO and Free-base do not provide machine-readable licensing infor-mation However their data is published under the li-cense CC-BY131 OpenCyc embeds licensing informa-tion into the RDF document during dereferencing butnot in machine-readable form132

126See httpswwww3orgTRvoid requested on Apr7 2016

127See httpdbpediaorgvoidpageDataset re-quested on Mar 5 2016

128See httpcreativecomonsorglicensesby-sa30 requested on Feb 4 2017

129See httpwwwgnuorgcopyleftfdlhtml re-quested on Feb 4 2017

130See httpcreativecomonsorgpublicdomainzero10 requested on Feb 4 2017

131See httpcreateivecommonsorglicensesby30 requested on Feb 4 2017

132License information is provided as plain text among furtherinformation with the relation rdfscomment

Table 14Evaluation results for the KGs regarding the dimension Interlinking

DB FB OC WD YA

mInst 025 0 038 0 (09) 031mURIs 093 091 089 096 096

5211 InterlinkingThe evaluation results of the dimension Interlinking

are shown in Table 14

Linking via owlsameAs mInst

Evaluation method Given all owlsameAs triplesin each KG we queried all those subjects thereof whichare instances but neither classes nor relations133 andwhere the resource in the object position of the triple isan external source ie not belonging to the namespaceof the KG

Evaluation result OpenCyc and YAGO achieve thebest results wrt this metric but DBpedia has by farthe most instances with at least one owlsameAs linkWe can therefore confirm the statement by Bizer et al[12] that DBpedia has established itself as a hub in theLinked Data cloud

In DBpedia there are about 52M instances with atleast one owlsameAs link Links to localized DBpe-dia versions (eg dedbpediaorg) were countedas internal links and hence not considered here Intotal one-fourth of all instances have at least oneowlsameAs link

In Wikidata neither owlsameAs links are pro-vided nor a corresponding proprietary relation is avail-able Instead Wikidata uses for each linked data seta proprietary relation (called identifier) to indicateequivalence For example the M-ID of a Freebase in-stance is stored via the relation ldquoFreebase identifierrdquo(wdtP646) as literal value (eg m01x3gpk)So far links to 426 different data sources are maintainedin this way

Although the equivalence statements in Wikidata canbe used to generate corresponding owlsameAs state-ments and although the stored identifiers are providedin the Browser interface as hyperlinks there are no gen-uine owlsameAs links available Hence Wikidata isevaluated with 0 If we view each equivalence relationas owlsameAs relation we would obtain around122M instances with owlsameAs statements Thiscorresponds to 86 of all instances If we consider

133The interlinking on schema level is already covered by thecriterion Interoperability of proprietary vocabulary

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 43

only entities instead of instances (since there are manyinstances due to reification) we obtain a coverage of65 Note however that although the linked resourcesprovide relevant content the resources are not alwaysRDF documents but instead HTML web pages There-fore we cannot easily subsume all identifiers (equiv-alence statements) under owlsameAs

YAGO has around 36M instances with at least oneowlsameAs link However most of them are linksto DBpedia based on common Wikipedia articles Ifthose links are excluded YAGO contains mostly linksto GeoNames and would be evaluated with just 001

In case of OpenCyc links to Cyc134 the commercialversion of OpenCyc were considered as being internalStill OpenCyc has the highest fulfillment degree witharound 40K instances with at least one owlsameAslink As mentioned earlier the relative high linkingdegree of OpenCyc can be attributed to dedicated ap-proaches of linking OpenCyc to other KGs135

Validity of external URIs mURIs

Regarding the dimension Accessibility we alreadyanalyzed the dereferencing possibility of resources inthe KG namespace Now we analyze the links to exter-nal URIs

Evaluation method External links include owlsameAs links as well as links to non-RDF-based Webresources (eg via foafhomepage) We measureerrors such as timouts client errors (HTTP response4xx) and server errors (HTTP response 5xx)

Evaluation result The external links are in most ofthe cases valid for all KGs All KGs obtain a metricvalue between 089 and 096

DBpedia stores provenance information via the re-lation provwasDerivedFrom Since almost alllinks refer to Wikipedia 99 of the resources are avail-able

Freebase achieves high metric values here sinceit contains owlsameAs links mainly to WikipediaAlso Wikipedia URIs are mostly resolvable

OpenCyc contains mainly external links to non-RDF-based Web resources to wikipediaorg and w3org

YAGO also achieves high metric values since it pro-vides owlsameAs links only to DBpedia and Geo-Names whose URIs do not change

For Wikidata the relation reference URL (wdtP854) which states provenance information amongother relations belongs to the links linking to external

134Ie swcyccom135See Interoperability of proprietary vocabulary in sec 528

Web resources Here we were able to resolve around955 without errors

Noticeable is that DBpedia and OpenCyc containmany owlsameAs links to URIs whose domains donot exist anymore136 One solution for such invalid linksmight be to remove them if they have been invalid for acertain time span

5212 Summary of ResultsWe now summarize the results of the evaluations

presented in this section

1 Syntactic validity of RDF documents All KGsprovide syntactically valid RDF documents

2 Syntactic validity of Literals In general the KGsachieve good scores regarding the Syntactic valid-ity of literals Although OpenCyc comprises over1M literals in total these literals are mainly labelsand descriptions which are not formatted in a spe-cial format For YAGO we detected about 519Ksyntactic errors (given 1M literal values) due to theusage of wildcards in the date values Obviouslythe syntactic invalidity of literals is accepted bythe publishers in order to keep the number of rela-tions low In case of Wikidata some invalid literalssuch as the ISBN have been corrected in newerversions of Wikidata This indicates that knowl-edge in Wikidata is curated continuously For DB-pedia comments next to the values to be extracted(such as ISBN) in the infoboxes of Wikipedia ledto inaccurately extracted values

3 Semantic validity of triples All considered KGsscored well regarding this metric This shows thatKGs can be used in general without concerns re-garding the correctness Note however that eval-uating the semantic validity of facts is very chal-lenging since a reliable ground truth is needed

4 Trustworthiness on KG level Based on the way ofhow data is imported and curated OpenCyc andWikidata can be trusted the most

5 Trustworthiness on statement level Here espe-cially good values are achieved for Freebase Wiki-data and YAGO YAGO stores per statement boththe source and the extraction technique which isunique among the KGs Wikidata also supports tostore the source of information but only around13 of the statements have provenance informa-tion attached Note however that not every state-

136Eg httprdfaboutcom httpwww4wiwissfu-berlindefactbook and httpwikicompanyorg (requested on Jan 11 2017)

44 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

ment in Wikidata requires a reference and that itis hard to evaluate which statements lack such areference

6 Using unknown and empty values Wikidata andFreebase support the indication of unknown andempty values

7 Check of schema restrictions during insertion ofnew statements Since Freebase and Wikidata areeditable by community members simple consis-tency checks are made during the insertion of newfacts in the user interface

8 Consistency of statements wrt class constraintsFreebase and Wikidata do not specify any classconstraints via owldisjointWith while theother KGs do

9 Consistency of statements wrt relation con-straints The inconsistencies of all KGs regardingthe range indications of relations are mainly due toinconsistently used data types (eg xsdgYearis used instead of xsdDate)Regarding the constraint of functional proper-ties the relation owlFunctionalPropertyis used by all KGs except Wikidata in most casesthe KGs comply with the usage restrictions of thisrelation

10 Creating a ranking of statements Only Wikidatasupports a ranking of statements This is in partic-ular worthwhile in case of statements which areonly temporally limited valid

11 Schema completeness Wikidata shows the highestdegree of schema completeness Also for DBpe-dia OpenCyc and YAGO we obtain results whichare presumably acceptable in most cross-domainuse cases While DBpedia classes were sometimesmissing in our evaluation the DBpedia relationswere covered considerably well OpenCyc lackssome relations of the gold standard but the classesof the gold standard were existing in OpenCycWhile the YAGO classes are peculiar in the sensethat they are connected to WordNet synsets it isremarkable that YAGO relations are often keptvery abstract so that they can be applied in differ-ent senses Freebase shows considerable room forimprovement concerning the coverage of typicalcross-domain classes and relations Note that Free-base classes are belonging to different domainsHence it is difficult to find related classes if theyare not in the same domain

12 Column completeness DBpedia and Freebaseshow the best column completeness values ie inthose KGs the predicates used by the instances of

each class are on average frequently used by all ofthose class instances We can name data importsas one reason for it

13 Population completeness Not very surprising isthe fact that all KGs show a higher degree of com-pleteness regarding well-known entities than re-garding rather unknown entities Especially Wiki-data shows an excellent performance for both well-known and rather unknown entities

14 Timeliness frequency of the KG Only Wikidataprovides the highest fulfillment degree for thiscriterion as it is continuously updated and as thechanges are immediately visible and queryable byusers

15 Specification of the validity period of statementsIn YAGO Freebase and Wikidata the temporalvalidity period of statements (eg term of office)can be specified

16 Specification of the modification date of state-ments Only Freebase keeps the modification datesof statements Wikidata provides the modificationdate of the queried resource during URI derefer-encing

17 Description of resources YAGO Wikidata andOpenCyc contain a label for almost every entitySurprisingly DBpedia shows a relatively low cov-erage wrt labels and descriptions (only 704)Manual investigations suggest that the interme-diate node mapping template is the main reasonfor that By means of this template intermediatenodes are introduced and instantiated but no la-bels are provided for them137

18 Labels in multiple languages YAGO Freebaseand Wikidata support hundreds of languages re-garding their stored labels Only OpenCyc con-tains labels merely in English While DBpediaYAGO and Freebase show a high coverage re-garding the English language Wikidata does nothave such a high coverage regarding English butinstead covers other languages to a considerableextent It is hence not only the most diverse KGin terms of languages but also the KG which con-tains the most labels for languages other than En-glish

19 Understandable RDF serialization DBpediaWikidata and YAGO provide several understand-

137An example is dbrVolkswagen_Passat_(B1)which has dboengine statements to the intermediate nodesVolkswagen_Passat_(B1)__1 etc representing differentengine variations

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 45

able RDF serialization formats Freebase onlyprovides the understandable format RDFTurtleOpenCyc relies only on RDFXML which is con-sidered as being not easily understandable for hu-mans

20 Self-describing URIs We can find mixed paradigmsregarding the URI generation DBpedia YAGOand OpenCyc rely on descriptive URIs whileWikidata and Freebase (in part classes and rela-tions are identified with self-describing URIs) usegeneric IDs ie opaque URIs

21 Avoiding blank nodes and RDF reification DB-pedia Wikidata YAGO and Freebase are theKGs which use reification ie which formulatestatements about statements There are differentways of implementing reification [27] DBpediaWikidata and Freebase use n-ary relations whileYAGO uses N-Quads creating so-called namedgraphs

22 Provisioning of several serialization formatsMany KGs provide RDF in several serializationformats Freebase is the only KG providing datain the serialization format RDFTurtle only

23 Using external vocabulary DBpedia and Wiki-data show high degrees of external vocabularyusage In DBpedia the RDF RDFS and OWLvocabularies are used Wikidata has a high ex-ternal vocabulary ratio since there exist manylanguage labels and descriptions (modeled viardfslabel and schemadescription)Also due to instantiations of statements withwdoStatement for reification purposes theexternal relation rdftype is used a lot

24 Interoperability of proprietary vocabulary Weobtained low fulfillment scores regarding this cri-terion OpenCyc shows the highest value Wecan mention as reason for that the fact thathalf of all OpenCyc classes exhibit at least oneowlsameAs linkWhile DBpedia has equivalence statements to ex-ternal classes for almost every second class only63 of all relations have equivalence relations torelations outside the DBpedia namespaceWikidata shows a very low interlinking degreeof classes to external classes and of relations toexternal relations

25 Dereferencing possibility of resources Resourcesin DBpedia OpenCyc and YAGO can be derefer-enced without considerable issues Wikidata usespredicates derived from relations that are not deref-erencable at all as well as blank nodes For Free-

base we measured a quite considerable amountof dereferencing failures due to server errors andunknown URIs Note also that Freebase requiredan API key for a large amount of requests

26 Availability of the KG While all other KGsshowed almost no outages YAGO shows a note-worthy instability regarding its online availabilityWe measured around 100 outages for YAGO ina time interval of 8 weeks taking on average 35hours

27 Provisioning of public SPARQL endpoint DBpe-dia Wikidata and YAGO provide a SPARQL end-point while Freebase and OpenCyc do not Note-worthy is that the Wikidata SPARQL endpoint hasa maximum execution time per query of 30 sec-onds This might be a bottleneck for some queries

28 Provisioning of an RDF export RDF exports areavailable for all KGs and are provided mostly inN-Triples and Turtle format

29 Support of content negotiation DBpedia Wiki-data and YAGO correctly return RDF data basedon content negotiation Both OpenCyc and Free-base do not support any content negotiation WhileOpenCyc only provides data in RDFXML Free-base only returns data with textplain as con-tent type

30 Linking HTML sites to RDF serializations AllKGs except OpenCyc interlink the HTML rep-resentations of resources with the correspondingRDF representations

31 Provisioning of KG metadata Only DBpedia andOpenCyc integrate metadata about the KG insome form DBpedia has the VoID vocabulary in-tegrated while OpenCyc reveals the current KGversion as machine-readable metadata

32 Provisioning machine-readable licensing informa-tion Only DBpedia and Wikidata provide licens-ing information about their KG data in machine-readable form

33 Interlinking via owlsameAs OpenCyc andYAGO achieve the best results wrt this met-ric but DBpedia has by far the most instanceswith at least one owlsameAs link Based onthe resource interlinkage DBpedia is justifiablycalled Linked Data hub Wikidata does not provideowlsameAs links but stores identifiers as liter-als that could be used to generate owlsameAslinks

34 Validity of external URIs The links to exter-nal Web resources are for all KGs valid inmost cases DBpedia and OpenCyc contain many

46 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Step 1 Requirements Analysis

- Identifying the preselection criteria P- Assigning a weight wi to each DQ criterion ci isin C

Step 2 Preselection based on the Preselection Criteria

- Manually selecting the KGs GP that fulfill the preselection criteria P

Step 3 Quantitative Assessment of the KGs

- Calculating the DQ metric mi(g) for each DQ criterion ci isin C- Calculating the fulfillment degree h(g) for each KG g isin GP

- Determining the KG g with the highest fulfillment degree h(g)

Step 4 Qualitative Assessment of the Result

- Assessing the selected KG g wrt qualitative aspects- Comparing the selected KG g with other KGs in G P

Fig 11 Proposed process for using our KG recommendation frame-work

owlsameAs links to RDF documents on do-mains which do not exist anymore those linkscould be deleted

6 KG Recommendation Framework

We now propose a framework for selecting themost suitable KG (or a set of suitable KGs) for agiven concrete setting based on a given set of KGsG = g1 gn To use this framework the user needsto go through the steps depicted in Fig 11

In Step 1 the preselection criteria and the weightsfor the criteria are specified The preselection criteriacan be both quality criteria or general criteria and needto be selected dependent on the use case The Timeli-ness frequency of the KG is an example for a qualitycriterion The license under which a KG is provided(eg CC0 license) is an example for a general criterionAfter weighting the criteria in Step 2 those KGs areneglected which do not fulfill the preselection criteriaIn Step 3 the fulfillment degrees of the remaining KGsare calculated and the KG with the highest fulfillmentdegree is selected Finally in Step 4 the result can be as-sessed wrt qualitative aspects (besides the quantitativeassessments using the DQ metrics) and if necessary analternative KG can be selected for being applied for thegiven scenario

Use case application In the following we showhow to use the KG recommendation framework in aparticular scenario The use case is based on the usageof DBpedia and MusicBrainz for the project BBC Musicas described in [33]

Description of the use case The publisher BBCwants to enrich news articles with fact sheets providingrelevant information about musicians mentioned in thearticles In order to obtain more details about the mu-sicians the user can leave the news section and accessthe musicians section where detailed information is pro-vided including a short description a picture the birthdate and the complete discography for each musicianFor being able to integrate the musicians informationinto the articles and to enable such a linking editorsshall tag the article based on a controlled vocabulary

The KG Recommendation Framework can be appliedas follows

1 Requirements analysis

ndash Preselection criteria According to the sce-nario description [33] the KG in questionshould (i) be actively curated and (ii) con-tain an appropriate amount of media enti-ties Given these two criteria a satisfactoryand up-to-date coverage of both old and newmusicians is expected

ndash Weighting of DQ criteria Based on the pre-selection criteria an example weighting ofthe DQ metrics for our use case is given inTable 15 Note that this is only one exam-ple configuration and the assignment of theweights is subjective to some degree Giventhe preselection criteria the criterion Timeli-ness frequency of the KG and the criteria ofthe DQ dimension Completeness are empha-sized Furthermore the criteria Dereferenc-ing possibility of resources and Availabilityof the KG are important as the KG shall beavailable online ready to be queried138

2 Preselection Freebase and OpenCyc are not con-sidered any further since Freebase is not being up-dated anymore and since OpenCyc contains onlyaround 4K entities in the media domain

3 Quantitative Assessment The overall fulfillmentscore for each KG is calculated based on the for-mula presented in Section 31 The result of thequantitative KG evaluation is presented in Ta-ble 15 By weighting the criteria according tothe constraints Wikidata achieves the best rankclosely followed by DBpedia Based on the quan-titative assessment Wikidata is recommended bythe framework

138We assume that in this use case rather the dereferencing ofHTTP URIs than the execution of SPARQL queries is desired

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 47

Table 15Framework with an example weighting which would be reasonablefor a user setting as given in [33]

Dimension Metric DBpedia Freebase OpenCyc Wikidata YAGO Example of UserWeighting wi

Accuracy msynRDF 1 1 1 1 1 1msynLit 0994 1 1 1 0624 1msemTriple 0990 0995 1 0993 0993 1

Trustworthiness mgraph 05 05 1 075 025 0mfact 05 1 0 1 1 1mNoV al 0 1 0 1 0 0

Consistency mcheckRestr 0 1 0 1 0 0mconClass 0875 1 0999 1 0333 0mconRelat 0992 0451 1 0500 0992 0

Relevancy mRanking 0 1 0 1 0 1

Completeness mcSchema 0905 0762 0921 1 0952 1mcCol 0402 0425 0 0285 0332 2mcPop 093 094 048 099 089 3

Timeliness mFreq 05 0 025 1 025 3mV alidity 0 1 0 1 1 0mChange 0 1 0 0 0 0

Ease of understanding mDescr 0704 0972 1 09999 1 1mLang 1 1 0 1 1 0muSer 1 1 0 1 1 0muURI 1 05 1 0 1 1

Interoperability mReif 05 05 05 0 05 0miSerial 1 0 05 1 1 1mextV oc 061 0108 0415 0682 0134 1mpropV oc 0150 0 0513 0001 0 1

Accessibility mDeref 1 0437 1 0414 1 2mAvai 09961 09998 1 09999 07306 2mSPARQL 1 0 0 1 1 1mExport 1 1 1 1 1 0mNegot 05 0 0 1 1 0mHTMLRDF 1 1 0 1 1 0mMeta 1 0 1 0 0 0

Licensing mmacLicense 1 0 0 1 0 0

Interlinking mInst 0251 0 0382 0 0310 3mURIs 0929 0908 0894 0957 0956 1

Unweighted Average 0683 0603 0496 0752 0625Weighted Average 0701 0493 0556 0714 0648

48 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

4 Qualitative Assessment The high population com-pleteness in general and the high coverage of enti-ties in the media domain in particular give Wiki-data advantage over the other KGs FurthermoreWikidata does not require that there is a Wikipediaarticle for each entity Thus missing Wikidata en-tities can be added by the editors directly and arethen available immediatelyThe use case requires to retrieve also detailed infor-mation about the musicians from the KG such as ashort descripion and a discography DBpedia tendsto store more of that data especially wrt discogra-phy A specialized database like MusicBrainz pro-vides even more data about musicians than DBpe-dia as it is not limited to the Wikipedia infoboxesWhile DBpedia does not provide any links to Mu-sicBrainz Wikidata stores around 120K equiva-lence links to MusicBrainz that can be used to pullmore data In conclusion Wikidata especially inthe combination with MusicBrainz seems to bean appropriate choice for the use case In this casethe qualitative assessment confirms the result ofthe quantitative assessment

The use case shows that our KG recommendationframework enables users to find the most suitable KGand is especially useful in giving an overview of themost relevant criteria when choosing a KG Howeverapplying our framework to the use case also showedthat besides the quantitative assessment there is stilla need for a deep understanding of commonalities anddifference of the KGs in order to make an informedchoice

7 Related Work

71 Linked Data Quality Criteria

Zaveri et al [49] provide a conceptual framework forquality assessment of linked data based on quality cri-teria and metrics which are grouped into quality dimen-sions and categories and which are based on the frame-work of Wang et al [47] Our framework is also basedon Wangrsquos dimensions and extended by the dimensionsConsistency [11] Licensing and Interlinking [49] Fur-thermore we reintroduce the dimensions Trustworthi-ness and Interoperability as a collective term for multi-ple dimensions

Many published DQ criteria and metrics are ratherabstract We in contrast selected and developed con-

crete criteria which can be applied to any KG in theLinked Open Data cloud Table 16 shows which ofthe metrics introduced in this article have already beenused to some extent in existing literature In summaryrelated work mainly proposed generic guidelines forpublishing Linked Data [26] introduced DQ criteriawith corresponding metrics (eg [2030]) and criteriawithout metrics (eg [4029]) 27 of the 34 criteria in-troduced in this article have been introduced or sup-ported in one way or another in earlier works The re-maining seven criteria namely Trustworthiness on KGlevel mgraph Indicating unknown and empty valuesmNoV al Check of schema restrictions during insertionof new statements mcheckRestr Creating a rankingof statements mRanking Timeliness frequency of theKG mFreq Specification of the validity period of state-ments mV alidity and Availability of the KG mAvaihave not been proposed so far to the best of our knowl-edge In the following we present more details of singleexisting approaches for Linked Data quality criteria

Pipino et al [40] introduce the criteria Schema com-pleteness Column completeness and Population com-pleteness in the context of databases We introducethose metrics for KGs and apply them to the best ofour knowledge the first time on the KGs DBpediaFreebase OpenCyc Wikidata and YAGO

OntoQA [45] introduces criteria and correspondingmetrics that can be used for the analysis of ontologiesBesides simple statistical figures such as the average ofinstances per class Tartir et al introduce also criteriaand metrics similar to our DQ criteria Description ofresources mDescr and Column completeness mcCol

Based on a large-scale crawl of RDF data Hogan etal [29] analyze quality issues of published RDF dataLater Hogan et al [30] introduce further criteria andmetrics based on Linked Data guidelines for data pub-lishers [26] Whereas Hogan et al crawl and analyzemany KGs we analyze a selected set of KGs in moredetail

Heath et al [26] provide guidelines for Linked Databut do not introduce criteria or metrics for the assess-ment of Linked Data quality Still the guidelines can beeasily translated into relevant criteria and metrics Forinstance Do you refer to additional access methodsleads to the criteria Provisioning of public SPARQLendpoint mSPARQL and Provisioning of an RDF ex-port mExport Also Do you map proprietary vocabu-lary terms to other vocabularies leads to the criterionInteroperability of proprietary vocabulary mpropV ocMetrics that are based on the guidelines of Heath et alcan also be found in other frameworks [3020]

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 49

Table 16Overview of related work regarding data quality criteria for KGs

DQ Metric [40] [45] [29] [26] [20] [22] [30] [48] [2] [34]

msynRDF X X

msynLit X X X X

msemTriple X X X X

mfact X X

mconClass X X X

mconRelat X X X X X X

mcSchema X X

mcCol X X X X

mcPop X X

mChange X X

mDescr X X X X

mLang X

muSer X

muURI X

mReif X X X

miSerial X

mextV oc X X

mpropV oc X

mDeref X X X X

mSPARQL X

mExport X X

mNegot X X X

mHTMLRDF X

mMeta X X X

mmacLicense X X X

mInst X X X

mURIs X X

Flemming [20] introduces a framework for the qual-ity assessment of Linked Data quality This frameworkmeasures the Linked Data quality based on a sample ofa few RDF documents Based on a systematic literaturereview criteria and metrics are introduced Flemmingintroduces the criteria Labels in multiple languagesmLang and Validity of external URIs mURIs the firsttime The framework is evaluated on a sample of RDFdocuments of DBpedia In contrast to Flemming weevaluate the whole KG DBpedia and also four otherwidely used KGs

SWIQA[22] is a quality assessment framework intro-duced by Fuumlrber et al that introduces criteria and met-rics for the dimensions Accuracy Completeness Timeli-ness and Uniqueness In this framework the dimensionAccuracy is divided into Syntactic validity and Sematicvalidity as proposed by Batini et al [6] Furthermorethe dimension Completeness comprises Schema com-pleteness Column completeness and Population com-pleteness following Pipino et al [40] In this articlewe make the same distinction but in addition distin-guish between RDF documents RDF triples and RDF

50 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

literals for evaluating the Accuracy since we considerRDF KGs

TripleCheckMate [35] is a framework for LinkedData quality assessment using a crowdsourcing-approachfor the manual validation of facts Based on this ap-proach Zaveri et al [48] and Acosta et al [23] analyzeboth syntactic and semantic accuracy as well as theconsistency of data in DBpedia

Kontokostas et al [34] present the test-driven evalu-ation framework RDFUnit for assessing Linked Dataquality This framework is inspired by the paradigmof test-driven software development The frameworkintroduces 17 SPARQL templates of tests that can beused for analyzing KGs wrt Accuracy and Consis-tency Note that those tests can also be used for eval-uating external constraints that exist due to the usageof external vocabulary The framework is applied byKontokostas et al on a set of KGs including DBpedia

72 Comparing KGs by Key Statistics

Duan et al [14] Tartir [45] and Hassanzadeh [25]can be mentioned as the most similar related work re-garding the evaluation of KGs using the key statisticspresented in Section 51

Duan et al [14] analyze the structuredness of data inDBpedia YAGO2 UniProt and in several benchmarkdata sets To that end the authors use simple statisticalkey figures that are calculated based on the correspond-ing RDF dumps In contrast to that approach we useSPARQL queries to obtain the figures thus not limitingourselves to the N-Tripel serialization of RDF dumpfiles Duan et al claim that simple statistical figures arenot sufficient to gain fruitful findings when analyzingthe structuredness and differences of RDF datasets Theauthors therefore propose in addition a coherence met-ric Accordingly we analyze not only simple statisti-cal key figures but further analyze the KGs wrt dataquality using 34 DQ metrics

Tartir et al [45] introduce with the system OntoQAmetrics that can be used for analyzing ontologies Moreprecisely it can be measured to which degree theschema level information is actually used on instancelevel An example of such a metric is the class richnessdefined as the number of classes with instances dividedby the number of classes without instances SWETOTAP and GlycO are used as showcase ontologies

Tartir et al [45] and Hassanzadeh et al [25] analyzehow domains are covered by KGs on both schema andinstance level For that Tartir et al introduce the mea-sure importance as the number of instances per class

and their subclasses In our case we cannot use this ap-proach since Freebase has no hierarchy Hassanzadehet al analyze the coverage of domains by listing themost frequent classes with the highest number of in-stances as a table This gives only little overview of thecovered domains since instances can belong to multi-ple classes in the same domain such as dboPlaceand dboPopulatedPlace For determining thedomain coverages of KGs for this article we there-fore adapt the idea of Hassanzadeh et al by manu-ally mapping the most frequent classes to domains anddeleting duplicates within the domains That meansif an instance is instantiated both as dboPlaceand dboPopulatedPlace the instance will becounted only once in the domain geography

8 Conclusion

Freely available knowledge graphs (KGs) have notbeen in the focus of any extensive comparative study sofar In this survey we defined a range of aspects accord-ing to which KGs can be analyzed We analyzed andcompared DBpedia Freebase OpenCyc Wikidata andYAGO along these aspects and proposed a frameworkas well as a process to enable readers to find the mostsuitable KG for their settings

References

[1] M Acosta E Simperl F Floumlck and M Vidal HARE AHybrid SPARQL Engine to Enhance Query Answers viaCrowdsourcing In Proceedings of the 8th InternationalConference on Knowledge Capture K-CAP 2015 pages111ndash118 ACM 2015

[2] M Acosta A Zaveri E Simperl D Kontokostas S Auer andJ Lehmann Crowdsourcing linked data quality assessment InThe Semantic WebndashISWC 2013 pages 260ndash276 Springer 2013

[3] M Acosta A Zaveri E Simperl D Kontokostas F Floumlckand J Lehmann Detecting Linked Data Quality Issues viaCrowdsourcing A DBpedia Study Semantic Web 2016

[4] S Auer C Bizer G Kobilarov J Lehmann R Cyganiak andZ Ives DBpedia A Nucleus for a Web of Open Data InProceedings of the 6th International Semantic Web Conferenceand 2nd Asian Semantic Web Conference ISWC 2007ASWC2007 pages 722ndash735 Springer 2007

[5] S Auer J Lehmann A-C Ngonga Ngomo and A ZaveriIntroduction to Linked Data and Its Lifecycle on the Web InReasoning Web Semantic Technologies for Intelligent DataAccess volume 8067 of Lecture Notes in Computer Sciencepages 1ndash90 Springer Berlin Heidelberg 2013

[6] C Batini C Cappiello C Francalanci and A MaurinoMethodologies for Data Quality Assessment and ImprovementACM Comput Surv 41(3)161ndash1652 July 2009

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 51

[7] S Bechhofer F van Harmelen J Hendler I Horrocks D LMcGuinness and P F Patel-Schneider OWL Web OntologyLanguage Reference httpswwww3orgTR2004REC-owl-ref-200402102004 [Online accessed 06-Apr-2016]

[8] T Berners-Lee Linked Data httpwwww3orgDesignIssuesLinkedDatahtml2006 [Online accessed 28-Feb-2016]

[9] T Berners-Lee Linked Data Is Merely More Datahttpwwww3orgDesignIssuesLinkedDatahtml 2006[Online accessed 28-02-2016]

[10] T Berners-Lee J Hendler and O Lassila The Semantic WebScientific American 284(5)29ndash37 5 2001

[11] C Bizer Quality-Driven Information Filtering in the Contextof Web-Based Information Systems VDM Publishing 2007

[12] C Bizer J Lehmann G Kobilarov S Auer C BeckerR Cyganiak and S Hellmann DBpediandashA crystallizationpoint for the Web of Data Web Semantics science servicesand agents on the world wide web 7(3)154ndash165 2009

[13] X Dong E Gabrilovich G Heitz W Horn N LaoK Murphy T Strohmann S Sun and W Zhang KnowledgeVault A Web-Scale Approach to Probabilistic KnowledgeFusion In Proceedings of the 20th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining KDD rsquo14 pages 601ndash610 New York NY USA 2014ACM

[14] S Duan A Kementsietsidis K Srinivas and O UdreaApples and Oranges A Comparison of RDF Benchmarks andReal RDF Datasets In Proceedings of the ACM SIGMODInternational Conference on Management of Data SIGMOD2011 pages 145ndash156 2011

[15] B Ell D Vrandecic and E Simperl Proceedings of the 10thInternational Semantic Web Conference (ISWC 2011) chapterLabels in the Web of Data pages 162ndash176 Springer BerlinHeidelberg Berlin Heidelberg 2011

[16] F Erxleben M Guumlnther M Kroumltzsch J Mendez andD Vrandecic Introducing Wikidata to the Linked Data WebIn Proceedings of the 13th International Semantic WebConference ISWC 2014 pages 50ndash65 Springer 2014

[17] M Faumlrber F Bartscherer C Menne and A Rettinger LinkedData Quality of DBpedia Freebase OpenCyc Wikidata andYAGO Semantic Web Journal 2017 to be published

[18] M Faumlrber C Menne and A Rettinger A Linked DataWrapper for CrunchBase Semantic Web Journal 2017 to bepublished

[19] C Fellbaum WordNet ndash An Electronic Lexical Database MITPress 1998

[20] A Flemming Qualitaumltsmerkmale von LinkedData-veroumlffentlichenden Datenquellen (Quality characteristicsof linked data publishing datasources) Diploma ThesisHumboldt University of Berlinhttpwwwdbisinformatikhu-berlindefileadminresearchpapersdiploma_seminar_thesisDiplomarbeit_Annika_Flemmingpdf 2011

[21] G Freedman and E G Reynolds Enriching Basal ReaderLessons with Semantic Webbing Reading Teacher33(6)677ndash684 1980

[22] C Fuumlrber and M Hepp SWIQA ndash A Semantic WebInformation Quality Assessment Framework In Proceedings ofthe 19th European Conference on Information Systems

(ECIS2011) volume 15 page 19 2011[23] R Guns Tracing the origins of the Semantic Web Journal of

the American Society for Information Science and Technology64(10)2173ndash2181 2013

[24] H Halpin P J Hayes J P McCusker D L McGuinness andH S Thompson The Semantic Web ndash ISWC 2010 9thInternational Semantic Web Conference ISWC 2010 ShanghaiChina chapter When owlsameAs Isnrsquot the Same An Analysisof Identity in Linked Data pages 305ndash320 Springer BerlinHeidelberg Berlin Heidelberg 2010

[25] O Hassanzadeh M J Ward M Rodriguez-Muro andK Srinivas Understanding a Large Corpus of Web TablesThrough Matching with Knowledge Bases ndash An EmpiricalStudy In Proceedings of the 10th International Workshop onOntology Matching collocated with the 14th InternationalSemantic Web Conference ISWC 2015 2015

[26] T Heath and C Bizer Linked data Evolving the web into aglobal data space Synthesis lectures on the semantic webtheory and technology 1(1)1ndash136 2011

[27] D Hernaacutendez A Hogan and M Kroumltzsch Reifying RDFWhat Works Well With Wikidata In Proceedings of the 11thInternational Workshop on Scalable Semantic Web KnowledgeBase Systems co-located with 14th International Semantic WebConference pages 32ndash47 2015

[28] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 A Spatially and Temporally Enhanced KnowledgeBase from Wikipedia Artificial Intelligence 19428ndash61 2013

[29] A Hogan A Harth A Passant S Decker and A PolleresWeaving the Pedantic Web Proceedings of the WWW2010Workshop on Linked Data on the Web 628 2010

[30] A Hogan J Umbrich A Harth R Cyganiak A Polleres andS Decker An empirical survey of linked data conformanceWeb Semantics Science Services and Agents on the WorldWide Web 1414ndash44 2012

[31] P Jain P Hitzler K Janowicz and C Venkatramani TherersquosNo Money in Linked Data httpcorescholarlibrarieswrighteducse240 2013 accessedJuly 20 2015

[32] J M Juran F M Gryna and R S Bingham editors QualityControl Handbook McGraw-Hill 1974

[33] G Kobilarov T Scott Y Raimond S Oliver C SizemoreM Smethurst C Bizer and R Lee Media Meets SemanticWeb ndash How the BBC Uses DBpedia and Linked Data to MakeConnections In Proceedings of the 6th European SemanticWeb Conference on The Semantic Web Research andApplications ESWC 2009 Heraklion pages 723ndash737 BerlinHeidelberg 2009 Springer

[34] D Kontokostas P Westphal S Auer S HellmannJ Lehmann R Cornelissen and A Zaveri Test-drivenevaluation of linked data quality In Proceedings of the 23rdinternational conference on World Wide Web pages 747ndash758ACM 2014

[35] D Kontokostas A Zaveri S Auer and J LehmannTripleCheckMate A Tool for Crowdsourcing the QualityAssessment of Linked Data In Knowledge Engineering andthe Semantic Web ndash 4th International Conference KESW 2013St Petersburg Russia October 7-9 2013 Proceedings pages265ndash272 Springer 2013

[36] C Matuszek J Cabral M J Witbrock and J DeOliveira AnIntroduction to the Syntax and Content of Cyc In AAAI SpringSymposium Formalizing and Compiling Background

52 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Knowledge and Its Applications to Knowledge Representationand Question Answering pages 44ndash49 AAAI - Association forthe Advancement of Artificial Intelligence 2006

[37] M Mecella M Scannapieco A Virgillito R BaldoniT Catarci and C Batini Managing data quality in cooperativeinformation systems In On the Move to Meaningful InternetSystems 2002 CoopIS DOA and ODBASE pages 486ndash502Springer 2002

[38] O Medelyan and C Legg Integrating Cyc and WikipediaFolksonomy meets rigorously defined common-sense InWikipedia and Artificial Intelligence An Evolving SynergyPapers from the 2008 AAAI Workshop page 65 2008

[39] F Naumann Quality-Driven Query Answering for IntegratedInformation Systems volume 2261 Springer Science ampBusiness Media 2002

[40] L L Pipino Y W Lee and R Y Wang Data QualityAssessment Communications of the ACM 45(4)211ndash2182002

[41] E Sandhaus Semantic Technology at the New York TimesLessons Learned and Future Directions In Proceedings of the9th International Semantic Web Conference on The SemanticWeb - Volume Part II ISWCrsquo10 pages 355ndash355 BerlinHeidelberg 2010 Springer

[42] A Singhal Introducing the Knowledge Graph things notstrings httpsgoogleblogblogspotde201205introducing-knowledge-graph-things-nothtml retrieved on Aug 29 2016 2012

[43] F M Suchanek G Kasneci and G Weikum YAGO A LargeOntology from Wikipedia and WordNet Web SemanticsScience Services and Agents on the World Wide Web6(3)203ndash217 2008

[44] T P Tanon D Vrandecic S Schaffert T Steiner andL Pintscher From Freebase to Wikidata The Great MigrationIn Proceedings of the 25th International Conference on WorldWide Web WWW 2016 pages 1419ndash1428 2016

[45] S Tartir I B Arpinar M Moore A P Sheth andB Aleman-meza OntoQA Metric-Based Ontology QualityAnalysis In IEEE Workshop on Knowledge Acquisition fromDistributed Autonomous Semantically Heterogeneous Dataand Knowledge Sources 2005

[46] R Y Wang M P Reddy and H B Kon Toward quality dataAn attribute-based approach Decision Support Systems13(3)349ndash372 1995

[47] R Y Wang and D M Strong Beyond Accuracy What DataQuality Means to Data Consumers Journal of managementinformation systems 12(4)5ndash33 1996

[48] A Zaveri D Kontokostas M A Sherif L BuumlhmannM Morsey S Auer and J Lehmann User-driven qualityevaluation of dbpedia In Proceedings of the 9th InternationalConference on Semantic Systems pages 97ndash104 ACM 2013

[49] A Zaveri A Rula A Maurino R Pietrobon J Lehmann andS Auer Quality Assessment for Linked Data A SurveySemantic Web 7(1)63ndash93 2015

Page 8: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with

8 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

mfact(g) =

1 provenance on statement

level is used05 provenance on resource

level is used0 otherwise

Indicating unknown and empty values If the datamodel of the considered KG supports the representa-tion of unknown and empty values more complex state-ments can be represented For instance empty valuesallow to represent that a person has no children andunknown values allow to represent that the birth date ofa person in not known This kind of higher explanatorypower of a KG increases the trustworthiness of the KG

mNoV al(g) =

1 unknown and empty values

are used05 either unknown or empty

values are used0 otherwise

323 ConsistencyDefinition of dimension Consistency implies that

ldquotwo or more values [in a dataset] do not conflict eachotherrdquo [37]

Discussion Due to the high variety of data providersin the Web of Data a user must expect data inconsisten-cies Data inconsistencies may be caused by (i) differ-ent information providers (ii) different levels of knowl-edge and (iii) different views of the world [11]

In OWL restrictions can be introduced to ensureconsistent modeling of knowledge to some degree TheOWL schema restrictions can be divided into class re-strictions and relation restrictions [7]

Class restrictions refer to classes For instanceone can specify via owldisjointWith that twoclasses have no common instance

Relation restrictions refer to the usage of relationsThey can be classified into value constraints and cardi-nality constraints

Value constraints determine the range of relationsowlsomeValuesFrom for instance specifies thatat least one value of a relation belongs to a certainclass If the expected data type of a relation is specifiedvia rdfsrange we also consider this as relationrestriction

Cardinality constraints limit the number of times a re-lation may exist per resource Via owlFunctionalproperty and owlInverseFunctionalProp

erty global cardinality constraints can be specifiedFunctional relations permit at most one value per re-source (eg the birth date of a person) Inverse func-tional relations specify that a value should only occuronce per resource This means that the subject is theonly resource linked to the given object via the givenrelation

Definition of metric We can measure the data qual-ity dimension Consistency by means of (i) whetherschema constraints are checked during the insertion ofnew statements into the KG and (ii) whether alreadyexisting statements in the KG are consistent to specifiedclass and relation constraints The fulfillment degree ofa KG g wrt the dimension consistency is measuredby the metrics mcheckRestr mconClass and mconRelatwhich are defined as follows

Check of schema restrictions during insertion of newstatements Checking the schema restrictions duringthe insertion of new statements can help to reject factsthat would render the KG inconsistent Such simplechecks are often done on the client side in the user inter-face For instance the application checks whether datawith the right data type is inserted Due to the depen-dency to the actual inserted data the check needs to becustom-designed Simple rules are applicable howeverinconsistencies can still appear if no suitable rules areavailable Examples of consistency checks are check-ing the expected data types of literals checking whetherthe entity to be inserted has a valid entity type (iechecking the rdftype relation) checking whetherthe assigned classes of the entity are disjoint ie con-tradicting each other (utilizing owldisjointWithrelations)

mcheckRestr(hg) =

1 schema restrictions arechecked

0 otherwise

Consistency of statements wrt class constraints Thismetric is intended to measure the degree to which theinstance data is consistent with the class restrictions(eg owldisjointWith) specified on the schemalevel

In the following we limit ourselves to the classconstraints given by all owldisjointWith state-ments defined on the schema level of the consid-ered KG Ie let CC be the set of all class con-straints defined as CC = (c1 c2) | (c1owldis-

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 9

jointWith c2) isin g18 Furthermore let cg(e) bethe set of all classes of instance e in g defined ascg(e) = c | (erdftype c) isin g Then we definemconClass(g) as follows

mconClass(g) =

|(c1 c2) isin CC | notexiste (c1 isin cg(e) and c2 isin cg(e))||(c1 c2) isin CC|

In case of an empty set of class constraints CC themetric should evaluate to 1

Consistency of statements wrt relation constraintsThe metric for this criterion is intended for measur-ing the degree to which the instance data is consis-tent with the relation restrictions (eg indicated viardfsrange and owlFunctionalProperty)specified on the schema level We evaluate this crite-rion by averaging over the scores obtained from sin-gle metrics mconRelati indicating the consistency ofstatements wrt different relation constraints

mconRelat(g) =1

n

nsumi=1

mconRelati(g)

In case of evaluating the consistency of instance dataconcretely wrt given rdfsrange and owlFunctionalProperty statements19 we can state

mconRelat(g) =mconRelatRg(g) +mconRelatFct(g)

2

Let Rr be the set of all rdfsrange constraints

Rr = (p d) | (prdfsrange d) isin g

and isDatatype(d)

18Implicit restrictions which can be deducted from the class hi-erarchy eg that a restriction for dboAnimal counts also fordboMammal a subclass of dboAnimal are not considered byus here

19We chose those relations (and for instance notowlInverseFunctionalProperty) as only those relationsare used by more than half of the considered KGs

and Rf be the set of all owlFunctionalPro-perty constraints

Rf = (p d) | (prdftypeowlFunc

tionalProperty) isin g and

(prdfsrange d) isin g and isDatatype(d)

Then we can define the metrics mconRelatRg(g) andmconRelatFct(g) as follows

mconRelatRg(g) =

|(s p o) isin g | exist(p d) isin Rr datatype(o) 6= d||(s p o) isin g | exist(p d) isin Rr|

mconRelatFct(g) =

|(s p o) isin g|exist(p d) isin Rf notexist(s p o2) isin g o 6= o2||(s p o) isin g | exist(p d) isin Rf|

In case of an empty set of relation constraints (Rr orRf ) the respective metric should evaluate to 1

33 Contextual Category

Contextual data quality ldquohighlights the requirementthat data quality must be considered within the contextof the task at handrdquo [47] This category contains thethree dimensions (i) Relevancy (ii) Completeness and(iii) Timeliness Wang et alrsquos further dimensions in thiscategory appropriate amount of data and value-addedare considered by us as being part of the dimensionCompleteness

331 RelevancyDefinition of dimension Relevancy is ldquothe extent

to which data are applicable and helpful for the task athandrdquo [47]

Discussion According to Bizer [11] Relevancy isan important quality dimension since the user is con-fronted with a variety of potentially relevant informa-tion on the Web

Definition of metric The dimension Relevancy isdetermined by the criterion Creating a ranking ofstatements20 The fulfillment degree of a KG g wrtthe dimension Relevancy is measured by the metricmRanking which is defined as follows

20We do not consider the relevancy of literals as there is no rankingof literals provided for the considered KGs

10 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Creating a ranking of statements By means of thiscriterion one can determine whether the KG supportsa ranking of statements by which the relative rele-vance of statements among other statements can beexpressed For instance given the Wikidata entityBarack Obama (wdtQ76) and the relation posi-tion held (wdtP39) President of the United Statesof America (wdtQ11696) has a preferred rank(wdoPreferredRank) (until 2017) while olderpositions which he holds no more are ranked as normalrank (wdoNormalRank)

mRanking(g) =

1 ranking of statements supported0 otherwise

Note that this criterion refers to a characteristic ofthe KG and not to a characteristic of the system thathosts the KG

332 CompletenessDefinition of dimension Completeness is ldquothe ex-

tent to which data are of sufficient breadth depth andscope for the task at handrdquo [47]

We include the following two aspects in this dimen-sion which are separate dimensions in Wang et alrsquosframework

ndash Appropriate amount of data Appropriate amountof data is ldquothe extent to which the quantity orvolume of available data is appropriaterdquo [47]

ndash Value-added Value-added is ldquothe extent to whichdata are beneficial and provide advantages fromtheir userdquo [47]

Discussion Pipino et al [40] divide Completenessinto

1 Schema completeness ie the extent to whichclasses and relations are not missing

2 Column completeness ie the extent to whichvalues of relations on instance level ndash ie facts ndashare not missing and

3 Population completeness ie the extent to whichentities are not missing

The Completeness dimension is context-dependent andtherefore belongs to the contextual category becausethe fact that a KG is seen as complete depends on theuse case scenario ie on the given KG and on the infor-mation need of the user As exemplified by Bizer [11]a list of German stocks is complete for an investor whois interested in German stocks but it is not complete for

an investor who is looking for an overview of Europeanstocks The completeness is hence only assessable bymeans of a concrete use case at hand or with the helpof a defined gold standard

Definition of metric We follow the above-mentioneddistinction of Pipino et al [40] and determine Com-pleteness by means of the criteria Schema completenessColumn completeness and Population completeness

The fulfillment degree of a KG g wrt the dimensionCompleteness is measured by the metrics mcSchemamcCol and mcPop which are defined as follows

Schema completeness By means of the criterionSchema completeness one can determine the complete-ness of the schema wrt classes and relations [40] Theschema is assessed by means of a gold standard Thisgold standard consists of classes and relations which arerelevant for the use case For evaluating cross-domainKGs we use as gold standard a typical set of cross-domain classes and relations It comprises (i) basicclasses such as people and locations in different gran-ularities and (ii) basic relations such as birth date andnumber of inhabitants We define the schema complete-ness mcSchema as the ratio of the number of classesand relations of the gold standard existing in g noclatg and the number of classes and relations in the goldstandard noclat

mcSchema(g) =noclatgnoclat

Column completeness In the traditional database area(with fixed schema) by means of the Column complete-ness criterion one can determine the degree by whichthe relations of a class which are defined on the schemalevel (each relation has one column) exist on the in-stance level [40] In the Semantic Web and Linked Datacontext however we cannot presume any fixed rela-tional schema on the schema level The set of possiblerelations for the instances of a class is given at run-time by the set of used relations for the instances ofthis class Therefore we need to modify this criterionas already proposed by Pipino et al [40] In the updatedversion by means of the criterion Column completenessone can determine the degree by which the instances ofa class use the same relations averaged over all classes

Formally we define the Column completeness met-ric mcCol(g) as the ratio of the number of instanceshaving class k and a value for the relation r nokp tothe number of all instances having class k nok Byaveraging over all class-relation-pairs which occur on

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 11

instance level we obtain a fulfillment degree regardingthe whole KG

mcCol(g) =1

|H|sum

(kp)isinH

nokpnok

We thereby let H = (k p) isin (K times P ) | existk isinCg and exist(x p o) | p isin P imp

g and (xrdftype k) bethe set of all combinations of the considered classesK = k1 kn and considered relations P =p1 pm

Note that there are also relations which are dedicatedto the instances of a specific class but which do notneed to exist for all instances of that class For instancenot all people need to have a relation hasChild ordeathDate21 For measuring the Column complete-ness we selected only those relations for an assessmentwhere a value of the relation typically exists for allgiven instances

Population completeness The Population complete-ness metric determines the extent to which the consid-ered KG covers a basic population [40] The assess-ment of the KG completeness wrt a basic populationis performed by means of a gold standard which coversboth well-known entities (called ldquoshort headrdquo eg then largest cities in the world according to the number ofinhabitants) and little-known entities (called ldquolong tailrdquoeg municipalities in Germany) We take all entitiescontained in our gold standard equally into account

Let GS be the set of entities in the gold standardThen we can define

mcPop(g) =|e|e isin GS and e isin Eg|

|e|e isin GS|

333 TimelinessDefinition of dimension Timeliness is ldquothe extent

to which the age of the data is appropriate for the taskat handrdquo [47]

Discussion Timeliness does not describe the creationdate of a statement but instead the time range since thelast update or the last verification of the statement [39]Due to the easy way of publishing data on the Webdata sources can be kept easier up-to-date than tradi-tional isolated data sources This results in advantagesto the consumer of Web data [39] How Timeliness is

21For an evaluation about the prediction which relations are of thisnature see [1]

measured depends on the application context For somesituations years are sufficient while in other situationsone may need days [39]

Definition of metric The dimension timeliness isdetermined by the criteria Timeliness frequency of theKG Specification of the validity period and Specifica-tion of the modification date of statements

The fulfillment degree of a KG g wrt the dimen-sion Timeliness is measured by the metrics mFreqmV alidity and mChange which are defined as follows

Timeliness frequency of the KG The criterion Time-liness frequency of the KG indicates how fast the KGis updated We consider the KG RDF export here anddifferentiate between continuous updates where the up-dates are always performed immediately and discreteKG updates where the updates take place in discretetime intervals In case the KG edits are available onlineimmediately but the RDF export files are available indiscrete varying updating intervals we consider theonline version of the KG since in the context of LinkedData it is sufficient that URIs are dereferenceable

mFreq(g) =

1 continuous updates05 discrete periodic updates025 discrete non-periodic updates0 otherwise

Specification of the validity period of statements Spec-ifying the validity period of statements enables to tem-porally limit the validity of statements By using this cri-terion we measure whether the KG supports the speci-fication of starting and maybe end dates of statementsby means of providing suitable forms of representation

mV alidity(g) =

1 specification of validity pe-riod supported

0 otherwise

Specification of the modification date of statementsThe modification date discloses the point in timeof the last verification of a statement The modifi-cation date is typically represented via the relationsschemadateModified and dctermsmodified

mChange(g) =

1 specification of modifica-

tion dates for statementssupported

0 otherwise

12 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

34 Representational Data Quality

Representational data quality ldquocontains aspects re-lated to the format of the data [] and meaning ofdatardquo [47] This category contains the two dimensions(i) Ease of understanding (ie regarding the human-readability) and (ii) Interoperability (ie regarding themachine-readability) The dimensions InterpretabilityRepresentational consistency and Concise representa-tion as in addition proposed by Wang et al [47] areconsidered by us as being a part of the dimension Inter-operability

341 Ease of UnderstandingDefinition of dimension The ease of understanding

is ldquothe extent to which data are clear without ambiguityand easily comprehendedrdquo [47]

Discussion This dimension focuses on the under-standability of a data source by a human data con-sumer In contrast the dimension Interoperability fo-cuses on technical aspects The understandability of adata source (here KG) can be improved by things suchas descriptive labels and literals in multiple languages

Definition of metric The dimension understand-ability is determined by the criteria Description of re-sources Labels in multiple languages UnderstandableRDF serialization and Self-describing URIs The ful-fillment degree of a KG g wrt the dimension Con-sistency is measured by the metrics mDescr mLangmuSer and muURI which are defined as follows

Description of resources Heath et al [2630] suggestto describe resources in a human-understandable wayeg via rdfslabel or rdfscomment Withinour framework the criterion is measured as followsGiven a sample of resources we divide the numberof resources in the KG for which at least one label orone description is provided (eg via rdfslabelrdfscomment or schemadescription) bythe number of all considered resources in the localnamespace

mDescr(g) = |u|u isin U localg and exist(u p o) isin g

p isin PlDesc||u|u isin U localg |

PlDesc is the set of implicitly used relations in g in-dicating that the value is a label or description (egPlDesc = rdfslabelrdfscomment)

Beschreibung) Daruumlber hinaus ist das Ergebnisder Evaluation auf Basis der Entitaumlten interessant -gt DBpedia weicht deutlich ab da manche Entitaumlten

(Intermediate-Node-Mapping) keine rdfslabel habenFolglich wuumlrde ich die Definition der Metrik allgemeinhalten (beschraumlnkt auf proprietaumlre Ressourcen dh imselben Namespace) die Evaluation jedoch nur anhandder Entitaumlten machen

Labels in multiple languages Resources in the KG aredescribed in a human-readable way via labels eg viardfslabel or skosprefLabel22 The charac-teristic feature of skosprefLabel is that this kindof label should be used per resource at most once incontrast rdfslabel has no cardinality restrictionsie it can be used several times for a given resourceLabels are usually provided in English as the ldquobasiclanguagerdquo The now introduced metric for the criterionLabels in multiple languages determines whether labelsin other languages than English are provided in the KG

mLang(g) =

1 Labels provided in English

and at least one other lan-guage

0 otherwise

Understandable RDF serialization RDFXML is therecommended RDF serialization format of the W3CHowever due to its syntax RDFXML documents arehard to read for humans The understandability of RDFdata by humans can be increased by providing RDFin other more human-understandable serialization for-mats such as N3 N-Triple and Turtle We measurethis criterion by measuring the supported serializationformats during the dereferencing of resources

muSer(hg) =

1 Other RDF serializationsthan RDFXML available

0 otherwise

Note that conversions from one RDF serializationformat into another are easy to perform

Self-describing URIs Descriptive URIs contribute toa better human-readability of KG data Sauermann etal23 recommend to use short memorable URIs in theSemantic Web context which are easier understandableand memorable by humans compared to opaque URIs24

22Using the namespace httpwwww3org200402skoscore

23See httpswwww3orgTRcooluris requested onMar 1 2016

24For an overview of URI patterns see httpswwww3orgcommunitybpmlodwikiBest_practises_-_previous_notes requested on Dec 27 2016

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 13

such as wdtQ1040 The criterion Self-describingURIs is dedicated to evaluate whether self-describingURIs or generic IDs are used for the identification ofresources

muURI(g) =

1 self-describing URIs always used05 self-describing URIs partly used0 otherwise

342 InteroperabilityInteroperability is another dimension of the repre-

sentational data quality category and subsumes Wanget alrsquos aspects interpretability representational consis-tency and concise representation

Definition of dimension We define Interoperabilityalong the subsumed dimensions of Wang et al

ndash Interpretability Interpretability is ldquothe extent towhich data are in appropriate language and unitsand the data definitions are clearrdquo [47]

ndash Representational consistency Representationalconsistency is ldquothe extent to which data are alwayspresented in the same format and are compatiblewith previous datardquo [47]

ndash Concise representation Concise representationis ldquothe extent to which data are compactly repre-sented without being overwhelmingrdquo [47]

Discussion regarding interpretability In contrastto the dimension understandability which focuses onthe understandability of RDF KG data towards the useras data consumer interpretability focuses on the rep-resentation forms of information in the KG from atechnical perspective An example is the considerationwhether blank nodes are used According to Heath etal [26] blank nodes should be avoided in the LinkedData context since they complicate the integration ofmultiple data sources and since they cannot be linkedby resources of other data sources

Discussion regarding representational consistencyIn the context of Linked Data it is best practice to reuseexisting vocabulary for the creation of own RDF dataIn this way less data needs to be prepared for beingpublished as Linked Data [26]

Discussion regarding concise representation Heathet al [26] made the observation that the RDF features(i) RDF reification25 (ii) RDF collections and RDF

25In the literature it is often not differentiated between reificationin the general sense and reification in the sense of the specific

container and (iii) blank nodes are not very widelyused in the Linked Open Data context Those featuresshould be avoided according to Heath et al in orderto simplify the processing of data on the client sideEven the querying of the data via SPARQL may getcomplicated if RDF reification RDF collections andRDF container are used We agree on that but alsopoint out that reification (implemented via RDF stan-dard reification n-ary relations singleton propertiesor named graphs) is inevitably necessary for makingstatements about statements

Definition of metric The dimension Interoperabil-ity is determined via the following criteria

ndash Avoiding blank nodes and RDF reificationndash Provisioning of several serialization formatsndash Using external vocabularyndash Interoperability of proprietary vocabulary

The fulfillment degree of a KG g wrt the dimen-sion Interoperability is measured by the metrics mReif miSerial mexV oc and mpropV oc which are defined asfollows

Avoiding blank nodes and RDF reification Using RDFblank nodes RDF reification RDF container and RDFlists is often considered as ambivalent On the one handthese RDF features are not very common and theycomplicate the processing and querying of RDF data[3026] On the other hand they are necessary in cer-tain situations eg when statements about statementsshould be made We measure the criterion by evaluatingwhether blank nodes and RDF reification are used

mReif (g) =

1 no blank nodes and no RDF

reification05 either blank nodes or RDF

reification0 otherwise

Provisioning of several serialization formats The in-terpretability of RDF data of a KG is increased if be-

proposal described in the RDF standard (Brickley D Guha R (eds)RDF Vocabulary Description Language 10 RDF Schema W3CRecommendation online available at httpwwww3orgTRrdf-schema requested on Sep 2 2016) For more informationabout reification and its implementation possibilities we can refer thereader to [27] In this article we use the term reification by defaultfor the general sense and standard reification or RDF reificationfor referring to the modeling of reification according to the RDFstandard

14 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

sides the serialization standard RDFXML further seri-alization formats are supported for URI dereferencing

miSerial(hg) =

1 RDFXML and further for-

mats are supported05 only RDFXML is supported0 otherwise

Using external vocabulary Using a common vocabu-lary for representing and describing the KG data allowsto represent resources and relations between resourcesin the Web of Data in a unified way This increases theinteroperability of data [3026] and allows a comfort-able data integration We measure the criterion of usingan external vocabulary by setting the number of tripleswith external vocabulary in predicate position to thenumber of all triples in the KG

mextV oc(g) =|(s p o)|(s p o) isin g and p isin P external

g ||(s p o) isin g|

Interoperability of proprietary vocabulary Linkingon schema level means to link the proprietary vo-cabulary to external vocabulary Proprietary vocab-ulary are classes and relations which were definedin the KG itself The interlinking to external vo-cabulary guarantees a high degree of interoperabil-ity [26] We measure the interlinking on schemalevel by calculating the ratio to which classes andrelations have at least one equivalency link (egowlsameAs owlequivalentProperty orowlequivalentClass) to classes and relationsrespectively of other data sources

mpropV oc(g) = |x isin Pg cup Cg|exist(x p o) isin g

(p isin Peq and (o isin U and o isin Uextg ))||Pg cup Cg|

where Peq = owlsameAsowlequivalent-PropertyowlequivalenClass and Uext

g con-sists of all URIs in Ug which are external to the KG gwhich means that hg is not responsible for resolvingthese URIs

35 Accessibility Category

Accessibility data quality refers to aspects on howdata can be accessed This category contains the threedimensions

ndash Accessibilityndash Licensing andndash Interlinking

Wangrsquos dimension access security is considered by usas being not relevant in the Linked Open Data contextas we only take open data sources into account

In the following we go into details of the mentioneddata quality dimensions

351 AccessibilityDefinition of dimension Accessibility is ldquothe ex-

tent to which data are available or easily and quicklyretrievablerdquo [47]

Discussion Wang et alrsquos definition of Accessibilitycontains the aspects availability response time anddata request They are defined as follows

1 Availability ldquoof a data source is the probability thata feasible query is correctly answered in a giventime rangerdquo [39]According to Naumann [39] the availability is animportant quality aspect for data sources on theWeb since in case of integrated systems (with fed-erated queries) usually all data sources need tobe available in order to execute the query Therecan be different influencing factors regarding theavailability of data sources such as the day timethe worldwide distribution of servers the planedmaintenance work and the caching of data LinkedData sources can be available as SPARQL end-points (for performing complex queries on thedata) and via HTTP URI dereferencing We needto consider both possibilities for this DQ dimen-sion

2 Response time characterizes the delay betweenthe point in time when the query was submittedand the point in time when the query response isreceived [11]Note that the response time is dependent on em-pirical factors such as the query the size of the in-dexed data the data structure the used triple storethe hardware and so on We do not consider theresponse time in our evaluations since obtaininga comprehensive result here is hard

3 In the context of Linked Data data requests canbe made (i) on SPARQL endpoints (ii) on RDFdumps (export files) and (iii) on Linked DataAPIs

Definition of metric We define the metric for thedimension Accessibility by means of metrics for thefollowing criteria

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 15

ndash Dereferencing possibility of resourcesndash Availability of the KGndash Provisioning of public SPARQL endpointndash Provisioning of an RDF exportndash Support of content negotiationndash Linking HTML sites to RDF serializationsndash Provisioning of KG metadata

The fulfillment degree of a KG g wrt the dimen-sion Accessibility is measured by the metrics mDeref mAvai mSPARQL mExport mNegot mHTMLRDF and mMeta which are defined as follows

Dereferencing possibility of resources One of theLinked Data principles [9] is the dereferencing possi-bility of resources URIs must be resolvable via HTTPrequests and useful information should be returnedthereby We assess the dereferencing possibility of re-sources in the KG by analyzing for each URI in the sam-ple set (here all URIs Ug) the HTTP response statuscode and by evaluating whether RDF data is returned Asuccessful dereferencing of resources is given if HTTPstatus code 200 and an RDF document is returned

mDeref (hg) =|dereferencable(Ug)|

|Ug|

Availability of the KG The Availability of the KG cri-terion indicates the uptime of the KG It is an essentialcriterion in the context of Linked Data since in case ofan integrated or federated query mostly all data sourcesneed to be available [39] We measure the availabil-ity of a KG by monitoring the ability of dereferencingURIs over a period of time This monitoring processcan be done with the help of a monitoring tool such asPingdom26

mAvai(hg) =Number of successful requests

Number of all requests

Provisioning of public SPARQL endpoint SPARQLendpoints allow the user to perform complex queries(including potentially many instances classes and rela-tions) on the KG This criterion here indicates whetheran official SPARQL endpoint is publicly availableThere might be additional restrictions of this SPARQLendpoint such as a maximum number of requests pertime slice or a maximum runtime of a query However

26See httppingdomcom requested on Mar 1 2016

we do not measure these restrictions here

mSPARQL(hg) =

1 SPARQL endpoint publiclyavailable

0 otherwise

Provisioning of an RDF export If there is no pub-lic SPARQL endpoint available or the restrictions ofthis endpoint are so strict that the user does not useit an RDF export dataset (RDF dump) can often beused This dataset can be used to set up a local pri-vate SPARQL endpoint The criterion here indicateswhether an RDF export dataset is officially available

mExport(hg) =

1 RDF export available0 otherwise

Support of content negotiation Content negotiation(CN) allows that the server returns RDF documentsduring the dereferencing of resources in the desiredRDF serialization format The HTTP protocol allowsthe client to specify the desired content type (eg RDFXML) in the HTTP request and the server to specifythe returned content type in the HTTP response header(eg applicationrdf+xml) In this way the de-sired and the provided content type are matched as faras possible It can happen that the server does not pro-vide the desired content type Moreover it may hap-pen that the server returns an incorrect content typeThis may lead to the fact that serialized RDF data isnot processed further An example is RDF data whichis declared as textplain [26] Hogan et al [29]therefore propose to let KGs return the most specificcontent type as possible We measure the Support ofcontent negotiation by dereferencing resources withdifferent RDF serialization formats as desired contenttype and by comparing the accept header of the HTTPrequest with the content type of the HTTP response

mNegot(hg) =

1 CN supported and correct

content types returned05 CN supported but wrong

content types returned0 otherwise

Linking HTML sites to RDF serializations Heath etal [26] suggest linking any HTML description of aresource to RDF serializations of this resource in or-der to make the discovery of corresponding RDF dataeasier (for Linked Data aware applications) For thatreason in the HTML header the so-called Autodiscov-

16 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

ery pattern can be included This pattern consists ofthe phrase link rel=alternate the indicationabout the provided RDF content type and a link to theRDF document27 We measure the linking of HTMLpages to RDF documents (ie resource representations)by evaluating whether the HTML representations of theresources contain links as described

mHTMLRDF (hg) =

1 Autodiscovery pattern usedat least once

0 otherwise

Provisioning of KG metadata In the light of the Se-mantic Web vision where agents select and make useof appropriate data sources on the Web also the meta-information about KGs needs to be available in amachine-readable format The two important mech-anisms to specify metadata about KGs are (i) usingsemantic sitemaps and (ii) using the VoID vocabu-lary28 [26] For instance the URI of the SPARQL end-point can be assigned via voidsparqlEndpointand the RDF export URL can be specified withvoiddataDump Such metadata can be added as ad-ditional facts to the KG or it can be provided as separateVoID file We measure the Provisioning of KG meta-data by evaluating whether machine-readable metadataabout the KG is available Note that the provisioningof licensing information in a machine-readable format(which is also a meta-information about the KG) isconsidered in the data quality dimension License lateron

mMeta(g) =

1 Machine-readable metadataabout g available

0 otherwise

352 LicenseDefinition of dimension Licensing is defined as

ldquothe granting of permission for a consumer to re-use adataset under defined conditionsrdquo [49]

Discussion The publication of licensing informationabout KGs is important for using KGs without legalconcerns especially in commercial settings CreativeCommons (CC)29 publishes several standard licensing

27An example is ltlinkrel=alternate type =applicationrdf+xml href=companyrdfgt

28See namespace httpwwww3orgTRvoid29See httpcreativecommonsorg requested on Mar

1 2016

contracts which define rights and obligations Thesecontracts are also in the Linked Data context popularThe most frequent licenses for Linked Data are CC-BYCC-BY-SA and CC0 [31] CC-BY30 requires specify-ing the source of the data CC-BY-SA31 requires in ad-dition that if the data is published it is published underthe same legal conditions CC032 defines the respectivedata as public domain and without any restrictions

Noteworthy is that most data sources in the LinkedOpen Data cloud do not provide any licensing infor-mation [31] which makes it difficult to use the datain commercial settings Even if data is published un-der CC-BY or CC-BY-SA the data is often not usedsince companies refer to uncertainties regarding thesecontracts

Definition of metric The dimension License isdetermined by the criterion Provisioning machine-readable licensing information

The fulfillment degree of a KG g wrt the dimensionLicense is measured by the metric mmacLicense whichis defined as follows

Provisioning machine-readable licensing informationLicenses define the legal frameworks under which theKG data may be used Providing machine-readable li-censing information allows users and applications to beaware of the license and to use the data of the KG inaccordance with the legal possibilities [3026]

Licenses can be specified in RDF via relationssuch as cclicence33 dctermslicence ordctermsrights The licensing information can bespecified either in the KG as additional facts or sepa-rately in a VoID file We measure the criterion by eval-uating whether licensing information is available in amachine-readable format

mmacLicense(g) =

1 machine-readable

licensing informationavailable

0 otherwise

353 InterlinkingDefinition of dimension Interlinking is the extent

ldquoto which entities that represent the same concept are

30See httpscreativecommonsorglicensesby40 requestedon Mar 1 2016

31See httpscreativecommonsorglicensesby-sa40 requested on Mar 1 2016

32See httpcreativecommonsorgpublicdomainzero10 requested on Mar 3 2016

33Using the namespace httpcreativecommonsorgns

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 17

linked to each other be it within or between two ormore data sourcesrdquo [49]

Discussion According to Bizer et al [12] DBpediaestablished itself as a hub in the Linked Data clouddue to its intensive interlinking with other KGs Theseinterlinking is on the instance level usually establishedvia owlsameAs links However according to Halpinet al [24] those owlsameAs links do not alwaysinterlink identical entities in reality According to theauthors one reason might be that the KGs provideentries in different granularity For instance the DB-pedia resource for Berlin (dboBerlin) links viaowlsameAs relations to three different resources inthe KG GeoNames34 namely (i) Berlin the capital35

(ii) Berlin the state36 and (iii) Berlin the city37 More-over owlsameAs relations are often created auto-matically by some mapping function Due to mappingerrors the precision is often below 100 [18]

Definition of metric The dimension Interlinking isdetermined by the criteria

ndash Interlinking via owlsameAsndash Validity of external URIs

The fulfillment degree of a KG g wrt the dimen-sion Interlinking is measured by the metrics mInst andmURIs which are defined as follows

Interlinking via owlsameAs The forth LinkedData principle according to Berners-Lee [8] is the inter-linking of data resources so that the user can explorefurther information According to Hogan et al [30] theinterlinking has a side effect It does not only result inotherwise isolated KGs but the number of incominglinks of a KG indicates the importance of the KG in theLinked Open Data cloud We measure the interlinkingon instance level38 by calculating the extent to which in-stances have at least one owlsameAs link to externalKGs

34See httpwwwgeonamesorg requested on Dec 312016

35See httpwwwgeonamesorg2950159berlinhtml requested on Feb 4 2017

36See httpwwwgeonamesorg2950157land-berlinhtml requested on Feb 4 2017

37See httpwwwgeonamesorg6547383berlin-stadthtml requested on Feb 4 2017

38The interlinking on schema level is already measured via thecriterion Interoperability of proprietary vocabulary

mInst(g) = |x isin Ig (Pg cup Cg) |

exist(xowlsameAs y) isin g and y isin Uextg |

|Ig (Pg cup Cg)|

Validity of external URIs The considered KG maycontain outgoing links referring to RDF resourcesor Web documents (non-RDF data) The linking toRDF resources is usually enabled by owlsameAsowlequivalentProperty and owlequivalentClass relations Web documents are linked viarelations such as foafhomepage and foafdepiction Linking to external resources always entailsthe problem that those links might get invalid over timeThis can have different causes For instance the URIsare not available anymore We measure the Validity ofexternal URIs by evaluating the URIs from an URI sam-ple set wrt whether there is a timeout a client error(HTTP response 4xx) or a server error (HTTP response5xx)

mURIs(g) =|x isin A | resolvable(x)|

|A|

where A = y | exist(x p y) isin g (p isin Peqandx isin Ug(CgcupPg)andx isin U local

g andy isin Uextg ) and resolvable(x)

returns true if HTTP status code 200 is returned Peq isthe set of relations used for linking to external sourcesExamples for such relations are owlsameAs andfoafhomepage

In case of an empty set A the metric should evaluateto 1

36 Conclusion

In this section we provided 34 DQ criteria which canbe applied in the form of DQ metrics to KGs in order toassess those KGs wrt data quality The DQ criteria areclassified into 11 DQ dimensions These dimensionsare themselves grouped into 4 DQ categories In totalwe have the following picture

ndash Intrinsic category

lowast Accuracylowast Syntactic validity of RDF documentslowast Syntactic validity of literalslowast Semantic validity of triples

18 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

lowast Trustworthinesslowast Trustworthiness on KG levellowast Trustworthiness on statement levellowast Using unknown and empty values

lowast Consistencylowast Check of schema restrictions during inser-

tion of new statementslowast Consistency of statements wrt class con-

straintslowast Consistency of statements wrt relation con-

straints

ndash Contextual category

lowast Relevancylowast Creating a ranking of statements

lowast Completenesslowast Schema completenesslowast Column completenesslowast Population completeness

lowast Timelinesslowast Timeliness frequency of the KGlowast Specification of the validity period of state-

mentslowast Specification of the modification date of

statements

ndash Representational data quality

lowast Ease of understandinglowast Description of resourceslowast Labels in multiple languageslowast Understandable RDF serializationlowast Self-describing URIs

lowast Interoperabilitylowast Avoiding blank nodes and RDF reificationlowast Provisioning of several serialization formatslowast Using external vocabularylowast Interoperability of proprietary vocabulary

ndash Accessibility category

lowast Accessibilitylowast Dereferencing possibility of resourceslowast Availability of the KGlowast Provisioning of public SPARQL endpointlowast Provisioning of an RDF exportlowast Support of content negotiationlowast Linking HTML sites to RDF serializationslowast Provisioning of KG metadata

lowast Licenselowast Provisioning machine-readable licensing in-

formationlowast Interlinking

lowast Interlinking via owlsameAslowast Validity of external URIs

4 Selection of KGs

We consider the following KGs for our comparativeevaluation

ndash DBpedia DBpedia39 is the most prominent KGin the LOD cloud [4] The project was initiatedby researchers from the Free University of Berlinand the University of Leipzig in collaborationwith OpenLink Software Since the first public re-lease in 2007 DBpedia is updated roughly once ayear40 By means of a dedicated open source ex-traction framework DBpedia is created from infor-mation contained in Wikipedia such as infobox ta-bles categorization information geo-coordinatesand external links Due to its role as the hub ofthe LOD cloud DBpedia contains many links toother datasets in the LOD cloud such as FreebaseOpenCyc UMBEL41 GeoNames Musicbrainz42

CIA World Factbook43 DBLP44 Project Guten-berg45 DBtune Jamendo46 Eurostat47 Uniprot48

and Bio2RDF4950 DBpedia has been used exten-sively in the Semantic Web research communitybut has become also relevant in commercial set-tings for instance companies such as the BBC[33] and the New York Times [41] use DBpediato organize their content The version of DBpediawe analyzed is 2015-04

39See httpdbpediaorg requested on Nov 1 201640There is also DBpedia live which started in 2009 and which

gets updated when Wikipedia is updated See httplivedbpediaorg requested on Nov 1 2016 Note however thatDBpedia live only provides a restricted set of relations compared toDBpedia Also the provisioning of data varies a lot While for sometime ranges DBpedia live provides data for each hour for other timeranges DBpedia live data is only available once a month

41See httpumbelorg requested on Dec 31 201642See httpmusicbrainzorg requested on Dec 31

201643See httpswwwciagovlibrary

publicationsthe-world-factbook requested on Dec31 2016

44See httpwwwdblporg requested on Dec 31 201645See httpswwwgutenbergorg requested on Dec

31 201646See httpdbtuneorgjamendo requested on Dec

31 201647See httpeurostatlinked-statisticsorg

requested on Dec 31 201648See httpwwwuniprotorg requested on Dec 31

201649See httpbio2rdforg requested on Dec 31 201650See a complete list of the links on the websites describing the sin-

gle DBpedia versions such as httpdownloadsdbpediaorg2016-04links (requested on Nov 1 2016)

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 19

ndash Freebase Freebase51 is a KG announced byMetaweb Technologies Inc in 2007 and was ac-quired by Google Inc on July 16 2010 In con-trast to DBpedia Freebase had provided an in-terface that allowed end-users to contribute tothe KG by editing structured data Besides user-contributed data Freebase integrated data fromWikipedia NNDB52 FMD53 and MusicBrainz54

Freebase uses a proprietary graph model for stor-ing also complex statements Freebase shut downits services completely on August 31 2016 Onlythe latest data dump is still available WikimediaDeutschland and Google integrate Freebase datainto Wikidata via the Primary Sources Tool55 Fur-ther information about the migration from Free-base to Wikidata is provided in [44] We analyzedthe latest Freebase version as of March 2015

ndash OpenCyc The Cyc56 project started in 1984 bythe industry research and development consortiumMicroelectronics and Computer Technology Cor-poration The aim of Cyc is to store ndash in a machine-processable way ndash millions of common sense factssuch as ldquoEvery tree is a plantrdquo The main focus ofCyc has been on inferencing and reasoning SinceCyc is proprietary a smaller version of the KGcalled OpenCyc57 was released under the opensource Apache license Version 2 In July 2006 Re-searchCyc58 was published for the research com-munity containing more facts than OpenCyc Wedid not consider Cyc and ResearchCyc since thoseKGs do not meet the chosen requirements namelythat the KGs are freely available and freely us-able in any context The version of OpenCyc weanalyzed is 2012-05-10

ndash Wikidata Wikidata59 is a project of WikimediaDeutschland which started on October 30 2012The aim of the project is to provide data whichcan be used by any Wikimedia project including

51See httpfreebasecom requested on Nov 1 201652See httpwwwnndbcom requested on Dec 31 201653See httpwwwfashionmodeldirectorycom re-

quested on Dec 31 201654See httpmusicbrainzorg requested on Dec 31

201655See httpswwwwikidataorgwikiWikidata

Primary_sources_tool requested on Apr 8 201656See httpwwwcyccom requested on Dec 31 201657See httpwwwopencycorg accessed on Nov 1

201658See httpresearchcyccom requested on Dec 31

201659See httpwikidataorg accessed on Nov 1 2016

Wikipedia Wikidata does not only store facts butalso the corresponding sources so that the valid-ity of facts can be checked Labels aliases anddescriptions of entities in Wikidata are providedin almost 400 languages Wikidata is a commu-nity effort ie users collaboratively add and editinformation Also the schema is maintained andextended based on community agreements Wiki-data is currently growing considerably due to theintegration of Freebase data [44] The version ofWikidata we analyzed is 2015-10

ndash YAGO YAGO60 ndash Yet Another Great Ontol-ogy ndash has been developed at the Max PlanckInstitute for Computer Science in Saarbruumlckensince 2007 YAGO comprises information ex-tracted from Wikipedia (such as information fromthe categories redirects and infoboxes) Word-Net [19] (such as information about synsets andhyponomies) and GeoNames61 The version ofYAGO we analyzed is YAGO3 which was pub-lished in March 2015

5 Comparison of KGs

51 Key Statistics

In the following we present statistical commonal-ities and differences of the KGs DBpedia FreebaseOpenCyc Wikidata and YAGO We thereby use thefollowing key statistics

ndash Number of triplesndash Number of classesndash Number of relationsndash Distribution of classes wrt the number of their

corresponding instancesndash Coverage of classes with at least one instance per

classndash Covered domains wrt entitiesndash Number of entitiesndash Number of instancesndash Number of entities per classndash Number of unique subjectsndash Number of unique predicatesndash Number of unique objects

In Section 72 we provide an overview of relatedwork wrt those key statistics

60See httpwwwmpi-infmpgdedepartmentsdatabases-and-information-systemsresearchyago-nagayagodownloads accessed on Nov 1 2016

61See httpwwwgeonamesorg requested on Dec 312016

20 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

511 TriplesRanking of KGs wrt number of triples The num-

ber of triples (see Table 2) differs considerably betweenthe KGs Freebase is the largest KG with over 31Btriples while OpenCyc resides the smallest KG withonly 24M triples The large size of Freebase can betraced back to the fact that large data sets such as Mu-sicBrainz have been integrated into this KG OpenCycin contrast has been built purely manually by expertsIn general this indicates a correlation between the wayof building up a KG and its size

Size differences between DBpedia and YAGO Asboth DBpedia and YAGO were created automaticallyby extracting semantically-structured information fromWikipedia the significant difference between their sizesndash in terms of triples ndash is in particular noteworthy Wecan mention here the following reasons YAGO inte-grates the statements from different language versionsof Wikipedia in one single KG while for the canon-ical DBpedia dataset (which is used in our evalua-tions) solely the English Wikipedia was used as in-formation source Besides that YAGO contains con-textual information and detailed provenance informa-tion Contextual information is for instance the an-chor texts of all links within Wikipedia For repre-senting the anchor texts the relation yagohasWikipediaAnchorText (330M triples in total) is usedThe provenance information of single statements isstored in a reified form In particular the relationsyagoextractionSource (1612M triples) andyagoextractionTechnique (1762M triples)are applied therefore3nInfluence of reification on the number of triples

DBpedia Freebase Wikidata and YAGO use someform of reification Reification in general describesthe possibility of making statements about statementsWhile reification has an influence on the number oftriples for DBpedia Freebase and Wikidata the num-ber of triples in YAGO is not influenced by reificationsince data is here provided in N-Quads62 This style ofreification is called Named Graph [27] The additionalcolumn (in comparison to triples) contains a unique IDof the statement by which the triple becomes identifiedFor backward compatibility the ID is commented andtherefore not imported into the triple store Note how-ever that transforming N-Quads to N-Triples leads to a

62The idea of N-Quads is based on the assignment of triples todifferent graphs YAGO uses N-Quads to identify statements per ID

high number of unique subjects concerning the set ofall triples

In case of DBpedia Freebase and Wikidata reifica-tion is implemented by means of n-ary relations Ann-ary relation denotes the relation between more thantwo resources and is implemented via additional inter-mediate nodes since in RDF only binary statementscan be modeled [1627] In Freebase and DBpedia datais mostly provided in the form of plain N-Triples andn-ary relations are only used for data from higher ar-ity63 Wikidata in contrast has the peculiarity that notonly every statement is expressed with the help of ann-ary relation but that in addition each statement is in-stantiated with wdoStatement This leads to about74M additional instances which is about one tenth ofall triples in Wikidata

512 ClassesMethods for counting classes The number of

classes can be calculated in different ways Classes canbe identified via rdfsClass and owlClass re-lations or via rdfssubClassOf relations64 SinceFreebase does not provide any class hierarchy withrdfssubClassOf relations and since Wikidatadoes not instantiate classes explicitly as classes butuses instead only ldquosubclass ofrdquo (wdtP279) relationsthe method of calculating the number of classes de-pends on the considered KG

Ranking of KG wrt number of classes Our eval-uations revealed that YAGO contains the highest num-ber of classes of all considered KGs DBpedia in con-trast has the fewest (see Table 2)

Number of classes in YAGO and DBpedia Howdoes it come to this gap between DBpedia and YAGOwith respect to the number of classes although bothKGs were created automatically based on WikipediaFor YAGO the classes are extracted from the categoriesin Wikipedia while the hierarchy of the classes is de-ployed with the help of WordNet synset relations TheDBpedia ontology in contrast is very small since itis created manually based on the mostly used infobox

63In Freebase Compound Value Types are used for reifi-cation [44] In DBpedia it is named Intermedia Node Map-ping see httpmappingsdbpediaorgindexphpTemplateIntermediateNodeMapping (requested on Dec31 2016)

64The number of classes in a KG may also be calculated by takingall entity type relations (rdftype and ldquoinstance ofrdquo (wdtP31)in case of Wikidata) on the instance level into account However thiswould result only in a lower bound estimation as here those classesare not considered which have no instances

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 21

DBpe

dia

Freeb

ase

Ope

nCyc

Wikidat

a

YAGO

20

40

60

80

100

Co

ve

rag

e in

Fig 1 Coverage of classes having at least one instance

templates in Wikipedia Besides those 736 classes theDBpedia KG contains further 444895 classes whichoriginate from the imported YAGO classes and whichare published in the namespace yago Those YAGOclasses are ndash like the DBpedia ontology classes ndash inter-connected via rdfssubClassOf to form a taxon-omy In the evaluation of DBpedia the YAGO classesare ignored as they do not belong to the DBpedia on-tology given as OWL file

Coverage of classes with at least one instanceFig 1 shows for each KG the extent to which classes areinstantiated that is for how many classes at least oneinstance exists YAGO exhibits the highest coveragerate (826) although it contains the highest numberof classes among the KGs This can be traced back tothe fact that YAGO classes are chosen by a heuristicthat considers Wikipedia leaf categories which tend tohave instances [43] OpenCyc (with 65) and Wiki-data (54) come last in the ranking Wikidata has thesecond highest number of classes in total (see Table 2)out of which relatively little are used on instance levelNote however that in some scenarios solely the schemalevel information (including classes) of KGs is neces-sary so that the low coverage of instances by classes isnot necessarily an issue

Correlation between number of classes and num-ber of instances In Fig 2 we can see a histogramof the classes with respect to the number of instancesper class That is for each KG we can spot how manyclasses have a high number of instances and how manyclasses have a low number of instances Note the log-arithmic scale on both axes The curves seem to fol-low power law distributions For DBpedia the line de-

Table 1Percentage of considered entities per KG for covered domains

DB FB OC WD YA

Reach of method 88 92 81 41 82

creases consistently for the first 250 classes before itdecreases more than exponentially beyond class 250

513 DomainsAll considered KGs are cross-domain meaning that a

variety of domains are covered in those KGs Howeverthe KGs often cover the single domains to a differentdegree Tartir [45] proposed to measure the covered do-mains of ontologies by determining the usage degree ofcorresponding classes the number of instances belong-ing to one or more subclasses of the respective domainis compared to the number of all instances In our workhowever we decided to evaluate the coverage of do-mains concerning the classes per KG via manual assign-ments of the mostly used classes to the domains peoplemedia organizations geography and biology65 Thislist of domains was created by aggregating the mostfrequent domains in Freebase

The manual assignment of classes to domainsis necessary in order to obtain a consistent assign-ment of the classes to the domains across all con-sidered KGs Otherwise the same classes in differ-ent KGs may be assigned to different domains More-over in some KGs classes may otherwise appear invarious domains simultaneously For instance theFreebase classes freebasemusicartist andfreebasepeopleperson overlap in terms oftheir instances and multiple domains (such as musicand people) might be assigned to them

As the reader can see in Table 1 our method to de-termine the coverage of domains and hence the reachof our evaluation includes about 80 of all entities ofeach KG except Wikidata It is calculated as the ratio ofthe number of unique entities of all considered domainsof a given KG divided by the number of all entities ofthis KG66 If the ratio was at 100 we were able toassign all entities of a KG to the chosen domains

Fig 3 shows the number of entities per domain in thedifferent KGs with a logarithmic scale Fig 4 presents

65See our website for examples of classes per domain andper KG httpkmaifbkitedusitesknowledge-graph-comparison (requested on Dec 31 2016)

66We used the number of unique entities of all domains and notthe sum of the entities measured per domain since entities may be inseveral domains at the same time

22 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

100 101 102 103

Classes

100

102

104

106

108

Nu

mb

er

of

insta

nce

s

DBpedia

Freebase

OpenCyc

Wikidata

YAGO

Fig 2 Distribution of classes wrt the number of instances per KG

persons media organizations geography biology100

102

104

106

108

1010

Num

ber

of entities

DBpedia

Freebase

OpenCyc

Wikidata

YAGO

Fig 3 Number of entities per domain

the relative coverage of each domain in each KG It iscalculated as the ratio of the number of entities in eachdomain to the total number of entities of the KG Avalue of 100 means that all instances reside in onesingle domain

The case of Freebase is especially outstanding here77 of all entities here are located in the media

domain This fact can be traced back to large-scaledata imports such as from MusicBrainz The classfreebasemusicrelease_track is account-able for 42 of the media entities As shown in Fig 3Freebase provides the most entities in four out of thefive domains when considering all KGs

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 23

persons media organizations geography biology

10

20

30

40

50

60

70

80

Rela

tive n

um

ber

of entities in p

erc

ent

DBpedia

Freebase

OpenCyc

Wikidata

YAGO

Fig 4 Relative number of entities per domain

In DBpedia and YAGO the domain of people is thelargest domain (50 and 34 respectively) Peculiar isthe higher coverage of YAGO regarding the geographydomain compared to DBpedia As one reason for thatwe can point out the data import of GeoNames intoYAGO

Wikidata contains around 150K entities in the do-main organization This is relativly few consideringthe total amount of entities being around 187M andconsidering the number of organizations in other KGsNote that even DBpedia provides more organizationentities than Wikidata The reason why Wikidata hasnot so many organization entities is not fully compre-hensible to us However we can point out that for ouranalysis we only considered Wikidata classes whichappeared more than 6000 times67 and that about 16Kclasses were therefore not considered It is possible thatentities of the domain organization are belonging tothose rather rarely occurring classes

514 Relations and PredicatesEvaluation method In this article we differentiate

between relations and predicates (see also Section 2)

ndash Relations ndash as short term for explicitly defined re-lations ndash refers to (proprietary) vocabulary definedon the schema level of a KG We identify the setof relations of a KG as the set of those links which

67This number is based on heuristics We focused on the 150 mostinstantiated classes and cut the long tail of classes having only fewinstances

are explicitly defined as such via assignments (forinstance with rdfsProperty) to classes InSection 2 we used Pg to denote this set

ndash In contrast we use predicates to denote links usedin the KG independently of their introduction onthe schema level The set of unique predicates perKG denoted as P imp

g is nothing else than the setof unique RDF terms on the predicate position ofall triples in the KG

It is important to distinguish the key statistics for rela-tions from the key statistics for predicates since theycan differ considerably depending on to which degreerelations are only defined on schema level but not usedon instance level

Evaluation resultsRelationsRanking regarding relations As presented in Ta-

ble 2 Freebase exhibits by far the highest number ofunique relations (around 785K) among the KGs YAGOshows only 106 relations which is the lowest value inthis comparison In the following we point out furtherfindings regarding the relations of the single KGs

DBpedia Regarding DBpedia relations we need todistinguish between so-called mapping-based prop-erties and non-mapping-based properties Mapping-based properties are created by extracing the informa-tion from infoboxes in Wikipedia using manually cre-ated mappings These mappings are specified in the DB-

24 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

pedia Mappings Wiki68 Mapping-based properties arecontained in the DBpedia ontology and located in thenamespace httpdbpediaorgontologyWe count 2819 such relations for the considered DB-pedia version 2015-04 Non-mapping-based properties(also called ldquoraw infobox propertiesrdquo) are extractedfrom Wikipedia without the help of manually createdmappings and hence without any manual adjustmentsTherefore they are generally of lower quality We count58776 such unique relations They reside in the names-pace httpdbpediaorgproperty Bothmapping-based and non-mapping-based properties areinstantiated in DBpedia with rdfProperty We ig-nore the non-mapping based properties for the calcu-lation of the number of relations |Pg| (see Table 2)since in contrast to DBpedia in YAGO non-mappingbased properties are not instantiated Note that themapping-based properties and the non-mapping basedproperties in DBpedia are not aligned69 and may over-lap until DBpedia version 2016-0470

Freebase The high number or Freebase relations canbe explained by two facts 1 About a third of all rela-tions in Freebase are duplicates in the sense that they aredeclared by means of the owlinverseOf relationas being inverse of other relations An example is the re-lation freebasemusicartistalbum and itsinverse relation freebasemusicalbumartist2 Freebase allowed users to introduce their own rela-tions without any limits These relations were originallyin each userrsquos namespace So-called commons adminswere able to approve those relations so that they gotincluded into the Freebase commons schema

OpenCyc For OpenCyc we measure 18028 uniquerelations We can assume that most of them are dedi-cated to statements on the schema level

Wikidata In Wikidata a relatively small set of rela-tions is provided Note in this context that despite thefact that Wikidata is curated by a community (just likeFreebase) Wikidata community members cannot insertarbitrarily new relations as it was possible in Freebaseinstead relations first need to be proposed and thenget accepted by the community if and only if certain

68See httpmappingsdbpediaorgindexphpMain_Page accessed on Nov 4 2016

69For instance The DBpedia ontology containsdbobirthName for the name of a person while the non-mappingbased property set contains dbpname dbpfirstname anddbpalternativeNames

70For instance dbpalias and dboalias

criteria are met71 One of those criteria is that each newrelation is presumably used at least 100 times Thisrelation proposal process can be mentioned as likelyreason why in Wikidata in relative terms more relationsare actually used than in Freebase

YAGO For YAGO we measure the small set of 106unique relations Although relations are curated man-ually for YAGO and DBpedia the size of the relationset differs significantly between those KGs Hoffart etal [28] mention the following reasons for that

1 Peculiarity of relations The DBpedia ontologyprovides quite many special relations For in-stance there exists the relation dboaircraftFighter between dboMilitaryUnit anddboMeanOfTransportation

2 Granularity of relations Relations in the DB-pedia ontology are more fine-grained than rela-tions in YAGO For instance DBpedia contains therelations dboauthor and dbodirectorwhereas in YAGO there is only the generic relationyagocreated

3 Date specification The DBpedia ontology intro-duces several relations for dates For instance DB-pedia contains the relations dbobirthDateand dbobirthYear for birth dates while inYAGO only the relation yagobirthOnDateis used Incomplete date specifications ndash for in-stance if only the year is known ndash are specifiedin YAGO by wildcards (ldquordquo) so that no multiplerelations are needed

4 Inverse relations YAGO has no relations ex-plicitly specified as being inverse In DBpediawe can find relations specified as inverse such asdboparent and dbochild

5 Reification YAGO introduces the SPOTL(X) for-mat This format extends the triple format ldquoSPOldquowith a specification of Time Location and conteXtIn this way no contextual relations are necessary(such as dbodistanceToLondon or dbopopulationAsOf) which occur if the relationsare closely aligned to Wikipedia template attributenames

Frequency of the usage of relations Fig 5 showsthe relative proportions of how often relations are usedper KG grouped into three classes Surprisingly DB-pedia and Freebase exhibit a high number of relationswhich are not used at all on the instance level In case of

71See httpswwwwikidataorgwikiWikidataProperty_proposal requested on Dec 31 2016

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 25

0 1-500 gt500

Number of relations

20

40

60

80

100

Re

lative

occu

ren

cie

s in

pe

rce

nt

DBpedia

Freebase

OpenCyc

Wikidata

YAGO

Fig 5 Frequency of the usage of the relations per KG grouped by(i) zero occurrences (ii) 1ndash500 occurrences and (iii) more than 500occurrences in the respective KG

OpenCyc 992 of the defined relations are never usedWe assume that those relations are used only withinCyc the commercial version of OpenCyc In case ofFreebase only 5 of the relations are used more than500 times and about 70 are not used at all Analo-gously to the discussion regarding the number of Free-base relations we can mention again the high numberof defined owlinverseOf relations and the highnumber of usersrsquo relation proposals as reasons for that

PredicatesRanking regarding predicates Freebase is here ndash

like in case of the ranking regarding relations ndash rankedfirst The lowest number of unique predictes is providedby OpenCyc which exhibits only 165 predicates AllKGs except OpenCyc provide more predicates then re-lations Our single observations regarding the predicatesets are as follows

DBpedia DBpedia is ranked third in terms of the ab-solute numbers of predicates about 60K predicates areused in DBpedia The set of relations and the set of pred-icates varies considerably here since also facts are ex-tracted from Wikipedia info-boxes whose predicates areconsidered by us as being only implicitly defined andwhich hence occur only as predicates These are the so-called non-mapping-based properties Note that in the

studied DBpedia version 2015-04 the set of explicitlydefined relations (mapping-based properties) and theset of implicitly defined relations (non-mapping-basedproperties) overlaps An example is dbpalias withdboalias

Freebase We can observe here a similar picture asfor the set of Freebase relations With about 785Kunique predicates Freebase exceeds the other KGs byfar Note however that 95 of the predicates (around743K) are used only once This relativizes the highnumber Most of the predicates are keys in the senseof ids and are used for internal modeling (for instancefreebasekeyuseradrianb)

OpenCyc In contrast to the 18028 unique relationswe measure only 164 unique predicates for OpenCycMore predicates are presumably used in Cyc

Wikidata We measure more Wikidata predicates thanWikidata relations since Wikidata predicates are cre-ated by modifying Wikidata relations An exampleare the following triples which express the statementBarack Obama (wdtQ76) is a human (wdtQ5) byan intermediate node (wdtQ76S123 abbreviated)

wdtQ76 wdtP31s wdtQ76S123wdtQ76S123 wdtP31v wdtQ5

The relation extension ldquosrdquo indicates that the RDF termin the object position is a statement The ldquovrdquo extensionallows to refer to a value (in Wikidata terminology)Besides those extensions there is ldquorrdquo to refer to a ref-erence and the ldquoqrdquo extension to refer to a qualifier Ingeneral these relation extensions are used for realizingreification via n-ary relations For that intermediatenodes are used which represent statements [16]

YAGO YAGO contains more predicates than DBpe-dia since infobox attributes from different languageversions of Wikipedia are aggregated into one KG72

while for DBpedia separate localized KG versions areoffered for non-English languages

515 Instances and EntitiesEvaluation method We distinguish between in-

stances Ig and entities Eg of a KG (cf Section 2)

1 Instances are belonging to classes They are iden-tified by retrieving the subjects of all triples wherethe predicates indicate class affiliations

72The language of each attribute is encoded in theURI for instance yagoinfoboxdeflaumlche andyagoinfoboxenareakm

26 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGODBpe

diaFre

ebas

eOpe

nCyc

Wiki

data

YAGO

10 010 110 210 310 410 510 610 710 810 9

Num

ber

of In

stan

ces

Fig 6 Number of instances per KG

2 Entities are real-world objects This excludesfor instance instantiated statements for beingentities Determining the set of entities is par-tially tricky In DBpedia and YAGO entitiesare determined as being an instance of theclass owlThing In Freebase entities are in-stances of freebasecommontopic and inWikidata instance of wdoItem In OpenCyccychIndividual corresponds to owlThingbut not all entities are classified in this way There-fore we approximately determine the set of en-tities in OpenCyc by manually classifying allclasses having more than 300 instances includingat least one entity73 In this way abstract classessuch as cychExistingObjectType are ne-glected

Ranking wrt the number of instances Table 2and Fig 6 show the number of instances per KG Wecan see that Wikidata comprises the highest numberof instances (142M) in total and OpenCyc the fewest(242K)

Ranking wrt the number of entities Table 2shows the ranking of KGs regarding the number of en-tities Freebase contains by far the highest number ofentities (about 499M) OpenCyc is at the bottom withonly about 41K entities

Differences in number of entities The reason whythe KGs show quite varying numbers of entities are theinformation sources of the KGs We illustrate this withthe music domain as example

1 Freebase had been created mainly from data im-ports such as from MusicBrainz Therefore enti-

73For instance cychIndividual cychMovie_CW andcychCity

ties in the domain of media and especially songrelease tracks are covered very well in Freebase77 of all entities are in the media domain (seeSection 513) out of which 42 are releasetracks74

Due to the large size and the world-wide coverageof entities in MusicBrainz Freebase contains al-bums and release tracks of both English and non-English languages For instance regarding the En-glish language the album ldquoThrillerrdquo from MichaelJackson and its single ldquoBillie Jeanrdquo are there aswell as rather unknown songs from the ldquoThrillerrdquoalbum such as ldquoThe Lady in My Liferdquo Regard-ing non-English languages Freebase contains forinstance songs and albums from Helene Fischersuch as ldquoLassrsquo mich in dein Lebenrdquo and ldquoZauber-mondrdquo also rather unknown songs such as ldquoHabrsquoden Himmel beruumlhrtrdquo can be found

2 In case of DBpedia the English Wikipedia is thesource of information In the English Wikipediamany albums and singles of English artists are cov-ered ndash such as the album ldquoThrillerrdquo and the singleldquoBillie Jeanrdquo Rather unknown songs such as ldquoTheLady in My Liferdquo are not covered in WikipediaFor many non-English artists such as the Germansinger Helene Fischer no music albums and nosingles are contained in the English Wikipedia Inthe corresponding language version of Wikipedia(and localized DBpedia version) this informationis often available (for instance the album ldquoZauber-mondrdquo and the song ldquoLassrsquo mich in dein Lebenrdquo)but not the rather unknown songs such as ldquoHabrsquoden Himmel beruumlhrtrdquo

3 For YAGO the same situation as for DBpediaholds with the difference that YAGO in additionimports entities also from the different languageversions of Wikipedia and imports also data fromsources such as GeoNames However the abovementioned works (ldquoLassrsquo mich in dein LebenrdquoldquoZaubermondrdquo and ldquoHabrsquo den Himmel beruumlhrtrdquo)of Helene Fischer are not in the YAGO althoughthe song ldquoLassrsquo mich in dein Lebenrdquo exists inthe German Wikipedia since May 2014 and al-though the used YAGO version 3 is based on theWikipedia dump of June 201475 Presumably theYAGO extraction system was unable to extract any

74Those release tracks are expressed via freebasemusicrelease_track

75See httpwwwmpi-infmpgdededepartmentsdatabases-and-information-

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 27DBpe

diaFre

ebas

eOpe

nCyc

Wiki

data

YAGO

10 0

10 1

10 2

10 3

10 4

Ave

rage

num

ber

of e

ntiti

es

Fig 7 Average number of entities per class per KG

types for those entities so that those entities werediscarded

4 Wikidata is supported by the community and con-tains music albums of English and non-Englishartists even if they do not exist in Wikipedia Anexample is the song ldquoThe Lady in My Liferdquo Notehowever that Wikidata does not provide all artistrsquosworks such as from Helene Fischer

5 OpenCyc contains only very few entities in themusic domain The reason is that OpenCyc has itsfocus mainly on common-sense knowledge andnot so much on facts about entities

Average number of entities per class Fig 7 showsthe average number of entities per class which can bewritten as |Eg||Cg| Obvious is the difference betweenDBpedia and YAGO (despite the similar number of en-tities) The reason for that is that the number of classesin the DBpedia ontology is small (as created manually)and in YAGO large (as created automatically)

Comparing number of instances with number ofentities Comparing the ratio of the number of instancesto the number of entities for each KG Wikidata ex-poses the highest difference As reason for that we canstate that each statement in Wikidata is modeled as aninstance of wdoStatement leading to 74M addi-tional instances In other KGs such as DBpedia state-ments are modeled without any dedicated statementassignment OpenCyc exposes also a high ratio sinceit contains mainly common sense knowledge and notas many entities as the other KGs Furthermore for ouranalysis we do not regard 100 of the entities but onlya large fraction of it (more precisely the classes with

systemsresearchyago-nagayagoarchive re-quested on Dec 31 2016

DBpedia

Freeb

ase

OpenC

ycW

ikida

ta

YAGO

0

1

2

3

4

5

6

7

8

Rat

io o

f num

ber

of in

stan

ces

to n

umbe

r of

ent

ities

Fig 8 Ratio of the number of instances to the number of entities foreach KG

the most frequently occurring instantiations) since en-tities are not consistently instantiated in OpenCyc (seebeginning of Section 515)

516 Subjects and ObjectsEvaluation method The number of unique subjects

and unique objects can be a meaningful KG charac-teristic regarding the link structure within the KG andin comparison to other KGs Especially interesting aredifferences between the number of unique subjects andthe number of unique objects

We measure the number of unique subjects by count-ing the unique resources (ie URIs and blank nodes) onthe subject position of N-Triples Sg = s | (s p o) ising Furthermore we measure the number of uniqueobjects by counting the unique resources on the ob-ject position of N-Triples excluding literals Og =o | (s p o) isin g and o isin U cup B Complementary thenumber of literals is given as Olit

g = o | (s p o) ising and o isin L

Ranking of KGs regarding number of uniquesubjects The number of unique subjects per KG is pre-sented in Fig 9 YAGO contains the highest number ofdifferent subjects while OpenCyc contains the fewest

Ranking of KGs regarding number of unique ob-jects The number of unique objects is also presented inFig 9 Freebase shows the highest score in this regardOpenCyc again the lowest

Ranking of KGs regarding the ratio of numberof unique subjects to number of unique objects Theratios of the number of unique subjects to the number ofunique objects vary considerably between the KGs (seeFig 8) We can observe that DBpedia has 265 timesmore objects than subjects while YAGO on the otherside has 19 times more unique subjects than objects

28 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Table 2Summary of key statistics

DBpedia Freebase OpenCyc Wikidata YAGO

Number of triples |(s p o) isin g| 411 885 960 3 124 791 156 2 412 520 748 530 833 1 001 461 792

Number of classes |Cg| 736 53 092 116 822 302 280 569 751

Number of relations |Pg| 2819 70 902 18 028 1874 106

No of unique predicates |P impg | 60 231 784 977 165 4839 88 736

Number of entities |Eg| 4 298 433 49 947 799 41 029 18 697 897 5 130 031

Number of instances |Ig| 20 764 283 115 880 761 242 383 142 213 806 12 291 250

Avg number of entities per class |Eg||Cg| 58403 9408 035 619 90

No of unique subjects |Sg| 31 391 413 125 144 313 261 097 142 278 154 331 806 927

No of unique non-literals in obj pos |Og| 83 284 634 189 466 866 423 432 101 745 685 17 438 196

No of unique literals in obj pos |Olitg | 161 398 382 1 782 723 759 1 081 818 308 144 682 682 313 508

DBpedia

Freeb

ase

OpenC

yc

Wiki

data

YAGO10 0

10 2

10 4

10 6

10 8

10 1 0

10 1 2

unique subjectsunique objects

Fig 9 Number of unique subjects and objects per KG Note thelogarithmic scale on the axis of ordinates

The high number of unique subjects in YAGO is sur-prising and can be explained by the reification styleused in YAGO Facts are stored as N-Quads in orderto allow for making statements about statements (forinstance storing the provenance information for state-ments) To that end IDs (instead of blank nodes) whichidentify the triples are used on the first position of N-Triples They lead to 308M unique subjects such asyagoid_6jg5ow_115_lm6jdp In the RDF ex-port of YAGO the IDs which identify the triples arecommented out in order to facilitate the N-Triple for-mat However the statements about statements are alsotransformed to triples In those cases the IDs identi-fying the reified statements are in the subject positionleading to such a high number of unique subjects

DBpedia contains considerably more owlsameAslinks to external resources than KGs like YAGO (290M

vs 38M links) leading to a bias of DBpedia towards ahigh number of unique objects

517 Summary of Key StatisticsBased on the evaluation results presented in the last

subsections we can highlight the following insights

1 Triples All KGs are very large Freebase is thelargest KG in terms of number of triples whileOpenCyc is the smallest KG We notice a corre-lation between the way of building up a KG andthe size of the KG automatically created KGs aretypically larger as the burdens of integrating newknowledge become lower Datasets which havebeen imported into the KGs such as MusicBrainzinto Freebase have a huge impact on the numberof triples and on the number of facts in the KGAlso the way of modeling data has a great impacton the number of triples For instance if n-aryrelations are expressed in N-Triples format (as incase of Wikidata) many intermediate nodes needto be modeled leading to many additional triplescompared to plain statements Last but not leastthe number of supported languages influences thenumber of triples

2 Classes The number of classes is highly varyingamong the KGs ranging from 736 (DBpedia) upto 300K (Wikidata) and 570K (YAGO) Despite itshigh number of classes YAGO contains in relativeterms the most classes which are actually used(ie classes with at least one instance) This canbe traced back to the fact that heuristics are usedfor selecting appropriate Wikipedia categories asclasses for YAGO Wikidata in contrast containsmany classes but out of them only a small fraction

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 29

is actually used on instance level Note howeverthat this is not necessarily a burden

3 Domains Although all considered KGs are speci-fied as crossdomain domains are not equally dis-tributed in the KGs Also the domain coverageamong the KGs differs considerably Which do-mains are well represented heavily depends onwhich datasets have been integrated into the KGsMusicBrainz facts had been imported into Free-base leading to a strong knowledge representation(77) in the domain of media in Freebase In DB-pedia and YAGO the domain people is the largestlikely due to Wikipedia as data source

4 Relations and Predicates Many relations arerarely used in the KGs Only 5 of the Freebaserelations are used more than 500 times and about70 are not used at all In DBpedia half of therelations of the DBpedia ontology are not usedat all and only a quarter of the relations is usedmore than 500 times For OpenCyc 992 of therelations are not used We assume that they areused only within Cyc the commercial version ofOpenCyc

5 Instances and Entities Freebase contains by farthe highest number of entities Wikidata exposesrelatively many instances in comparison to theentities as each statement is instantiated leadingto around 74M instances which are not entities

6 Subjects and Objects YAGO provides the high-est number of unique subjects among the KGsand also the highest ratio of the number of uniquesubjects to the number of unique objects This isdue to the fact that N-Quad representations needto be expressed via intermedium nodes and thatYAGO is concentrated on classes which are linkedby entities and other classes but which do not pro-vide outlinks DBpedia exhibits more unique ob-jects than unique subjects since it contains manyowlsameAs statements to external entities

52 Data Quality Analysis

We now present the results obtained by applyingthe DQ metrics introduced in the Sections 32 ndash 35 tothe KGs DBpedia Freebase OpenCyc Wikidata andYAGO

521 AccuracyThe fulfillment degrees of the KGs regarding the

Accuracy metrics are shown in Table 3

Table 3Evaluation results for the KGs regarding the dimension Accuracy

DB FB OC WD YA

msynRDF 1 1 1 1 1msynLit 099 1 1 1 062msemTriple 099 lt1 1 099 099

Syntactic validity of RDF documents msynRDF

Evaluation method For evaluating the Syntactic va-lidity of RDF documents we dereference the entityldquoHamburgrdquo as resource sample in each KG In caseof DBpedia YAGO Wikidata and OpenCyc thereare RDFXML serializations of the resource availablewhich can be validated by the official W3C RDF valida-tor76 Freebase only provides a Turtle serialization Weevaluate the syntactic validity of this Turtle documentby verifying if the document can be loaded into an RDFmodel of the Apache Jena Framework77

Evaluation result All considered KGs provide syn-tactically valid RDF documents In case of YAGO andWikidata the RDF validator declares the used languagecodes as invalid since the validator evaluates languagecodes in accordance with ISO-639 The criticized lan-guage codes are however contained in the newer stan-dard ISO 639-3 and actually valid

Syntactic validity of literals msynLit

Evaluation method We evaluate the Syntactic va-lidity of literals by means of the relations date ofbirth number of inhabitants and International Stan-dard Book Number (ISBN) as those relations cover dif-ferent domains ndash namely people cities and books ndashand as they can be found in all KGs In general do-main knowledge is needed for selecting representativerelations so that a meaningful coverage is guaranteed

Note that OpenCyc is not taken into account forthis criterion Although OpenCyc comprises around11M literals in total these literals are essentially la-bels and descriptions (given via rdfslabel andrdfscomment) ie not aligned to specific datatypes Hence OpenCyc has no syntactic invalid literalsand is assigned the metric value 1

As long as a literal with data type is given its syntaxis verified with the help of the function RDFDatatypeisValid(String) of the Apache Jena framework

76See httpsw3orgRDFValidator requested onMar 2 2016

77See httpsjenaapacheorg requested Mar 22016

30 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Thereby standard data types such as xsddate canbe validated easily especially if different data types areprovided78 If no data type is provided or if the literalvalue is of type xsdString the literal is evaluatedby a regular expression which is created manually (seebelow depending on the considered relation) For eachof the three relations we created a sample of 1M literalvalues per KG as long as the respective KG containsso many literals

Evaluation results All KGs except YAGO per-formed very well regarding the Syntactic validity ofliterals

Date of Birth For Wikidata DBpedia and Freebaseall verified literal values (1M per KG) were syntacti-cally correct79 For YAGO we detected around 519Ksyntactic errors (given 1M literal values) due to the us-age of wildcards in the date values For instance thebirth date of yagoSocrates is specified as ldquo470--rdquo which does not correspond to the syntax ofxsddate Obviously the syntactic invalidity of lit-erals is accepted by the YAGO publishers in order tokeep the number of relations low80

Number of inhabitants The data types of the literalvalues regarding the number of inhabitants were validin all KGs For DBpedia YAGO and Wikidata weevaluated the syntactic validity of the number of inhab-itants by checking if xsdnonNegativeIntegerxsddecimal and xsdinteger were used asdata types for the typed literals In Freebase no datatype is specified Therefore we evaluated the values bymeans of a regular expression which allows only thedecimals 0-9 periods and commas

ISBN The ISBN is an identifier for books and maga-zines The identifier can occur in various formats withor without preceding ldquoISBNrdquo with or without delim-iters and with 10 or 13 digits Gupta81 provided a regu-lar expression for validating ISBN in its different formswhich we used in our evaluation All in all most ofthe ISBN were assessed as syntactically correct The

78In DBpedia for instance data for the relationdbobirthDate is stored both as xsdgYear and xsddate

79Surprisingly the Jena Framework assessed data values with anegative year (ie BC eg ldquo-600rdquo for xsdgYear) as invaliddespite the correct syntax

80In order to model the dates to the extent they are known furtherrelations would be necessary such as using wasBornOnYearwith range xsdgYear wasBornOnYearMonth with rangexsdgYearMonth

81See httphowtodoinjavacomregexjava-regex-validate-international-standard-book-number-isbns requested on Mar 1 2016

lowest fulfillment degree was obtained for DBpediaWe found the following findings for the single KGs InFreebase around 699K ISBN numbers were availableOut of them 38 were assessed as syntactically incorrectTypical mistakes were too long numbers and wrongprefixes82 In case of Wikidata 18 of around 11K ISBNnumbers were syntactically invalid However some in-valid numbers have meanwhile been corrected This in-dicates that the Wikidata community does not only careabout inserting new data but also about curating givenKG data In case of YAGO we could only find 400triples with the relation yagohasISBN Seven of theliterals on the object position were syntactically incor-rect For DBpedia we evaluated around 24K literals7419 of them were assessed as syntactically incorrectIn many cases comments next to the ISBN numbers inthe info-boxes of Wikipedia led to an inaccurate extrac-tion of data so that the comments are either extractedas additional facts about ISBN numbers83 or togetherwith the actual ISBN numbers as coherent strings84

Semantic validity of triples msemTriple

Evaluation method The semantic validity can be re-liably measured by means of a reference data set which(i) contains at least to some degree the same facts asin the KG and (ii) which is regarded as some kind ofauthority We decided to use the Integrated AuthorityFile (Gemeinsame Normdatei GND)85 which is anauthority file especially concerning persons and corpo-rate bodies and which was created manually by Ger-man libraries Due to the focus on persons (especiallyauthors) we decided to evaluate a random sample ofperson entities wrt the following relations birth placedeath place birth date and death date For each ofthese relations the corresponding relations in the KGswere determined Then a random sample of 100 personentities per KG was chosen For each entity we retrievedthe facts with the mentioned relations and assessedmanually whether a GND entry exists and whether thevalues of the relations match with the values in the KG

Evaluation result We evaluated up to 400 facts perKG and observed only for a few facts some discrep-ancies For instance Wikidata states as death date of

82Eg we found the 16 digit ISBN 9789780307986931 (cffreebasem0pkny27) and the ISBN 2940045143431 with pre-fix 294 instead of 978 (cf freebasem0v3xf7b)

83See dbrPrince_Caspian84An example is ldquoISBN 0755111974 (hardcover edition)rdquo for

dbrMy_Family_and_Other_Animals85See httpwwwdnbdeENStandardisierung

GNDgndhtml requested on Sep 8 2016

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 31

ldquoAnton Erkelenzldquo (wdtQ589196) April 24 whereasGND states April 25 For DBpedia and YAGO we en-countered 3 and for Wikidata 4 errors Hence thoseKGs were evaluated with 099 Note that OpenCyc hasno values for the chosen relations and thus evaluates to1

During evaluation we identified the following issues

1 For finding the right entry in GND more informa-tion besides the name of the person is needed Thisinformation is sometimes not given so that entitydisambiguation is in those cases hard to perform

2 Contrary to assumptions often either no corre-sponding GND entry exists or not many facts ofthe GND entity are given In other words GND isincomplete wrt to entities (cf Population com-pleteness) and relations (cf Column complete-ness)

3 Values of different granularity need to be matchedsuch as an exact date of birth against the indicationof a year only

In conclusion the evaluation of semantic validity ishard even if a random sample set is evaluated manuallyMeaningful differences among the KGs might be re-vealed only when a very large sample is evaluated egby using crowd-sourcing [2348] Another approachfor assessing the semantic validity is presented by Kon-tokostas et al [34] who propose a test-driven evalu-ation where test cases are created to evaluate triplessemi-automatically For instance an interval specifiesthe valid height of a person and all triples which lieoutside of this interval are evaluated manually In thisway outliers can be easily found but possible wrongvalues within the interval are not detected

Our findings appear to be consistent with the evalua-tion results of the YAGO developer team for YAGO2where manually assessing 4412 statements resulted inan accuracy of 98186

522 TrustworthinessThe fulfillment degrees of the KGs regarding the

Trustworthiness criteria are shown in Table 4

Trustworthiness on KG level mgraph

Evaluation method Regarding the trustworthinessof a KG in general we differentiate between the method

86With a weighted averaging of 95 see httpwwwmpi-infmpgdededepartmentsdatabases-and-information-systemsresearchyago-nagayagostatistics requested on Mar 3 2016

Table 4Evaluation results for the KGs regarding the dimensionTrustworthiness

DB FB OC WD YA

mgraph 05 05 1 075 025mfact 05 1 0 1 1mNoV al 0 1 0 1 0

of how new data is inserted into the KG and the methodof how existing data is curated

Evaluation results The KGs differ considerablywrt this metric OpenCyc obtains the highest scorehere followed by Wikidata In the following we pro-vide findings for the single KGs which are listed bydecreasing fulfillment score

Cyc is edited (expanded and modified) exclusively bya dedicated expert group The free version OpenCycis derived from Cyc and only a locally hosted versioncan be modified by the data consumer

Wikidata is also curated and expanded manually butby volunteers of the Wikidata community Wikidataallows importing data from external sources such asFreebase87 However new data is not just inserted butis approved by the community

Freebase was also curated by a community of vol-unteers In contrast to Wikidata the proportion of dataimported automatically is considerably higher and newdata imports were not dependent on community ap-provals

DBpedia and YAGO The knowledge of both KGs isextracted from Wikipedia but DBpedia differs fromYAGO wrt the community involvement Any usercan engage (i) in mapping the Wikipedia infobox tem-plates to the DBpedia ontology in the DBpedia map-pings wiki88 and (ii) in the development of the DBpediaextraction framework

Trustworthiness on statement levelWe determine the Trustworthiness on statement level

by evaluating whether provenance information for state-ments is used in the KGs The picture is mixed

DBpedia uses the relation provwasDerivedFrom to store the sources of the entities and their state-

87Note that imports from Freebase require the approval ofthe community (see httpswwwwikidataorgwikiWikidataPrimary_sources_tool) Besides that there arebots which import automatically (see httpswwwwikidataorgwikiWikidataBotsde)

88See httpmappingsdbpediaorg requested onMar 3 2016

32 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

ments However as the source is always the correspond-ing Wikipedia article89 this provenance informationis trivial and the fulfillment degree is hence of ratherformal nature

YAGO uses its own vocabulary to indicate thesource of information Interestingly YAGO stores perstatement both the source (via yagoextractionSource eg the Wikipedia article) and the used ex-traction technique (via yagoextractionTech-nique eg ldquoInfobox Extractorrdquo or ldquoCategoryMap-perrdquo) The number of statements about sources is 161Mand hence many times over the number of instances inthe KG The reason for that is that in YAGO the sourceis stored for each fact

In Wikidata several relations can be used for refer-ring to sources such as ldquoimported fromrdquo (wdtP143)ldquostated inrdquo (wdtP248) and ldquoreference URLrdquo (wdtP854)90 Note that ldquoimported fromrdquo relations are usedfor automatic imports but that statements with such areference are not accepted (ldquodata is not sourcedrdquo)91 Tosource data the other relations ldquostated inrdquo and ldquoref-erence URLrdquo can be used The number of all storedreferences in Wikidata92 is around 971K Based on thenumber of all statements93 74M this corresponds to acoverage of around 13 Note however that not everystatement in Wikidata requires a reference according tothe Wikidata guidelines In order to be able to state howmany references are actually missing a manual evalua-tion would be necessary However such an evaluationwould be presumably highly subjective

Freebase uses proprietary vocabulary for represent-ing provenance via n-ary relations which are in Free-base called Compound Value Types (CVT) data fromhigher arity can be expressed [44]94

OpenCyc differs from the other KGs in that it usesneither an external vocabulary nor a proprietary vocab-ulary for storing provenance information

89Eg httpenwikipediaorgwikiHamburg fordbrHamburg

90All relations are instances of Wikidata property to indicate asource (wdtQ18608359)

91See httpswwwwikidataorgwikiPropertyP143 requested Mar 3 2016

92This is the number of instances of wdoReference93This is the number of instances of wdoStatement94Eg for a statement with the relation freebaselocation

statistical_regionpopulation the source can bestored via freebasemeasurement_unitdated_integersource

Table 5Evaluation results for the KGs regarding the dimension Consistency

DB FB OC WD YA

mcheckRestr 0 1 0 1 0mconClass 088 1 lt1 1 033mconRelat 099 045 1 050 099

Indicating unknown and empty values mNoV al

This criterion highlights the subtle data model ofWikidata and Freebase in comparison to the data mod-els of the other KGs Wikidata allows for storing un-known values and empty values (eg that ldquoElizabeth Iof Englandrdquo (wdtQ7207) had no children) Howeverin the Wikidata RDF export such statements are onlyindirectly available since they are represented via blanknodes and via the relation owlsomeValuesFrom

YAGO supports the representation of unknown val-ues and empty values by providing explicit relationsfor such cases95 Inexact dates are modeled by meansof wildcards (eg ldquo1940--rdquo if only the year isknown) Note however the invalidity of such stringsas date literals (see Section 521) Unknown dates arenot supported by YAGO

523 ConsistencyThe fulfillment degrees of the KGs regarding the

Consistency criteria are shown in Table 5

Check of schema restrictions during insertion of newstatements mcheckRestr

The values of the metric mcheckRestr indicating re-strictions during the insertion of new statements arevarying among the KGs The web interfaces of Free-base and Wikidata verify during the insertion of newstatements by the user whether the input is compatiblewith the respective data type For instance data of therelation ldquodate of birthrdquo (wdtP569) is expected to bein a syntactically valid form DBpedia OpenCyc andYAGO have no checks for schema restriction during theinsertion of new statements

Consistency of statements wrt class constraintsmconClass

Evaluation method For evaluating the consis-tency of class constraints we considered the relationowldisjointWith since this is the only rela-tion which is used by more than half of the consid-

95Eg freebasefreebasevaluenotationhas_no_value

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 33

ered KGs We only focused on direct instantiationshere if there is for instance the triple (dboPlantowldisjointWith dboAnimal) then theremust not be a resource which is instantiated both asdboPlant and dboAnimal

Evaluation results We obtained mixed results hereOnly Freebase OpenCyc and Wikidata perform verywell96

Freebase and Wikidata do not specify any constraintswith owldisjointWith Hence those two KGshave no inconsistencies wrt class restrictions and wecan assign the metric value 1 to them In case of Open-Cyc 5 out of the 27112 class restrictions are incon-sistent DBpedia contains 24 class constraints Threeout of them are inconsistent For instance over 1200instances exist which are both a dboAgent and adboPlace YAGO contains 42 constraints dedi-cated mainly for WordNet classes which are mostlyinconsistent

Consistency of statements wrt relation constraintsmconRelat

Evaluation method Here we considered the rela-tions rdfsrange and owlFunctionalProperty as those are used in more than every second con-sidered KG rdfsrange specifies the expected typeof an instance on the object position of a triple whileowlFunctionalProperty indicates that a rela-tion should only be used at most once per resource Weonly took datatype properties into account for this eval-uation since consistencies regarding object propertieswould require to distinguish Open World assumptionand Closed World assumption

Evaluation results In the following we considerthe fulfillment degree for the relation constraintsrdfsrange and owlFunctionalPropertyseparately In Table 5 we show the average of the fulfill-ment scores of each KG regarding rdfsrange andowlFunctionalProperty Note that the num-bers of evaluated relation constraints varied from KG toKG depending on how many relation constraints wereavailable per KG

Range Wikidata does not use any rdfsrangerestrictions Within the Wikidata data model there iswdopropertyType but this indicates not the ex-act allowed data type of a relation (eg wdoprop

96Note that the sample size varies among the KGs (depend-ing on how many owldisjointWith statements are availableper KG) Therefore inconsistencies measured on a small set ofowldisjointWith facts become more visible

Table 6Evaluation results for the KGs regarding the dimension Relevancy

DB FB OC WD YA

mRanking 0 1 0 1 0

ertyTypeTime can represent a year or an exact date)On the talk pages of Wikidata relations users can indi-cate the allowed values of relations via One of state-ments97 Since One of statements are only listed onthe property talk pages and since not only entity typesbut also concrete instances are used as One of valueswe do not consider those statements here

DBpedia obtains the highest measured fulfillmentscore wrt consistency of rdfsrange statementsAn example for a range inconsistency is that the relationdbobirthDate requires a data type xsddatein about 20 of those relations the data type xsdgYear is used though

YAGO Freebase and OpenCyc contain range incon-sistencies primarily since they specify designated datatypes via range relations which are not consistentlyused on the instance level For instance YAGO spec-ifies proprietary data types such as yagoyagoURLand yagoyagoISBN On the instance level how-ever either no data type is used or the unspecific datatype xsdstring

FunctionalProperty The restriction indicated byowlFunctionalProperty is used by all KGsexcept Wikidata On the talk pages about the rela-tions in Wikidata users can specify the cardinalityrestriction via setting the relation to single how-ever this is not part of the Wikidata data modelThe other KGs mostly comply with the usage re-strictions of owlFunctionalProperty Note-worthy is that in Freebase 999 of the inconsis-tencies obtained here are caused by the usages ofthe relations freebasetypeobjectname andfreebasecommonnotable_fordisplay_name

524 RelevancyThe fulfillment degrees of the KGs regarding the

Relevancy criteria are shown in Table 6

Creating a ranking of statements mRanking

Only Wikidata supports the modeling of a rankingof statements Each statement is ranked with ldquopre-

97See httpswwwwikidataorgwikiCategoryProperties_with_one-of_constraints for an overviewrequested on Jan 29 2017

34 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Table 7Evaluation results for the KGs regarding the dimensionCompleteness

DB FB OC WD YA

mcSchema 091 076 092 1 095mcColumn 040 043 0 029 033mcPop 093 094 048 099 089mcPop (short) 1 1 082 1 090mcPop (long) 086 088 014 098 088

ferred rankrdquo (wdoPreferredRank) ldquonormal rankrdquo(wdoNormalRank) or ldquodeprecated rankrdquo (wdoDeprecatedRank) The preferred rank corre-sponds to the up-to-date value or the consensus of theWikidata community wrt this relation Freebase doesnot provide any ranking of statements entities or re-lations However the meanwhile shutdown FreebaseSearch API provided a ranking for resources98

525 CompletenessThe fulfillment degrees of the KGs regarding the

Completeness criteria are shown in Table 7

Schema completeness mcSchema

Evaluation method Since a gold standard for eval-uating the Schema completeness of the considered KGshas not been published we built one on our own Thisgold standard is available online99 It is based on thedata set used in Section 513 where we needed as-signments of classes to domains and comprises of 41classes as well as 22 relations It is oriented towards thedomains people media organizations geography andbiology The classes in the gold standard were alignedto corresponding WordNet synsets (using WordNet ver-sion 31) and were grouped into main classes

Evaluation results Generally Wikidata performsoptimal also DBpedia OpenCyc and YAGO exhibitresults which can be judged as acceptable for most usecases Freebase shows considerable room for improve-ment concerning the coverage of typical cross-domainclasses and relations The results in more detail are asfollows

DBpedia DBpedia shows a good score regardingSchema completeness and its schema is mainly limited

98See httpsdevelopersgooglecomfreebasev1search-cookbookscoring-and-ranking re-quested on Mar 4 2016

99See httpkmaifbkitedusitesknowledge-graph-comparison requested on Jan 29 2017

due to the characteristics of how information is storedand extracted from Wikipedia

1 Classes The DBpedia ontology was created man-ually and covers all domains well However it is incom-plete in the details and therefore appears unbalancedFor instance within the domain of plants the DBpe-dia ontology does not use the class tree but the classginko which is a subclass of trees We can mentionas reason for such gaps in the modeling the fact thatthe ontology is created by means of the most frequentlyused infobox templates in Wikipedia

2 Relations Relations are considerably well cov-ered in the DBpedia ontology Some missing relationsor modeling failures are due to the Wikipedia infoboxcharacteristics For example to represent the gender ofa person the existing relation foafgender seemsto fit However it is only modeled in the ontology asbelonging to the class dbolanguage and not usedon instance level Note that the gender of a person is of-ten not explicitly mentioned in the Wikipedia infoboxesbut implicitly mentioned in the category names (forinstance American male singers) While DBpediadoes not exploit this knowledge YAGO does use it andprovides facts with the relation yagohasGender

Freebase Freebase shows a very ambivalent schemacompleteness On the one hand Freebase targets ratherthe representation of facts on instance level than therepresentation of classes and their hierarchy On theother hand Freebase provides a vast amount of rela-tions leading to a very good coverage of the requestedrelations

1 Classes Freebase lacks a class hierarchy and sub-classes of classes are often in different domains (for in-stance the classes freebasemusicartist andsportsmen freebasesportspro_athlete arelogically a subclass of the class people freebasepersonpeople but not explicitly stated as such)which makes it difficult to find suitable sub- and su-perclasses Noteworthy the biology domain containsno classes This is due to the fact that classes are rep-resented as entities such as tree100 and ginko101 Theginko tree is not classified as tree but by the genericclass freebasebiologyoganism_classification

2 Relations Freebase exhibits all relations requestedby our gold standard This is not surprising given thevast amount of available relations in Freebase (see Sec-tion 514 and Table 2)

100Freebase ID freebasem07j7r101Freebase ID freebasem0htd3

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 35

OpenCyc In total OpenCyc exposes a quite highSchema completeness scoring This is due to the factthat OpenCyc has been created manually and has itsfocus on generic and common-sense knowledge

1 Classes The ontology of OpenCyc covers bothgeneric and specific classes such as cychSocialGroup and cychLandTopographicalFeatureWe can state that OpenCyc is complete with respect tothe considered classes

2 Relations OpenCyc lacks some relations of thegold standard such as the number of pages or the ISBNof books

Wikidata According to our evaluation Wikidata iscomplete both with respect to classes and relations

1 Classes Besides frequently used generic classessuch as ldquohumanrdquo (wdtQ5) also very specific classesexist such as ldquolandformrdquo (wdtQ271669) in the senseof a geomorphologial unit with over 3K instances

2 Relations In particular remarkable is that Wiki-data covers all relations of the gold standard eventhough it has extremely less relations than FreebaseThus the Wikidata methodology to let users proposenew relations to discuss about their outreach and fi-nally to approve or disapprove the relations seems tobe appropriate

YAGO Due to its concentration on modeling classesYAGO shows the best overall Schema completenessfulfillment score among the KGs

1 Classes To create the set of classes in YAGOthe Wikipedia categories are extracted and connectedto WordNet synsets Since also our gold standard isalready aligned to WordNet synsets we can measure afull completeness score for YAGO classes

2 Relations The YAGO schema does not containmany unique but rather abstract relations which canbe understood in different senses The abstract rela-tion names make it often difficult to infer the mean-ing The relation yagowasCreatedOnDate forinstance can be used reasonably for both the founda-tion year of a company and for the publication dateof a movie DBpedia in contrast provides the rela-tion dbpfoundationYear Often the meaning ofYAGO relations is only fully understood after consider-ing the associated classes using domain and range ofthe relations Expanding the YAGO schema by furthermore fine-grained relations appears reasonable

Column completeness mcColumn

Evaluation method For evaluating KGs wrt Col-umn completeness for each KG 25 class-relation-

Table 8Metric values of mcCol for single class-relation-pairs

Relation DB FB OC ED YA

Personndashbirthdate 048 048 0 070 077

Personndashsex ndash 057 0 094 064

Bookndashauthor 091 093 0 082 028

BookndashISBN 073 063 ndash 018 001

combinations102 were created based on our gold stan-dard created for measuring the Schema completenessIt was ensured that only those relations were selectedfor a given class for which a value typically exists forthat class For instance we did not include the deathdate as potential relation for living people

Evaluation results In general no KG yields a met-ric score of over 043 As visible in Table 8 KGs oftenhave some specific class-relation-pairs which are wellrepresented on instance level while the rest of the pairsare poorly represented The well-represented pairs pre-sumably originate either from column-complete datasets which were imported (cf MusicBrainz in case ofFreebase) or from user edits focusing primarily on factsabout entities of popular classes such as people Wenotice the following observations with respect to thesingle KGs

DBpedia DBpedia fails regarding the relation sex forinstances of class Person since it does not containsuch a relation in its ontology If we considered the non-mapping-based property dbpgender instead (notdefined in the ontology) we would gain a coverage ofonly 025 (about 5K people) We can note hence thatthe extraction of data out of the Wikipedia categorieswould be a further fruitful data source for DBpedia

Freebase Freebase surprisingly shows a very highcoverage (927) of the authors of books given the ba-sic population of 17M books Note however that thereare not only books modeled under freebasebookbook but also entities of other types such as a descrip-tion of the Lord of Rings (see freebasem07bz5)Also the coverage of ISBN for books is quite high(634)

OpenCyc OpenCyc breaks ranks as mostly no val-ues for the considered relations are stored in this KG It

102The selection of class-relation-pairs was depending on the factwhich class-relation-pairs were available per KG Hence the choiceis varying from KG to KG Also note that less class-relation-pairswere used if no 25 pairs were available in the respective KG

36 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

contains mainly taxonomic knowledge and only thinlyspread instance facts

Wikidata Wikidata achieves a high coverage of birthdates (703) and of gender (941) despite the highnumber of 3M people103

YAGO YAGO obtains a coverage of 635 for gen-der relations as it in contrast to DBpedia extracts thisimplicit information from Wikipedia

Population completeness mcPop

Evaluation method In order to evaluate the Popu-lation completeness we need a gold standard consist-ing of a basic entity population for each consideredKG This gold standard which is available online104

was created on the basis of our gold standard usedfor evaluating the Schema completeness and the Col-umn completeness For its creation we selected fiveclasses from each of the five domains and determinedtwo well-known entities (called short head) and tworather unknown entities (called long tail) for each ofthose classes The exact entity selection criteria are asfollows

1 The well-known entities were chosen without tem-poral and location-based restrictions To take themost popular entities per domain we used quan-titative statements For instance to select well-known athletes we ranked athletes by the numberof won olympic medals to select the most popu-lar mountains we ranked the mountains by theirheights

2 To select the rather unknown entities we consid-ered entities associated to both Germany and aspecific year For instance regarding the athleteswe selected German athletes active in the year2010 such as Maria Houmlfl-Riesch The selectionof rather unknown entities in the domain of biol-ogy is based on the IUCN Red List of ThreatenedSpecies105106

Selecting four entities per class and five classes perdomain resulted in 100 entities to be used for evaluatingthe Population completeness

103These 3M instances form about 185 of all instances in Wiki-data See httpswwwwikidataorgwikiWikidataStatistics requested on Nov 7 2016

104See httpkmaifbkitedusitesknowledge-graph-comparison requested on Jan 29 2017

105See httpwwwiucnredlistorg requested on Apr2 2016

106Note that selecting entities by their importance or popularity ishard in general and that also other popularity measures such as thePageRank scores may be taken into account

Evaluation results All KGs except OpenCyc showgood evaluation results Since also Wikidata exhibitsgood evaluation results the population degree appar-ently does not depend on the age or the insertion methodof the KG Fig 10 additionally depicts the populationcompleteness for the single domains for each KG Inthe following we firstly present our findings for well-known entities before we secondly go into the detailsof rather unknown entities

Well-known entities Here all considered KGsachieve good results DBpedia Freebase and Wikidataare complete wrt the well-known entities in our goldstandard YAGO lacks some well-known entities al-though some of them are represented in Wikipedia Onereason for this fact is that those Wikipedia entities donot get imported into YAGO for which a WordNet classexists For instance there is no ldquoGreat White Sharkrdquoentity only the WordNet class yagowordnet_great_white_shark_101484850

Not-well-known entities First of all not very surpris-ing is the fact that all KGs show a higher degree of com-pleteness regarding well-known entities than regard-ing rather unknown entities as the KGs are orientedtowards general knowledge and not domain-specificknowledge Secondly two things are in particular pe-culiar concerning long-tail entities in the KGs Whilemost of the KGs obtain a score of about 088 Wiki-data deflects upwards and OpenCyc deflects stronglydownwards

Wikidata exhibits a very high Population complete-ness degree for long tail entities This is a result fromthe central storage of interwiki links between differentWikimedia projects (especially between the differentWikipedia language versions) in Wikidata A Wikidataentry is added to Wikidata as soon as a new entity isadded in one of the many Wikipedia language versionsNote however that in this way English-language labelsfor the entities are often missing We measure that onlyabout 546 (102M) of all Wikidata resources have anEnglish label

OpenCyc exhibits a poor population degree score of014 for long-tail entities OpenCycrsquos sister KGs Cycand ResearchCyc are apparently considerably bettercovered with entities [36] leading to higher Populationcompleteness scores

526 TimelinessThe evaluation results concerning the dimension

Timeliness are presented in Table 9

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 37

DBpedia Freebase OpenCyc Wikidata YAGO0

01

02

03

04

05

06

07

08

09

1

PeopleMediaOrganizationsGeographyBiology

Fig 10 Population completeness regarding the different domains per KG

Table 9Evaluation results for the KGs regarding the dimension Timeliness

DB FB OC WD YA

mFreq 05 0 025 1 025mV alidity 0 1 0 1 1mChange 0 1 0 0 0

Timeliness frequency of the KG mFreq

Evaluation results The KGs are very diverse re-garding the frequency in which the KGs are updatedranging from a score of 0 for Freebase (not updated anymore) to 1 for Wikidata (updates immediately visibleand retrievable) Note that the Timeliness frequency ofthe KG can be a crucial point and a criterion for exclu-sion in the process of choosing the right KG for a givensetting [17] In the following we outline some charac-teristics of the KGs with respect to their up-to-dateness

DBpedia is created about once to twice a year andis not modified in the meantime From September2013 until November 2016 six DBpedia versions havebeen published107 Besides the static DBpedia DBpe-dia live108 has been continuously updated by trackingchanges in Wikipedia in real-time However it does notprovide the full range of relations as DBpedia

107These versions are DBpedia 38 DBpedia 39 DBpedia 2014DBpedia 2015-04 DBpedia 2015-10 and DBpedia 2016-04 Alwaysthe latest DBpedia version is published online for dereferencing

108See httplivedbpediaorg requested on Mar 42016

Freebase had been updated continuously until itsclose-down and is not updated anymore

OpenCyc has been updated less than once per yearThe last OpenCyc version dates from May 2012109 Tothe best of our knowledge Cyc and OpenCyc respec-tively are developed further but no exact date of thenext version is known

Wikidata provides the highest fulfillment degree forthis criterion Modifications in Wikidata are via browserand via HTTP URI dereferencing immediately visibleHence Wikidata falls in the category of continuousupdates Besides that an RDF export is provided ona roughly monthly basis (either via the RDF exportwebpage110 or via own processing using the Wikidatatoolkit111)

YAGO has been updated less than once per yearYAGO3 was published in 2015 YAGO2 in 2011 andthe interim version YAGO2s in 2013 A date of the nextrelease has not been published

Specification of the validity period of statementsmV alidity

Evaluation results Although representing the va-lidity period of statements is obviously reasonable formany relations (for instance the presidentrsquos term of

109See httpswopencycorg requested on Nov 82016

110See httptoolswmflabsorgwikidata-exportsrdfexports requested on Nov 23 2016

111See httpsgithubcomWikidataWikidata-Toolkit requested on Nov 8 2016

38 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Table 10Evaluation results for the KGs regarding the dimension Ease ofunderstanding

DB FB OC WD YA

mDescr 070 097 1 lt1 1mLang 1 1 0 1 1muSer 1 1 0 1 1muURI 1 05 1 0 1

office) specifying the validity period of statements isin several KGs either not possible at all or only rudi-mentary performed

DBpedia and OpenCyc do not realize any specifi-cation possibility In YAGO Freebase and Wikidatathe temporal validity period of statements can be spec-ified In YAGO this modeling possibility is madeavailable via the relations yagooccursSinceyagooccursUntil and yagooccursOnDateWikidata provides the relations ldquostart timerdquo (wdtP580)and ldquoend timerdquo (wdtP582) In Freebase CompoundValue Types (CVTs) are used to represent relations withhigher arity [44] As part of this representation validityperiods of statements can be specified An example isldquoVancouverrsquos population in 1997rdquo

Specification of the modification date of statementsmChange

Evaluation results The modification date of state-ments can only be specified in Freebase but not in theother KGs Together with the criteria on Timelinessthis reflects that the considered KGs are mostly notsufficiently equipped with possibilities for modelingtemporal aspects within and about the KG

In Freebase the date of the last review of a fact can berepresented via the relation freebasefreebasevaluenotationis_reviewed In the DBpediaontology the relation dctermsmodified is usedto state the date of the last revision of the DBpediaontology When dereferencing a resource in Wikidatathe latest modification date of the resource is returnedvia schemadateModified This however doesnot hold for statements Thus Wikidata is evaluatedwith 0 too

527 Ease of UnderstandingDescription of resources mDescr

Evaluation method We measured the extent towhich entities are described Regarding the labelswe considered rdfslabel for all KGs Regard-ing the descriptions the corresponding relations dif-

fer from KG to KG DBpedia for instance usesrdfscomment and dcelementsdescriptionwhile Freebase provides freebasecommontopicdescription112

Evaluation result For all KGs the rule applies thatin case there is no label available usually there isalso no description available The current metric couldtherefore (without significant restrictions) be applied tordfslabel occurrences only

YAGO Wikidata and OpenCyc contain a label foralmost every entity In Wikidata the entities withoutany label are of experimental nature and are most likelynot used113

Surprisingly DBpedia shows a relatively low cov-erage wrt labels and descriptions (only 704) Ourmanual investigations suggest that relations with higherarity are modeled by means of intermediate nodeswhich have no labels114

Labels in multiple languages mLang

Evaluation method Here we measure whether theKGs contain labels (rdfslabel) in other languagesthan English This is done by means of the languageannotations of literals such as ldquoderdquo for literals inGerman

Evaluation results DBpedia provides labels in 13languages Further languages are provided in the lo-calized DBpedia versions YAGO integrates statementsof the different language versions of Wikipedia intoone KG Therefore it provides labels in 326 differentlanguages Freebase and Wikidata also provide a lot oflanguages (244 and 395 languages respectively) Con-trary to the other KGs OpenCyc only provides labelsin English

Coverage of languages We also measured the cov-erage of selected languages in the KGs ie the extentto which entities have an rdfslabel with a specificlanguage annotation115 Our evaluation shows that DB-pedia YAGO and Freebase achieve a high coveragewith more than 90 regarding the English language Incontrast to those KGs Wikidata shows a relative low

112Human-readable resource descriptions may also be representedby other relations [15] However we focused on those relations whichare commonly used in the considered KGs

113For instance wdtQ5127809 represents a game fo the Nin-tendo Entertainment System but there is no further information foran identification of the entity available

114Eg dbrNayim links via dboCareerStation to 10entities of his carrier stations

115Note that literals such as rdfslabel do not necessarily havelanguage annotations In those cases we assume that no languageinformation is available

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 39

coverage regarding the English language of only 546but a coverage of over 30 for further languages suchas German and French Wikidata is hence not only themost diverse KG in terms of languages but has also thehighest coverage regarding non-English languages

Understandable RDF serialization muSer

The provisioning of understandable RDF serializa-tions in the context of URI dereferencing leads to a bet-ter understandability for human data consumers DB-pedia YAGO and Wikidata provide N-Triples andN3Turtle serializations Freebase in contrast onlyprovides a Turtle serialization OpenCyc only usesRDFXML which is regarded as not easily understand-able by humans

Self-describing URIs muURI

We can observe two different paradigms of URI us-age On the one hand DBpedia OpenCyc and YAGOrely on descriptive URIs and therefore achieve the fullfulfillment degree In DBpedia and YAGO the URIsof the entities are determined by the corresponding En-glish Wikipedia article The mapping to the EnglishWikipedia is thus trivial In case of OpenCyc two RDFexports are provided one using opaque and one us-ing self-describing URIs The self-describing URIs arethereby derived from the rdfslabel values of theresources

On the other hand Wikidata and Freebase (the latterin part) rely on opaque URIs Wikidata uses Q-IDsfor resources (items in Wikidata terminology) andP-IDs for relations Freebase uses self-describing URIsonly partially namely opaque M-IDs for entities andself-describing URIs for classes and relations116

528 InteroperabilityThe evaluation results of the dimension Interoper-

ability are presented in Table 11

Avoiding blank nodes and RDF reification mReif

Reification allows to represent further informationabout single statements In conclusion we can state thatDBpedia Freebase OpenCyc and YAGO use someform of reification However none of the consideredKGs uses the RDF standard for reification Wikidatamakes extensive use of reification every relation isstored in the form of an n-ary relation In case of DB-pedia and Freebase in contrast facts are predominantlystored as N-Tripels and only relations of higher arity

116Eg freebasemusicalbum for the class music al-bums and freebasepeoplepersondate_of_birthfor the relation day of birth

Table 11Evaluation results for the KGs regarding the dimensionInteroperability

DB FB OC WD YA

mReif 05 05 05 0 05miSerial 1 0 05 1 1mextV oc 061 011 041 068 013mpropV oc 015 0 051 gt0 0

are stored via n-ary relations117 YAGO stores facts asN-Quads in order to be able to store meta informationof facts like provenance information When the quadsare loaded in a triple store the IDs referring to thesingle statements are ignored and quads are convertedinto triples In this way most of the statements are stillusable without the necessity to deal with reification

Blank nodes are non-dereferencable anonymous re-sources They are used by the Wikidata and OpenCycdata model

Provisioning of several serialization formats miSerial

DBpedia YAGO and Wikidata fulfill the criterion ofProvisioning several RDF serialization formats to thefull extent as they provide data in RDFXML and sev-eral other serialization formats during the URI derefer-encing In addition DBpedia and YAGO provide fur-ther RDF serialization formats (eg JSON-LD Micro-data and CSV) via their SPARQL endpoints Freebaseis the only KG providing RDF only in Turtle format

Using external vocabulary mextV oc

Evaluation method This criterion indicates the ex-tent to which external vocabulary is used For that foreach KG we divide the occurrence number of tripleswith external relations by the number of all relations inthis KG

Evaluation results DBpedia uses 37 unique exter-nal relations from 8 different vocabularies while theother KGs mainly restrict themselves to the externalvocabularies RDF RDFS and OWL

Wikidata reveals a high external vocabulary ratiotoo We can mention two obvious reasons for that fact1 Information in Wikidata is provided in a huge varietyof languages leading to 85M rdfslabel and 140Mschemadescription literals 2 Wikidata makesextensive use of reification Out of the 140M triplesused for instantiations via rdftype about 74M (ie

117See Section 511 for more details wrt the influence of reifica-tion on the number of triples

40 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

about the half) are taken for instantiations of statementsie for reification

Interoperability of proprietary vocabulary mpropV oc

Evaluation method This criterion determines the ex-tent to which URIs of proprietary vocabulary are linkedto external vocabulary via equivalence relations Foreach KG we measure which classes and relationsare linked via owlsameAs118 owlequivalentClass (in Wikidata wdtP1709) and owlequivalentProperty (in Wikidata wdtP1628) to ex-ternal vocabulary Note that other relations such asrdfsubPropertyOf could be taken into accounthowever in this work we only consider equivalencyrelations

Evaluation results In general we obtained low ful-fillment scores regarding this criterion OpenCyc showsthe highest value We achieved the following singlefindings

Regarding its classes DBpedia reaches a relativehigh interlinking degree of about 484 Classes arethereby linked to FOAF Wikidata schemaorg andDUL119 Regarding its relations DBpedia links to Wiki-data and schemaorg120 Only 63 of the DBpediarelations are linked to external vocabulary

Freebase only provides owlsameAs links in theform of a separate RDF file but these links are only oninstance level Thus the KG is evaluated with 0

In OpenCyc about half of all classes exhibit at leastone external linking via owlsameAs Internal linksto resources of swcyccom the commercial ver-sion of OpenCyc were ignored in our evaluation Theconsidered classes are mainly linked to FOAF UM-BEL DBpedia and linkedmdborg the relations mainlyto FOAF DBpedia Dublin Core Terms and linked-mdborg The relative high linking degree of OpenCyccan be attributed to dedicated approaches of linkingOpenCyc to other KGs (see eg Medelyan et al [38])

Regarding the classes Wikidata provides linksmainly to DBpedia Considering all Wikidata classesonly 01 of all Wikidata classes are linked to equiva-

118OpenCyc uses owlsameAs both on schema and instancelevel This is appropriate as the OWL primer states The built-inOWL property owlsameAs links an individual to an individualas well as The owlsameAs statements are often used in definingmappings between ontologies see httpswwww3orgTR2004REC-owl-ref-20040210sameAs-def (requestedon Feb 4 2017)

119See httpwwwontologydesignpatternsorgontdulDULowl requested on Jan 11 2017

120Eg dbobirthDate is linked to wdtP569 andschemabirthDate

Table 12Evaluation results for the KGs regarding the dimension Accessibility

DB FB OC WD YA

mDeref 1 1 044 041 1mAvai lt1 073 lt1 lt1 1mSPARQL 1 1 0 1 0mExport 1 1 1 1 1mNegot 05 1 0 1 0mHTMLRDF 1 1 1 1 0mMeta 1 0 0 0 1

lent external classes This may be due to the high num-ber of classes in Wikidata in general Regarding therelations Wikidata provides links in particular to FOAFand schemaorg and achieves here a linking coverageof 21 Although this is low frequently used relationsare linked121

YAGO contains around 553K owlequivalentClass links to classes within the DBpedia namespacedby However as YAGO classes (and their hierarchy)were imported also into DBpedia (using the namespacehttpdbpediaorgclassyago) we donot count those owlequivalentClass links inYAGO as external links for YAGO

529 AccessibilityThe evaluation results of the dimension Accessibility

are presented in Table 12

Dereferencing possibility of resources mDeref

Evaluation method We measured the dereferenc-ing possibilities of resources by trying to dereferenceURIs containing the fully-qualified domain name ofthe KG For that we randomly selected 15K URIs inthe subject predicate and object position of triples ineach KG We submitted HTTP requests with the HTTPaccept header field set to applicationrdf+xmlin order to perform content negotiation

Evaluation results In case of DBpedia OpenCycand YAGO all URIs were dereferenced successfullyand returned appropriate RDF data so that they fulfilledthis criterion completely For DBpedia 45K URIs wereanalyzed for OpenCyc only around 30K due to thesmall number of unique predicates We observed almost

121Frequently used relations with stated equivalence to externalrelations are eg wdtP31 linked to rdftype and wdtP279linked to rdfssubClassOf

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 41

the same picture for YAGO namely no notable errorsduring dereferencing

For Wikidata which contains also not that manyunique predicates we analyzed around 35K URIs Notethat predicates which are derived from relations using asuffix (eg the suffix s as in wdtP1024s is usedfor predicates referring to a statement) could not bedereferenced at all Furthermore the blank nodes usedfor reification cannot be dereferenced

Regarding Freebase mainly all URIs on subjectand object position of triples could be dereferencedSome resources were not resolvable even after multi-ple attempts (HTTP server error 503 eg freebasem0156q) Surprisingly server errors also appearedwhile browsing the website freebasecom so that datawas partially not available Regarding the predicate po-sition many URIs are not dereferencable due to servererrors (HTTP 503) or due to unknown URIs (HTTP404) Note that if a large number of Freebase requestsare performed an API key from Google is necessaryIn our experiments the access was blocked after a fewthousand requests Hence we can point out that withoutan API key the Freebase KG is only usable to a limitedextent

Availability of the KG mAvai

Evaluation method We measured the availabilityof the officially hosted KGs with the monitoring servicePingdom122 For each KG an uptime test was set upwhich checked the availability of the resource Ham-burg as representative resource for successful URI re-solving (ie returning the status code HTTP 200) ev-ery minute over the time range of 60 days (Dec 182015ndashFeb 15 2016)

Evaluation result While the other KGs showed al-most no outages and were again online after some min-utes on average YAGO outages took place frequentlyand lasted on average 35 hours123 In the given timerange four outages took longer than one day Based onthese insights we recommend to use a local version ofYAGO for time-critical queries

Availability of a public SPARQL endpoint mSPARQL

The SPARQL endpoints of DBpedia and YAGO are

122See httpswwwpingdomcom requested Mar 2 2016The HTTP requests of Pingdom are executed by various servers sothat caching is prevented

123See diagrams per KG on our website (httpkmaifbkitedusitesknowledge-graph-comparisonrequested on Jan 31 2017)

provided by a Virtuoso server124 the Wikidata SPARQLendpoint via Blazegraph125 Freebase and OpenCyc donot provide an official SPARQL endpoint However anendpoint for the MQL query language for the FreebaseKG was available

Especially regarding the Wikidata SPARQL endpointwe observed access restrictions The maximum execu-tion time per query is set to 30 seconds but there is nolimitation regarding the returning number of rows How-ever the front-end of the SPARQL endpoint crashed incase of large result sets with more than 15M rows Al-though public SPARQL endpoints need to be preparedfor inefficient queries the time limit of Wikidata mayimpede the execution of reasonable queries

Provisioning of an RDF export mExport

All considered KGs provide RDF exports as down-loadable files The format of the data differs from KGto KG Mostly data is provided in N-Triples and Turtleformat

Support of content negotiation mNegot

We measure the support of content negotiation re-garding the serialization formats RDFXML N3Turtleand N-Triples OpenCyc does not provide any contentnegotiation only RDFXML is supported as contenttype Therefore OpenCyc does not fulfill the criterionof supporting content negotiation

The endpoints for DBpedia Wikidata and YAGOcorrectly returned the appropriate RDF serializationformat and the corresponding HTML representationof the tested resources Freebase does currently notprovide any content negotiation and only the contenttype textplain is returned

Noteworthy is also that regarding the N-Triples seri-alization YAGO and DBpedia require the accept headertextplain and not applicationn-triplesThis is due to the usage of Virtuoso as endpoint For DB-pedia the forwarding to httpdbpediaorgdata[resource]ntriples does not work in-stead the HTML representation is returned Thereforethe KG is evaluated with 05

Linking HTML sites to RDF serializations mHTMLRDF

All KGs except OpenCyc interlink the HTML represen-tations of resources with the corresponding RDF repre-sentations by means of ltlink rel=alternate

124See httpsvirtuosoopenlinkswcom re-quested on Dec 28 2016

125See httpswwwblazegraphcom requested on Dec28 2016

42 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Table 13Evaluation results for the KGs regarding the dimension License

DB FB OC WD YA

mmacLicense 1 0 0 1 0

type=content type href=URLgtin the HTML header

Provisioning of metadata about the KG mmeta

For this criterion we analyzed if KG metadata isavailable such as in the form of a VoID file126 DBpediaintegrates the VoID vocabulary directly in its KG127 andprovides information such as the SPARQL endpointURL and the number of all triples OpenCyc revealsthe current KG version number via owlversionInfo For YAGO Freebase and Wikidata no metainformation could be found

5210 LicenseThe evaluation results of the dimension License are

shown in Table 13

Provisioning machine-readable licensing informationmmacLicense

DBpedia and Wikidata provide licensing informa-tion about their KG data in machine-readable form ForDBpedia this is done in the ontology via the predi-cate cclicense linking to CC-BY-SA128 and GNUFree Documentation License (GNU FDL)129 Wikidataembeds licensing information during the dereferenc-ing of resources in the RDF document by linking withcclicense to the license CC0130 YAGO and Free-base do not provide machine-readable licensing infor-mation However their data is published under the li-cense CC-BY131 OpenCyc embeds licensing informa-tion into the RDF document during dereferencing butnot in machine-readable form132

126See httpswwww3orgTRvoid requested on Apr7 2016

127See httpdbpediaorgvoidpageDataset re-quested on Mar 5 2016

128See httpcreativecomonsorglicensesby-sa30 requested on Feb 4 2017

129See httpwwwgnuorgcopyleftfdlhtml re-quested on Feb 4 2017

130See httpcreativecomonsorgpublicdomainzero10 requested on Feb 4 2017

131See httpcreateivecommonsorglicensesby30 requested on Feb 4 2017

132License information is provided as plain text among furtherinformation with the relation rdfscomment

Table 14Evaluation results for the KGs regarding the dimension Interlinking

DB FB OC WD YA

mInst 025 0 038 0 (09) 031mURIs 093 091 089 096 096

5211 InterlinkingThe evaluation results of the dimension Interlinking

are shown in Table 14

Linking via owlsameAs mInst

Evaluation method Given all owlsameAs triplesin each KG we queried all those subjects thereof whichare instances but neither classes nor relations133 andwhere the resource in the object position of the triple isan external source ie not belonging to the namespaceof the KG

Evaluation result OpenCyc and YAGO achieve thebest results wrt this metric but DBpedia has by farthe most instances with at least one owlsameAs linkWe can therefore confirm the statement by Bizer et al[12] that DBpedia has established itself as a hub in theLinked Data cloud

In DBpedia there are about 52M instances with atleast one owlsameAs link Links to localized DBpe-dia versions (eg dedbpediaorg) were countedas internal links and hence not considered here Intotal one-fourth of all instances have at least oneowlsameAs link

In Wikidata neither owlsameAs links are pro-vided nor a corresponding proprietary relation is avail-able Instead Wikidata uses for each linked data seta proprietary relation (called identifier) to indicateequivalence For example the M-ID of a Freebase in-stance is stored via the relation ldquoFreebase identifierrdquo(wdtP646) as literal value (eg m01x3gpk)So far links to 426 different data sources are maintainedin this way

Although the equivalence statements in Wikidata canbe used to generate corresponding owlsameAs state-ments and although the stored identifiers are providedin the Browser interface as hyperlinks there are no gen-uine owlsameAs links available Hence Wikidata isevaluated with 0 If we view each equivalence relationas owlsameAs relation we would obtain around122M instances with owlsameAs statements Thiscorresponds to 86 of all instances If we consider

133The interlinking on schema level is already covered by thecriterion Interoperability of proprietary vocabulary

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 43

only entities instead of instances (since there are manyinstances due to reification) we obtain a coverage of65 Note however that although the linked resourcesprovide relevant content the resources are not alwaysRDF documents but instead HTML web pages There-fore we cannot easily subsume all identifiers (equiv-alence statements) under owlsameAs

YAGO has around 36M instances with at least oneowlsameAs link However most of them are linksto DBpedia based on common Wikipedia articles Ifthose links are excluded YAGO contains mostly linksto GeoNames and would be evaluated with just 001

In case of OpenCyc links to Cyc134 the commercialversion of OpenCyc were considered as being internalStill OpenCyc has the highest fulfillment degree witharound 40K instances with at least one owlsameAslink As mentioned earlier the relative high linkingdegree of OpenCyc can be attributed to dedicated ap-proaches of linking OpenCyc to other KGs135

Validity of external URIs mURIs

Regarding the dimension Accessibility we alreadyanalyzed the dereferencing possibility of resources inthe KG namespace Now we analyze the links to exter-nal URIs

Evaluation method External links include owlsameAs links as well as links to non-RDF-based Webresources (eg via foafhomepage) We measureerrors such as timouts client errors (HTTP response4xx) and server errors (HTTP response 5xx)

Evaluation result The external links are in most ofthe cases valid for all KGs All KGs obtain a metricvalue between 089 and 096

DBpedia stores provenance information via the re-lation provwasDerivedFrom Since almost alllinks refer to Wikipedia 99 of the resources are avail-able

Freebase achieves high metric values here sinceit contains owlsameAs links mainly to WikipediaAlso Wikipedia URIs are mostly resolvable

OpenCyc contains mainly external links to non-RDF-based Web resources to wikipediaorg and w3org

YAGO also achieves high metric values since it pro-vides owlsameAs links only to DBpedia and Geo-Names whose URIs do not change

For Wikidata the relation reference URL (wdtP854) which states provenance information amongother relations belongs to the links linking to external

134Ie swcyccom135See Interoperability of proprietary vocabulary in sec 528

Web resources Here we were able to resolve around955 without errors

Noticeable is that DBpedia and OpenCyc containmany owlsameAs links to URIs whose domains donot exist anymore136 One solution for such invalid linksmight be to remove them if they have been invalid for acertain time span

5212 Summary of ResultsWe now summarize the results of the evaluations

presented in this section

1 Syntactic validity of RDF documents All KGsprovide syntactically valid RDF documents

2 Syntactic validity of Literals In general the KGsachieve good scores regarding the Syntactic valid-ity of literals Although OpenCyc comprises over1M literals in total these literals are mainly labelsand descriptions which are not formatted in a spe-cial format For YAGO we detected about 519Ksyntactic errors (given 1M literal values) due to theusage of wildcards in the date values Obviouslythe syntactic invalidity of literals is accepted bythe publishers in order to keep the number of rela-tions low In case of Wikidata some invalid literalssuch as the ISBN have been corrected in newerversions of Wikidata This indicates that knowl-edge in Wikidata is curated continuously For DB-pedia comments next to the values to be extracted(such as ISBN) in the infoboxes of Wikipedia ledto inaccurately extracted values

3 Semantic validity of triples All considered KGsscored well regarding this metric This shows thatKGs can be used in general without concerns re-garding the correctness Note however that eval-uating the semantic validity of facts is very chal-lenging since a reliable ground truth is needed

4 Trustworthiness on KG level Based on the way ofhow data is imported and curated OpenCyc andWikidata can be trusted the most

5 Trustworthiness on statement level Here espe-cially good values are achieved for Freebase Wiki-data and YAGO YAGO stores per statement boththe source and the extraction technique which isunique among the KGs Wikidata also supports tostore the source of information but only around13 of the statements have provenance informa-tion attached Note however that not every state-

136Eg httprdfaboutcom httpwww4wiwissfu-berlindefactbook and httpwikicompanyorg (requested on Jan 11 2017)

44 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

ment in Wikidata requires a reference and that itis hard to evaluate which statements lack such areference

6 Using unknown and empty values Wikidata andFreebase support the indication of unknown andempty values

7 Check of schema restrictions during insertion ofnew statements Since Freebase and Wikidata areeditable by community members simple consis-tency checks are made during the insertion of newfacts in the user interface

8 Consistency of statements wrt class constraintsFreebase and Wikidata do not specify any classconstraints via owldisjointWith while theother KGs do

9 Consistency of statements wrt relation con-straints The inconsistencies of all KGs regardingthe range indications of relations are mainly due toinconsistently used data types (eg xsdgYearis used instead of xsdDate)Regarding the constraint of functional proper-ties the relation owlFunctionalPropertyis used by all KGs except Wikidata in most casesthe KGs comply with the usage restrictions of thisrelation

10 Creating a ranking of statements Only Wikidatasupports a ranking of statements This is in partic-ular worthwhile in case of statements which areonly temporally limited valid

11 Schema completeness Wikidata shows the highestdegree of schema completeness Also for DBpe-dia OpenCyc and YAGO we obtain results whichare presumably acceptable in most cross-domainuse cases While DBpedia classes were sometimesmissing in our evaluation the DBpedia relationswere covered considerably well OpenCyc lackssome relations of the gold standard but the classesof the gold standard were existing in OpenCycWhile the YAGO classes are peculiar in the sensethat they are connected to WordNet synsets it isremarkable that YAGO relations are often keptvery abstract so that they can be applied in differ-ent senses Freebase shows considerable room forimprovement concerning the coverage of typicalcross-domain classes and relations Note that Free-base classes are belonging to different domainsHence it is difficult to find related classes if theyare not in the same domain

12 Column completeness DBpedia and Freebaseshow the best column completeness values ie inthose KGs the predicates used by the instances of

each class are on average frequently used by all ofthose class instances We can name data importsas one reason for it

13 Population completeness Not very surprising isthe fact that all KGs show a higher degree of com-pleteness regarding well-known entities than re-garding rather unknown entities Especially Wiki-data shows an excellent performance for both well-known and rather unknown entities

14 Timeliness frequency of the KG Only Wikidataprovides the highest fulfillment degree for thiscriterion as it is continuously updated and as thechanges are immediately visible and queryable byusers

15 Specification of the validity period of statementsIn YAGO Freebase and Wikidata the temporalvalidity period of statements (eg term of office)can be specified

16 Specification of the modification date of state-ments Only Freebase keeps the modification datesof statements Wikidata provides the modificationdate of the queried resource during URI derefer-encing

17 Description of resources YAGO Wikidata andOpenCyc contain a label for almost every entitySurprisingly DBpedia shows a relatively low cov-erage wrt labels and descriptions (only 704)Manual investigations suggest that the interme-diate node mapping template is the main reasonfor that By means of this template intermediatenodes are introduced and instantiated but no la-bels are provided for them137

18 Labels in multiple languages YAGO Freebaseand Wikidata support hundreds of languages re-garding their stored labels Only OpenCyc con-tains labels merely in English While DBpediaYAGO and Freebase show a high coverage re-garding the English language Wikidata does nothave such a high coverage regarding English butinstead covers other languages to a considerableextent It is hence not only the most diverse KGin terms of languages but also the KG which con-tains the most labels for languages other than En-glish

19 Understandable RDF serialization DBpediaWikidata and YAGO provide several understand-

137An example is dbrVolkswagen_Passat_(B1)which has dboengine statements to the intermediate nodesVolkswagen_Passat_(B1)__1 etc representing differentengine variations

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 45

able RDF serialization formats Freebase onlyprovides the understandable format RDFTurtleOpenCyc relies only on RDFXML which is con-sidered as being not easily understandable for hu-mans

20 Self-describing URIs We can find mixed paradigmsregarding the URI generation DBpedia YAGOand OpenCyc rely on descriptive URIs whileWikidata and Freebase (in part classes and rela-tions are identified with self-describing URIs) usegeneric IDs ie opaque URIs

21 Avoiding blank nodes and RDF reification DB-pedia Wikidata YAGO and Freebase are theKGs which use reification ie which formulatestatements about statements There are differentways of implementing reification [27] DBpediaWikidata and Freebase use n-ary relations whileYAGO uses N-Quads creating so-called namedgraphs

22 Provisioning of several serialization formatsMany KGs provide RDF in several serializationformats Freebase is the only KG providing datain the serialization format RDFTurtle only

23 Using external vocabulary DBpedia and Wiki-data show high degrees of external vocabularyusage In DBpedia the RDF RDFS and OWLvocabularies are used Wikidata has a high ex-ternal vocabulary ratio since there exist manylanguage labels and descriptions (modeled viardfslabel and schemadescription)Also due to instantiations of statements withwdoStatement for reification purposes theexternal relation rdftype is used a lot

24 Interoperability of proprietary vocabulary Weobtained low fulfillment scores regarding this cri-terion OpenCyc shows the highest value Wecan mention as reason for that the fact thathalf of all OpenCyc classes exhibit at least oneowlsameAs linkWhile DBpedia has equivalence statements to ex-ternal classes for almost every second class only63 of all relations have equivalence relations torelations outside the DBpedia namespaceWikidata shows a very low interlinking degreeof classes to external classes and of relations toexternal relations

25 Dereferencing possibility of resources Resourcesin DBpedia OpenCyc and YAGO can be derefer-enced without considerable issues Wikidata usespredicates derived from relations that are not deref-erencable at all as well as blank nodes For Free-

base we measured a quite considerable amountof dereferencing failures due to server errors andunknown URIs Note also that Freebase requiredan API key for a large amount of requests

26 Availability of the KG While all other KGsshowed almost no outages YAGO shows a note-worthy instability regarding its online availabilityWe measured around 100 outages for YAGO ina time interval of 8 weeks taking on average 35hours

27 Provisioning of public SPARQL endpoint DBpe-dia Wikidata and YAGO provide a SPARQL end-point while Freebase and OpenCyc do not Note-worthy is that the Wikidata SPARQL endpoint hasa maximum execution time per query of 30 sec-onds This might be a bottleneck for some queries

28 Provisioning of an RDF export RDF exports areavailable for all KGs and are provided mostly inN-Triples and Turtle format

29 Support of content negotiation DBpedia Wiki-data and YAGO correctly return RDF data basedon content negotiation Both OpenCyc and Free-base do not support any content negotiation WhileOpenCyc only provides data in RDFXML Free-base only returns data with textplain as con-tent type

30 Linking HTML sites to RDF serializations AllKGs except OpenCyc interlink the HTML rep-resentations of resources with the correspondingRDF representations

31 Provisioning of KG metadata Only DBpedia andOpenCyc integrate metadata about the KG insome form DBpedia has the VoID vocabulary in-tegrated while OpenCyc reveals the current KGversion as machine-readable metadata

32 Provisioning machine-readable licensing informa-tion Only DBpedia and Wikidata provide licens-ing information about their KG data in machine-readable form

33 Interlinking via owlsameAs OpenCyc andYAGO achieve the best results wrt this met-ric but DBpedia has by far the most instanceswith at least one owlsameAs link Based onthe resource interlinkage DBpedia is justifiablycalled Linked Data hub Wikidata does not provideowlsameAs links but stores identifiers as liter-als that could be used to generate owlsameAslinks

34 Validity of external URIs The links to exter-nal Web resources are for all KGs valid inmost cases DBpedia and OpenCyc contain many

46 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Step 1 Requirements Analysis

- Identifying the preselection criteria P- Assigning a weight wi to each DQ criterion ci isin C

Step 2 Preselection based on the Preselection Criteria

- Manually selecting the KGs GP that fulfill the preselection criteria P

Step 3 Quantitative Assessment of the KGs

- Calculating the DQ metric mi(g) for each DQ criterion ci isin C- Calculating the fulfillment degree h(g) for each KG g isin GP

- Determining the KG g with the highest fulfillment degree h(g)

Step 4 Qualitative Assessment of the Result

- Assessing the selected KG g wrt qualitative aspects- Comparing the selected KG g with other KGs in G P

Fig 11 Proposed process for using our KG recommendation frame-work

owlsameAs links to RDF documents on do-mains which do not exist anymore those linkscould be deleted

6 KG Recommendation Framework

We now propose a framework for selecting themost suitable KG (or a set of suitable KGs) for agiven concrete setting based on a given set of KGsG = g1 gn To use this framework the user needsto go through the steps depicted in Fig 11

In Step 1 the preselection criteria and the weightsfor the criteria are specified The preselection criteriacan be both quality criteria or general criteria and needto be selected dependent on the use case The Timeli-ness frequency of the KG is an example for a qualitycriterion The license under which a KG is provided(eg CC0 license) is an example for a general criterionAfter weighting the criteria in Step 2 those KGs areneglected which do not fulfill the preselection criteriaIn Step 3 the fulfillment degrees of the remaining KGsare calculated and the KG with the highest fulfillmentdegree is selected Finally in Step 4 the result can be as-sessed wrt qualitative aspects (besides the quantitativeassessments using the DQ metrics) and if necessary analternative KG can be selected for being applied for thegiven scenario

Use case application In the following we showhow to use the KG recommendation framework in aparticular scenario The use case is based on the usageof DBpedia and MusicBrainz for the project BBC Musicas described in [33]

Description of the use case The publisher BBCwants to enrich news articles with fact sheets providingrelevant information about musicians mentioned in thearticles In order to obtain more details about the mu-sicians the user can leave the news section and accessthe musicians section where detailed information is pro-vided including a short description a picture the birthdate and the complete discography for each musicianFor being able to integrate the musicians informationinto the articles and to enable such a linking editorsshall tag the article based on a controlled vocabulary

The KG Recommendation Framework can be appliedas follows

1 Requirements analysis

ndash Preselection criteria According to the sce-nario description [33] the KG in questionshould (i) be actively curated and (ii) con-tain an appropriate amount of media enti-ties Given these two criteria a satisfactoryand up-to-date coverage of both old and newmusicians is expected

ndash Weighting of DQ criteria Based on the pre-selection criteria an example weighting ofthe DQ metrics for our use case is given inTable 15 Note that this is only one exam-ple configuration and the assignment of theweights is subjective to some degree Giventhe preselection criteria the criterion Timeli-ness frequency of the KG and the criteria ofthe DQ dimension Completeness are empha-sized Furthermore the criteria Dereferenc-ing possibility of resources and Availabilityof the KG are important as the KG shall beavailable online ready to be queried138

2 Preselection Freebase and OpenCyc are not con-sidered any further since Freebase is not being up-dated anymore and since OpenCyc contains onlyaround 4K entities in the media domain

3 Quantitative Assessment The overall fulfillmentscore for each KG is calculated based on the for-mula presented in Section 31 The result of thequantitative KG evaluation is presented in Ta-ble 15 By weighting the criteria according tothe constraints Wikidata achieves the best rankclosely followed by DBpedia Based on the quan-titative assessment Wikidata is recommended bythe framework

138We assume that in this use case rather the dereferencing ofHTTP URIs than the execution of SPARQL queries is desired

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 47

Table 15Framework with an example weighting which would be reasonablefor a user setting as given in [33]

Dimension Metric DBpedia Freebase OpenCyc Wikidata YAGO Example of UserWeighting wi

Accuracy msynRDF 1 1 1 1 1 1msynLit 0994 1 1 1 0624 1msemTriple 0990 0995 1 0993 0993 1

Trustworthiness mgraph 05 05 1 075 025 0mfact 05 1 0 1 1 1mNoV al 0 1 0 1 0 0

Consistency mcheckRestr 0 1 0 1 0 0mconClass 0875 1 0999 1 0333 0mconRelat 0992 0451 1 0500 0992 0

Relevancy mRanking 0 1 0 1 0 1

Completeness mcSchema 0905 0762 0921 1 0952 1mcCol 0402 0425 0 0285 0332 2mcPop 093 094 048 099 089 3

Timeliness mFreq 05 0 025 1 025 3mV alidity 0 1 0 1 1 0mChange 0 1 0 0 0 0

Ease of understanding mDescr 0704 0972 1 09999 1 1mLang 1 1 0 1 1 0muSer 1 1 0 1 1 0muURI 1 05 1 0 1 1

Interoperability mReif 05 05 05 0 05 0miSerial 1 0 05 1 1 1mextV oc 061 0108 0415 0682 0134 1mpropV oc 0150 0 0513 0001 0 1

Accessibility mDeref 1 0437 1 0414 1 2mAvai 09961 09998 1 09999 07306 2mSPARQL 1 0 0 1 1 1mExport 1 1 1 1 1 0mNegot 05 0 0 1 1 0mHTMLRDF 1 1 0 1 1 0mMeta 1 0 1 0 0 0

Licensing mmacLicense 1 0 0 1 0 0

Interlinking mInst 0251 0 0382 0 0310 3mURIs 0929 0908 0894 0957 0956 1

Unweighted Average 0683 0603 0496 0752 0625Weighted Average 0701 0493 0556 0714 0648

48 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

4 Qualitative Assessment The high population com-pleteness in general and the high coverage of enti-ties in the media domain in particular give Wiki-data advantage over the other KGs FurthermoreWikidata does not require that there is a Wikipediaarticle for each entity Thus missing Wikidata en-tities can be added by the editors directly and arethen available immediatelyThe use case requires to retrieve also detailed infor-mation about the musicians from the KG such as ashort descripion and a discography DBpedia tendsto store more of that data especially wrt discogra-phy A specialized database like MusicBrainz pro-vides even more data about musicians than DBpe-dia as it is not limited to the Wikipedia infoboxesWhile DBpedia does not provide any links to Mu-sicBrainz Wikidata stores around 120K equiva-lence links to MusicBrainz that can be used to pullmore data In conclusion Wikidata especially inthe combination with MusicBrainz seems to bean appropriate choice for the use case In this casethe qualitative assessment confirms the result ofthe quantitative assessment

The use case shows that our KG recommendationframework enables users to find the most suitable KGand is especially useful in giving an overview of themost relevant criteria when choosing a KG Howeverapplying our framework to the use case also showedthat besides the quantitative assessment there is stilla need for a deep understanding of commonalities anddifference of the KGs in order to make an informedchoice

7 Related Work

71 Linked Data Quality Criteria

Zaveri et al [49] provide a conceptual framework forquality assessment of linked data based on quality cri-teria and metrics which are grouped into quality dimen-sions and categories and which are based on the frame-work of Wang et al [47] Our framework is also basedon Wangrsquos dimensions and extended by the dimensionsConsistency [11] Licensing and Interlinking [49] Fur-thermore we reintroduce the dimensions Trustworthi-ness and Interoperability as a collective term for multi-ple dimensions

Many published DQ criteria and metrics are ratherabstract We in contrast selected and developed con-

crete criteria which can be applied to any KG in theLinked Open Data cloud Table 16 shows which ofthe metrics introduced in this article have already beenused to some extent in existing literature In summaryrelated work mainly proposed generic guidelines forpublishing Linked Data [26] introduced DQ criteriawith corresponding metrics (eg [2030]) and criteriawithout metrics (eg [4029]) 27 of the 34 criteria in-troduced in this article have been introduced or sup-ported in one way or another in earlier works The re-maining seven criteria namely Trustworthiness on KGlevel mgraph Indicating unknown and empty valuesmNoV al Check of schema restrictions during insertionof new statements mcheckRestr Creating a rankingof statements mRanking Timeliness frequency of theKG mFreq Specification of the validity period of state-ments mV alidity and Availability of the KG mAvaihave not been proposed so far to the best of our knowl-edge In the following we present more details of singleexisting approaches for Linked Data quality criteria

Pipino et al [40] introduce the criteria Schema com-pleteness Column completeness and Population com-pleteness in the context of databases We introducethose metrics for KGs and apply them to the best ofour knowledge the first time on the KGs DBpediaFreebase OpenCyc Wikidata and YAGO

OntoQA [45] introduces criteria and correspondingmetrics that can be used for the analysis of ontologiesBesides simple statistical figures such as the average ofinstances per class Tartir et al introduce also criteriaand metrics similar to our DQ criteria Description ofresources mDescr and Column completeness mcCol

Based on a large-scale crawl of RDF data Hogan etal [29] analyze quality issues of published RDF dataLater Hogan et al [30] introduce further criteria andmetrics based on Linked Data guidelines for data pub-lishers [26] Whereas Hogan et al crawl and analyzemany KGs we analyze a selected set of KGs in moredetail

Heath et al [26] provide guidelines for Linked Databut do not introduce criteria or metrics for the assess-ment of Linked Data quality Still the guidelines can beeasily translated into relevant criteria and metrics Forinstance Do you refer to additional access methodsleads to the criteria Provisioning of public SPARQLendpoint mSPARQL and Provisioning of an RDF ex-port mExport Also Do you map proprietary vocabu-lary terms to other vocabularies leads to the criterionInteroperability of proprietary vocabulary mpropV ocMetrics that are based on the guidelines of Heath et alcan also be found in other frameworks [3020]

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 49

Table 16Overview of related work regarding data quality criteria for KGs

DQ Metric [40] [45] [29] [26] [20] [22] [30] [48] [2] [34]

msynRDF X X

msynLit X X X X

msemTriple X X X X

mfact X X

mconClass X X X

mconRelat X X X X X X

mcSchema X X

mcCol X X X X

mcPop X X

mChange X X

mDescr X X X X

mLang X

muSer X

muURI X

mReif X X X

miSerial X

mextV oc X X

mpropV oc X

mDeref X X X X

mSPARQL X

mExport X X

mNegot X X X

mHTMLRDF X

mMeta X X X

mmacLicense X X X

mInst X X X

mURIs X X

Flemming [20] introduces a framework for the qual-ity assessment of Linked Data quality This frameworkmeasures the Linked Data quality based on a sample ofa few RDF documents Based on a systematic literaturereview criteria and metrics are introduced Flemmingintroduces the criteria Labels in multiple languagesmLang and Validity of external URIs mURIs the firsttime The framework is evaluated on a sample of RDFdocuments of DBpedia In contrast to Flemming weevaluate the whole KG DBpedia and also four otherwidely used KGs

SWIQA[22] is a quality assessment framework intro-duced by Fuumlrber et al that introduces criteria and met-rics for the dimensions Accuracy Completeness Timeli-ness and Uniqueness In this framework the dimensionAccuracy is divided into Syntactic validity and Sematicvalidity as proposed by Batini et al [6] Furthermorethe dimension Completeness comprises Schema com-pleteness Column completeness and Population com-pleteness following Pipino et al [40] In this articlewe make the same distinction but in addition distin-guish between RDF documents RDF triples and RDF

50 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

literals for evaluating the Accuracy since we considerRDF KGs

TripleCheckMate [35] is a framework for LinkedData quality assessment using a crowdsourcing-approachfor the manual validation of facts Based on this ap-proach Zaveri et al [48] and Acosta et al [23] analyzeboth syntactic and semantic accuracy as well as theconsistency of data in DBpedia

Kontokostas et al [34] present the test-driven evalu-ation framework RDFUnit for assessing Linked Dataquality This framework is inspired by the paradigmof test-driven software development The frameworkintroduces 17 SPARQL templates of tests that can beused for analyzing KGs wrt Accuracy and Consis-tency Note that those tests can also be used for eval-uating external constraints that exist due to the usageof external vocabulary The framework is applied byKontokostas et al on a set of KGs including DBpedia

72 Comparing KGs by Key Statistics

Duan et al [14] Tartir [45] and Hassanzadeh [25]can be mentioned as the most similar related work re-garding the evaluation of KGs using the key statisticspresented in Section 51

Duan et al [14] analyze the structuredness of data inDBpedia YAGO2 UniProt and in several benchmarkdata sets To that end the authors use simple statisticalkey figures that are calculated based on the correspond-ing RDF dumps In contrast to that approach we useSPARQL queries to obtain the figures thus not limitingourselves to the N-Tripel serialization of RDF dumpfiles Duan et al claim that simple statistical figures arenot sufficient to gain fruitful findings when analyzingthe structuredness and differences of RDF datasets Theauthors therefore propose in addition a coherence met-ric Accordingly we analyze not only simple statisti-cal key figures but further analyze the KGs wrt dataquality using 34 DQ metrics

Tartir et al [45] introduce with the system OntoQAmetrics that can be used for analyzing ontologies Moreprecisely it can be measured to which degree theschema level information is actually used on instancelevel An example of such a metric is the class richnessdefined as the number of classes with instances dividedby the number of classes without instances SWETOTAP and GlycO are used as showcase ontologies

Tartir et al [45] and Hassanzadeh et al [25] analyzehow domains are covered by KGs on both schema andinstance level For that Tartir et al introduce the mea-sure importance as the number of instances per class

and their subclasses In our case we cannot use this ap-proach since Freebase has no hierarchy Hassanzadehet al analyze the coverage of domains by listing themost frequent classes with the highest number of in-stances as a table This gives only little overview of thecovered domains since instances can belong to multi-ple classes in the same domain such as dboPlaceand dboPopulatedPlace For determining thedomain coverages of KGs for this article we there-fore adapt the idea of Hassanzadeh et al by manu-ally mapping the most frequent classes to domains anddeleting duplicates within the domains That meansif an instance is instantiated both as dboPlaceand dboPopulatedPlace the instance will becounted only once in the domain geography

8 Conclusion

Freely available knowledge graphs (KGs) have notbeen in the focus of any extensive comparative study sofar In this survey we defined a range of aspects accord-ing to which KGs can be analyzed We analyzed andcompared DBpedia Freebase OpenCyc Wikidata andYAGO along these aspects and proposed a frameworkas well as a process to enable readers to find the mostsuitable KG for their settings

References

[1] M Acosta E Simperl F Floumlck and M Vidal HARE AHybrid SPARQL Engine to Enhance Query Answers viaCrowdsourcing In Proceedings of the 8th InternationalConference on Knowledge Capture K-CAP 2015 pages111ndash118 ACM 2015

[2] M Acosta A Zaveri E Simperl D Kontokostas S Auer andJ Lehmann Crowdsourcing linked data quality assessment InThe Semantic WebndashISWC 2013 pages 260ndash276 Springer 2013

[3] M Acosta A Zaveri E Simperl D Kontokostas F Floumlckand J Lehmann Detecting Linked Data Quality Issues viaCrowdsourcing A DBpedia Study Semantic Web 2016

[4] S Auer C Bizer G Kobilarov J Lehmann R Cyganiak andZ Ives DBpedia A Nucleus for a Web of Open Data InProceedings of the 6th International Semantic Web Conferenceand 2nd Asian Semantic Web Conference ISWC 2007ASWC2007 pages 722ndash735 Springer 2007

[5] S Auer J Lehmann A-C Ngonga Ngomo and A ZaveriIntroduction to Linked Data and Its Lifecycle on the Web InReasoning Web Semantic Technologies for Intelligent DataAccess volume 8067 of Lecture Notes in Computer Sciencepages 1ndash90 Springer Berlin Heidelberg 2013

[6] C Batini C Cappiello C Francalanci and A MaurinoMethodologies for Data Quality Assessment and ImprovementACM Comput Surv 41(3)161ndash1652 July 2009

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 51

[7] S Bechhofer F van Harmelen J Hendler I Horrocks D LMcGuinness and P F Patel-Schneider OWL Web OntologyLanguage Reference httpswwww3orgTR2004REC-owl-ref-200402102004 [Online accessed 06-Apr-2016]

[8] T Berners-Lee Linked Data httpwwww3orgDesignIssuesLinkedDatahtml2006 [Online accessed 28-Feb-2016]

[9] T Berners-Lee Linked Data Is Merely More Datahttpwwww3orgDesignIssuesLinkedDatahtml 2006[Online accessed 28-02-2016]

[10] T Berners-Lee J Hendler and O Lassila The Semantic WebScientific American 284(5)29ndash37 5 2001

[11] C Bizer Quality-Driven Information Filtering in the Contextof Web-Based Information Systems VDM Publishing 2007

[12] C Bizer J Lehmann G Kobilarov S Auer C BeckerR Cyganiak and S Hellmann DBpediandashA crystallizationpoint for the Web of Data Web Semantics science servicesand agents on the world wide web 7(3)154ndash165 2009

[13] X Dong E Gabrilovich G Heitz W Horn N LaoK Murphy T Strohmann S Sun and W Zhang KnowledgeVault A Web-Scale Approach to Probabilistic KnowledgeFusion In Proceedings of the 20th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining KDD rsquo14 pages 601ndash610 New York NY USA 2014ACM

[14] S Duan A Kementsietsidis K Srinivas and O UdreaApples and Oranges A Comparison of RDF Benchmarks andReal RDF Datasets In Proceedings of the ACM SIGMODInternational Conference on Management of Data SIGMOD2011 pages 145ndash156 2011

[15] B Ell D Vrandecic and E Simperl Proceedings of the 10thInternational Semantic Web Conference (ISWC 2011) chapterLabels in the Web of Data pages 162ndash176 Springer BerlinHeidelberg Berlin Heidelberg 2011

[16] F Erxleben M Guumlnther M Kroumltzsch J Mendez andD Vrandecic Introducing Wikidata to the Linked Data WebIn Proceedings of the 13th International Semantic WebConference ISWC 2014 pages 50ndash65 Springer 2014

[17] M Faumlrber F Bartscherer C Menne and A Rettinger LinkedData Quality of DBpedia Freebase OpenCyc Wikidata andYAGO Semantic Web Journal 2017 to be published

[18] M Faumlrber C Menne and A Rettinger A Linked DataWrapper for CrunchBase Semantic Web Journal 2017 to bepublished

[19] C Fellbaum WordNet ndash An Electronic Lexical Database MITPress 1998

[20] A Flemming Qualitaumltsmerkmale von LinkedData-veroumlffentlichenden Datenquellen (Quality characteristicsof linked data publishing datasources) Diploma ThesisHumboldt University of Berlinhttpwwwdbisinformatikhu-berlindefileadminresearchpapersdiploma_seminar_thesisDiplomarbeit_Annika_Flemmingpdf 2011

[21] G Freedman and E G Reynolds Enriching Basal ReaderLessons with Semantic Webbing Reading Teacher33(6)677ndash684 1980

[22] C Fuumlrber and M Hepp SWIQA ndash A Semantic WebInformation Quality Assessment Framework In Proceedings ofthe 19th European Conference on Information Systems

(ECIS2011) volume 15 page 19 2011[23] R Guns Tracing the origins of the Semantic Web Journal of

the American Society for Information Science and Technology64(10)2173ndash2181 2013

[24] H Halpin P J Hayes J P McCusker D L McGuinness andH S Thompson The Semantic Web ndash ISWC 2010 9thInternational Semantic Web Conference ISWC 2010 ShanghaiChina chapter When owlsameAs Isnrsquot the Same An Analysisof Identity in Linked Data pages 305ndash320 Springer BerlinHeidelberg Berlin Heidelberg 2010

[25] O Hassanzadeh M J Ward M Rodriguez-Muro andK Srinivas Understanding a Large Corpus of Web TablesThrough Matching with Knowledge Bases ndash An EmpiricalStudy In Proceedings of the 10th International Workshop onOntology Matching collocated with the 14th InternationalSemantic Web Conference ISWC 2015 2015

[26] T Heath and C Bizer Linked data Evolving the web into aglobal data space Synthesis lectures on the semantic webtheory and technology 1(1)1ndash136 2011

[27] D Hernaacutendez A Hogan and M Kroumltzsch Reifying RDFWhat Works Well With Wikidata In Proceedings of the 11thInternational Workshop on Scalable Semantic Web KnowledgeBase Systems co-located with 14th International Semantic WebConference pages 32ndash47 2015

[28] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 A Spatially and Temporally Enhanced KnowledgeBase from Wikipedia Artificial Intelligence 19428ndash61 2013

[29] A Hogan A Harth A Passant S Decker and A PolleresWeaving the Pedantic Web Proceedings of the WWW2010Workshop on Linked Data on the Web 628 2010

[30] A Hogan J Umbrich A Harth R Cyganiak A Polleres andS Decker An empirical survey of linked data conformanceWeb Semantics Science Services and Agents on the WorldWide Web 1414ndash44 2012

[31] P Jain P Hitzler K Janowicz and C Venkatramani TherersquosNo Money in Linked Data httpcorescholarlibrarieswrighteducse240 2013 accessedJuly 20 2015

[32] J M Juran F M Gryna and R S Bingham editors QualityControl Handbook McGraw-Hill 1974

[33] G Kobilarov T Scott Y Raimond S Oliver C SizemoreM Smethurst C Bizer and R Lee Media Meets SemanticWeb ndash How the BBC Uses DBpedia and Linked Data to MakeConnections In Proceedings of the 6th European SemanticWeb Conference on The Semantic Web Research andApplications ESWC 2009 Heraklion pages 723ndash737 BerlinHeidelberg 2009 Springer

[34] D Kontokostas P Westphal S Auer S HellmannJ Lehmann R Cornelissen and A Zaveri Test-drivenevaluation of linked data quality In Proceedings of the 23rdinternational conference on World Wide Web pages 747ndash758ACM 2014

[35] D Kontokostas A Zaveri S Auer and J LehmannTripleCheckMate A Tool for Crowdsourcing the QualityAssessment of Linked Data In Knowledge Engineering andthe Semantic Web ndash 4th International Conference KESW 2013St Petersburg Russia October 7-9 2013 Proceedings pages265ndash272 Springer 2013

[36] C Matuszek J Cabral M J Witbrock and J DeOliveira AnIntroduction to the Syntax and Content of Cyc In AAAI SpringSymposium Formalizing and Compiling Background

52 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Knowledge and Its Applications to Knowledge Representationand Question Answering pages 44ndash49 AAAI - Association forthe Advancement of Artificial Intelligence 2006

[37] M Mecella M Scannapieco A Virgillito R BaldoniT Catarci and C Batini Managing data quality in cooperativeinformation systems In On the Move to Meaningful InternetSystems 2002 CoopIS DOA and ODBASE pages 486ndash502Springer 2002

[38] O Medelyan and C Legg Integrating Cyc and WikipediaFolksonomy meets rigorously defined common-sense InWikipedia and Artificial Intelligence An Evolving SynergyPapers from the 2008 AAAI Workshop page 65 2008

[39] F Naumann Quality-Driven Query Answering for IntegratedInformation Systems volume 2261 Springer Science ampBusiness Media 2002

[40] L L Pipino Y W Lee and R Y Wang Data QualityAssessment Communications of the ACM 45(4)211ndash2182002

[41] E Sandhaus Semantic Technology at the New York TimesLessons Learned and Future Directions In Proceedings of the9th International Semantic Web Conference on The SemanticWeb - Volume Part II ISWCrsquo10 pages 355ndash355 BerlinHeidelberg 2010 Springer

[42] A Singhal Introducing the Knowledge Graph things notstrings httpsgoogleblogblogspotde201205introducing-knowledge-graph-things-nothtml retrieved on Aug 29 2016 2012

[43] F M Suchanek G Kasneci and G Weikum YAGO A LargeOntology from Wikipedia and WordNet Web SemanticsScience Services and Agents on the World Wide Web6(3)203ndash217 2008

[44] T P Tanon D Vrandecic S Schaffert T Steiner andL Pintscher From Freebase to Wikidata The Great MigrationIn Proceedings of the 25th International Conference on WorldWide Web WWW 2016 pages 1419ndash1428 2016

[45] S Tartir I B Arpinar M Moore A P Sheth andB Aleman-meza OntoQA Metric-Based Ontology QualityAnalysis In IEEE Workshop on Knowledge Acquisition fromDistributed Autonomous Semantically Heterogeneous Dataand Knowledge Sources 2005

[46] R Y Wang M P Reddy and H B Kon Toward quality dataAn attribute-based approach Decision Support Systems13(3)349ndash372 1995

[47] R Y Wang and D M Strong Beyond Accuracy What DataQuality Means to Data Consumers Journal of managementinformation systems 12(4)5ndash33 1996

[48] A Zaveri D Kontokostas M A Sherif L BuumlhmannM Morsey S Auer and J Lehmann User-driven qualityevaluation of dbpedia In Proceedings of the 9th InternationalConference on Semantic Systems pages 97ndash104 ACM 2013

[49] A Zaveri A Rula A Maurino R Pietrobon J Lehmann andS Auer Quality Assessment for Linked Data A SurveySemantic Web 7(1)63ndash93 2015

Page 9: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 9

jointWith c2) isin g18 Furthermore let cg(e) bethe set of all classes of instance e in g defined ascg(e) = c | (erdftype c) isin g Then we definemconClass(g) as follows

mconClass(g) =

|(c1 c2) isin CC | notexiste (c1 isin cg(e) and c2 isin cg(e))||(c1 c2) isin CC|

In case of an empty set of class constraints CC themetric should evaluate to 1

Consistency of statements wrt relation constraintsThe metric for this criterion is intended for measur-ing the degree to which the instance data is consis-tent with the relation restrictions (eg indicated viardfsrange and owlFunctionalProperty)specified on the schema level We evaluate this crite-rion by averaging over the scores obtained from sin-gle metrics mconRelati indicating the consistency ofstatements wrt different relation constraints

mconRelat(g) =1

n

nsumi=1

mconRelati(g)

In case of evaluating the consistency of instance dataconcretely wrt given rdfsrange and owlFunctionalProperty statements19 we can state

mconRelat(g) =mconRelatRg(g) +mconRelatFct(g)

2

Let Rr be the set of all rdfsrange constraints

Rr = (p d) | (prdfsrange d) isin g

and isDatatype(d)

18Implicit restrictions which can be deducted from the class hi-erarchy eg that a restriction for dboAnimal counts also fordboMammal a subclass of dboAnimal are not considered byus here

19We chose those relations (and for instance notowlInverseFunctionalProperty) as only those relationsare used by more than half of the considered KGs

and Rf be the set of all owlFunctionalPro-perty constraints

Rf = (p d) | (prdftypeowlFunc

tionalProperty) isin g and

(prdfsrange d) isin g and isDatatype(d)

Then we can define the metrics mconRelatRg(g) andmconRelatFct(g) as follows

mconRelatRg(g) =

|(s p o) isin g | exist(p d) isin Rr datatype(o) 6= d||(s p o) isin g | exist(p d) isin Rr|

mconRelatFct(g) =

|(s p o) isin g|exist(p d) isin Rf notexist(s p o2) isin g o 6= o2||(s p o) isin g | exist(p d) isin Rf|

In case of an empty set of relation constraints (Rr orRf ) the respective metric should evaluate to 1

33 Contextual Category

Contextual data quality ldquohighlights the requirementthat data quality must be considered within the contextof the task at handrdquo [47] This category contains thethree dimensions (i) Relevancy (ii) Completeness and(iii) Timeliness Wang et alrsquos further dimensions in thiscategory appropriate amount of data and value-addedare considered by us as being part of the dimensionCompleteness

331 RelevancyDefinition of dimension Relevancy is ldquothe extent

to which data are applicable and helpful for the task athandrdquo [47]

Discussion According to Bizer [11] Relevancy isan important quality dimension since the user is con-fronted with a variety of potentially relevant informa-tion on the Web

Definition of metric The dimension Relevancy isdetermined by the criterion Creating a ranking ofstatements20 The fulfillment degree of a KG g wrtthe dimension Relevancy is measured by the metricmRanking which is defined as follows

20We do not consider the relevancy of literals as there is no rankingof literals provided for the considered KGs

10 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Creating a ranking of statements By means of thiscriterion one can determine whether the KG supportsa ranking of statements by which the relative rele-vance of statements among other statements can beexpressed For instance given the Wikidata entityBarack Obama (wdtQ76) and the relation posi-tion held (wdtP39) President of the United Statesof America (wdtQ11696) has a preferred rank(wdoPreferredRank) (until 2017) while olderpositions which he holds no more are ranked as normalrank (wdoNormalRank)

mRanking(g) =

1 ranking of statements supported0 otherwise

Note that this criterion refers to a characteristic ofthe KG and not to a characteristic of the system thathosts the KG

332 CompletenessDefinition of dimension Completeness is ldquothe ex-

tent to which data are of sufficient breadth depth andscope for the task at handrdquo [47]

We include the following two aspects in this dimen-sion which are separate dimensions in Wang et alrsquosframework

ndash Appropriate amount of data Appropriate amountof data is ldquothe extent to which the quantity orvolume of available data is appropriaterdquo [47]

ndash Value-added Value-added is ldquothe extent to whichdata are beneficial and provide advantages fromtheir userdquo [47]

Discussion Pipino et al [40] divide Completenessinto

1 Schema completeness ie the extent to whichclasses and relations are not missing

2 Column completeness ie the extent to whichvalues of relations on instance level ndash ie facts ndashare not missing and

3 Population completeness ie the extent to whichentities are not missing

The Completeness dimension is context-dependent andtherefore belongs to the contextual category becausethe fact that a KG is seen as complete depends on theuse case scenario ie on the given KG and on the infor-mation need of the user As exemplified by Bizer [11]a list of German stocks is complete for an investor whois interested in German stocks but it is not complete for

an investor who is looking for an overview of Europeanstocks The completeness is hence only assessable bymeans of a concrete use case at hand or with the helpof a defined gold standard

Definition of metric We follow the above-mentioneddistinction of Pipino et al [40] and determine Com-pleteness by means of the criteria Schema completenessColumn completeness and Population completeness

The fulfillment degree of a KG g wrt the dimensionCompleteness is measured by the metrics mcSchemamcCol and mcPop which are defined as follows

Schema completeness By means of the criterionSchema completeness one can determine the complete-ness of the schema wrt classes and relations [40] Theschema is assessed by means of a gold standard Thisgold standard consists of classes and relations which arerelevant for the use case For evaluating cross-domainKGs we use as gold standard a typical set of cross-domain classes and relations It comprises (i) basicclasses such as people and locations in different gran-ularities and (ii) basic relations such as birth date andnumber of inhabitants We define the schema complete-ness mcSchema as the ratio of the number of classesand relations of the gold standard existing in g noclatg and the number of classes and relations in the goldstandard noclat

mcSchema(g) =noclatgnoclat

Column completeness In the traditional database area(with fixed schema) by means of the Column complete-ness criterion one can determine the degree by whichthe relations of a class which are defined on the schemalevel (each relation has one column) exist on the in-stance level [40] In the Semantic Web and Linked Datacontext however we cannot presume any fixed rela-tional schema on the schema level The set of possiblerelations for the instances of a class is given at run-time by the set of used relations for the instances ofthis class Therefore we need to modify this criterionas already proposed by Pipino et al [40] In the updatedversion by means of the criterion Column completenessone can determine the degree by which the instances ofa class use the same relations averaged over all classes

Formally we define the Column completeness met-ric mcCol(g) as the ratio of the number of instanceshaving class k and a value for the relation r nokp tothe number of all instances having class k nok Byaveraging over all class-relation-pairs which occur on

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 11

instance level we obtain a fulfillment degree regardingthe whole KG

mcCol(g) =1

|H|sum

(kp)isinH

nokpnok

We thereby let H = (k p) isin (K times P ) | existk isinCg and exist(x p o) | p isin P imp

g and (xrdftype k) bethe set of all combinations of the considered classesK = k1 kn and considered relations P =p1 pm

Note that there are also relations which are dedicatedto the instances of a specific class but which do notneed to exist for all instances of that class For instancenot all people need to have a relation hasChild ordeathDate21 For measuring the Column complete-ness we selected only those relations for an assessmentwhere a value of the relation typically exists for allgiven instances

Population completeness The Population complete-ness metric determines the extent to which the consid-ered KG covers a basic population [40] The assess-ment of the KG completeness wrt a basic populationis performed by means of a gold standard which coversboth well-known entities (called ldquoshort headrdquo eg then largest cities in the world according to the number ofinhabitants) and little-known entities (called ldquolong tailrdquoeg municipalities in Germany) We take all entitiescontained in our gold standard equally into account

Let GS be the set of entities in the gold standardThen we can define

mcPop(g) =|e|e isin GS and e isin Eg|

|e|e isin GS|

333 TimelinessDefinition of dimension Timeliness is ldquothe extent

to which the age of the data is appropriate for the taskat handrdquo [47]

Discussion Timeliness does not describe the creationdate of a statement but instead the time range since thelast update or the last verification of the statement [39]Due to the easy way of publishing data on the Webdata sources can be kept easier up-to-date than tradi-tional isolated data sources This results in advantagesto the consumer of Web data [39] How Timeliness is

21For an evaluation about the prediction which relations are of thisnature see [1]

measured depends on the application context For somesituations years are sufficient while in other situationsone may need days [39]

Definition of metric The dimension timeliness isdetermined by the criteria Timeliness frequency of theKG Specification of the validity period and Specifica-tion of the modification date of statements

The fulfillment degree of a KG g wrt the dimen-sion Timeliness is measured by the metrics mFreqmV alidity and mChange which are defined as follows

Timeliness frequency of the KG The criterion Time-liness frequency of the KG indicates how fast the KGis updated We consider the KG RDF export here anddifferentiate between continuous updates where the up-dates are always performed immediately and discreteKG updates where the updates take place in discretetime intervals In case the KG edits are available onlineimmediately but the RDF export files are available indiscrete varying updating intervals we consider theonline version of the KG since in the context of LinkedData it is sufficient that URIs are dereferenceable

mFreq(g) =

1 continuous updates05 discrete periodic updates025 discrete non-periodic updates0 otherwise

Specification of the validity period of statements Spec-ifying the validity period of statements enables to tem-porally limit the validity of statements By using this cri-terion we measure whether the KG supports the speci-fication of starting and maybe end dates of statementsby means of providing suitable forms of representation

mV alidity(g) =

1 specification of validity pe-riod supported

0 otherwise

Specification of the modification date of statementsThe modification date discloses the point in timeof the last verification of a statement The modifi-cation date is typically represented via the relationsschemadateModified and dctermsmodified

mChange(g) =

1 specification of modifica-

tion dates for statementssupported

0 otherwise

12 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

34 Representational Data Quality

Representational data quality ldquocontains aspects re-lated to the format of the data [] and meaning ofdatardquo [47] This category contains the two dimensions(i) Ease of understanding (ie regarding the human-readability) and (ii) Interoperability (ie regarding themachine-readability) The dimensions InterpretabilityRepresentational consistency and Concise representa-tion as in addition proposed by Wang et al [47] areconsidered by us as being a part of the dimension Inter-operability

341 Ease of UnderstandingDefinition of dimension The ease of understanding

is ldquothe extent to which data are clear without ambiguityand easily comprehendedrdquo [47]

Discussion This dimension focuses on the under-standability of a data source by a human data con-sumer In contrast the dimension Interoperability fo-cuses on technical aspects The understandability of adata source (here KG) can be improved by things suchas descriptive labels and literals in multiple languages

Definition of metric The dimension understand-ability is determined by the criteria Description of re-sources Labels in multiple languages UnderstandableRDF serialization and Self-describing URIs The ful-fillment degree of a KG g wrt the dimension Con-sistency is measured by the metrics mDescr mLangmuSer and muURI which are defined as follows

Description of resources Heath et al [2630] suggestto describe resources in a human-understandable wayeg via rdfslabel or rdfscomment Withinour framework the criterion is measured as followsGiven a sample of resources we divide the numberof resources in the KG for which at least one label orone description is provided (eg via rdfslabelrdfscomment or schemadescription) bythe number of all considered resources in the localnamespace

mDescr(g) = |u|u isin U localg and exist(u p o) isin g

p isin PlDesc||u|u isin U localg |

PlDesc is the set of implicitly used relations in g in-dicating that the value is a label or description (egPlDesc = rdfslabelrdfscomment)

Beschreibung) Daruumlber hinaus ist das Ergebnisder Evaluation auf Basis der Entitaumlten interessant -gt DBpedia weicht deutlich ab da manche Entitaumlten

(Intermediate-Node-Mapping) keine rdfslabel habenFolglich wuumlrde ich die Definition der Metrik allgemeinhalten (beschraumlnkt auf proprietaumlre Ressourcen dh imselben Namespace) die Evaluation jedoch nur anhandder Entitaumlten machen

Labels in multiple languages Resources in the KG aredescribed in a human-readable way via labels eg viardfslabel or skosprefLabel22 The charac-teristic feature of skosprefLabel is that this kindof label should be used per resource at most once incontrast rdfslabel has no cardinality restrictionsie it can be used several times for a given resourceLabels are usually provided in English as the ldquobasiclanguagerdquo The now introduced metric for the criterionLabels in multiple languages determines whether labelsin other languages than English are provided in the KG

mLang(g) =

1 Labels provided in English

and at least one other lan-guage

0 otherwise

Understandable RDF serialization RDFXML is therecommended RDF serialization format of the W3CHowever due to its syntax RDFXML documents arehard to read for humans The understandability of RDFdata by humans can be increased by providing RDFin other more human-understandable serialization for-mats such as N3 N-Triple and Turtle We measurethis criterion by measuring the supported serializationformats during the dereferencing of resources

muSer(hg) =

1 Other RDF serializationsthan RDFXML available

0 otherwise

Note that conversions from one RDF serializationformat into another are easy to perform

Self-describing URIs Descriptive URIs contribute toa better human-readability of KG data Sauermann etal23 recommend to use short memorable URIs in theSemantic Web context which are easier understandableand memorable by humans compared to opaque URIs24

22Using the namespace httpwwww3org200402skoscore

23See httpswwww3orgTRcooluris requested onMar 1 2016

24For an overview of URI patterns see httpswwww3orgcommunitybpmlodwikiBest_practises_-_previous_notes requested on Dec 27 2016

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 13

such as wdtQ1040 The criterion Self-describingURIs is dedicated to evaluate whether self-describingURIs or generic IDs are used for the identification ofresources

muURI(g) =

1 self-describing URIs always used05 self-describing URIs partly used0 otherwise

342 InteroperabilityInteroperability is another dimension of the repre-

sentational data quality category and subsumes Wanget alrsquos aspects interpretability representational consis-tency and concise representation

Definition of dimension We define Interoperabilityalong the subsumed dimensions of Wang et al

ndash Interpretability Interpretability is ldquothe extent towhich data are in appropriate language and unitsand the data definitions are clearrdquo [47]

ndash Representational consistency Representationalconsistency is ldquothe extent to which data are alwayspresented in the same format and are compatiblewith previous datardquo [47]

ndash Concise representation Concise representationis ldquothe extent to which data are compactly repre-sented without being overwhelmingrdquo [47]

Discussion regarding interpretability In contrastto the dimension understandability which focuses onthe understandability of RDF KG data towards the useras data consumer interpretability focuses on the rep-resentation forms of information in the KG from atechnical perspective An example is the considerationwhether blank nodes are used According to Heath etal [26] blank nodes should be avoided in the LinkedData context since they complicate the integration ofmultiple data sources and since they cannot be linkedby resources of other data sources

Discussion regarding representational consistencyIn the context of Linked Data it is best practice to reuseexisting vocabulary for the creation of own RDF dataIn this way less data needs to be prepared for beingpublished as Linked Data [26]

Discussion regarding concise representation Heathet al [26] made the observation that the RDF features(i) RDF reification25 (ii) RDF collections and RDF

25In the literature it is often not differentiated between reificationin the general sense and reification in the sense of the specific

container and (iii) blank nodes are not very widelyused in the Linked Open Data context Those featuresshould be avoided according to Heath et al in orderto simplify the processing of data on the client sideEven the querying of the data via SPARQL may getcomplicated if RDF reification RDF collections andRDF container are used We agree on that but alsopoint out that reification (implemented via RDF stan-dard reification n-ary relations singleton propertiesor named graphs) is inevitably necessary for makingstatements about statements

Definition of metric The dimension Interoperabil-ity is determined via the following criteria

ndash Avoiding blank nodes and RDF reificationndash Provisioning of several serialization formatsndash Using external vocabularyndash Interoperability of proprietary vocabulary

The fulfillment degree of a KG g wrt the dimen-sion Interoperability is measured by the metrics mReif miSerial mexV oc and mpropV oc which are defined asfollows

Avoiding blank nodes and RDF reification Using RDFblank nodes RDF reification RDF container and RDFlists is often considered as ambivalent On the one handthese RDF features are not very common and theycomplicate the processing and querying of RDF data[3026] On the other hand they are necessary in cer-tain situations eg when statements about statementsshould be made We measure the criterion by evaluatingwhether blank nodes and RDF reification are used

mReif (g) =

1 no blank nodes and no RDF

reification05 either blank nodes or RDF

reification0 otherwise

Provisioning of several serialization formats The in-terpretability of RDF data of a KG is increased if be-

proposal described in the RDF standard (Brickley D Guha R (eds)RDF Vocabulary Description Language 10 RDF Schema W3CRecommendation online available at httpwwww3orgTRrdf-schema requested on Sep 2 2016) For more informationabout reification and its implementation possibilities we can refer thereader to [27] In this article we use the term reification by defaultfor the general sense and standard reification or RDF reificationfor referring to the modeling of reification according to the RDFstandard

14 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

sides the serialization standard RDFXML further seri-alization formats are supported for URI dereferencing

miSerial(hg) =

1 RDFXML and further for-

mats are supported05 only RDFXML is supported0 otherwise

Using external vocabulary Using a common vocabu-lary for representing and describing the KG data allowsto represent resources and relations between resourcesin the Web of Data in a unified way This increases theinteroperability of data [3026] and allows a comfort-able data integration We measure the criterion of usingan external vocabulary by setting the number of tripleswith external vocabulary in predicate position to thenumber of all triples in the KG

mextV oc(g) =|(s p o)|(s p o) isin g and p isin P external

g ||(s p o) isin g|

Interoperability of proprietary vocabulary Linkingon schema level means to link the proprietary vo-cabulary to external vocabulary Proprietary vocab-ulary are classes and relations which were definedin the KG itself The interlinking to external vo-cabulary guarantees a high degree of interoperabil-ity [26] We measure the interlinking on schemalevel by calculating the ratio to which classes andrelations have at least one equivalency link (egowlsameAs owlequivalentProperty orowlequivalentClass) to classes and relationsrespectively of other data sources

mpropV oc(g) = |x isin Pg cup Cg|exist(x p o) isin g

(p isin Peq and (o isin U and o isin Uextg ))||Pg cup Cg|

where Peq = owlsameAsowlequivalent-PropertyowlequivalenClass and Uext

g con-sists of all URIs in Ug which are external to the KG gwhich means that hg is not responsible for resolvingthese URIs

35 Accessibility Category

Accessibility data quality refers to aspects on howdata can be accessed This category contains the threedimensions

ndash Accessibilityndash Licensing andndash Interlinking

Wangrsquos dimension access security is considered by usas being not relevant in the Linked Open Data contextas we only take open data sources into account

In the following we go into details of the mentioneddata quality dimensions

351 AccessibilityDefinition of dimension Accessibility is ldquothe ex-

tent to which data are available or easily and quicklyretrievablerdquo [47]

Discussion Wang et alrsquos definition of Accessibilitycontains the aspects availability response time anddata request They are defined as follows

1 Availability ldquoof a data source is the probability thata feasible query is correctly answered in a giventime rangerdquo [39]According to Naumann [39] the availability is animportant quality aspect for data sources on theWeb since in case of integrated systems (with fed-erated queries) usually all data sources need tobe available in order to execute the query Therecan be different influencing factors regarding theavailability of data sources such as the day timethe worldwide distribution of servers the planedmaintenance work and the caching of data LinkedData sources can be available as SPARQL end-points (for performing complex queries on thedata) and via HTTP URI dereferencing We needto consider both possibilities for this DQ dimen-sion

2 Response time characterizes the delay betweenthe point in time when the query was submittedand the point in time when the query response isreceived [11]Note that the response time is dependent on em-pirical factors such as the query the size of the in-dexed data the data structure the used triple storethe hardware and so on We do not consider theresponse time in our evaluations since obtaininga comprehensive result here is hard

3 In the context of Linked Data data requests canbe made (i) on SPARQL endpoints (ii) on RDFdumps (export files) and (iii) on Linked DataAPIs

Definition of metric We define the metric for thedimension Accessibility by means of metrics for thefollowing criteria

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 15

ndash Dereferencing possibility of resourcesndash Availability of the KGndash Provisioning of public SPARQL endpointndash Provisioning of an RDF exportndash Support of content negotiationndash Linking HTML sites to RDF serializationsndash Provisioning of KG metadata

The fulfillment degree of a KG g wrt the dimen-sion Accessibility is measured by the metrics mDeref mAvai mSPARQL mExport mNegot mHTMLRDF and mMeta which are defined as follows

Dereferencing possibility of resources One of theLinked Data principles [9] is the dereferencing possi-bility of resources URIs must be resolvable via HTTPrequests and useful information should be returnedthereby We assess the dereferencing possibility of re-sources in the KG by analyzing for each URI in the sam-ple set (here all URIs Ug) the HTTP response statuscode and by evaluating whether RDF data is returned Asuccessful dereferencing of resources is given if HTTPstatus code 200 and an RDF document is returned

mDeref (hg) =|dereferencable(Ug)|

|Ug|

Availability of the KG The Availability of the KG cri-terion indicates the uptime of the KG It is an essentialcriterion in the context of Linked Data since in case ofan integrated or federated query mostly all data sourcesneed to be available [39] We measure the availabil-ity of a KG by monitoring the ability of dereferencingURIs over a period of time This monitoring processcan be done with the help of a monitoring tool such asPingdom26

mAvai(hg) =Number of successful requests

Number of all requests

Provisioning of public SPARQL endpoint SPARQLendpoints allow the user to perform complex queries(including potentially many instances classes and rela-tions) on the KG This criterion here indicates whetheran official SPARQL endpoint is publicly availableThere might be additional restrictions of this SPARQLendpoint such as a maximum number of requests pertime slice or a maximum runtime of a query However

26See httppingdomcom requested on Mar 1 2016

we do not measure these restrictions here

mSPARQL(hg) =

1 SPARQL endpoint publiclyavailable

0 otherwise

Provisioning of an RDF export If there is no pub-lic SPARQL endpoint available or the restrictions ofthis endpoint are so strict that the user does not useit an RDF export dataset (RDF dump) can often beused This dataset can be used to set up a local pri-vate SPARQL endpoint The criterion here indicateswhether an RDF export dataset is officially available

mExport(hg) =

1 RDF export available0 otherwise

Support of content negotiation Content negotiation(CN) allows that the server returns RDF documentsduring the dereferencing of resources in the desiredRDF serialization format The HTTP protocol allowsthe client to specify the desired content type (eg RDFXML) in the HTTP request and the server to specifythe returned content type in the HTTP response header(eg applicationrdf+xml) In this way the de-sired and the provided content type are matched as faras possible It can happen that the server does not pro-vide the desired content type Moreover it may hap-pen that the server returns an incorrect content typeThis may lead to the fact that serialized RDF data isnot processed further An example is RDF data whichis declared as textplain [26] Hogan et al [29]therefore propose to let KGs return the most specificcontent type as possible We measure the Support ofcontent negotiation by dereferencing resources withdifferent RDF serialization formats as desired contenttype and by comparing the accept header of the HTTPrequest with the content type of the HTTP response

mNegot(hg) =

1 CN supported and correct

content types returned05 CN supported but wrong

content types returned0 otherwise

Linking HTML sites to RDF serializations Heath etal [26] suggest linking any HTML description of aresource to RDF serializations of this resource in or-der to make the discovery of corresponding RDF dataeasier (for Linked Data aware applications) For thatreason in the HTML header the so-called Autodiscov-

16 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

ery pattern can be included This pattern consists ofthe phrase link rel=alternate the indicationabout the provided RDF content type and a link to theRDF document27 We measure the linking of HTMLpages to RDF documents (ie resource representations)by evaluating whether the HTML representations of theresources contain links as described

mHTMLRDF (hg) =

1 Autodiscovery pattern usedat least once

0 otherwise

Provisioning of KG metadata In the light of the Se-mantic Web vision where agents select and make useof appropriate data sources on the Web also the meta-information about KGs needs to be available in amachine-readable format The two important mech-anisms to specify metadata about KGs are (i) usingsemantic sitemaps and (ii) using the VoID vocabu-lary28 [26] For instance the URI of the SPARQL end-point can be assigned via voidsparqlEndpointand the RDF export URL can be specified withvoiddataDump Such metadata can be added as ad-ditional facts to the KG or it can be provided as separateVoID file We measure the Provisioning of KG meta-data by evaluating whether machine-readable metadataabout the KG is available Note that the provisioningof licensing information in a machine-readable format(which is also a meta-information about the KG) isconsidered in the data quality dimension License lateron

mMeta(g) =

1 Machine-readable metadataabout g available

0 otherwise

352 LicenseDefinition of dimension Licensing is defined as

ldquothe granting of permission for a consumer to re-use adataset under defined conditionsrdquo [49]

Discussion The publication of licensing informationabout KGs is important for using KGs without legalconcerns especially in commercial settings CreativeCommons (CC)29 publishes several standard licensing

27An example is ltlinkrel=alternate type =applicationrdf+xml href=companyrdfgt

28See namespace httpwwww3orgTRvoid29See httpcreativecommonsorg requested on Mar

1 2016

contracts which define rights and obligations Thesecontracts are also in the Linked Data context popularThe most frequent licenses for Linked Data are CC-BYCC-BY-SA and CC0 [31] CC-BY30 requires specify-ing the source of the data CC-BY-SA31 requires in ad-dition that if the data is published it is published underthe same legal conditions CC032 defines the respectivedata as public domain and without any restrictions

Noteworthy is that most data sources in the LinkedOpen Data cloud do not provide any licensing infor-mation [31] which makes it difficult to use the datain commercial settings Even if data is published un-der CC-BY or CC-BY-SA the data is often not usedsince companies refer to uncertainties regarding thesecontracts

Definition of metric The dimension License isdetermined by the criterion Provisioning machine-readable licensing information

The fulfillment degree of a KG g wrt the dimensionLicense is measured by the metric mmacLicense whichis defined as follows

Provisioning machine-readable licensing informationLicenses define the legal frameworks under which theKG data may be used Providing machine-readable li-censing information allows users and applications to beaware of the license and to use the data of the KG inaccordance with the legal possibilities [3026]

Licenses can be specified in RDF via relationssuch as cclicence33 dctermslicence ordctermsrights The licensing information can bespecified either in the KG as additional facts or sepa-rately in a VoID file We measure the criterion by eval-uating whether licensing information is available in amachine-readable format

mmacLicense(g) =

1 machine-readable

licensing informationavailable

0 otherwise

353 InterlinkingDefinition of dimension Interlinking is the extent

ldquoto which entities that represent the same concept are

30See httpscreativecommonsorglicensesby40 requestedon Mar 1 2016

31See httpscreativecommonsorglicensesby-sa40 requested on Mar 1 2016

32See httpcreativecommonsorgpublicdomainzero10 requested on Mar 3 2016

33Using the namespace httpcreativecommonsorgns

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 17

linked to each other be it within or between two ormore data sourcesrdquo [49]

Discussion According to Bizer et al [12] DBpediaestablished itself as a hub in the Linked Data clouddue to its intensive interlinking with other KGs Theseinterlinking is on the instance level usually establishedvia owlsameAs links However according to Halpinet al [24] those owlsameAs links do not alwaysinterlink identical entities in reality According to theauthors one reason might be that the KGs provideentries in different granularity For instance the DB-pedia resource for Berlin (dboBerlin) links viaowlsameAs relations to three different resources inthe KG GeoNames34 namely (i) Berlin the capital35

(ii) Berlin the state36 and (iii) Berlin the city37 More-over owlsameAs relations are often created auto-matically by some mapping function Due to mappingerrors the precision is often below 100 [18]

Definition of metric The dimension Interlinking isdetermined by the criteria

ndash Interlinking via owlsameAsndash Validity of external URIs

The fulfillment degree of a KG g wrt the dimen-sion Interlinking is measured by the metrics mInst andmURIs which are defined as follows

Interlinking via owlsameAs The forth LinkedData principle according to Berners-Lee [8] is the inter-linking of data resources so that the user can explorefurther information According to Hogan et al [30] theinterlinking has a side effect It does not only result inotherwise isolated KGs but the number of incominglinks of a KG indicates the importance of the KG in theLinked Open Data cloud We measure the interlinkingon instance level38 by calculating the extent to which in-stances have at least one owlsameAs link to externalKGs

34See httpwwwgeonamesorg requested on Dec 312016

35See httpwwwgeonamesorg2950159berlinhtml requested on Feb 4 2017

36See httpwwwgeonamesorg2950157land-berlinhtml requested on Feb 4 2017

37See httpwwwgeonamesorg6547383berlin-stadthtml requested on Feb 4 2017

38The interlinking on schema level is already measured via thecriterion Interoperability of proprietary vocabulary

mInst(g) = |x isin Ig (Pg cup Cg) |

exist(xowlsameAs y) isin g and y isin Uextg |

|Ig (Pg cup Cg)|

Validity of external URIs The considered KG maycontain outgoing links referring to RDF resourcesor Web documents (non-RDF data) The linking toRDF resources is usually enabled by owlsameAsowlequivalentProperty and owlequivalentClass relations Web documents are linked viarelations such as foafhomepage and foafdepiction Linking to external resources always entailsthe problem that those links might get invalid over timeThis can have different causes For instance the URIsare not available anymore We measure the Validity ofexternal URIs by evaluating the URIs from an URI sam-ple set wrt whether there is a timeout a client error(HTTP response 4xx) or a server error (HTTP response5xx)

mURIs(g) =|x isin A | resolvable(x)|

|A|

where A = y | exist(x p y) isin g (p isin Peqandx isin Ug(CgcupPg)andx isin U local

g andy isin Uextg ) and resolvable(x)

returns true if HTTP status code 200 is returned Peq isthe set of relations used for linking to external sourcesExamples for such relations are owlsameAs andfoafhomepage

In case of an empty set A the metric should evaluateto 1

36 Conclusion

In this section we provided 34 DQ criteria which canbe applied in the form of DQ metrics to KGs in order toassess those KGs wrt data quality The DQ criteria areclassified into 11 DQ dimensions These dimensionsare themselves grouped into 4 DQ categories In totalwe have the following picture

ndash Intrinsic category

lowast Accuracylowast Syntactic validity of RDF documentslowast Syntactic validity of literalslowast Semantic validity of triples

18 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

lowast Trustworthinesslowast Trustworthiness on KG levellowast Trustworthiness on statement levellowast Using unknown and empty values

lowast Consistencylowast Check of schema restrictions during inser-

tion of new statementslowast Consistency of statements wrt class con-

straintslowast Consistency of statements wrt relation con-

straints

ndash Contextual category

lowast Relevancylowast Creating a ranking of statements

lowast Completenesslowast Schema completenesslowast Column completenesslowast Population completeness

lowast Timelinesslowast Timeliness frequency of the KGlowast Specification of the validity period of state-

mentslowast Specification of the modification date of

statements

ndash Representational data quality

lowast Ease of understandinglowast Description of resourceslowast Labels in multiple languageslowast Understandable RDF serializationlowast Self-describing URIs

lowast Interoperabilitylowast Avoiding blank nodes and RDF reificationlowast Provisioning of several serialization formatslowast Using external vocabularylowast Interoperability of proprietary vocabulary

ndash Accessibility category

lowast Accessibilitylowast Dereferencing possibility of resourceslowast Availability of the KGlowast Provisioning of public SPARQL endpointlowast Provisioning of an RDF exportlowast Support of content negotiationlowast Linking HTML sites to RDF serializationslowast Provisioning of KG metadata

lowast Licenselowast Provisioning machine-readable licensing in-

formationlowast Interlinking

lowast Interlinking via owlsameAslowast Validity of external URIs

4 Selection of KGs

We consider the following KGs for our comparativeevaluation

ndash DBpedia DBpedia39 is the most prominent KGin the LOD cloud [4] The project was initiatedby researchers from the Free University of Berlinand the University of Leipzig in collaborationwith OpenLink Software Since the first public re-lease in 2007 DBpedia is updated roughly once ayear40 By means of a dedicated open source ex-traction framework DBpedia is created from infor-mation contained in Wikipedia such as infobox ta-bles categorization information geo-coordinatesand external links Due to its role as the hub ofthe LOD cloud DBpedia contains many links toother datasets in the LOD cloud such as FreebaseOpenCyc UMBEL41 GeoNames Musicbrainz42

CIA World Factbook43 DBLP44 Project Guten-berg45 DBtune Jamendo46 Eurostat47 Uniprot48

and Bio2RDF4950 DBpedia has been used exten-sively in the Semantic Web research communitybut has become also relevant in commercial set-tings for instance companies such as the BBC[33] and the New York Times [41] use DBpediato organize their content The version of DBpediawe analyzed is 2015-04

39See httpdbpediaorg requested on Nov 1 201640There is also DBpedia live which started in 2009 and which

gets updated when Wikipedia is updated See httplivedbpediaorg requested on Nov 1 2016 Note however thatDBpedia live only provides a restricted set of relations compared toDBpedia Also the provisioning of data varies a lot While for sometime ranges DBpedia live provides data for each hour for other timeranges DBpedia live data is only available once a month

41See httpumbelorg requested on Dec 31 201642See httpmusicbrainzorg requested on Dec 31

201643See httpswwwciagovlibrary

publicationsthe-world-factbook requested on Dec31 2016

44See httpwwwdblporg requested on Dec 31 201645See httpswwwgutenbergorg requested on Dec

31 201646See httpdbtuneorgjamendo requested on Dec

31 201647See httpeurostatlinked-statisticsorg

requested on Dec 31 201648See httpwwwuniprotorg requested on Dec 31

201649See httpbio2rdforg requested on Dec 31 201650See a complete list of the links on the websites describing the sin-

gle DBpedia versions such as httpdownloadsdbpediaorg2016-04links (requested on Nov 1 2016)

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 19

ndash Freebase Freebase51 is a KG announced byMetaweb Technologies Inc in 2007 and was ac-quired by Google Inc on July 16 2010 In con-trast to DBpedia Freebase had provided an in-terface that allowed end-users to contribute tothe KG by editing structured data Besides user-contributed data Freebase integrated data fromWikipedia NNDB52 FMD53 and MusicBrainz54

Freebase uses a proprietary graph model for stor-ing also complex statements Freebase shut downits services completely on August 31 2016 Onlythe latest data dump is still available WikimediaDeutschland and Google integrate Freebase datainto Wikidata via the Primary Sources Tool55 Fur-ther information about the migration from Free-base to Wikidata is provided in [44] We analyzedthe latest Freebase version as of March 2015

ndash OpenCyc The Cyc56 project started in 1984 bythe industry research and development consortiumMicroelectronics and Computer Technology Cor-poration The aim of Cyc is to store ndash in a machine-processable way ndash millions of common sense factssuch as ldquoEvery tree is a plantrdquo The main focus ofCyc has been on inferencing and reasoning SinceCyc is proprietary a smaller version of the KGcalled OpenCyc57 was released under the opensource Apache license Version 2 In July 2006 Re-searchCyc58 was published for the research com-munity containing more facts than OpenCyc Wedid not consider Cyc and ResearchCyc since thoseKGs do not meet the chosen requirements namelythat the KGs are freely available and freely us-able in any context The version of OpenCyc weanalyzed is 2012-05-10

ndash Wikidata Wikidata59 is a project of WikimediaDeutschland which started on October 30 2012The aim of the project is to provide data whichcan be used by any Wikimedia project including

51See httpfreebasecom requested on Nov 1 201652See httpwwwnndbcom requested on Dec 31 201653See httpwwwfashionmodeldirectorycom re-

quested on Dec 31 201654See httpmusicbrainzorg requested on Dec 31

201655See httpswwwwikidataorgwikiWikidata

Primary_sources_tool requested on Apr 8 201656See httpwwwcyccom requested on Dec 31 201657See httpwwwopencycorg accessed on Nov 1

201658See httpresearchcyccom requested on Dec 31

201659See httpwikidataorg accessed on Nov 1 2016

Wikipedia Wikidata does not only store facts butalso the corresponding sources so that the valid-ity of facts can be checked Labels aliases anddescriptions of entities in Wikidata are providedin almost 400 languages Wikidata is a commu-nity effort ie users collaboratively add and editinformation Also the schema is maintained andextended based on community agreements Wiki-data is currently growing considerably due to theintegration of Freebase data [44] The version ofWikidata we analyzed is 2015-10

ndash YAGO YAGO60 ndash Yet Another Great Ontol-ogy ndash has been developed at the Max PlanckInstitute for Computer Science in Saarbruumlckensince 2007 YAGO comprises information ex-tracted from Wikipedia (such as information fromthe categories redirects and infoboxes) Word-Net [19] (such as information about synsets andhyponomies) and GeoNames61 The version ofYAGO we analyzed is YAGO3 which was pub-lished in March 2015

5 Comparison of KGs

51 Key Statistics

In the following we present statistical commonal-ities and differences of the KGs DBpedia FreebaseOpenCyc Wikidata and YAGO We thereby use thefollowing key statistics

ndash Number of triplesndash Number of classesndash Number of relationsndash Distribution of classes wrt the number of their

corresponding instancesndash Coverage of classes with at least one instance per

classndash Covered domains wrt entitiesndash Number of entitiesndash Number of instancesndash Number of entities per classndash Number of unique subjectsndash Number of unique predicatesndash Number of unique objects

In Section 72 we provide an overview of relatedwork wrt those key statistics

60See httpwwwmpi-infmpgdedepartmentsdatabases-and-information-systemsresearchyago-nagayagodownloads accessed on Nov 1 2016

61See httpwwwgeonamesorg requested on Dec 312016

20 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

511 TriplesRanking of KGs wrt number of triples The num-

ber of triples (see Table 2) differs considerably betweenthe KGs Freebase is the largest KG with over 31Btriples while OpenCyc resides the smallest KG withonly 24M triples The large size of Freebase can betraced back to the fact that large data sets such as Mu-sicBrainz have been integrated into this KG OpenCycin contrast has been built purely manually by expertsIn general this indicates a correlation between the wayof building up a KG and its size

Size differences between DBpedia and YAGO Asboth DBpedia and YAGO were created automaticallyby extracting semantically-structured information fromWikipedia the significant difference between their sizesndash in terms of triples ndash is in particular noteworthy Wecan mention here the following reasons YAGO inte-grates the statements from different language versionsof Wikipedia in one single KG while for the canon-ical DBpedia dataset (which is used in our evalua-tions) solely the English Wikipedia was used as in-formation source Besides that YAGO contains con-textual information and detailed provenance informa-tion Contextual information is for instance the an-chor texts of all links within Wikipedia For repre-senting the anchor texts the relation yagohasWikipediaAnchorText (330M triples in total) is usedThe provenance information of single statements isstored in a reified form In particular the relationsyagoextractionSource (1612M triples) andyagoextractionTechnique (1762M triples)are applied therefore3nInfluence of reification on the number of triples

DBpedia Freebase Wikidata and YAGO use someform of reification Reification in general describesthe possibility of making statements about statementsWhile reification has an influence on the number oftriples for DBpedia Freebase and Wikidata the num-ber of triples in YAGO is not influenced by reificationsince data is here provided in N-Quads62 This style ofreification is called Named Graph [27] The additionalcolumn (in comparison to triples) contains a unique IDof the statement by which the triple becomes identifiedFor backward compatibility the ID is commented andtherefore not imported into the triple store Note how-ever that transforming N-Quads to N-Triples leads to a

62The idea of N-Quads is based on the assignment of triples todifferent graphs YAGO uses N-Quads to identify statements per ID

high number of unique subjects concerning the set ofall triples

In case of DBpedia Freebase and Wikidata reifica-tion is implemented by means of n-ary relations Ann-ary relation denotes the relation between more thantwo resources and is implemented via additional inter-mediate nodes since in RDF only binary statementscan be modeled [1627] In Freebase and DBpedia datais mostly provided in the form of plain N-Triples andn-ary relations are only used for data from higher ar-ity63 Wikidata in contrast has the peculiarity that notonly every statement is expressed with the help of ann-ary relation but that in addition each statement is in-stantiated with wdoStatement This leads to about74M additional instances which is about one tenth ofall triples in Wikidata

512 ClassesMethods for counting classes The number of

classes can be calculated in different ways Classes canbe identified via rdfsClass and owlClass re-lations or via rdfssubClassOf relations64 SinceFreebase does not provide any class hierarchy withrdfssubClassOf relations and since Wikidatadoes not instantiate classes explicitly as classes butuses instead only ldquosubclass ofrdquo (wdtP279) relationsthe method of calculating the number of classes de-pends on the considered KG

Ranking of KG wrt number of classes Our eval-uations revealed that YAGO contains the highest num-ber of classes of all considered KGs DBpedia in con-trast has the fewest (see Table 2)

Number of classes in YAGO and DBpedia Howdoes it come to this gap between DBpedia and YAGOwith respect to the number of classes although bothKGs were created automatically based on WikipediaFor YAGO the classes are extracted from the categoriesin Wikipedia while the hierarchy of the classes is de-ployed with the help of WordNet synset relations TheDBpedia ontology in contrast is very small since itis created manually based on the mostly used infobox

63In Freebase Compound Value Types are used for reifi-cation [44] In DBpedia it is named Intermedia Node Map-ping see httpmappingsdbpediaorgindexphpTemplateIntermediateNodeMapping (requested on Dec31 2016)

64The number of classes in a KG may also be calculated by takingall entity type relations (rdftype and ldquoinstance ofrdquo (wdtP31)in case of Wikidata) on the instance level into account However thiswould result only in a lower bound estimation as here those classesare not considered which have no instances

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 21

DBpe

dia

Freeb

ase

Ope

nCyc

Wikidat

a

YAGO

20

40

60

80

100

Co

ve

rag

e in

Fig 1 Coverage of classes having at least one instance

templates in Wikipedia Besides those 736 classes theDBpedia KG contains further 444895 classes whichoriginate from the imported YAGO classes and whichare published in the namespace yago Those YAGOclasses are ndash like the DBpedia ontology classes ndash inter-connected via rdfssubClassOf to form a taxon-omy In the evaluation of DBpedia the YAGO classesare ignored as they do not belong to the DBpedia on-tology given as OWL file

Coverage of classes with at least one instanceFig 1 shows for each KG the extent to which classes areinstantiated that is for how many classes at least oneinstance exists YAGO exhibits the highest coveragerate (826) although it contains the highest numberof classes among the KGs This can be traced back tothe fact that YAGO classes are chosen by a heuristicthat considers Wikipedia leaf categories which tend tohave instances [43] OpenCyc (with 65) and Wiki-data (54) come last in the ranking Wikidata has thesecond highest number of classes in total (see Table 2)out of which relatively little are used on instance levelNote however that in some scenarios solely the schemalevel information (including classes) of KGs is neces-sary so that the low coverage of instances by classes isnot necessarily an issue

Correlation between number of classes and num-ber of instances In Fig 2 we can see a histogramof the classes with respect to the number of instancesper class That is for each KG we can spot how manyclasses have a high number of instances and how manyclasses have a low number of instances Note the log-arithmic scale on both axes The curves seem to fol-low power law distributions For DBpedia the line de-

Table 1Percentage of considered entities per KG for covered domains

DB FB OC WD YA

Reach of method 88 92 81 41 82

creases consistently for the first 250 classes before itdecreases more than exponentially beyond class 250

513 DomainsAll considered KGs are cross-domain meaning that a

variety of domains are covered in those KGs Howeverthe KGs often cover the single domains to a differentdegree Tartir [45] proposed to measure the covered do-mains of ontologies by determining the usage degree ofcorresponding classes the number of instances belong-ing to one or more subclasses of the respective domainis compared to the number of all instances In our workhowever we decided to evaluate the coverage of do-mains concerning the classes per KG via manual assign-ments of the mostly used classes to the domains peoplemedia organizations geography and biology65 Thislist of domains was created by aggregating the mostfrequent domains in Freebase

The manual assignment of classes to domainsis necessary in order to obtain a consistent assign-ment of the classes to the domains across all con-sidered KGs Otherwise the same classes in differ-ent KGs may be assigned to different domains More-over in some KGs classes may otherwise appear invarious domains simultaneously For instance theFreebase classes freebasemusicartist andfreebasepeopleperson overlap in terms oftheir instances and multiple domains (such as musicand people) might be assigned to them

As the reader can see in Table 1 our method to de-termine the coverage of domains and hence the reachof our evaluation includes about 80 of all entities ofeach KG except Wikidata It is calculated as the ratio ofthe number of unique entities of all considered domainsof a given KG divided by the number of all entities ofthis KG66 If the ratio was at 100 we were able toassign all entities of a KG to the chosen domains

Fig 3 shows the number of entities per domain in thedifferent KGs with a logarithmic scale Fig 4 presents

65See our website for examples of classes per domain andper KG httpkmaifbkitedusitesknowledge-graph-comparison (requested on Dec 31 2016)

66We used the number of unique entities of all domains and notthe sum of the entities measured per domain since entities may be inseveral domains at the same time

22 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

100 101 102 103

Classes

100

102

104

106

108

Nu

mb

er

of

insta

nce

s

DBpedia

Freebase

OpenCyc

Wikidata

YAGO

Fig 2 Distribution of classes wrt the number of instances per KG

persons media organizations geography biology100

102

104

106

108

1010

Num

ber

of entities

DBpedia

Freebase

OpenCyc

Wikidata

YAGO

Fig 3 Number of entities per domain

the relative coverage of each domain in each KG It iscalculated as the ratio of the number of entities in eachdomain to the total number of entities of the KG Avalue of 100 means that all instances reside in onesingle domain

The case of Freebase is especially outstanding here77 of all entities here are located in the media

domain This fact can be traced back to large-scaledata imports such as from MusicBrainz The classfreebasemusicrelease_track is account-able for 42 of the media entities As shown in Fig 3Freebase provides the most entities in four out of thefive domains when considering all KGs

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 23

persons media organizations geography biology

10

20

30

40

50

60

70

80

Rela

tive n

um

ber

of entities in p

erc

ent

DBpedia

Freebase

OpenCyc

Wikidata

YAGO

Fig 4 Relative number of entities per domain

In DBpedia and YAGO the domain of people is thelargest domain (50 and 34 respectively) Peculiar isthe higher coverage of YAGO regarding the geographydomain compared to DBpedia As one reason for thatwe can point out the data import of GeoNames intoYAGO

Wikidata contains around 150K entities in the do-main organization This is relativly few consideringthe total amount of entities being around 187M andconsidering the number of organizations in other KGsNote that even DBpedia provides more organizationentities than Wikidata The reason why Wikidata hasnot so many organization entities is not fully compre-hensible to us However we can point out that for ouranalysis we only considered Wikidata classes whichappeared more than 6000 times67 and that about 16Kclasses were therefore not considered It is possible thatentities of the domain organization are belonging tothose rather rarely occurring classes

514 Relations and PredicatesEvaluation method In this article we differentiate

between relations and predicates (see also Section 2)

ndash Relations ndash as short term for explicitly defined re-lations ndash refers to (proprietary) vocabulary definedon the schema level of a KG We identify the setof relations of a KG as the set of those links which

67This number is based on heuristics We focused on the 150 mostinstantiated classes and cut the long tail of classes having only fewinstances

are explicitly defined as such via assignments (forinstance with rdfsProperty) to classes InSection 2 we used Pg to denote this set

ndash In contrast we use predicates to denote links usedin the KG independently of their introduction onthe schema level The set of unique predicates perKG denoted as P imp

g is nothing else than the setof unique RDF terms on the predicate position ofall triples in the KG

It is important to distinguish the key statistics for rela-tions from the key statistics for predicates since theycan differ considerably depending on to which degreerelations are only defined on schema level but not usedon instance level

Evaluation resultsRelationsRanking regarding relations As presented in Ta-

ble 2 Freebase exhibits by far the highest number ofunique relations (around 785K) among the KGs YAGOshows only 106 relations which is the lowest value inthis comparison In the following we point out furtherfindings regarding the relations of the single KGs

DBpedia Regarding DBpedia relations we need todistinguish between so-called mapping-based prop-erties and non-mapping-based properties Mapping-based properties are created by extracing the informa-tion from infoboxes in Wikipedia using manually cre-ated mappings These mappings are specified in the DB-

24 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

pedia Mappings Wiki68 Mapping-based properties arecontained in the DBpedia ontology and located in thenamespace httpdbpediaorgontologyWe count 2819 such relations for the considered DB-pedia version 2015-04 Non-mapping-based properties(also called ldquoraw infobox propertiesrdquo) are extractedfrom Wikipedia without the help of manually createdmappings and hence without any manual adjustmentsTherefore they are generally of lower quality We count58776 such unique relations They reside in the names-pace httpdbpediaorgproperty Bothmapping-based and non-mapping-based properties areinstantiated in DBpedia with rdfProperty We ig-nore the non-mapping based properties for the calcu-lation of the number of relations |Pg| (see Table 2)since in contrast to DBpedia in YAGO non-mappingbased properties are not instantiated Note that themapping-based properties and the non-mapping basedproperties in DBpedia are not aligned69 and may over-lap until DBpedia version 2016-0470

Freebase The high number or Freebase relations canbe explained by two facts 1 About a third of all rela-tions in Freebase are duplicates in the sense that they aredeclared by means of the owlinverseOf relationas being inverse of other relations An example is the re-lation freebasemusicartistalbum and itsinverse relation freebasemusicalbumartist2 Freebase allowed users to introduce their own rela-tions without any limits These relations were originallyin each userrsquos namespace So-called commons adminswere able to approve those relations so that they gotincluded into the Freebase commons schema

OpenCyc For OpenCyc we measure 18028 uniquerelations We can assume that most of them are dedi-cated to statements on the schema level

Wikidata In Wikidata a relatively small set of rela-tions is provided Note in this context that despite thefact that Wikidata is curated by a community (just likeFreebase) Wikidata community members cannot insertarbitrarily new relations as it was possible in Freebaseinstead relations first need to be proposed and thenget accepted by the community if and only if certain

68See httpmappingsdbpediaorgindexphpMain_Page accessed on Nov 4 2016

69For instance The DBpedia ontology containsdbobirthName for the name of a person while the non-mappingbased property set contains dbpname dbpfirstname anddbpalternativeNames

70For instance dbpalias and dboalias

criteria are met71 One of those criteria is that each newrelation is presumably used at least 100 times Thisrelation proposal process can be mentioned as likelyreason why in Wikidata in relative terms more relationsare actually used than in Freebase

YAGO For YAGO we measure the small set of 106unique relations Although relations are curated man-ually for YAGO and DBpedia the size of the relationset differs significantly between those KGs Hoffart etal [28] mention the following reasons for that

1 Peculiarity of relations The DBpedia ontologyprovides quite many special relations For in-stance there exists the relation dboaircraftFighter between dboMilitaryUnit anddboMeanOfTransportation

2 Granularity of relations Relations in the DB-pedia ontology are more fine-grained than rela-tions in YAGO For instance DBpedia contains therelations dboauthor and dbodirectorwhereas in YAGO there is only the generic relationyagocreated

3 Date specification The DBpedia ontology intro-duces several relations for dates For instance DB-pedia contains the relations dbobirthDateand dbobirthYear for birth dates while inYAGO only the relation yagobirthOnDateis used Incomplete date specifications ndash for in-stance if only the year is known ndash are specifiedin YAGO by wildcards (ldquordquo) so that no multiplerelations are needed

4 Inverse relations YAGO has no relations ex-plicitly specified as being inverse In DBpediawe can find relations specified as inverse such asdboparent and dbochild

5 Reification YAGO introduces the SPOTL(X) for-mat This format extends the triple format ldquoSPOldquowith a specification of Time Location and conteXtIn this way no contextual relations are necessary(such as dbodistanceToLondon or dbopopulationAsOf) which occur if the relationsare closely aligned to Wikipedia template attributenames

Frequency of the usage of relations Fig 5 showsthe relative proportions of how often relations are usedper KG grouped into three classes Surprisingly DB-pedia and Freebase exhibit a high number of relationswhich are not used at all on the instance level In case of

71See httpswwwwikidataorgwikiWikidataProperty_proposal requested on Dec 31 2016

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 25

0 1-500 gt500

Number of relations

20

40

60

80

100

Re

lative

occu

ren

cie

s in

pe

rce

nt

DBpedia

Freebase

OpenCyc

Wikidata

YAGO

Fig 5 Frequency of the usage of the relations per KG grouped by(i) zero occurrences (ii) 1ndash500 occurrences and (iii) more than 500occurrences in the respective KG

OpenCyc 992 of the defined relations are never usedWe assume that those relations are used only withinCyc the commercial version of OpenCyc In case ofFreebase only 5 of the relations are used more than500 times and about 70 are not used at all Analo-gously to the discussion regarding the number of Free-base relations we can mention again the high numberof defined owlinverseOf relations and the highnumber of usersrsquo relation proposals as reasons for that

PredicatesRanking regarding predicates Freebase is here ndash

like in case of the ranking regarding relations ndash rankedfirst The lowest number of unique predictes is providedby OpenCyc which exhibits only 165 predicates AllKGs except OpenCyc provide more predicates then re-lations Our single observations regarding the predicatesets are as follows

DBpedia DBpedia is ranked third in terms of the ab-solute numbers of predicates about 60K predicates areused in DBpedia The set of relations and the set of pred-icates varies considerably here since also facts are ex-tracted from Wikipedia info-boxes whose predicates areconsidered by us as being only implicitly defined andwhich hence occur only as predicates These are the so-called non-mapping-based properties Note that in the

studied DBpedia version 2015-04 the set of explicitlydefined relations (mapping-based properties) and theset of implicitly defined relations (non-mapping-basedproperties) overlaps An example is dbpalias withdboalias

Freebase We can observe here a similar picture asfor the set of Freebase relations With about 785Kunique predicates Freebase exceeds the other KGs byfar Note however that 95 of the predicates (around743K) are used only once This relativizes the highnumber Most of the predicates are keys in the senseof ids and are used for internal modeling (for instancefreebasekeyuseradrianb)

OpenCyc In contrast to the 18028 unique relationswe measure only 164 unique predicates for OpenCycMore predicates are presumably used in Cyc

Wikidata We measure more Wikidata predicates thanWikidata relations since Wikidata predicates are cre-ated by modifying Wikidata relations An exampleare the following triples which express the statementBarack Obama (wdtQ76) is a human (wdtQ5) byan intermediate node (wdtQ76S123 abbreviated)

wdtQ76 wdtP31s wdtQ76S123wdtQ76S123 wdtP31v wdtQ5

The relation extension ldquosrdquo indicates that the RDF termin the object position is a statement The ldquovrdquo extensionallows to refer to a value (in Wikidata terminology)Besides those extensions there is ldquorrdquo to refer to a ref-erence and the ldquoqrdquo extension to refer to a qualifier Ingeneral these relation extensions are used for realizingreification via n-ary relations For that intermediatenodes are used which represent statements [16]

YAGO YAGO contains more predicates than DBpe-dia since infobox attributes from different languageversions of Wikipedia are aggregated into one KG72

while for DBpedia separate localized KG versions areoffered for non-English languages

515 Instances and EntitiesEvaluation method We distinguish between in-

stances Ig and entities Eg of a KG (cf Section 2)

1 Instances are belonging to classes They are iden-tified by retrieving the subjects of all triples wherethe predicates indicate class affiliations

72The language of each attribute is encoded in theURI for instance yagoinfoboxdeflaumlche andyagoinfoboxenareakm

26 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGODBpe

diaFre

ebas

eOpe

nCyc

Wiki

data

YAGO

10 010 110 210 310 410 510 610 710 810 9

Num

ber

of In

stan

ces

Fig 6 Number of instances per KG

2 Entities are real-world objects This excludesfor instance instantiated statements for beingentities Determining the set of entities is par-tially tricky In DBpedia and YAGO entitiesare determined as being an instance of theclass owlThing In Freebase entities are in-stances of freebasecommontopic and inWikidata instance of wdoItem In OpenCyccychIndividual corresponds to owlThingbut not all entities are classified in this way There-fore we approximately determine the set of en-tities in OpenCyc by manually classifying allclasses having more than 300 instances includingat least one entity73 In this way abstract classessuch as cychExistingObjectType are ne-glected

Ranking wrt the number of instances Table 2and Fig 6 show the number of instances per KG Wecan see that Wikidata comprises the highest numberof instances (142M) in total and OpenCyc the fewest(242K)

Ranking wrt the number of entities Table 2shows the ranking of KGs regarding the number of en-tities Freebase contains by far the highest number ofentities (about 499M) OpenCyc is at the bottom withonly about 41K entities

Differences in number of entities The reason whythe KGs show quite varying numbers of entities are theinformation sources of the KGs We illustrate this withthe music domain as example

1 Freebase had been created mainly from data im-ports such as from MusicBrainz Therefore enti-

73For instance cychIndividual cychMovie_CW andcychCity

ties in the domain of media and especially songrelease tracks are covered very well in Freebase77 of all entities are in the media domain (seeSection 513) out of which 42 are releasetracks74

Due to the large size and the world-wide coverageof entities in MusicBrainz Freebase contains al-bums and release tracks of both English and non-English languages For instance regarding the En-glish language the album ldquoThrillerrdquo from MichaelJackson and its single ldquoBillie Jeanrdquo are there aswell as rather unknown songs from the ldquoThrillerrdquoalbum such as ldquoThe Lady in My Liferdquo Regard-ing non-English languages Freebase contains forinstance songs and albums from Helene Fischersuch as ldquoLassrsquo mich in dein Lebenrdquo and ldquoZauber-mondrdquo also rather unknown songs such as ldquoHabrsquoden Himmel beruumlhrtrdquo can be found

2 In case of DBpedia the English Wikipedia is thesource of information In the English Wikipediamany albums and singles of English artists are cov-ered ndash such as the album ldquoThrillerrdquo and the singleldquoBillie Jeanrdquo Rather unknown songs such as ldquoTheLady in My Liferdquo are not covered in WikipediaFor many non-English artists such as the Germansinger Helene Fischer no music albums and nosingles are contained in the English Wikipedia Inthe corresponding language version of Wikipedia(and localized DBpedia version) this informationis often available (for instance the album ldquoZauber-mondrdquo and the song ldquoLassrsquo mich in dein Lebenrdquo)but not the rather unknown songs such as ldquoHabrsquoden Himmel beruumlhrtrdquo

3 For YAGO the same situation as for DBpediaholds with the difference that YAGO in additionimports entities also from the different languageversions of Wikipedia and imports also data fromsources such as GeoNames However the abovementioned works (ldquoLassrsquo mich in dein LebenrdquoldquoZaubermondrdquo and ldquoHabrsquo den Himmel beruumlhrtrdquo)of Helene Fischer are not in the YAGO althoughthe song ldquoLassrsquo mich in dein Lebenrdquo exists inthe German Wikipedia since May 2014 and al-though the used YAGO version 3 is based on theWikipedia dump of June 201475 Presumably theYAGO extraction system was unable to extract any

74Those release tracks are expressed via freebasemusicrelease_track

75See httpwwwmpi-infmpgdededepartmentsdatabases-and-information-

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 27DBpe

diaFre

ebas

eOpe

nCyc

Wiki

data

YAGO

10 0

10 1

10 2

10 3

10 4

Ave

rage

num

ber

of e

ntiti

es

Fig 7 Average number of entities per class per KG

types for those entities so that those entities werediscarded

4 Wikidata is supported by the community and con-tains music albums of English and non-Englishartists even if they do not exist in Wikipedia Anexample is the song ldquoThe Lady in My Liferdquo Notehowever that Wikidata does not provide all artistrsquosworks such as from Helene Fischer

5 OpenCyc contains only very few entities in themusic domain The reason is that OpenCyc has itsfocus mainly on common-sense knowledge andnot so much on facts about entities

Average number of entities per class Fig 7 showsthe average number of entities per class which can bewritten as |Eg||Cg| Obvious is the difference betweenDBpedia and YAGO (despite the similar number of en-tities) The reason for that is that the number of classesin the DBpedia ontology is small (as created manually)and in YAGO large (as created automatically)

Comparing number of instances with number ofentities Comparing the ratio of the number of instancesto the number of entities for each KG Wikidata ex-poses the highest difference As reason for that we canstate that each statement in Wikidata is modeled as aninstance of wdoStatement leading to 74M addi-tional instances In other KGs such as DBpedia state-ments are modeled without any dedicated statementassignment OpenCyc exposes also a high ratio sinceit contains mainly common sense knowledge and notas many entities as the other KGs Furthermore for ouranalysis we do not regard 100 of the entities but onlya large fraction of it (more precisely the classes with

systemsresearchyago-nagayagoarchive re-quested on Dec 31 2016

DBpedia

Freeb

ase

OpenC

ycW

ikida

ta

YAGO

0

1

2

3

4

5

6

7

8

Rat

io o

f num

ber

of in

stan

ces

to n

umbe

r of

ent

ities

Fig 8 Ratio of the number of instances to the number of entities foreach KG

the most frequently occurring instantiations) since en-tities are not consistently instantiated in OpenCyc (seebeginning of Section 515)

516 Subjects and ObjectsEvaluation method The number of unique subjects

and unique objects can be a meaningful KG charac-teristic regarding the link structure within the KG andin comparison to other KGs Especially interesting aredifferences between the number of unique subjects andthe number of unique objects

We measure the number of unique subjects by count-ing the unique resources (ie URIs and blank nodes) onthe subject position of N-Triples Sg = s | (s p o) ising Furthermore we measure the number of uniqueobjects by counting the unique resources on the ob-ject position of N-Triples excluding literals Og =o | (s p o) isin g and o isin U cup B Complementary thenumber of literals is given as Olit

g = o | (s p o) ising and o isin L

Ranking of KGs regarding number of uniquesubjects The number of unique subjects per KG is pre-sented in Fig 9 YAGO contains the highest number ofdifferent subjects while OpenCyc contains the fewest

Ranking of KGs regarding number of unique ob-jects The number of unique objects is also presented inFig 9 Freebase shows the highest score in this regardOpenCyc again the lowest

Ranking of KGs regarding the ratio of numberof unique subjects to number of unique objects Theratios of the number of unique subjects to the number ofunique objects vary considerably between the KGs (seeFig 8) We can observe that DBpedia has 265 timesmore objects than subjects while YAGO on the otherside has 19 times more unique subjects than objects

28 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Table 2Summary of key statistics

DBpedia Freebase OpenCyc Wikidata YAGO

Number of triples |(s p o) isin g| 411 885 960 3 124 791 156 2 412 520 748 530 833 1 001 461 792

Number of classes |Cg| 736 53 092 116 822 302 280 569 751

Number of relations |Pg| 2819 70 902 18 028 1874 106

No of unique predicates |P impg | 60 231 784 977 165 4839 88 736

Number of entities |Eg| 4 298 433 49 947 799 41 029 18 697 897 5 130 031

Number of instances |Ig| 20 764 283 115 880 761 242 383 142 213 806 12 291 250

Avg number of entities per class |Eg||Cg| 58403 9408 035 619 90

No of unique subjects |Sg| 31 391 413 125 144 313 261 097 142 278 154 331 806 927

No of unique non-literals in obj pos |Og| 83 284 634 189 466 866 423 432 101 745 685 17 438 196

No of unique literals in obj pos |Olitg | 161 398 382 1 782 723 759 1 081 818 308 144 682 682 313 508

DBpedia

Freeb

ase

OpenC

yc

Wiki

data

YAGO10 0

10 2

10 4

10 6

10 8

10 1 0

10 1 2

unique subjectsunique objects

Fig 9 Number of unique subjects and objects per KG Note thelogarithmic scale on the axis of ordinates

The high number of unique subjects in YAGO is sur-prising and can be explained by the reification styleused in YAGO Facts are stored as N-Quads in orderto allow for making statements about statements (forinstance storing the provenance information for state-ments) To that end IDs (instead of blank nodes) whichidentify the triples are used on the first position of N-Triples They lead to 308M unique subjects such asyagoid_6jg5ow_115_lm6jdp In the RDF ex-port of YAGO the IDs which identify the triples arecommented out in order to facilitate the N-Triple for-mat However the statements about statements are alsotransformed to triples In those cases the IDs identi-fying the reified statements are in the subject positionleading to such a high number of unique subjects

DBpedia contains considerably more owlsameAslinks to external resources than KGs like YAGO (290M

vs 38M links) leading to a bias of DBpedia towards ahigh number of unique objects

517 Summary of Key StatisticsBased on the evaluation results presented in the last

subsections we can highlight the following insights

1 Triples All KGs are very large Freebase is thelargest KG in terms of number of triples whileOpenCyc is the smallest KG We notice a corre-lation between the way of building up a KG andthe size of the KG automatically created KGs aretypically larger as the burdens of integrating newknowledge become lower Datasets which havebeen imported into the KGs such as MusicBrainzinto Freebase have a huge impact on the numberof triples and on the number of facts in the KGAlso the way of modeling data has a great impacton the number of triples For instance if n-aryrelations are expressed in N-Triples format (as incase of Wikidata) many intermediate nodes needto be modeled leading to many additional triplescompared to plain statements Last but not leastthe number of supported languages influences thenumber of triples

2 Classes The number of classes is highly varyingamong the KGs ranging from 736 (DBpedia) upto 300K (Wikidata) and 570K (YAGO) Despite itshigh number of classes YAGO contains in relativeterms the most classes which are actually used(ie classes with at least one instance) This canbe traced back to the fact that heuristics are usedfor selecting appropriate Wikipedia categories asclasses for YAGO Wikidata in contrast containsmany classes but out of them only a small fraction

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 29

is actually used on instance level Note howeverthat this is not necessarily a burden

3 Domains Although all considered KGs are speci-fied as crossdomain domains are not equally dis-tributed in the KGs Also the domain coverageamong the KGs differs considerably Which do-mains are well represented heavily depends onwhich datasets have been integrated into the KGsMusicBrainz facts had been imported into Free-base leading to a strong knowledge representation(77) in the domain of media in Freebase In DB-pedia and YAGO the domain people is the largestlikely due to Wikipedia as data source

4 Relations and Predicates Many relations arerarely used in the KGs Only 5 of the Freebaserelations are used more than 500 times and about70 are not used at all In DBpedia half of therelations of the DBpedia ontology are not usedat all and only a quarter of the relations is usedmore than 500 times For OpenCyc 992 of therelations are not used We assume that they areused only within Cyc the commercial version ofOpenCyc

5 Instances and Entities Freebase contains by farthe highest number of entities Wikidata exposesrelatively many instances in comparison to theentities as each statement is instantiated leadingto around 74M instances which are not entities

6 Subjects and Objects YAGO provides the high-est number of unique subjects among the KGsand also the highest ratio of the number of uniquesubjects to the number of unique objects This isdue to the fact that N-Quad representations needto be expressed via intermedium nodes and thatYAGO is concentrated on classes which are linkedby entities and other classes but which do not pro-vide outlinks DBpedia exhibits more unique ob-jects than unique subjects since it contains manyowlsameAs statements to external entities

52 Data Quality Analysis

We now present the results obtained by applyingthe DQ metrics introduced in the Sections 32 ndash 35 tothe KGs DBpedia Freebase OpenCyc Wikidata andYAGO

521 AccuracyThe fulfillment degrees of the KGs regarding the

Accuracy metrics are shown in Table 3

Table 3Evaluation results for the KGs regarding the dimension Accuracy

DB FB OC WD YA

msynRDF 1 1 1 1 1msynLit 099 1 1 1 062msemTriple 099 lt1 1 099 099

Syntactic validity of RDF documents msynRDF

Evaluation method For evaluating the Syntactic va-lidity of RDF documents we dereference the entityldquoHamburgrdquo as resource sample in each KG In caseof DBpedia YAGO Wikidata and OpenCyc thereare RDFXML serializations of the resource availablewhich can be validated by the official W3C RDF valida-tor76 Freebase only provides a Turtle serialization Weevaluate the syntactic validity of this Turtle documentby verifying if the document can be loaded into an RDFmodel of the Apache Jena Framework77

Evaluation result All considered KGs provide syn-tactically valid RDF documents In case of YAGO andWikidata the RDF validator declares the used languagecodes as invalid since the validator evaluates languagecodes in accordance with ISO-639 The criticized lan-guage codes are however contained in the newer stan-dard ISO 639-3 and actually valid

Syntactic validity of literals msynLit

Evaluation method We evaluate the Syntactic va-lidity of literals by means of the relations date ofbirth number of inhabitants and International Stan-dard Book Number (ISBN) as those relations cover dif-ferent domains ndash namely people cities and books ndashand as they can be found in all KGs In general do-main knowledge is needed for selecting representativerelations so that a meaningful coverage is guaranteed

Note that OpenCyc is not taken into account forthis criterion Although OpenCyc comprises around11M literals in total these literals are essentially la-bels and descriptions (given via rdfslabel andrdfscomment) ie not aligned to specific datatypes Hence OpenCyc has no syntactic invalid literalsand is assigned the metric value 1

As long as a literal with data type is given its syntaxis verified with the help of the function RDFDatatypeisValid(String) of the Apache Jena framework

76See httpsw3orgRDFValidator requested onMar 2 2016

77See httpsjenaapacheorg requested Mar 22016

30 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Thereby standard data types such as xsddate canbe validated easily especially if different data types areprovided78 If no data type is provided or if the literalvalue is of type xsdString the literal is evaluatedby a regular expression which is created manually (seebelow depending on the considered relation) For eachof the three relations we created a sample of 1M literalvalues per KG as long as the respective KG containsso many literals

Evaluation results All KGs except YAGO per-formed very well regarding the Syntactic validity ofliterals

Date of Birth For Wikidata DBpedia and Freebaseall verified literal values (1M per KG) were syntacti-cally correct79 For YAGO we detected around 519Ksyntactic errors (given 1M literal values) due to the us-age of wildcards in the date values For instance thebirth date of yagoSocrates is specified as ldquo470--rdquo which does not correspond to the syntax ofxsddate Obviously the syntactic invalidity of lit-erals is accepted by the YAGO publishers in order tokeep the number of relations low80

Number of inhabitants The data types of the literalvalues regarding the number of inhabitants were validin all KGs For DBpedia YAGO and Wikidata weevaluated the syntactic validity of the number of inhab-itants by checking if xsdnonNegativeIntegerxsddecimal and xsdinteger were used asdata types for the typed literals In Freebase no datatype is specified Therefore we evaluated the values bymeans of a regular expression which allows only thedecimals 0-9 periods and commas

ISBN The ISBN is an identifier for books and maga-zines The identifier can occur in various formats withor without preceding ldquoISBNrdquo with or without delim-iters and with 10 or 13 digits Gupta81 provided a regu-lar expression for validating ISBN in its different formswhich we used in our evaluation All in all most ofthe ISBN were assessed as syntactically correct The

78In DBpedia for instance data for the relationdbobirthDate is stored both as xsdgYear and xsddate

79Surprisingly the Jena Framework assessed data values with anegative year (ie BC eg ldquo-600rdquo for xsdgYear) as invaliddespite the correct syntax

80In order to model the dates to the extent they are known furtherrelations would be necessary such as using wasBornOnYearwith range xsdgYear wasBornOnYearMonth with rangexsdgYearMonth

81See httphowtodoinjavacomregexjava-regex-validate-international-standard-book-number-isbns requested on Mar 1 2016

lowest fulfillment degree was obtained for DBpediaWe found the following findings for the single KGs InFreebase around 699K ISBN numbers were availableOut of them 38 were assessed as syntactically incorrectTypical mistakes were too long numbers and wrongprefixes82 In case of Wikidata 18 of around 11K ISBNnumbers were syntactically invalid However some in-valid numbers have meanwhile been corrected This in-dicates that the Wikidata community does not only careabout inserting new data but also about curating givenKG data In case of YAGO we could only find 400triples with the relation yagohasISBN Seven of theliterals on the object position were syntactically incor-rect For DBpedia we evaluated around 24K literals7419 of them were assessed as syntactically incorrectIn many cases comments next to the ISBN numbers inthe info-boxes of Wikipedia led to an inaccurate extrac-tion of data so that the comments are either extractedas additional facts about ISBN numbers83 or togetherwith the actual ISBN numbers as coherent strings84

Semantic validity of triples msemTriple

Evaluation method The semantic validity can be re-liably measured by means of a reference data set which(i) contains at least to some degree the same facts asin the KG and (ii) which is regarded as some kind ofauthority We decided to use the Integrated AuthorityFile (Gemeinsame Normdatei GND)85 which is anauthority file especially concerning persons and corpo-rate bodies and which was created manually by Ger-man libraries Due to the focus on persons (especiallyauthors) we decided to evaluate a random sample ofperson entities wrt the following relations birth placedeath place birth date and death date For each ofthese relations the corresponding relations in the KGswere determined Then a random sample of 100 personentities per KG was chosen For each entity we retrievedthe facts with the mentioned relations and assessedmanually whether a GND entry exists and whether thevalues of the relations match with the values in the KG

Evaluation result We evaluated up to 400 facts perKG and observed only for a few facts some discrep-ancies For instance Wikidata states as death date of

82Eg we found the 16 digit ISBN 9789780307986931 (cffreebasem0pkny27) and the ISBN 2940045143431 with pre-fix 294 instead of 978 (cf freebasem0v3xf7b)

83See dbrPrince_Caspian84An example is ldquoISBN 0755111974 (hardcover edition)rdquo for

dbrMy_Family_and_Other_Animals85See httpwwwdnbdeENStandardisierung

GNDgndhtml requested on Sep 8 2016

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 31

ldquoAnton Erkelenzldquo (wdtQ589196) April 24 whereasGND states April 25 For DBpedia and YAGO we en-countered 3 and for Wikidata 4 errors Hence thoseKGs were evaluated with 099 Note that OpenCyc hasno values for the chosen relations and thus evaluates to1

During evaluation we identified the following issues

1 For finding the right entry in GND more informa-tion besides the name of the person is needed Thisinformation is sometimes not given so that entitydisambiguation is in those cases hard to perform

2 Contrary to assumptions often either no corre-sponding GND entry exists or not many facts ofthe GND entity are given In other words GND isincomplete wrt to entities (cf Population com-pleteness) and relations (cf Column complete-ness)

3 Values of different granularity need to be matchedsuch as an exact date of birth against the indicationof a year only

In conclusion the evaluation of semantic validity ishard even if a random sample set is evaluated manuallyMeaningful differences among the KGs might be re-vealed only when a very large sample is evaluated egby using crowd-sourcing [2348] Another approachfor assessing the semantic validity is presented by Kon-tokostas et al [34] who propose a test-driven evalu-ation where test cases are created to evaluate triplessemi-automatically For instance an interval specifiesthe valid height of a person and all triples which lieoutside of this interval are evaluated manually In thisway outliers can be easily found but possible wrongvalues within the interval are not detected

Our findings appear to be consistent with the evalua-tion results of the YAGO developer team for YAGO2where manually assessing 4412 statements resulted inan accuracy of 98186

522 TrustworthinessThe fulfillment degrees of the KGs regarding the

Trustworthiness criteria are shown in Table 4

Trustworthiness on KG level mgraph

Evaluation method Regarding the trustworthinessof a KG in general we differentiate between the method

86With a weighted averaging of 95 see httpwwwmpi-infmpgdededepartmentsdatabases-and-information-systemsresearchyago-nagayagostatistics requested on Mar 3 2016

Table 4Evaluation results for the KGs regarding the dimensionTrustworthiness

DB FB OC WD YA

mgraph 05 05 1 075 025mfact 05 1 0 1 1mNoV al 0 1 0 1 0

of how new data is inserted into the KG and the methodof how existing data is curated

Evaluation results The KGs differ considerablywrt this metric OpenCyc obtains the highest scorehere followed by Wikidata In the following we pro-vide findings for the single KGs which are listed bydecreasing fulfillment score

Cyc is edited (expanded and modified) exclusively bya dedicated expert group The free version OpenCycis derived from Cyc and only a locally hosted versioncan be modified by the data consumer

Wikidata is also curated and expanded manually butby volunteers of the Wikidata community Wikidataallows importing data from external sources such asFreebase87 However new data is not just inserted butis approved by the community

Freebase was also curated by a community of vol-unteers In contrast to Wikidata the proportion of dataimported automatically is considerably higher and newdata imports were not dependent on community ap-provals

DBpedia and YAGO The knowledge of both KGs isextracted from Wikipedia but DBpedia differs fromYAGO wrt the community involvement Any usercan engage (i) in mapping the Wikipedia infobox tem-plates to the DBpedia ontology in the DBpedia map-pings wiki88 and (ii) in the development of the DBpediaextraction framework

Trustworthiness on statement levelWe determine the Trustworthiness on statement level

by evaluating whether provenance information for state-ments is used in the KGs The picture is mixed

DBpedia uses the relation provwasDerivedFrom to store the sources of the entities and their state-

87Note that imports from Freebase require the approval ofthe community (see httpswwwwikidataorgwikiWikidataPrimary_sources_tool) Besides that there arebots which import automatically (see httpswwwwikidataorgwikiWikidataBotsde)

88See httpmappingsdbpediaorg requested onMar 3 2016

32 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

ments However as the source is always the correspond-ing Wikipedia article89 this provenance informationis trivial and the fulfillment degree is hence of ratherformal nature

YAGO uses its own vocabulary to indicate thesource of information Interestingly YAGO stores perstatement both the source (via yagoextractionSource eg the Wikipedia article) and the used ex-traction technique (via yagoextractionTech-nique eg ldquoInfobox Extractorrdquo or ldquoCategoryMap-perrdquo) The number of statements about sources is 161Mand hence many times over the number of instances inthe KG The reason for that is that in YAGO the sourceis stored for each fact

In Wikidata several relations can be used for refer-ring to sources such as ldquoimported fromrdquo (wdtP143)ldquostated inrdquo (wdtP248) and ldquoreference URLrdquo (wdtP854)90 Note that ldquoimported fromrdquo relations are usedfor automatic imports but that statements with such areference are not accepted (ldquodata is not sourcedrdquo)91 Tosource data the other relations ldquostated inrdquo and ldquoref-erence URLrdquo can be used The number of all storedreferences in Wikidata92 is around 971K Based on thenumber of all statements93 74M this corresponds to acoverage of around 13 Note however that not everystatement in Wikidata requires a reference according tothe Wikidata guidelines In order to be able to state howmany references are actually missing a manual evalua-tion would be necessary However such an evaluationwould be presumably highly subjective

Freebase uses proprietary vocabulary for represent-ing provenance via n-ary relations which are in Free-base called Compound Value Types (CVT) data fromhigher arity can be expressed [44]94

OpenCyc differs from the other KGs in that it usesneither an external vocabulary nor a proprietary vocab-ulary for storing provenance information

89Eg httpenwikipediaorgwikiHamburg fordbrHamburg

90All relations are instances of Wikidata property to indicate asource (wdtQ18608359)

91See httpswwwwikidataorgwikiPropertyP143 requested Mar 3 2016

92This is the number of instances of wdoReference93This is the number of instances of wdoStatement94Eg for a statement with the relation freebaselocation

statistical_regionpopulation the source can bestored via freebasemeasurement_unitdated_integersource

Table 5Evaluation results for the KGs regarding the dimension Consistency

DB FB OC WD YA

mcheckRestr 0 1 0 1 0mconClass 088 1 lt1 1 033mconRelat 099 045 1 050 099

Indicating unknown and empty values mNoV al

This criterion highlights the subtle data model ofWikidata and Freebase in comparison to the data mod-els of the other KGs Wikidata allows for storing un-known values and empty values (eg that ldquoElizabeth Iof Englandrdquo (wdtQ7207) had no children) Howeverin the Wikidata RDF export such statements are onlyindirectly available since they are represented via blanknodes and via the relation owlsomeValuesFrom

YAGO supports the representation of unknown val-ues and empty values by providing explicit relationsfor such cases95 Inexact dates are modeled by meansof wildcards (eg ldquo1940--rdquo if only the year isknown) Note however the invalidity of such stringsas date literals (see Section 521) Unknown dates arenot supported by YAGO

523 ConsistencyThe fulfillment degrees of the KGs regarding the

Consistency criteria are shown in Table 5

Check of schema restrictions during insertion of newstatements mcheckRestr

The values of the metric mcheckRestr indicating re-strictions during the insertion of new statements arevarying among the KGs The web interfaces of Free-base and Wikidata verify during the insertion of newstatements by the user whether the input is compatiblewith the respective data type For instance data of therelation ldquodate of birthrdquo (wdtP569) is expected to bein a syntactically valid form DBpedia OpenCyc andYAGO have no checks for schema restriction during theinsertion of new statements

Consistency of statements wrt class constraintsmconClass

Evaluation method For evaluating the consis-tency of class constraints we considered the relationowldisjointWith since this is the only rela-tion which is used by more than half of the consid-

95Eg freebasefreebasevaluenotationhas_no_value

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 33

ered KGs We only focused on direct instantiationshere if there is for instance the triple (dboPlantowldisjointWith dboAnimal) then theremust not be a resource which is instantiated both asdboPlant and dboAnimal

Evaluation results We obtained mixed results hereOnly Freebase OpenCyc and Wikidata perform verywell96

Freebase and Wikidata do not specify any constraintswith owldisjointWith Hence those two KGshave no inconsistencies wrt class restrictions and wecan assign the metric value 1 to them In case of Open-Cyc 5 out of the 27112 class restrictions are incon-sistent DBpedia contains 24 class constraints Threeout of them are inconsistent For instance over 1200instances exist which are both a dboAgent and adboPlace YAGO contains 42 constraints dedi-cated mainly for WordNet classes which are mostlyinconsistent

Consistency of statements wrt relation constraintsmconRelat

Evaluation method Here we considered the rela-tions rdfsrange and owlFunctionalProperty as those are used in more than every second con-sidered KG rdfsrange specifies the expected typeof an instance on the object position of a triple whileowlFunctionalProperty indicates that a rela-tion should only be used at most once per resource Weonly took datatype properties into account for this eval-uation since consistencies regarding object propertieswould require to distinguish Open World assumptionand Closed World assumption

Evaluation results In the following we considerthe fulfillment degree for the relation constraintsrdfsrange and owlFunctionalPropertyseparately In Table 5 we show the average of the fulfill-ment scores of each KG regarding rdfsrange andowlFunctionalProperty Note that the num-bers of evaluated relation constraints varied from KG toKG depending on how many relation constraints wereavailable per KG

Range Wikidata does not use any rdfsrangerestrictions Within the Wikidata data model there iswdopropertyType but this indicates not the ex-act allowed data type of a relation (eg wdoprop

96Note that the sample size varies among the KGs (depend-ing on how many owldisjointWith statements are availableper KG) Therefore inconsistencies measured on a small set ofowldisjointWith facts become more visible

Table 6Evaluation results for the KGs regarding the dimension Relevancy

DB FB OC WD YA

mRanking 0 1 0 1 0

ertyTypeTime can represent a year or an exact date)On the talk pages of Wikidata relations users can indi-cate the allowed values of relations via One of state-ments97 Since One of statements are only listed onthe property talk pages and since not only entity typesbut also concrete instances are used as One of valueswe do not consider those statements here

DBpedia obtains the highest measured fulfillmentscore wrt consistency of rdfsrange statementsAn example for a range inconsistency is that the relationdbobirthDate requires a data type xsddatein about 20 of those relations the data type xsdgYear is used though

YAGO Freebase and OpenCyc contain range incon-sistencies primarily since they specify designated datatypes via range relations which are not consistentlyused on the instance level For instance YAGO spec-ifies proprietary data types such as yagoyagoURLand yagoyagoISBN On the instance level how-ever either no data type is used or the unspecific datatype xsdstring

FunctionalProperty The restriction indicated byowlFunctionalProperty is used by all KGsexcept Wikidata On the talk pages about the rela-tions in Wikidata users can specify the cardinalityrestriction via setting the relation to single how-ever this is not part of the Wikidata data modelThe other KGs mostly comply with the usage re-strictions of owlFunctionalProperty Note-worthy is that in Freebase 999 of the inconsis-tencies obtained here are caused by the usages ofthe relations freebasetypeobjectname andfreebasecommonnotable_fordisplay_name

524 RelevancyThe fulfillment degrees of the KGs regarding the

Relevancy criteria are shown in Table 6

Creating a ranking of statements mRanking

Only Wikidata supports the modeling of a rankingof statements Each statement is ranked with ldquopre-

97See httpswwwwikidataorgwikiCategoryProperties_with_one-of_constraints for an overviewrequested on Jan 29 2017

34 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Table 7Evaluation results for the KGs regarding the dimensionCompleteness

DB FB OC WD YA

mcSchema 091 076 092 1 095mcColumn 040 043 0 029 033mcPop 093 094 048 099 089mcPop (short) 1 1 082 1 090mcPop (long) 086 088 014 098 088

ferred rankrdquo (wdoPreferredRank) ldquonormal rankrdquo(wdoNormalRank) or ldquodeprecated rankrdquo (wdoDeprecatedRank) The preferred rank corre-sponds to the up-to-date value or the consensus of theWikidata community wrt this relation Freebase doesnot provide any ranking of statements entities or re-lations However the meanwhile shutdown FreebaseSearch API provided a ranking for resources98

525 CompletenessThe fulfillment degrees of the KGs regarding the

Completeness criteria are shown in Table 7

Schema completeness mcSchema

Evaluation method Since a gold standard for eval-uating the Schema completeness of the considered KGshas not been published we built one on our own Thisgold standard is available online99 It is based on thedata set used in Section 513 where we needed as-signments of classes to domains and comprises of 41classes as well as 22 relations It is oriented towards thedomains people media organizations geography andbiology The classes in the gold standard were alignedto corresponding WordNet synsets (using WordNet ver-sion 31) and were grouped into main classes

Evaluation results Generally Wikidata performsoptimal also DBpedia OpenCyc and YAGO exhibitresults which can be judged as acceptable for most usecases Freebase shows considerable room for improve-ment concerning the coverage of typical cross-domainclasses and relations The results in more detail are asfollows

DBpedia DBpedia shows a good score regardingSchema completeness and its schema is mainly limited

98See httpsdevelopersgooglecomfreebasev1search-cookbookscoring-and-ranking re-quested on Mar 4 2016

99See httpkmaifbkitedusitesknowledge-graph-comparison requested on Jan 29 2017

due to the characteristics of how information is storedand extracted from Wikipedia

1 Classes The DBpedia ontology was created man-ually and covers all domains well However it is incom-plete in the details and therefore appears unbalancedFor instance within the domain of plants the DBpe-dia ontology does not use the class tree but the classginko which is a subclass of trees We can mentionas reason for such gaps in the modeling the fact thatthe ontology is created by means of the most frequentlyused infobox templates in Wikipedia

2 Relations Relations are considerably well cov-ered in the DBpedia ontology Some missing relationsor modeling failures are due to the Wikipedia infoboxcharacteristics For example to represent the gender ofa person the existing relation foafgender seemsto fit However it is only modeled in the ontology asbelonging to the class dbolanguage and not usedon instance level Note that the gender of a person is of-ten not explicitly mentioned in the Wikipedia infoboxesbut implicitly mentioned in the category names (forinstance American male singers) While DBpediadoes not exploit this knowledge YAGO does use it andprovides facts with the relation yagohasGender

Freebase Freebase shows a very ambivalent schemacompleteness On the one hand Freebase targets ratherthe representation of facts on instance level than therepresentation of classes and their hierarchy On theother hand Freebase provides a vast amount of rela-tions leading to a very good coverage of the requestedrelations

1 Classes Freebase lacks a class hierarchy and sub-classes of classes are often in different domains (for in-stance the classes freebasemusicartist andsportsmen freebasesportspro_athlete arelogically a subclass of the class people freebasepersonpeople but not explicitly stated as such)which makes it difficult to find suitable sub- and su-perclasses Noteworthy the biology domain containsno classes This is due to the fact that classes are rep-resented as entities such as tree100 and ginko101 Theginko tree is not classified as tree but by the genericclass freebasebiologyoganism_classification

2 Relations Freebase exhibits all relations requestedby our gold standard This is not surprising given thevast amount of available relations in Freebase (see Sec-tion 514 and Table 2)

100Freebase ID freebasem07j7r101Freebase ID freebasem0htd3

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 35

OpenCyc In total OpenCyc exposes a quite highSchema completeness scoring This is due to the factthat OpenCyc has been created manually and has itsfocus on generic and common-sense knowledge

1 Classes The ontology of OpenCyc covers bothgeneric and specific classes such as cychSocialGroup and cychLandTopographicalFeatureWe can state that OpenCyc is complete with respect tothe considered classes

2 Relations OpenCyc lacks some relations of thegold standard such as the number of pages or the ISBNof books

Wikidata According to our evaluation Wikidata iscomplete both with respect to classes and relations

1 Classes Besides frequently used generic classessuch as ldquohumanrdquo (wdtQ5) also very specific classesexist such as ldquolandformrdquo (wdtQ271669) in the senseof a geomorphologial unit with over 3K instances

2 Relations In particular remarkable is that Wiki-data covers all relations of the gold standard eventhough it has extremely less relations than FreebaseThus the Wikidata methodology to let users proposenew relations to discuss about their outreach and fi-nally to approve or disapprove the relations seems tobe appropriate

YAGO Due to its concentration on modeling classesYAGO shows the best overall Schema completenessfulfillment score among the KGs

1 Classes To create the set of classes in YAGOthe Wikipedia categories are extracted and connectedto WordNet synsets Since also our gold standard isalready aligned to WordNet synsets we can measure afull completeness score for YAGO classes

2 Relations The YAGO schema does not containmany unique but rather abstract relations which canbe understood in different senses The abstract rela-tion names make it often difficult to infer the mean-ing The relation yagowasCreatedOnDate forinstance can be used reasonably for both the founda-tion year of a company and for the publication dateof a movie DBpedia in contrast provides the rela-tion dbpfoundationYear Often the meaning ofYAGO relations is only fully understood after consider-ing the associated classes using domain and range ofthe relations Expanding the YAGO schema by furthermore fine-grained relations appears reasonable

Column completeness mcColumn

Evaluation method For evaluating KGs wrt Col-umn completeness for each KG 25 class-relation-

Table 8Metric values of mcCol for single class-relation-pairs

Relation DB FB OC ED YA

Personndashbirthdate 048 048 0 070 077

Personndashsex ndash 057 0 094 064

Bookndashauthor 091 093 0 082 028

BookndashISBN 073 063 ndash 018 001

combinations102 were created based on our gold stan-dard created for measuring the Schema completenessIt was ensured that only those relations were selectedfor a given class for which a value typically exists forthat class For instance we did not include the deathdate as potential relation for living people

Evaluation results In general no KG yields a met-ric score of over 043 As visible in Table 8 KGs oftenhave some specific class-relation-pairs which are wellrepresented on instance level while the rest of the pairsare poorly represented The well-represented pairs pre-sumably originate either from column-complete datasets which were imported (cf MusicBrainz in case ofFreebase) or from user edits focusing primarily on factsabout entities of popular classes such as people Wenotice the following observations with respect to thesingle KGs

DBpedia DBpedia fails regarding the relation sex forinstances of class Person since it does not containsuch a relation in its ontology If we considered the non-mapping-based property dbpgender instead (notdefined in the ontology) we would gain a coverage ofonly 025 (about 5K people) We can note hence thatthe extraction of data out of the Wikipedia categorieswould be a further fruitful data source for DBpedia

Freebase Freebase surprisingly shows a very highcoverage (927) of the authors of books given the ba-sic population of 17M books Note however that thereare not only books modeled under freebasebookbook but also entities of other types such as a descrip-tion of the Lord of Rings (see freebasem07bz5)Also the coverage of ISBN for books is quite high(634)

OpenCyc OpenCyc breaks ranks as mostly no val-ues for the considered relations are stored in this KG It

102The selection of class-relation-pairs was depending on the factwhich class-relation-pairs were available per KG Hence the choiceis varying from KG to KG Also note that less class-relation-pairswere used if no 25 pairs were available in the respective KG

36 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

contains mainly taxonomic knowledge and only thinlyspread instance facts

Wikidata Wikidata achieves a high coverage of birthdates (703) and of gender (941) despite the highnumber of 3M people103

YAGO YAGO obtains a coverage of 635 for gen-der relations as it in contrast to DBpedia extracts thisimplicit information from Wikipedia

Population completeness mcPop

Evaluation method In order to evaluate the Popu-lation completeness we need a gold standard consist-ing of a basic entity population for each consideredKG This gold standard which is available online104

was created on the basis of our gold standard usedfor evaluating the Schema completeness and the Col-umn completeness For its creation we selected fiveclasses from each of the five domains and determinedtwo well-known entities (called short head) and tworather unknown entities (called long tail) for each ofthose classes The exact entity selection criteria are asfollows

1 The well-known entities were chosen without tem-poral and location-based restrictions To take themost popular entities per domain we used quan-titative statements For instance to select well-known athletes we ranked athletes by the numberof won olympic medals to select the most popu-lar mountains we ranked the mountains by theirheights

2 To select the rather unknown entities we consid-ered entities associated to both Germany and aspecific year For instance regarding the athleteswe selected German athletes active in the year2010 such as Maria Houmlfl-Riesch The selectionof rather unknown entities in the domain of biol-ogy is based on the IUCN Red List of ThreatenedSpecies105106

Selecting four entities per class and five classes perdomain resulted in 100 entities to be used for evaluatingthe Population completeness

103These 3M instances form about 185 of all instances in Wiki-data See httpswwwwikidataorgwikiWikidataStatistics requested on Nov 7 2016

104See httpkmaifbkitedusitesknowledge-graph-comparison requested on Jan 29 2017

105See httpwwwiucnredlistorg requested on Apr2 2016

106Note that selecting entities by their importance or popularity ishard in general and that also other popularity measures such as thePageRank scores may be taken into account

Evaluation results All KGs except OpenCyc showgood evaluation results Since also Wikidata exhibitsgood evaluation results the population degree appar-ently does not depend on the age or the insertion methodof the KG Fig 10 additionally depicts the populationcompleteness for the single domains for each KG Inthe following we firstly present our findings for well-known entities before we secondly go into the detailsof rather unknown entities

Well-known entities Here all considered KGsachieve good results DBpedia Freebase and Wikidataare complete wrt the well-known entities in our goldstandard YAGO lacks some well-known entities al-though some of them are represented in Wikipedia Onereason for this fact is that those Wikipedia entities donot get imported into YAGO for which a WordNet classexists For instance there is no ldquoGreat White Sharkrdquoentity only the WordNet class yagowordnet_great_white_shark_101484850

Not-well-known entities First of all not very surpris-ing is the fact that all KGs show a higher degree of com-pleteness regarding well-known entities than regard-ing rather unknown entities as the KGs are orientedtowards general knowledge and not domain-specificknowledge Secondly two things are in particular pe-culiar concerning long-tail entities in the KGs Whilemost of the KGs obtain a score of about 088 Wiki-data deflects upwards and OpenCyc deflects stronglydownwards

Wikidata exhibits a very high Population complete-ness degree for long tail entities This is a result fromthe central storage of interwiki links between differentWikimedia projects (especially between the differentWikipedia language versions) in Wikidata A Wikidataentry is added to Wikidata as soon as a new entity isadded in one of the many Wikipedia language versionsNote however that in this way English-language labelsfor the entities are often missing We measure that onlyabout 546 (102M) of all Wikidata resources have anEnglish label

OpenCyc exhibits a poor population degree score of014 for long-tail entities OpenCycrsquos sister KGs Cycand ResearchCyc are apparently considerably bettercovered with entities [36] leading to higher Populationcompleteness scores

526 TimelinessThe evaluation results concerning the dimension

Timeliness are presented in Table 9

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 37

DBpedia Freebase OpenCyc Wikidata YAGO0

01

02

03

04

05

06

07

08

09

1

PeopleMediaOrganizationsGeographyBiology

Fig 10 Population completeness regarding the different domains per KG

Table 9Evaluation results for the KGs regarding the dimension Timeliness

DB FB OC WD YA

mFreq 05 0 025 1 025mV alidity 0 1 0 1 1mChange 0 1 0 0 0

Timeliness frequency of the KG mFreq

Evaluation results The KGs are very diverse re-garding the frequency in which the KGs are updatedranging from a score of 0 for Freebase (not updated anymore) to 1 for Wikidata (updates immediately visibleand retrievable) Note that the Timeliness frequency ofthe KG can be a crucial point and a criterion for exclu-sion in the process of choosing the right KG for a givensetting [17] In the following we outline some charac-teristics of the KGs with respect to their up-to-dateness

DBpedia is created about once to twice a year andis not modified in the meantime From September2013 until November 2016 six DBpedia versions havebeen published107 Besides the static DBpedia DBpe-dia live108 has been continuously updated by trackingchanges in Wikipedia in real-time However it does notprovide the full range of relations as DBpedia

107These versions are DBpedia 38 DBpedia 39 DBpedia 2014DBpedia 2015-04 DBpedia 2015-10 and DBpedia 2016-04 Alwaysthe latest DBpedia version is published online for dereferencing

108See httplivedbpediaorg requested on Mar 42016

Freebase had been updated continuously until itsclose-down and is not updated anymore

OpenCyc has been updated less than once per yearThe last OpenCyc version dates from May 2012109 Tothe best of our knowledge Cyc and OpenCyc respec-tively are developed further but no exact date of thenext version is known

Wikidata provides the highest fulfillment degree forthis criterion Modifications in Wikidata are via browserand via HTTP URI dereferencing immediately visibleHence Wikidata falls in the category of continuousupdates Besides that an RDF export is provided ona roughly monthly basis (either via the RDF exportwebpage110 or via own processing using the Wikidatatoolkit111)

YAGO has been updated less than once per yearYAGO3 was published in 2015 YAGO2 in 2011 andthe interim version YAGO2s in 2013 A date of the nextrelease has not been published

Specification of the validity period of statementsmV alidity

Evaluation results Although representing the va-lidity period of statements is obviously reasonable formany relations (for instance the presidentrsquos term of

109See httpswopencycorg requested on Nov 82016

110See httptoolswmflabsorgwikidata-exportsrdfexports requested on Nov 23 2016

111See httpsgithubcomWikidataWikidata-Toolkit requested on Nov 8 2016

38 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Table 10Evaluation results for the KGs regarding the dimension Ease ofunderstanding

DB FB OC WD YA

mDescr 070 097 1 lt1 1mLang 1 1 0 1 1muSer 1 1 0 1 1muURI 1 05 1 0 1

office) specifying the validity period of statements isin several KGs either not possible at all or only rudi-mentary performed

DBpedia and OpenCyc do not realize any specifi-cation possibility In YAGO Freebase and Wikidatathe temporal validity period of statements can be spec-ified In YAGO this modeling possibility is madeavailable via the relations yagooccursSinceyagooccursUntil and yagooccursOnDateWikidata provides the relations ldquostart timerdquo (wdtP580)and ldquoend timerdquo (wdtP582) In Freebase CompoundValue Types (CVTs) are used to represent relations withhigher arity [44] As part of this representation validityperiods of statements can be specified An example isldquoVancouverrsquos population in 1997rdquo

Specification of the modification date of statementsmChange

Evaluation results The modification date of state-ments can only be specified in Freebase but not in theother KGs Together with the criteria on Timelinessthis reflects that the considered KGs are mostly notsufficiently equipped with possibilities for modelingtemporal aspects within and about the KG

In Freebase the date of the last review of a fact can berepresented via the relation freebasefreebasevaluenotationis_reviewed In the DBpediaontology the relation dctermsmodified is usedto state the date of the last revision of the DBpediaontology When dereferencing a resource in Wikidatathe latest modification date of the resource is returnedvia schemadateModified This however doesnot hold for statements Thus Wikidata is evaluatedwith 0 too

527 Ease of UnderstandingDescription of resources mDescr

Evaluation method We measured the extent towhich entities are described Regarding the labelswe considered rdfslabel for all KGs Regard-ing the descriptions the corresponding relations dif-

fer from KG to KG DBpedia for instance usesrdfscomment and dcelementsdescriptionwhile Freebase provides freebasecommontopicdescription112

Evaluation result For all KGs the rule applies thatin case there is no label available usually there isalso no description available The current metric couldtherefore (without significant restrictions) be applied tordfslabel occurrences only

YAGO Wikidata and OpenCyc contain a label foralmost every entity In Wikidata the entities withoutany label are of experimental nature and are most likelynot used113

Surprisingly DBpedia shows a relatively low cov-erage wrt labels and descriptions (only 704) Ourmanual investigations suggest that relations with higherarity are modeled by means of intermediate nodeswhich have no labels114

Labels in multiple languages mLang

Evaluation method Here we measure whether theKGs contain labels (rdfslabel) in other languagesthan English This is done by means of the languageannotations of literals such as ldquoderdquo for literals inGerman

Evaluation results DBpedia provides labels in 13languages Further languages are provided in the lo-calized DBpedia versions YAGO integrates statementsof the different language versions of Wikipedia intoone KG Therefore it provides labels in 326 differentlanguages Freebase and Wikidata also provide a lot oflanguages (244 and 395 languages respectively) Con-trary to the other KGs OpenCyc only provides labelsin English

Coverage of languages We also measured the cov-erage of selected languages in the KGs ie the extentto which entities have an rdfslabel with a specificlanguage annotation115 Our evaluation shows that DB-pedia YAGO and Freebase achieve a high coveragewith more than 90 regarding the English language Incontrast to those KGs Wikidata shows a relative low

112Human-readable resource descriptions may also be representedby other relations [15] However we focused on those relations whichare commonly used in the considered KGs

113For instance wdtQ5127809 represents a game fo the Nin-tendo Entertainment System but there is no further information foran identification of the entity available

114Eg dbrNayim links via dboCareerStation to 10entities of his carrier stations

115Note that literals such as rdfslabel do not necessarily havelanguage annotations In those cases we assume that no languageinformation is available

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 39

coverage regarding the English language of only 546but a coverage of over 30 for further languages suchas German and French Wikidata is hence not only themost diverse KG in terms of languages but has also thehighest coverage regarding non-English languages

Understandable RDF serialization muSer

The provisioning of understandable RDF serializa-tions in the context of URI dereferencing leads to a bet-ter understandability for human data consumers DB-pedia YAGO and Wikidata provide N-Triples andN3Turtle serializations Freebase in contrast onlyprovides a Turtle serialization OpenCyc only usesRDFXML which is regarded as not easily understand-able by humans

Self-describing URIs muURI

We can observe two different paradigms of URI us-age On the one hand DBpedia OpenCyc and YAGOrely on descriptive URIs and therefore achieve the fullfulfillment degree In DBpedia and YAGO the URIsof the entities are determined by the corresponding En-glish Wikipedia article The mapping to the EnglishWikipedia is thus trivial In case of OpenCyc two RDFexports are provided one using opaque and one us-ing self-describing URIs The self-describing URIs arethereby derived from the rdfslabel values of theresources

On the other hand Wikidata and Freebase (the latterin part) rely on opaque URIs Wikidata uses Q-IDsfor resources (items in Wikidata terminology) andP-IDs for relations Freebase uses self-describing URIsonly partially namely opaque M-IDs for entities andself-describing URIs for classes and relations116

528 InteroperabilityThe evaluation results of the dimension Interoper-

ability are presented in Table 11

Avoiding blank nodes and RDF reification mReif

Reification allows to represent further informationabout single statements In conclusion we can state thatDBpedia Freebase OpenCyc and YAGO use someform of reification However none of the consideredKGs uses the RDF standard for reification Wikidatamakes extensive use of reification every relation isstored in the form of an n-ary relation In case of DB-pedia and Freebase in contrast facts are predominantlystored as N-Tripels and only relations of higher arity

116Eg freebasemusicalbum for the class music al-bums and freebasepeoplepersondate_of_birthfor the relation day of birth

Table 11Evaluation results for the KGs regarding the dimensionInteroperability

DB FB OC WD YA

mReif 05 05 05 0 05miSerial 1 0 05 1 1mextV oc 061 011 041 068 013mpropV oc 015 0 051 gt0 0

are stored via n-ary relations117 YAGO stores facts asN-Quads in order to be able to store meta informationof facts like provenance information When the quadsare loaded in a triple store the IDs referring to thesingle statements are ignored and quads are convertedinto triples In this way most of the statements are stillusable without the necessity to deal with reification

Blank nodes are non-dereferencable anonymous re-sources They are used by the Wikidata and OpenCycdata model

Provisioning of several serialization formats miSerial

DBpedia YAGO and Wikidata fulfill the criterion ofProvisioning several RDF serialization formats to thefull extent as they provide data in RDFXML and sev-eral other serialization formats during the URI derefer-encing In addition DBpedia and YAGO provide fur-ther RDF serialization formats (eg JSON-LD Micro-data and CSV) via their SPARQL endpoints Freebaseis the only KG providing RDF only in Turtle format

Using external vocabulary mextV oc

Evaluation method This criterion indicates the ex-tent to which external vocabulary is used For that foreach KG we divide the occurrence number of tripleswith external relations by the number of all relations inthis KG

Evaluation results DBpedia uses 37 unique exter-nal relations from 8 different vocabularies while theother KGs mainly restrict themselves to the externalvocabularies RDF RDFS and OWL

Wikidata reveals a high external vocabulary ratiotoo We can mention two obvious reasons for that fact1 Information in Wikidata is provided in a huge varietyof languages leading to 85M rdfslabel and 140Mschemadescription literals 2 Wikidata makesextensive use of reification Out of the 140M triplesused for instantiations via rdftype about 74M (ie

117See Section 511 for more details wrt the influence of reifica-tion on the number of triples

40 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

about the half) are taken for instantiations of statementsie for reification

Interoperability of proprietary vocabulary mpropV oc

Evaluation method This criterion determines the ex-tent to which URIs of proprietary vocabulary are linkedto external vocabulary via equivalence relations Foreach KG we measure which classes and relationsare linked via owlsameAs118 owlequivalentClass (in Wikidata wdtP1709) and owlequivalentProperty (in Wikidata wdtP1628) to ex-ternal vocabulary Note that other relations such asrdfsubPropertyOf could be taken into accounthowever in this work we only consider equivalencyrelations

Evaluation results In general we obtained low ful-fillment scores regarding this criterion OpenCyc showsthe highest value We achieved the following singlefindings

Regarding its classes DBpedia reaches a relativehigh interlinking degree of about 484 Classes arethereby linked to FOAF Wikidata schemaorg andDUL119 Regarding its relations DBpedia links to Wiki-data and schemaorg120 Only 63 of the DBpediarelations are linked to external vocabulary

Freebase only provides owlsameAs links in theform of a separate RDF file but these links are only oninstance level Thus the KG is evaluated with 0

In OpenCyc about half of all classes exhibit at leastone external linking via owlsameAs Internal linksto resources of swcyccom the commercial ver-sion of OpenCyc were ignored in our evaluation Theconsidered classes are mainly linked to FOAF UM-BEL DBpedia and linkedmdborg the relations mainlyto FOAF DBpedia Dublin Core Terms and linked-mdborg The relative high linking degree of OpenCyccan be attributed to dedicated approaches of linkingOpenCyc to other KGs (see eg Medelyan et al [38])

Regarding the classes Wikidata provides linksmainly to DBpedia Considering all Wikidata classesonly 01 of all Wikidata classes are linked to equiva-

118OpenCyc uses owlsameAs both on schema and instancelevel This is appropriate as the OWL primer states The built-inOWL property owlsameAs links an individual to an individualas well as The owlsameAs statements are often used in definingmappings between ontologies see httpswwww3orgTR2004REC-owl-ref-20040210sameAs-def (requestedon Feb 4 2017)

119See httpwwwontologydesignpatternsorgontdulDULowl requested on Jan 11 2017

120Eg dbobirthDate is linked to wdtP569 andschemabirthDate

Table 12Evaluation results for the KGs regarding the dimension Accessibility

DB FB OC WD YA

mDeref 1 1 044 041 1mAvai lt1 073 lt1 lt1 1mSPARQL 1 1 0 1 0mExport 1 1 1 1 1mNegot 05 1 0 1 0mHTMLRDF 1 1 1 1 0mMeta 1 0 0 0 1

lent external classes This may be due to the high num-ber of classes in Wikidata in general Regarding therelations Wikidata provides links in particular to FOAFand schemaorg and achieves here a linking coverageof 21 Although this is low frequently used relationsare linked121

YAGO contains around 553K owlequivalentClass links to classes within the DBpedia namespacedby However as YAGO classes (and their hierarchy)were imported also into DBpedia (using the namespacehttpdbpediaorgclassyago) we donot count those owlequivalentClass links inYAGO as external links for YAGO

529 AccessibilityThe evaluation results of the dimension Accessibility

are presented in Table 12

Dereferencing possibility of resources mDeref

Evaluation method We measured the dereferenc-ing possibilities of resources by trying to dereferenceURIs containing the fully-qualified domain name ofthe KG For that we randomly selected 15K URIs inthe subject predicate and object position of triples ineach KG We submitted HTTP requests with the HTTPaccept header field set to applicationrdf+xmlin order to perform content negotiation

Evaluation results In case of DBpedia OpenCycand YAGO all URIs were dereferenced successfullyand returned appropriate RDF data so that they fulfilledthis criterion completely For DBpedia 45K URIs wereanalyzed for OpenCyc only around 30K due to thesmall number of unique predicates We observed almost

121Frequently used relations with stated equivalence to externalrelations are eg wdtP31 linked to rdftype and wdtP279linked to rdfssubClassOf

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 41

the same picture for YAGO namely no notable errorsduring dereferencing

For Wikidata which contains also not that manyunique predicates we analyzed around 35K URIs Notethat predicates which are derived from relations using asuffix (eg the suffix s as in wdtP1024s is usedfor predicates referring to a statement) could not bedereferenced at all Furthermore the blank nodes usedfor reification cannot be dereferenced

Regarding Freebase mainly all URIs on subjectand object position of triples could be dereferencedSome resources were not resolvable even after multi-ple attempts (HTTP server error 503 eg freebasem0156q) Surprisingly server errors also appearedwhile browsing the website freebasecom so that datawas partially not available Regarding the predicate po-sition many URIs are not dereferencable due to servererrors (HTTP 503) or due to unknown URIs (HTTP404) Note that if a large number of Freebase requestsare performed an API key from Google is necessaryIn our experiments the access was blocked after a fewthousand requests Hence we can point out that withoutan API key the Freebase KG is only usable to a limitedextent

Availability of the KG mAvai

Evaluation method We measured the availabilityof the officially hosted KGs with the monitoring servicePingdom122 For each KG an uptime test was set upwhich checked the availability of the resource Ham-burg as representative resource for successful URI re-solving (ie returning the status code HTTP 200) ev-ery minute over the time range of 60 days (Dec 182015ndashFeb 15 2016)

Evaluation result While the other KGs showed al-most no outages and were again online after some min-utes on average YAGO outages took place frequentlyand lasted on average 35 hours123 In the given timerange four outages took longer than one day Based onthese insights we recommend to use a local version ofYAGO for time-critical queries

Availability of a public SPARQL endpoint mSPARQL

The SPARQL endpoints of DBpedia and YAGO are

122See httpswwwpingdomcom requested Mar 2 2016The HTTP requests of Pingdom are executed by various servers sothat caching is prevented

123See diagrams per KG on our website (httpkmaifbkitedusitesknowledge-graph-comparisonrequested on Jan 31 2017)

provided by a Virtuoso server124 the Wikidata SPARQLendpoint via Blazegraph125 Freebase and OpenCyc donot provide an official SPARQL endpoint However anendpoint for the MQL query language for the FreebaseKG was available

Especially regarding the Wikidata SPARQL endpointwe observed access restrictions The maximum execu-tion time per query is set to 30 seconds but there is nolimitation regarding the returning number of rows How-ever the front-end of the SPARQL endpoint crashed incase of large result sets with more than 15M rows Al-though public SPARQL endpoints need to be preparedfor inefficient queries the time limit of Wikidata mayimpede the execution of reasonable queries

Provisioning of an RDF export mExport

All considered KGs provide RDF exports as down-loadable files The format of the data differs from KGto KG Mostly data is provided in N-Triples and Turtleformat

Support of content negotiation mNegot

We measure the support of content negotiation re-garding the serialization formats RDFXML N3Turtleand N-Triples OpenCyc does not provide any contentnegotiation only RDFXML is supported as contenttype Therefore OpenCyc does not fulfill the criterionof supporting content negotiation

The endpoints for DBpedia Wikidata and YAGOcorrectly returned the appropriate RDF serializationformat and the corresponding HTML representationof the tested resources Freebase does currently notprovide any content negotiation and only the contenttype textplain is returned

Noteworthy is also that regarding the N-Triples seri-alization YAGO and DBpedia require the accept headertextplain and not applicationn-triplesThis is due to the usage of Virtuoso as endpoint For DB-pedia the forwarding to httpdbpediaorgdata[resource]ntriples does not work in-stead the HTML representation is returned Thereforethe KG is evaluated with 05

Linking HTML sites to RDF serializations mHTMLRDF

All KGs except OpenCyc interlink the HTML represen-tations of resources with the corresponding RDF repre-sentations by means of ltlink rel=alternate

124See httpsvirtuosoopenlinkswcom re-quested on Dec 28 2016

125See httpswwwblazegraphcom requested on Dec28 2016

42 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Table 13Evaluation results for the KGs regarding the dimension License

DB FB OC WD YA

mmacLicense 1 0 0 1 0

type=content type href=URLgtin the HTML header

Provisioning of metadata about the KG mmeta

For this criterion we analyzed if KG metadata isavailable such as in the form of a VoID file126 DBpediaintegrates the VoID vocabulary directly in its KG127 andprovides information such as the SPARQL endpointURL and the number of all triples OpenCyc revealsthe current KG version number via owlversionInfo For YAGO Freebase and Wikidata no metainformation could be found

5210 LicenseThe evaluation results of the dimension License are

shown in Table 13

Provisioning machine-readable licensing informationmmacLicense

DBpedia and Wikidata provide licensing informa-tion about their KG data in machine-readable form ForDBpedia this is done in the ontology via the predi-cate cclicense linking to CC-BY-SA128 and GNUFree Documentation License (GNU FDL)129 Wikidataembeds licensing information during the dereferenc-ing of resources in the RDF document by linking withcclicense to the license CC0130 YAGO and Free-base do not provide machine-readable licensing infor-mation However their data is published under the li-cense CC-BY131 OpenCyc embeds licensing informa-tion into the RDF document during dereferencing butnot in machine-readable form132

126See httpswwww3orgTRvoid requested on Apr7 2016

127See httpdbpediaorgvoidpageDataset re-quested on Mar 5 2016

128See httpcreativecomonsorglicensesby-sa30 requested on Feb 4 2017

129See httpwwwgnuorgcopyleftfdlhtml re-quested on Feb 4 2017

130See httpcreativecomonsorgpublicdomainzero10 requested on Feb 4 2017

131See httpcreateivecommonsorglicensesby30 requested on Feb 4 2017

132License information is provided as plain text among furtherinformation with the relation rdfscomment

Table 14Evaluation results for the KGs regarding the dimension Interlinking

DB FB OC WD YA

mInst 025 0 038 0 (09) 031mURIs 093 091 089 096 096

5211 InterlinkingThe evaluation results of the dimension Interlinking

are shown in Table 14

Linking via owlsameAs mInst

Evaluation method Given all owlsameAs triplesin each KG we queried all those subjects thereof whichare instances but neither classes nor relations133 andwhere the resource in the object position of the triple isan external source ie not belonging to the namespaceof the KG

Evaluation result OpenCyc and YAGO achieve thebest results wrt this metric but DBpedia has by farthe most instances with at least one owlsameAs linkWe can therefore confirm the statement by Bizer et al[12] that DBpedia has established itself as a hub in theLinked Data cloud

In DBpedia there are about 52M instances with atleast one owlsameAs link Links to localized DBpe-dia versions (eg dedbpediaorg) were countedas internal links and hence not considered here Intotal one-fourth of all instances have at least oneowlsameAs link

In Wikidata neither owlsameAs links are pro-vided nor a corresponding proprietary relation is avail-able Instead Wikidata uses for each linked data seta proprietary relation (called identifier) to indicateequivalence For example the M-ID of a Freebase in-stance is stored via the relation ldquoFreebase identifierrdquo(wdtP646) as literal value (eg m01x3gpk)So far links to 426 different data sources are maintainedin this way

Although the equivalence statements in Wikidata canbe used to generate corresponding owlsameAs state-ments and although the stored identifiers are providedin the Browser interface as hyperlinks there are no gen-uine owlsameAs links available Hence Wikidata isevaluated with 0 If we view each equivalence relationas owlsameAs relation we would obtain around122M instances with owlsameAs statements Thiscorresponds to 86 of all instances If we consider

133The interlinking on schema level is already covered by thecriterion Interoperability of proprietary vocabulary

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 43

only entities instead of instances (since there are manyinstances due to reification) we obtain a coverage of65 Note however that although the linked resourcesprovide relevant content the resources are not alwaysRDF documents but instead HTML web pages There-fore we cannot easily subsume all identifiers (equiv-alence statements) under owlsameAs

YAGO has around 36M instances with at least oneowlsameAs link However most of them are linksto DBpedia based on common Wikipedia articles Ifthose links are excluded YAGO contains mostly linksto GeoNames and would be evaluated with just 001

In case of OpenCyc links to Cyc134 the commercialversion of OpenCyc were considered as being internalStill OpenCyc has the highest fulfillment degree witharound 40K instances with at least one owlsameAslink As mentioned earlier the relative high linkingdegree of OpenCyc can be attributed to dedicated ap-proaches of linking OpenCyc to other KGs135

Validity of external URIs mURIs

Regarding the dimension Accessibility we alreadyanalyzed the dereferencing possibility of resources inthe KG namespace Now we analyze the links to exter-nal URIs

Evaluation method External links include owlsameAs links as well as links to non-RDF-based Webresources (eg via foafhomepage) We measureerrors such as timouts client errors (HTTP response4xx) and server errors (HTTP response 5xx)

Evaluation result The external links are in most ofthe cases valid for all KGs All KGs obtain a metricvalue between 089 and 096

DBpedia stores provenance information via the re-lation provwasDerivedFrom Since almost alllinks refer to Wikipedia 99 of the resources are avail-able

Freebase achieves high metric values here sinceit contains owlsameAs links mainly to WikipediaAlso Wikipedia URIs are mostly resolvable

OpenCyc contains mainly external links to non-RDF-based Web resources to wikipediaorg and w3org

YAGO also achieves high metric values since it pro-vides owlsameAs links only to DBpedia and Geo-Names whose URIs do not change

For Wikidata the relation reference URL (wdtP854) which states provenance information amongother relations belongs to the links linking to external

134Ie swcyccom135See Interoperability of proprietary vocabulary in sec 528

Web resources Here we were able to resolve around955 without errors

Noticeable is that DBpedia and OpenCyc containmany owlsameAs links to URIs whose domains donot exist anymore136 One solution for such invalid linksmight be to remove them if they have been invalid for acertain time span

5212 Summary of ResultsWe now summarize the results of the evaluations

presented in this section

1 Syntactic validity of RDF documents All KGsprovide syntactically valid RDF documents

2 Syntactic validity of Literals In general the KGsachieve good scores regarding the Syntactic valid-ity of literals Although OpenCyc comprises over1M literals in total these literals are mainly labelsand descriptions which are not formatted in a spe-cial format For YAGO we detected about 519Ksyntactic errors (given 1M literal values) due to theusage of wildcards in the date values Obviouslythe syntactic invalidity of literals is accepted bythe publishers in order to keep the number of rela-tions low In case of Wikidata some invalid literalssuch as the ISBN have been corrected in newerversions of Wikidata This indicates that knowl-edge in Wikidata is curated continuously For DB-pedia comments next to the values to be extracted(such as ISBN) in the infoboxes of Wikipedia ledto inaccurately extracted values

3 Semantic validity of triples All considered KGsscored well regarding this metric This shows thatKGs can be used in general without concerns re-garding the correctness Note however that eval-uating the semantic validity of facts is very chal-lenging since a reliable ground truth is needed

4 Trustworthiness on KG level Based on the way ofhow data is imported and curated OpenCyc andWikidata can be trusted the most

5 Trustworthiness on statement level Here espe-cially good values are achieved for Freebase Wiki-data and YAGO YAGO stores per statement boththe source and the extraction technique which isunique among the KGs Wikidata also supports tostore the source of information but only around13 of the statements have provenance informa-tion attached Note however that not every state-

136Eg httprdfaboutcom httpwww4wiwissfu-berlindefactbook and httpwikicompanyorg (requested on Jan 11 2017)

44 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

ment in Wikidata requires a reference and that itis hard to evaluate which statements lack such areference

6 Using unknown and empty values Wikidata andFreebase support the indication of unknown andempty values

7 Check of schema restrictions during insertion ofnew statements Since Freebase and Wikidata areeditable by community members simple consis-tency checks are made during the insertion of newfacts in the user interface

8 Consistency of statements wrt class constraintsFreebase and Wikidata do not specify any classconstraints via owldisjointWith while theother KGs do

9 Consistency of statements wrt relation con-straints The inconsistencies of all KGs regardingthe range indications of relations are mainly due toinconsistently used data types (eg xsdgYearis used instead of xsdDate)Regarding the constraint of functional proper-ties the relation owlFunctionalPropertyis used by all KGs except Wikidata in most casesthe KGs comply with the usage restrictions of thisrelation

10 Creating a ranking of statements Only Wikidatasupports a ranking of statements This is in partic-ular worthwhile in case of statements which areonly temporally limited valid

11 Schema completeness Wikidata shows the highestdegree of schema completeness Also for DBpe-dia OpenCyc and YAGO we obtain results whichare presumably acceptable in most cross-domainuse cases While DBpedia classes were sometimesmissing in our evaluation the DBpedia relationswere covered considerably well OpenCyc lackssome relations of the gold standard but the classesof the gold standard were existing in OpenCycWhile the YAGO classes are peculiar in the sensethat they are connected to WordNet synsets it isremarkable that YAGO relations are often keptvery abstract so that they can be applied in differ-ent senses Freebase shows considerable room forimprovement concerning the coverage of typicalcross-domain classes and relations Note that Free-base classes are belonging to different domainsHence it is difficult to find related classes if theyare not in the same domain

12 Column completeness DBpedia and Freebaseshow the best column completeness values ie inthose KGs the predicates used by the instances of

each class are on average frequently used by all ofthose class instances We can name data importsas one reason for it

13 Population completeness Not very surprising isthe fact that all KGs show a higher degree of com-pleteness regarding well-known entities than re-garding rather unknown entities Especially Wiki-data shows an excellent performance for both well-known and rather unknown entities

14 Timeliness frequency of the KG Only Wikidataprovides the highest fulfillment degree for thiscriterion as it is continuously updated and as thechanges are immediately visible and queryable byusers

15 Specification of the validity period of statementsIn YAGO Freebase and Wikidata the temporalvalidity period of statements (eg term of office)can be specified

16 Specification of the modification date of state-ments Only Freebase keeps the modification datesof statements Wikidata provides the modificationdate of the queried resource during URI derefer-encing

17 Description of resources YAGO Wikidata andOpenCyc contain a label for almost every entitySurprisingly DBpedia shows a relatively low cov-erage wrt labels and descriptions (only 704)Manual investigations suggest that the interme-diate node mapping template is the main reasonfor that By means of this template intermediatenodes are introduced and instantiated but no la-bels are provided for them137

18 Labels in multiple languages YAGO Freebaseand Wikidata support hundreds of languages re-garding their stored labels Only OpenCyc con-tains labels merely in English While DBpediaYAGO and Freebase show a high coverage re-garding the English language Wikidata does nothave such a high coverage regarding English butinstead covers other languages to a considerableextent It is hence not only the most diverse KGin terms of languages but also the KG which con-tains the most labels for languages other than En-glish

19 Understandable RDF serialization DBpediaWikidata and YAGO provide several understand-

137An example is dbrVolkswagen_Passat_(B1)which has dboengine statements to the intermediate nodesVolkswagen_Passat_(B1)__1 etc representing differentengine variations

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 45

able RDF serialization formats Freebase onlyprovides the understandable format RDFTurtleOpenCyc relies only on RDFXML which is con-sidered as being not easily understandable for hu-mans

20 Self-describing URIs We can find mixed paradigmsregarding the URI generation DBpedia YAGOand OpenCyc rely on descriptive URIs whileWikidata and Freebase (in part classes and rela-tions are identified with self-describing URIs) usegeneric IDs ie opaque URIs

21 Avoiding blank nodes and RDF reification DB-pedia Wikidata YAGO and Freebase are theKGs which use reification ie which formulatestatements about statements There are differentways of implementing reification [27] DBpediaWikidata and Freebase use n-ary relations whileYAGO uses N-Quads creating so-called namedgraphs

22 Provisioning of several serialization formatsMany KGs provide RDF in several serializationformats Freebase is the only KG providing datain the serialization format RDFTurtle only

23 Using external vocabulary DBpedia and Wiki-data show high degrees of external vocabularyusage In DBpedia the RDF RDFS and OWLvocabularies are used Wikidata has a high ex-ternal vocabulary ratio since there exist manylanguage labels and descriptions (modeled viardfslabel and schemadescription)Also due to instantiations of statements withwdoStatement for reification purposes theexternal relation rdftype is used a lot

24 Interoperability of proprietary vocabulary Weobtained low fulfillment scores regarding this cri-terion OpenCyc shows the highest value Wecan mention as reason for that the fact thathalf of all OpenCyc classes exhibit at least oneowlsameAs linkWhile DBpedia has equivalence statements to ex-ternal classes for almost every second class only63 of all relations have equivalence relations torelations outside the DBpedia namespaceWikidata shows a very low interlinking degreeof classes to external classes and of relations toexternal relations

25 Dereferencing possibility of resources Resourcesin DBpedia OpenCyc and YAGO can be derefer-enced without considerable issues Wikidata usespredicates derived from relations that are not deref-erencable at all as well as blank nodes For Free-

base we measured a quite considerable amountof dereferencing failures due to server errors andunknown URIs Note also that Freebase requiredan API key for a large amount of requests

26 Availability of the KG While all other KGsshowed almost no outages YAGO shows a note-worthy instability regarding its online availabilityWe measured around 100 outages for YAGO ina time interval of 8 weeks taking on average 35hours

27 Provisioning of public SPARQL endpoint DBpe-dia Wikidata and YAGO provide a SPARQL end-point while Freebase and OpenCyc do not Note-worthy is that the Wikidata SPARQL endpoint hasa maximum execution time per query of 30 sec-onds This might be a bottleneck for some queries

28 Provisioning of an RDF export RDF exports areavailable for all KGs and are provided mostly inN-Triples and Turtle format

29 Support of content negotiation DBpedia Wiki-data and YAGO correctly return RDF data basedon content negotiation Both OpenCyc and Free-base do not support any content negotiation WhileOpenCyc only provides data in RDFXML Free-base only returns data with textplain as con-tent type

30 Linking HTML sites to RDF serializations AllKGs except OpenCyc interlink the HTML rep-resentations of resources with the correspondingRDF representations

31 Provisioning of KG metadata Only DBpedia andOpenCyc integrate metadata about the KG insome form DBpedia has the VoID vocabulary in-tegrated while OpenCyc reveals the current KGversion as machine-readable metadata

32 Provisioning machine-readable licensing informa-tion Only DBpedia and Wikidata provide licens-ing information about their KG data in machine-readable form

33 Interlinking via owlsameAs OpenCyc andYAGO achieve the best results wrt this met-ric but DBpedia has by far the most instanceswith at least one owlsameAs link Based onthe resource interlinkage DBpedia is justifiablycalled Linked Data hub Wikidata does not provideowlsameAs links but stores identifiers as liter-als that could be used to generate owlsameAslinks

34 Validity of external URIs The links to exter-nal Web resources are for all KGs valid inmost cases DBpedia and OpenCyc contain many

46 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Step 1 Requirements Analysis

- Identifying the preselection criteria P- Assigning a weight wi to each DQ criterion ci isin C

Step 2 Preselection based on the Preselection Criteria

- Manually selecting the KGs GP that fulfill the preselection criteria P

Step 3 Quantitative Assessment of the KGs

- Calculating the DQ metric mi(g) for each DQ criterion ci isin C- Calculating the fulfillment degree h(g) for each KG g isin GP

- Determining the KG g with the highest fulfillment degree h(g)

Step 4 Qualitative Assessment of the Result

- Assessing the selected KG g wrt qualitative aspects- Comparing the selected KG g with other KGs in G P

Fig 11 Proposed process for using our KG recommendation frame-work

owlsameAs links to RDF documents on do-mains which do not exist anymore those linkscould be deleted

6 KG Recommendation Framework

We now propose a framework for selecting themost suitable KG (or a set of suitable KGs) for agiven concrete setting based on a given set of KGsG = g1 gn To use this framework the user needsto go through the steps depicted in Fig 11

In Step 1 the preselection criteria and the weightsfor the criteria are specified The preselection criteriacan be both quality criteria or general criteria and needto be selected dependent on the use case The Timeli-ness frequency of the KG is an example for a qualitycriterion The license under which a KG is provided(eg CC0 license) is an example for a general criterionAfter weighting the criteria in Step 2 those KGs areneglected which do not fulfill the preselection criteriaIn Step 3 the fulfillment degrees of the remaining KGsare calculated and the KG with the highest fulfillmentdegree is selected Finally in Step 4 the result can be as-sessed wrt qualitative aspects (besides the quantitativeassessments using the DQ metrics) and if necessary analternative KG can be selected for being applied for thegiven scenario

Use case application In the following we showhow to use the KG recommendation framework in aparticular scenario The use case is based on the usageof DBpedia and MusicBrainz for the project BBC Musicas described in [33]

Description of the use case The publisher BBCwants to enrich news articles with fact sheets providingrelevant information about musicians mentioned in thearticles In order to obtain more details about the mu-sicians the user can leave the news section and accessthe musicians section where detailed information is pro-vided including a short description a picture the birthdate and the complete discography for each musicianFor being able to integrate the musicians informationinto the articles and to enable such a linking editorsshall tag the article based on a controlled vocabulary

The KG Recommendation Framework can be appliedas follows

1 Requirements analysis

ndash Preselection criteria According to the sce-nario description [33] the KG in questionshould (i) be actively curated and (ii) con-tain an appropriate amount of media enti-ties Given these two criteria a satisfactoryand up-to-date coverage of both old and newmusicians is expected

ndash Weighting of DQ criteria Based on the pre-selection criteria an example weighting ofthe DQ metrics for our use case is given inTable 15 Note that this is only one exam-ple configuration and the assignment of theweights is subjective to some degree Giventhe preselection criteria the criterion Timeli-ness frequency of the KG and the criteria ofthe DQ dimension Completeness are empha-sized Furthermore the criteria Dereferenc-ing possibility of resources and Availabilityof the KG are important as the KG shall beavailable online ready to be queried138

2 Preselection Freebase and OpenCyc are not con-sidered any further since Freebase is not being up-dated anymore and since OpenCyc contains onlyaround 4K entities in the media domain

3 Quantitative Assessment The overall fulfillmentscore for each KG is calculated based on the for-mula presented in Section 31 The result of thequantitative KG evaluation is presented in Ta-ble 15 By weighting the criteria according tothe constraints Wikidata achieves the best rankclosely followed by DBpedia Based on the quan-titative assessment Wikidata is recommended bythe framework

138We assume that in this use case rather the dereferencing ofHTTP URIs than the execution of SPARQL queries is desired

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 47

Table 15Framework with an example weighting which would be reasonablefor a user setting as given in [33]

Dimension Metric DBpedia Freebase OpenCyc Wikidata YAGO Example of UserWeighting wi

Accuracy msynRDF 1 1 1 1 1 1msynLit 0994 1 1 1 0624 1msemTriple 0990 0995 1 0993 0993 1

Trustworthiness mgraph 05 05 1 075 025 0mfact 05 1 0 1 1 1mNoV al 0 1 0 1 0 0

Consistency mcheckRestr 0 1 0 1 0 0mconClass 0875 1 0999 1 0333 0mconRelat 0992 0451 1 0500 0992 0

Relevancy mRanking 0 1 0 1 0 1

Completeness mcSchema 0905 0762 0921 1 0952 1mcCol 0402 0425 0 0285 0332 2mcPop 093 094 048 099 089 3

Timeliness mFreq 05 0 025 1 025 3mV alidity 0 1 0 1 1 0mChange 0 1 0 0 0 0

Ease of understanding mDescr 0704 0972 1 09999 1 1mLang 1 1 0 1 1 0muSer 1 1 0 1 1 0muURI 1 05 1 0 1 1

Interoperability mReif 05 05 05 0 05 0miSerial 1 0 05 1 1 1mextV oc 061 0108 0415 0682 0134 1mpropV oc 0150 0 0513 0001 0 1

Accessibility mDeref 1 0437 1 0414 1 2mAvai 09961 09998 1 09999 07306 2mSPARQL 1 0 0 1 1 1mExport 1 1 1 1 1 0mNegot 05 0 0 1 1 0mHTMLRDF 1 1 0 1 1 0mMeta 1 0 1 0 0 0

Licensing mmacLicense 1 0 0 1 0 0

Interlinking mInst 0251 0 0382 0 0310 3mURIs 0929 0908 0894 0957 0956 1

Unweighted Average 0683 0603 0496 0752 0625Weighted Average 0701 0493 0556 0714 0648

48 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

4 Qualitative Assessment The high population com-pleteness in general and the high coverage of enti-ties in the media domain in particular give Wiki-data advantage over the other KGs FurthermoreWikidata does not require that there is a Wikipediaarticle for each entity Thus missing Wikidata en-tities can be added by the editors directly and arethen available immediatelyThe use case requires to retrieve also detailed infor-mation about the musicians from the KG such as ashort descripion and a discography DBpedia tendsto store more of that data especially wrt discogra-phy A specialized database like MusicBrainz pro-vides even more data about musicians than DBpe-dia as it is not limited to the Wikipedia infoboxesWhile DBpedia does not provide any links to Mu-sicBrainz Wikidata stores around 120K equiva-lence links to MusicBrainz that can be used to pullmore data In conclusion Wikidata especially inthe combination with MusicBrainz seems to bean appropriate choice for the use case In this casethe qualitative assessment confirms the result ofthe quantitative assessment

The use case shows that our KG recommendationframework enables users to find the most suitable KGand is especially useful in giving an overview of themost relevant criteria when choosing a KG Howeverapplying our framework to the use case also showedthat besides the quantitative assessment there is stilla need for a deep understanding of commonalities anddifference of the KGs in order to make an informedchoice

7 Related Work

71 Linked Data Quality Criteria

Zaveri et al [49] provide a conceptual framework forquality assessment of linked data based on quality cri-teria and metrics which are grouped into quality dimen-sions and categories and which are based on the frame-work of Wang et al [47] Our framework is also basedon Wangrsquos dimensions and extended by the dimensionsConsistency [11] Licensing and Interlinking [49] Fur-thermore we reintroduce the dimensions Trustworthi-ness and Interoperability as a collective term for multi-ple dimensions

Many published DQ criteria and metrics are ratherabstract We in contrast selected and developed con-

crete criteria which can be applied to any KG in theLinked Open Data cloud Table 16 shows which ofthe metrics introduced in this article have already beenused to some extent in existing literature In summaryrelated work mainly proposed generic guidelines forpublishing Linked Data [26] introduced DQ criteriawith corresponding metrics (eg [2030]) and criteriawithout metrics (eg [4029]) 27 of the 34 criteria in-troduced in this article have been introduced or sup-ported in one way or another in earlier works The re-maining seven criteria namely Trustworthiness on KGlevel mgraph Indicating unknown and empty valuesmNoV al Check of schema restrictions during insertionof new statements mcheckRestr Creating a rankingof statements mRanking Timeliness frequency of theKG mFreq Specification of the validity period of state-ments mV alidity and Availability of the KG mAvaihave not been proposed so far to the best of our knowl-edge In the following we present more details of singleexisting approaches for Linked Data quality criteria

Pipino et al [40] introduce the criteria Schema com-pleteness Column completeness and Population com-pleteness in the context of databases We introducethose metrics for KGs and apply them to the best ofour knowledge the first time on the KGs DBpediaFreebase OpenCyc Wikidata and YAGO

OntoQA [45] introduces criteria and correspondingmetrics that can be used for the analysis of ontologiesBesides simple statistical figures such as the average ofinstances per class Tartir et al introduce also criteriaand metrics similar to our DQ criteria Description ofresources mDescr and Column completeness mcCol

Based on a large-scale crawl of RDF data Hogan etal [29] analyze quality issues of published RDF dataLater Hogan et al [30] introduce further criteria andmetrics based on Linked Data guidelines for data pub-lishers [26] Whereas Hogan et al crawl and analyzemany KGs we analyze a selected set of KGs in moredetail

Heath et al [26] provide guidelines for Linked Databut do not introduce criteria or metrics for the assess-ment of Linked Data quality Still the guidelines can beeasily translated into relevant criteria and metrics Forinstance Do you refer to additional access methodsleads to the criteria Provisioning of public SPARQLendpoint mSPARQL and Provisioning of an RDF ex-port mExport Also Do you map proprietary vocabu-lary terms to other vocabularies leads to the criterionInteroperability of proprietary vocabulary mpropV ocMetrics that are based on the guidelines of Heath et alcan also be found in other frameworks [3020]

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 49

Table 16Overview of related work regarding data quality criteria for KGs

DQ Metric [40] [45] [29] [26] [20] [22] [30] [48] [2] [34]

msynRDF X X

msynLit X X X X

msemTriple X X X X

mfact X X

mconClass X X X

mconRelat X X X X X X

mcSchema X X

mcCol X X X X

mcPop X X

mChange X X

mDescr X X X X

mLang X

muSer X

muURI X

mReif X X X

miSerial X

mextV oc X X

mpropV oc X

mDeref X X X X

mSPARQL X

mExport X X

mNegot X X X

mHTMLRDF X

mMeta X X X

mmacLicense X X X

mInst X X X

mURIs X X

Flemming [20] introduces a framework for the qual-ity assessment of Linked Data quality This frameworkmeasures the Linked Data quality based on a sample ofa few RDF documents Based on a systematic literaturereview criteria and metrics are introduced Flemmingintroduces the criteria Labels in multiple languagesmLang and Validity of external URIs mURIs the firsttime The framework is evaluated on a sample of RDFdocuments of DBpedia In contrast to Flemming weevaluate the whole KG DBpedia and also four otherwidely used KGs

SWIQA[22] is a quality assessment framework intro-duced by Fuumlrber et al that introduces criteria and met-rics for the dimensions Accuracy Completeness Timeli-ness and Uniqueness In this framework the dimensionAccuracy is divided into Syntactic validity and Sematicvalidity as proposed by Batini et al [6] Furthermorethe dimension Completeness comprises Schema com-pleteness Column completeness and Population com-pleteness following Pipino et al [40] In this articlewe make the same distinction but in addition distin-guish between RDF documents RDF triples and RDF

50 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

literals for evaluating the Accuracy since we considerRDF KGs

TripleCheckMate [35] is a framework for LinkedData quality assessment using a crowdsourcing-approachfor the manual validation of facts Based on this ap-proach Zaveri et al [48] and Acosta et al [23] analyzeboth syntactic and semantic accuracy as well as theconsistency of data in DBpedia

Kontokostas et al [34] present the test-driven evalu-ation framework RDFUnit for assessing Linked Dataquality This framework is inspired by the paradigmof test-driven software development The frameworkintroduces 17 SPARQL templates of tests that can beused for analyzing KGs wrt Accuracy and Consis-tency Note that those tests can also be used for eval-uating external constraints that exist due to the usageof external vocabulary The framework is applied byKontokostas et al on a set of KGs including DBpedia

72 Comparing KGs by Key Statistics

Duan et al [14] Tartir [45] and Hassanzadeh [25]can be mentioned as the most similar related work re-garding the evaluation of KGs using the key statisticspresented in Section 51

Duan et al [14] analyze the structuredness of data inDBpedia YAGO2 UniProt and in several benchmarkdata sets To that end the authors use simple statisticalkey figures that are calculated based on the correspond-ing RDF dumps In contrast to that approach we useSPARQL queries to obtain the figures thus not limitingourselves to the N-Tripel serialization of RDF dumpfiles Duan et al claim that simple statistical figures arenot sufficient to gain fruitful findings when analyzingthe structuredness and differences of RDF datasets Theauthors therefore propose in addition a coherence met-ric Accordingly we analyze not only simple statisti-cal key figures but further analyze the KGs wrt dataquality using 34 DQ metrics

Tartir et al [45] introduce with the system OntoQAmetrics that can be used for analyzing ontologies Moreprecisely it can be measured to which degree theschema level information is actually used on instancelevel An example of such a metric is the class richnessdefined as the number of classes with instances dividedby the number of classes without instances SWETOTAP and GlycO are used as showcase ontologies

Tartir et al [45] and Hassanzadeh et al [25] analyzehow domains are covered by KGs on both schema andinstance level For that Tartir et al introduce the mea-sure importance as the number of instances per class

and their subclasses In our case we cannot use this ap-proach since Freebase has no hierarchy Hassanzadehet al analyze the coverage of domains by listing themost frequent classes with the highest number of in-stances as a table This gives only little overview of thecovered domains since instances can belong to multi-ple classes in the same domain such as dboPlaceand dboPopulatedPlace For determining thedomain coverages of KGs for this article we there-fore adapt the idea of Hassanzadeh et al by manu-ally mapping the most frequent classes to domains anddeleting duplicates within the domains That meansif an instance is instantiated both as dboPlaceand dboPopulatedPlace the instance will becounted only once in the domain geography

8 Conclusion

Freely available knowledge graphs (KGs) have notbeen in the focus of any extensive comparative study sofar In this survey we defined a range of aspects accord-ing to which KGs can be analyzed We analyzed andcompared DBpedia Freebase OpenCyc Wikidata andYAGO along these aspects and proposed a frameworkas well as a process to enable readers to find the mostsuitable KG for their settings

References

[1] M Acosta E Simperl F Floumlck and M Vidal HARE AHybrid SPARQL Engine to Enhance Query Answers viaCrowdsourcing In Proceedings of the 8th InternationalConference on Knowledge Capture K-CAP 2015 pages111ndash118 ACM 2015

[2] M Acosta A Zaveri E Simperl D Kontokostas S Auer andJ Lehmann Crowdsourcing linked data quality assessment InThe Semantic WebndashISWC 2013 pages 260ndash276 Springer 2013

[3] M Acosta A Zaveri E Simperl D Kontokostas F Floumlckand J Lehmann Detecting Linked Data Quality Issues viaCrowdsourcing A DBpedia Study Semantic Web 2016

[4] S Auer C Bizer G Kobilarov J Lehmann R Cyganiak andZ Ives DBpedia A Nucleus for a Web of Open Data InProceedings of the 6th International Semantic Web Conferenceand 2nd Asian Semantic Web Conference ISWC 2007ASWC2007 pages 722ndash735 Springer 2007

[5] S Auer J Lehmann A-C Ngonga Ngomo and A ZaveriIntroduction to Linked Data and Its Lifecycle on the Web InReasoning Web Semantic Technologies for Intelligent DataAccess volume 8067 of Lecture Notes in Computer Sciencepages 1ndash90 Springer Berlin Heidelberg 2013

[6] C Batini C Cappiello C Francalanci and A MaurinoMethodologies for Data Quality Assessment and ImprovementACM Comput Surv 41(3)161ndash1652 July 2009

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 51

[7] S Bechhofer F van Harmelen J Hendler I Horrocks D LMcGuinness and P F Patel-Schneider OWL Web OntologyLanguage Reference httpswwww3orgTR2004REC-owl-ref-200402102004 [Online accessed 06-Apr-2016]

[8] T Berners-Lee Linked Data httpwwww3orgDesignIssuesLinkedDatahtml2006 [Online accessed 28-Feb-2016]

[9] T Berners-Lee Linked Data Is Merely More Datahttpwwww3orgDesignIssuesLinkedDatahtml 2006[Online accessed 28-02-2016]

[10] T Berners-Lee J Hendler and O Lassila The Semantic WebScientific American 284(5)29ndash37 5 2001

[11] C Bizer Quality-Driven Information Filtering in the Contextof Web-Based Information Systems VDM Publishing 2007

[12] C Bizer J Lehmann G Kobilarov S Auer C BeckerR Cyganiak and S Hellmann DBpediandashA crystallizationpoint for the Web of Data Web Semantics science servicesand agents on the world wide web 7(3)154ndash165 2009

[13] X Dong E Gabrilovich G Heitz W Horn N LaoK Murphy T Strohmann S Sun and W Zhang KnowledgeVault A Web-Scale Approach to Probabilistic KnowledgeFusion In Proceedings of the 20th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining KDD rsquo14 pages 601ndash610 New York NY USA 2014ACM

[14] S Duan A Kementsietsidis K Srinivas and O UdreaApples and Oranges A Comparison of RDF Benchmarks andReal RDF Datasets In Proceedings of the ACM SIGMODInternational Conference on Management of Data SIGMOD2011 pages 145ndash156 2011

[15] B Ell D Vrandecic and E Simperl Proceedings of the 10thInternational Semantic Web Conference (ISWC 2011) chapterLabels in the Web of Data pages 162ndash176 Springer BerlinHeidelberg Berlin Heidelberg 2011

[16] F Erxleben M Guumlnther M Kroumltzsch J Mendez andD Vrandecic Introducing Wikidata to the Linked Data WebIn Proceedings of the 13th International Semantic WebConference ISWC 2014 pages 50ndash65 Springer 2014

[17] M Faumlrber F Bartscherer C Menne and A Rettinger LinkedData Quality of DBpedia Freebase OpenCyc Wikidata andYAGO Semantic Web Journal 2017 to be published

[18] M Faumlrber C Menne and A Rettinger A Linked DataWrapper for CrunchBase Semantic Web Journal 2017 to bepublished

[19] C Fellbaum WordNet ndash An Electronic Lexical Database MITPress 1998

[20] A Flemming Qualitaumltsmerkmale von LinkedData-veroumlffentlichenden Datenquellen (Quality characteristicsof linked data publishing datasources) Diploma ThesisHumboldt University of Berlinhttpwwwdbisinformatikhu-berlindefileadminresearchpapersdiploma_seminar_thesisDiplomarbeit_Annika_Flemmingpdf 2011

[21] G Freedman and E G Reynolds Enriching Basal ReaderLessons with Semantic Webbing Reading Teacher33(6)677ndash684 1980

[22] C Fuumlrber and M Hepp SWIQA ndash A Semantic WebInformation Quality Assessment Framework In Proceedings ofthe 19th European Conference on Information Systems

(ECIS2011) volume 15 page 19 2011[23] R Guns Tracing the origins of the Semantic Web Journal of

the American Society for Information Science and Technology64(10)2173ndash2181 2013

[24] H Halpin P J Hayes J P McCusker D L McGuinness andH S Thompson The Semantic Web ndash ISWC 2010 9thInternational Semantic Web Conference ISWC 2010 ShanghaiChina chapter When owlsameAs Isnrsquot the Same An Analysisof Identity in Linked Data pages 305ndash320 Springer BerlinHeidelberg Berlin Heidelberg 2010

[25] O Hassanzadeh M J Ward M Rodriguez-Muro andK Srinivas Understanding a Large Corpus of Web TablesThrough Matching with Knowledge Bases ndash An EmpiricalStudy In Proceedings of the 10th International Workshop onOntology Matching collocated with the 14th InternationalSemantic Web Conference ISWC 2015 2015

[26] T Heath and C Bizer Linked data Evolving the web into aglobal data space Synthesis lectures on the semantic webtheory and technology 1(1)1ndash136 2011

[27] D Hernaacutendez A Hogan and M Kroumltzsch Reifying RDFWhat Works Well With Wikidata In Proceedings of the 11thInternational Workshop on Scalable Semantic Web KnowledgeBase Systems co-located with 14th International Semantic WebConference pages 32ndash47 2015

[28] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 A Spatially and Temporally Enhanced KnowledgeBase from Wikipedia Artificial Intelligence 19428ndash61 2013

[29] A Hogan A Harth A Passant S Decker and A PolleresWeaving the Pedantic Web Proceedings of the WWW2010Workshop on Linked Data on the Web 628 2010

[30] A Hogan J Umbrich A Harth R Cyganiak A Polleres andS Decker An empirical survey of linked data conformanceWeb Semantics Science Services and Agents on the WorldWide Web 1414ndash44 2012

[31] P Jain P Hitzler K Janowicz and C Venkatramani TherersquosNo Money in Linked Data httpcorescholarlibrarieswrighteducse240 2013 accessedJuly 20 2015

[32] J M Juran F M Gryna and R S Bingham editors QualityControl Handbook McGraw-Hill 1974

[33] G Kobilarov T Scott Y Raimond S Oliver C SizemoreM Smethurst C Bizer and R Lee Media Meets SemanticWeb ndash How the BBC Uses DBpedia and Linked Data to MakeConnections In Proceedings of the 6th European SemanticWeb Conference on The Semantic Web Research andApplications ESWC 2009 Heraklion pages 723ndash737 BerlinHeidelberg 2009 Springer

[34] D Kontokostas P Westphal S Auer S HellmannJ Lehmann R Cornelissen and A Zaveri Test-drivenevaluation of linked data quality In Proceedings of the 23rdinternational conference on World Wide Web pages 747ndash758ACM 2014

[35] D Kontokostas A Zaveri S Auer and J LehmannTripleCheckMate A Tool for Crowdsourcing the QualityAssessment of Linked Data In Knowledge Engineering andthe Semantic Web ndash 4th International Conference KESW 2013St Petersburg Russia October 7-9 2013 Proceedings pages265ndash272 Springer 2013

[36] C Matuszek J Cabral M J Witbrock and J DeOliveira AnIntroduction to the Syntax and Content of Cyc In AAAI SpringSymposium Formalizing and Compiling Background

52 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Knowledge and Its Applications to Knowledge Representationand Question Answering pages 44ndash49 AAAI - Association forthe Advancement of Artificial Intelligence 2006

[37] M Mecella M Scannapieco A Virgillito R BaldoniT Catarci and C Batini Managing data quality in cooperativeinformation systems In On the Move to Meaningful InternetSystems 2002 CoopIS DOA and ODBASE pages 486ndash502Springer 2002

[38] O Medelyan and C Legg Integrating Cyc and WikipediaFolksonomy meets rigorously defined common-sense InWikipedia and Artificial Intelligence An Evolving SynergyPapers from the 2008 AAAI Workshop page 65 2008

[39] F Naumann Quality-Driven Query Answering for IntegratedInformation Systems volume 2261 Springer Science ampBusiness Media 2002

[40] L L Pipino Y W Lee and R Y Wang Data QualityAssessment Communications of the ACM 45(4)211ndash2182002

[41] E Sandhaus Semantic Technology at the New York TimesLessons Learned and Future Directions In Proceedings of the9th International Semantic Web Conference on The SemanticWeb - Volume Part II ISWCrsquo10 pages 355ndash355 BerlinHeidelberg 2010 Springer

[42] A Singhal Introducing the Knowledge Graph things notstrings httpsgoogleblogblogspotde201205introducing-knowledge-graph-things-nothtml retrieved on Aug 29 2016 2012

[43] F M Suchanek G Kasneci and G Weikum YAGO A LargeOntology from Wikipedia and WordNet Web SemanticsScience Services and Agents on the World Wide Web6(3)203ndash217 2008

[44] T P Tanon D Vrandecic S Schaffert T Steiner andL Pintscher From Freebase to Wikidata The Great MigrationIn Proceedings of the 25th International Conference on WorldWide Web WWW 2016 pages 1419ndash1428 2016

[45] S Tartir I B Arpinar M Moore A P Sheth andB Aleman-meza OntoQA Metric-Based Ontology QualityAnalysis In IEEE Workshop on Knowledge Acquisition fromDistributed Autonomous Semantically Heterogeneous Dataand Knowledge Sources 2005

[46] R Y Wang M P Reddy and H B Kon Toward quality dataAn attribute-based approach Decision Support Systems13(3)349ndash372 1995

[47] R Y Wang and D M Strong Beyond Accuracy What DataQuality Means to Data Consumers Journal of managementinformation systems 12(4)5ndash33 1996

[48] A Zaveri D Kontokostas M A Sherif L BuumlhmannM Morsey S Auer and J Lehmann User-driven qualityevaluation of dbpedia In Proceedings of the 9th InternationalConference on Semantic Systems pages 97ndash104 ACM 2013

[49] A Zaveri A Rula A Maurino R Pietrobon J Lehmann andS Auer Quality Assessment for Linked Data A SurveySemantic Web 7(1)63ndash93 2015

Page 10: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with

10 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Creating a ranking of statements By means of thiscriterion one can determine whether the KG supportsa ranking of statements by which the relative rele-vance of statements among other statements can beexpressed For instance given the Wikidata entityBarack Obama (wdtQ76) and the relation posi-tion held (wdtP39) President of the United Statesof America (wdtQ11696) has a preferred rank(wdoPreferredRank) (until 2017) while olderpositions which he holds no more are ranked as normalrank (wdoNormalRank)

mRanking(g) =

1 ranking of statements supported0 otherwise

Note that this criterion refers to a characteristic ofthe KG and not to a characteristic of the system thathosts the KG

332 CompletenessDefinition of dimension Completeness is ldquothe ex-

tent to which data are of sufficient breadth depth andscope for the task at handrdquo [47]

We include the following two aspects in this dimen-sion which are separate dimensions in Wang et alrsquosframework

ndash Appropriate amount of data Appropriate amountof data is ldquothe extent to which the quantity orvolume of available data is appropriaterdquo [47]

ndash Value-added Value-added is ldquothe extent to whichdata are beneficial and provide advantages fromtheir userdquo [47]

Discussion Pipino et al [40] divide Completenessinto

1 Schema completeness ie the extent to whichclasses and relations are not missing

2 Column completeness ie the extent to whichvalues of relations on instance level ndash ie facts ndashare not missing and

3 Population completeness ie the extent to whichentities are not missing

The Completeness dimension is context-dependent andtherefore belongs to the contextual category becausethe fact that a KG is seen as complete depends on theuse case scenario ie on the given KG and on the infor-mation need of the user As exemplified by Bizer [11]a list of German stocks is complete for an investor whois interested in German stocks but it is not complete for

an investor who is looking for an overview of Europeanstocks The completeness is hence only assessable bymeans of a concrete use case at hand or with the helpof a defined gold standard

Definition of metric We follow the above-mentioneddistinction of Pipino et al [40] and determine Com-pleteness by means of the criteria Schema completenessColumn completeness and Population completeness

The fulfillment degree of a KG g wrt the dimensionCompleteness is measured by the metrics mcSchemamcCol and mcPop which are defined as follows

Schema completeness By means of the criterionSchema completeness one can determine the complete-ness of the schema wrt classes and relations [40] Theschema is assessed by means of a gold standard Thisgold standard consists of classes and relations which arerelevant for the use case For evaluating cross-domainKGs we use as gold standard a typical set of cross-domain classes and relations It comprises (i) basicclasses such as people and locations in different gran-ularities and (ii) basic relations such as birth date andnumber of inhabitants We define the schema complete-ness mcSchema as the ratio of the number of classesand relations of the gold standard existing in g noclatg and the number of classes and relations in the goldstandard noclat

mcSchema(g) =noclatgnoclat

Column completeness In the traditional database area(with fixed schema) by means of the Column complete-ness criterion one can determine the degree by whichthe relations of a class which are defined on the schemalevel (each relation has one column) exist on the in-stance level [40] In the Semantic Web and Linked Datacontext however we cannot presume any fixed rela-tional schema on the schema level The set of possiblerelations for the instances of a class is given at run-time by the set of used relations for the instances ofthis class Therefore we need to modify this criterionas already proposed by Pipino et al [40] In the updatedversion by means of the criterion Column completenessone can determine the degree by which the instances ofa class use the same relations averaged over all classes

Formally we define the Column completeness met-ric mcCol(g) as the ratio of the number of instanceshaving class k and a value for the relation r nokp tothe number of all instances having class k nok Byaveraging over all class-relation-pairs which occur on

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 11

instance level we obtain a fulfillment degree regardingthe whole KG

mcCol(g) =1

|H|sum

(kp)isinH

nokpnok

We thereby let H = (k p) isin (K times P ) | existk isinCg and exist(x p o) | p isin P imp

g and (xrdftype k) bethe set of all combinations of the considered classesK = k1 kn and considered relations P =p1 pm

Note that there are also relations which are dedicatedto the instances of a specific class but which do notneed to exist for all instances of that class For instancenot all people need to have a relation hasChild ordeathDate21 For measuring the Column complete-ness we selected only those relations for an assessmentwhere a value of the relation typically exists for allgiven instances

Population completeness The Population complete-ness metric determines the extent to which the consid-ered KG covers a basic population [40] The assess-ment of the KG completeness wrt a basic populationis performed by means of a gold standard which coversboth well-known entities (called ldquoshort headrdquo eg then largest cities in the world according to the number ofinhabitants) and little-known entities (called ldquolong tailrdquoeg municipalities in Germany) We take all entitiescontained in our gold standard equally into account

Let GS be the set of entities in the gold standardThen we can define

mcPop(g) =|e|e isin GS and e isin Eg|

|e|e isin GS|

333 TimelinessDefinition of dimension Timeliness is ldquothe extent

to which the age of the data is appropriate for the taskat handrdquo [47]

Discussion Timeliness does not describe the creationdate of a statement but instead the time range since thelast update or the last verification of the statement [39]Due to the easy way of publishing data on the Webdata sources can be kept easier up-to-date than tradi-tional isolated data sources This results in advantagesto the consumer of Web data [39] How Timeliness is

21For an evaluation about the prediction which relations are of thisnature see [1]

measured depends on the application context For somesituations years are sufficient while in other situationsone may need days [39]

Definition of metric The dimension timeliness isdetermined by the criteria Timeliness frequency of theKG Specification of the validity period and Specifica-tion of the modification date of statements

The fulfillment degree of a KG g wrt the dimen-sion Timeliness is measured by the metrics mFreqmV alidity and mChange which are defined as follows

Timeliness frequency of the KG The criterion Time-liness frequency of the KG indicates how fast the KGis updated We consider the KG RDF export here anddifferentiate between continuous updates where the up-dates are always performed immediately and discreteKG updates where the updates take place in discretetime intervals In case the KG edits are available onlineimmediately but the RDF export files are available indiscrete varying updating intervals we consider theonline version of the KG since in the context of LinkedData it is sufficient that URIs are dereferenceable

mFreq(g) =

1 continuous updates05 discrete periodic updates025 discrete non-periodic updates0 otherwise

Specification of the validity period of statements Spec-ifying the validity period of statements enables to tem-porally limit the validity of statements By using this cri-terion we measure whether the KG supports the speci-fication of starting and maybe end dates of statementsby means of providing suitable forms of representation

mV alidity(g) =

1 specification of validity pe-riod supported

0 otherwise

Specification of the modification date of statementsThe modification date discloses the point in timeof the last verification of a statement The modifi-cation date is typically represented via the relationsschemadateModified and dctermsmodified

mChange(g) =

1 specification of modifica-

tion dates for statementssupported

0 otherwise

12 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

34 Representational Data Quality

Representational data quality ldquocontains aspects re-lated to the format of the data [] and meaning ofdatardquo [47] This category contains the two dimensions(i) Ease of understanding (ie regarding the human-readability) and (ii) Interoperability (ie regarding themachine-readability) The dimensions InterpretabilityRepresentational consistency and Concise representa-tion as in addition proposed by Wang et al [47] areconsidered by us as being a part of the dimension Inter-operability

341 Ease of UnderstandingDefinition of dimension The ease of understanding

is ldquothe extent to which data are clear without ambiguityand easily comprehendedrdquo [47]

Discussion This dimension focuses on the under-standability of a data source by a human data con-sumer In contrast the dimension Interoperability fo-cuses on technical aspects The understandability of adata source (here KG) can be improved by things suchas descriptive labels and literals in multiple languages

Definition of metric The dimension understand-ability is determined by the criteria Description of re-sources Labels in multiple languages UnderstandableRDF serialization and Self-describing URIs The ful-fillment degree of a KG g wrt the dimension Con-sistency is measured by the metrics mDescr mLangmuSer and muURI which are defined as follows

Description of resources Heath et al [2630] suggestto describe resources in a human-understandable wayeg via rdfslabel or rdfscomment Withinour framework the criterion is measured as followsGiven a sample of resources we divide the numberof resources in the KG for which at least one label orone description is provided (eg via rdfslabelrdfscomment or schemadescription) bythe number of all considered resources in the localnamespace

mDescr(g) = |u|u isin U localg and exist(u p o) isin g

p isin PlDesc||u|u isin U localg |

PlDesc is the set of implicitly used relations in g in-dicating that the value is a label or description (egPlDesc = rdfslabelrdfscomment)

Beschreibung) Daruumlber hinaus ist das Ergebnisder Evaluation auf Basis der Entitaumlten interessant -gt DBpedia weicht deutlich ab da manche Entitaumlten

(Intermediate-Node-Mapping) keine rdfslabel habenFolglich wuumlrde ich die Definition der Metrik allgemeinhalten (beschraumlnkt auf proprietaumlre Ressourcen dh imselben Namespace) die Evaluation jedoch nur anhandder Entitaumlten machen

Labels in multiple languages Resources in the KG aredescribed in a human-readable way via labels eg viardfslabel or skosprefLabel22 The charac-teristic feature of skosprefLabel is that this kindof label should be used per resource at most once incontrast rdfslabel has no cardinality restrictionsie it can be used several times for a given resourceLabels are usually provided in English as the ldquobasiclanguagerdquo The now introduced metric for the criterionLabels in multiple languages determines whether labelsin other languages than English are provided in the KG

mLang(g) =

1 Labels provided in English

and at least one other lan-guage

0 otherwise

Understandable RDF serialization RDFXML is therecommended RDF serialization format of the W3CHowever due to its syntax RDFXML documents arehard to read for humans The understandability of RDFdata by humans can be increased by providing RDFin other more human-understandable serialization for-mats such as N3 N-Triple and Turtle We measurethis criterion by measuring the supported serializationformats during the dereferencing of resources

muSer(hg) =

1 Other RDF serializationsthan RDFXML available

0 otherwise

Note that conversions from one RDF serializationformat into another are easy to perform

Self-describing URIs Descriptive URIs contribute toa better human-readability of KG data Sauermann etal23 recommend to use short memorable URIs in theSemantic Web context which are easier understandableand memorable by humans compared to opaque URIs24

22Using the namespace httpwwww3org200402skoscore

23See httpswwww3orgTRcooluris requested onMar 1 2016

24For an overview of URI patterns see httpswwww3orgcommunitybpmlodwikiBest_practises_-_previous_notes requested on Dec 27 2016

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 13

such as wdtQ1040 The criterion Self-describingURIs is dedicated to evaluate whether self-describingURIs or generic IDs are used for the identification ofresources

muURI(g) =

1 self-describing URIs always used05 self-describing URIs partly used0 otherwise

342 InteroperabilityInteroperability is another dimension of the repre-

sentational data quality category and subsumes Wanget alrsquos aspects interpretability representational consis-tency and concise representation

Definition of dimension We define Interoperabilityalong the subsumed dimensions of Wang et al

ndash Interpretability Interpretability is ldquothe extent towhich data are in appropriate language and unitsand the data definitions are clearrdquo [47]

ndash Representational consistency Representationalconsistency is ldquothe extent to which data are alwayspresented in the same format and are compatiblewith previous datardquo [47]

ndash Concise representation Concise representationis ldquothe extent to which data are compactly repre-sented without being overwhelmingrdquo [47]

Discussion regarding interpretability In contrastto the dimension understandability which focuses onthe understandability of RDF KG data towards the useras data consumer interpretability focuses on the rep-resentation forms of information in the KG from atechnical perspective An example is the considerationwhether blank nodes are used According to Heath etal [26] blank nodes should be avoided in the LinkedData context since they complicate the integration ofmultiple data sources and since they cannot be linkedby resources of other data sources

Discussion regarding representational consistencyIn the context of Linked Data it is best practice to reuseexisting vocabulary for the creation of own RDF dataIn this way less data needs to be prepared for beingpublished as Linked Data [26]

Discussion regarding concise representation Heathet al [26] made the observation that the RDF features(i) RDF reification25 (ii) RDF collections and RDF

25In the literature it is often not differentiated between reificationin the general sense and reification in the sense of the specific

container and (iii) blank nodes are not very widelyused in the Linked Open Data context Those featuresshould be avoided according to Heath et al in orderto simplify the processing of data on the client sideEven the querying of the data via SPARQL may getcomplicated if RDF reification RDF collections andRDF container are used We agree on that but alsopoint out that reification (implemented via RDF stan-dard reification n-ary relations singleton propertiesor named graphs) is inevitably necessary for makingstatements about statements

Definition of metric The dimension Interoperabil-ity is determined via the following criteria

ndash Avoiding blank nodes and RDF reificationndash Provisioning of several serialization formatsndash Using external vocabularyndash Interoperability of proprietary vocabulary

The fulfillment degree of a KG g wrt the dimen-sion Interoperability is measured by the metrics mReif miSerial mexV oc and mpropV oc which are defined asfollows

Avoiding blank nodes and RDF reification Using RDFblank nodes RDF reification RDF container and RDFlists is often considered as ambivalent On the one handthese RDF features are not very common and theycomplicate the processing and querying of RDF data[3026] On the other hand they are necessary in cer-tain situations eg when statements about statementsshould be made We measure the criterion by evaluatingwhether blank nodes and RDF reification are used

mReif (g) =

1 no blank nodes and no RDF

reification05 either blank nodes or RDF

reification0 otherwise

Provisioning of several serialization formats The in-terpretability of RDF data of a KG is increased if be-

proposal described in the RDF standard (Brickley D Guha R (eds)RDF Vocabulary Description Language 10 RDF Schema W3CRecommendation online available at httpwwww3orgTRrdf-schema requested on Sep 2 2016) For more informationabout reification and its implementation possibilities we can refer thereader to [27] In this article we use the term reification by defaultfor the general sense and standard reification or RDF reificationfor referring to the modeling of reification according to the RDFstandard

14 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

sides the serialization standard RDFXML further seri-alization formats are supported for URI dereferencing

miSerial(hg) =

1 RDFXML and further for-

mats are supported05 only RDFXML is supported0 otherwise

Using external vocabulary Using a common vocabu-lary for representing and describing the KG data allowsto represent resources and relations between resourcesin the Web of Data in a unified way This increases theinteroperability of data [3026] and allows a comfort-able data integration We measure the criterion of usingan external vocabulary by setting the number of tripleswith external vocabulary in predicate position to thenumber of all triples in the KG

mextV oc(g) =|(s p o)|(s p o) isin g and p isin P external

g ||(s p o) isin g|

Interoperability of proprietary vocabulary Linkingon schema level means to link the proprietary vo-cabulary to external vocabulary Proprietary vocab-ulary are classes and relations which were definedin the KG itself The interlinking to external vo-cabulary guarantees a high degree of interoperabil-ity [26] We measure the interlinking on schemalevel by calculating the ratio to which classes andrelations have at least one equivalency link (egowlsameAs owlequivalentProperty orowlequivalentClass) to classes and relationsrespectively of other data sources

mpropV oc(g) = |x isin Pg cup Cg|exist(x p o) isin g

(p isin Peq and (o isin U and o isin Uextg ))||Pg cup Cg|

where Peq = owlsameAsowlequivalent-PropertyowlequivalenClass and Uext

g con-sists of all URIs in Ug which are external to the KG gwhich means that hg is not responsible for resolvingthese URIs

35 Accessibility Category

Accessibility data quality refers to aspects on howdata can be accessed This category contains the threedimensions

ndash Accessibilityndash Licensing andndash Interlinking

Wangrsquos dimension access security is considered by usas being not relevant in the Linked Open Data contextas we only take open data sources into account

In the following we go into details of the mentioneddata quality dimensions

351 AccessibilityDefinition of dimension Accessibility is ldquothe ex-

tent to which data are available or easily and quicklyretrievablerdquo [47]

Discussion Wang et alrsquos definition of Accessibilitycontains the aspects availability response time anddata request They are defined as follows

1 Availability ldquoof a data source is the probability thata feasible query is correctly answered in a giventime rangerdquo [39]According to Naumann [39] the availability is animportant quality aspect for data sources on theWeb since in case of integrated systems (with fed-erated queries) usually all data sources need tobe available in order to execute the query Therecan be different influencing factors regarding theavailability of data sources such as the day timethe worldwide distribution of servers the planedmaintenance work and the caching of data LinkedData sources can be available as SPARQL end-points (for performing complex queries on thedata) and via HTTP URI dereferencing We needto consider both possibilities for this DQ dimen-sion

2 Response time characterizes the delay betweenthe point in time when the query was submittedand the point in time when the query response isreceived [11]Note that the response time is dependent on em-pirical factors such as the query the size of the in-dexed data the data structure the used triple storethe hardware and so on We do not consider theresponse time in our evaluations since obtaininga comprehensive result here is hard

3 In the context of Linked Data data requests canbe made (i) on SPARQL endpoints (ii) on RDFdumps (export files) and (iii) on Linked DataAPIs

Definition of metric We define the metric for thedimension Accessibility by means of metrics for thefollowing criteria

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 15

ndash Dereferencing possibility of resourcesndash Availability of the KGndash Provisioning of public SPARQL endpointndash Provisioning of an RDF exportndash Support of content negotiationndash Linking HTML sites to RDF serializationsndash Provisioning of KG metadata

The fulfillment degree of a KG g wrt the dimen-sion Accessibility is measured by the metrics mDeref mAvai mSPARQL mExport mNegot mHTMLRDF and mMeta which are defined as follows

Dereferencing possibility of resources One of theLinked Data principles [9] is the dereferencing possi-bility of resources URIs must be resolvable via HTTPrequests and useful information should be returnedthereby We assess the dereferencing possibility of re-sources in the KG by analyzing for each URI in the sam-ple set (here all URIs Ug) the HTTP response statuscode and by evaluating whether RDF data is returned Asuccessful dereferencing of resources is given if HTTPstatus code 200 and an RDF document is returned

mDeref (hg) =|dereferencable(Ug)|

|Ug|

Availability of the KG The Availability of the KG cri-terion indicates the uptime of the KG It is an essentialcriterion in the context of Linked Data since in case ofan integrated or federated query mostly all data sourcesneed to be available [39] We measure the availabil-ity of a KG by monitoring the ability of dereferencingURIs over a period of time This monitoring processcan be done with the help of a monitoring tool such asPingdom26

mAvai(hg) =Number of successful requests

Number of all requests

Provisioning of public SPARQL endpoint SPARQLendpoints allow the user to perform complex queries(including potentially many instances classes and rela-tions) on the KG This criterion here indicates whetheran official SPARQL endpoint is publicly availableThere might be additional restrictions of this SPARQLendpoint such as a maximum number of requests pertime slice or a maximum runtime of a query However

26See httppingdomcom requested on Mar 1 2016

we do not measure these restrictions here

mSPARQL(hg) =

1 SPARQL endpoint publiclyavailable

0 otherwise

Provisioning of an RDF export If there is no pub-lic SPARQL endpoint available or the restrictions ofthis endpoint are so strict that the user does not useit an RDF export dataset (RDF dump) can often beused This dataset can be used to set up a local pri-vate SPARQL endpoint The criterion here indicateswhether an RDF export dataset is officially available

mExport(hg) =

1 RDF export available0 otherwise

Support of content negotiation Content negotiation(CN) allows that the server returns RDF documentsduring the dereferencing of resources in the desiredRDF serialization format The HTTP protocol allowsthe client to specify the desired content type (eg RDFXML) in the HTTP request and the server to specifythe returned content type in the HTTP response header(eg applicationrdf+xml) In this way the de-sired and the provided content type are matched as faras possible It can happen that the server does not pro-vide the desired content type Moreover it may hap-pen that the server returns an incorrect content typeThis may lead to the fact that serialized RDF data isnot processed further An example is RDF data whichis declared as textplain [26] Hogan et al [29]therefore propose to let KGs return the most specificcontent type as possible We measure the Support ofcontent negotiation by dereferencing resources withdifferent RDF serialization formats as desired contenttype and by comparing the accept header of the HTTPrequest with the content type of the HTTP response

mNegot(hg) =

1 CN supported and correct

content types returned05 CN supported but wrong

content types returned0 otherwise

Linking HTML sites to RDF serializations Heath etal [26] suggest linking any HTML description of aresource to RDF serializations of this resource in or-der to make the discovery of corresponding RDF dataeasier (for Linked Data aware applications) For thatreason in the HTML header the so-called Autodiscov-

16 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

ery pattern can be included This pattern consists ofthe phrase link rel=alternate the indicationabout the provided RDF content type and a link to theRDF document27 We measure the linking of HTMLpages to RDF documents (ie resource representations)by evaluating whether the HTML representations of theresources contain links as described

mHTMLRDF (hg) =

1 Autodiscovery pattern usedat least once

0 otherwise

Provisioning of KG metadata In the light of the Se-mantic Web vision where agents select and make useof appropriate data sources on the Web also the meta-information about KGs needs to be available in amachine-readable format The two important mech-anisms to specify metadata about KGs are (i) usingsemantic sitemaps and (ii) using the VoID vocabu-lary28 [26] For instance the URI of the SPARQL end-point can be assigned via voidsparqlEndpointand the RDF export URL can be specified withvoiddataDump Such metadata can be added as ad-ditional facts to the KG or it can be provided as separateVoID file We measure the Provisioning of KG meta-data by evaluating whether machine-readable metadataabout the KG is available Note that the provisioningof licensing information in a machine-readable format(which is also a meta-information about the KG) isconsidered in the data quality dimension License lateron

mMeta(g) =

1 Machine-readable metadataabout g available

0 otherwise

352 LicenseDefinition of dimension Licensing is defined as

ldquothe granting of permission for a consumer to re-use adataset under defined conditionsrdquo [49]

Discussion The publication of licensing informationabout KGs is important for using KGs without legalconcerns especially in commercial settings CreativeCommons (CC)29 publishes several standard licensing

27An example is ltlinkrel=alternate type =applicationrdf+xml href=companyrdfgt

28See namespace httpwwww3orgTRvoid29See httpcreativecommonsorg requested on Mar

1 2016

contracts which define rights and obligations Thesecontracts are also in the Linked Data context popularThe most frequent licenses for Linked Data are CC-BYCC-BY-SA and CC0 [31] CC-BY30 requires specify-ing the source of the data CC-BY-SA31 requires in ad-dition that if the data is published it is published underthe same legal conditions CC032 defines the respectivedata as public domain and without any restrictions

Noteworthy is that most data sources in the LinkedOpen Data cloud do not provide any licensing infor-mation [31] which makes it difficult to use the datain commercial settings Even if data is published un-der CC-BY or CC-BY-SA the data is often not usedsince companies refer to uncertainties regarding thesecontracts

Definition of metric The dimension License isdetermined by the criterion Provisioning machine-readable licensing information

The fulfillment degree of a KG g wrt the dimensionLicense is measured by the metric mmacLicense whichis defined as follows

Provisioning machine-readable licensing informationLicenses define the legal frameworks under which theKG data may be used Providing machine-readable li-censing information allows users and applications to beaware of the license and to use the data of the KG inaccordance with the legal possibilities [3026]

Licenses can be specified in RDF via relationssuch as cclicence33 dctermslicence ordctermsrights The licensing information can bespecified either in the KG as additional facts or sepa-rately in a VoID file We measure the criterion by eval-uating whether licensing information is available in amachine-readable format

mmacLicense(g) =

1 machine-readable

licensing informationavailable

0 otherwise

353 InterlinkingDefinition of dimension Interlinking is the extent

ldquoto which entities that represent the same concept are

30See httpscreativecommonsorglicensesby40 requestedon Mar 1 2016

31See httpscreativecommonsorglicensesby-sa40 requested on Mar 1 2016

32See httpcreativecommonsorgpublicdomainzero10 requested on Mar 3 2016

33Using the namespace httpcreativecommonsorgns

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 17

linked to each other be it within or between two ormore data sourcesrdquo [49]

Discussion According to Bizer et al [12] DBpediaestablished itself as a hub in the Linked Data clouddue to its intensive interlinking with other KGs Theseinterlinking is on the instance level usually establishedvia owlsameAs links However according to Halpinet al [24] those owlsameAs links do not alwaysinterlink identical entities in reality According to theauthors one reason might be that the KGs provideentries in different granularity For instance the DB-pedia resource for Berlin (dboBerlin) links viaowlsameAs relations to three different resources inthe KG GeoNames34 namely (i) Berlin the capital35

(ii) Berlin the state36 and (iii) Berlin the city37 More-over owlsameAs relations are often created auto-matically by some mapping function Due to mappingerrors the precision is often below 100 [18]

Definition of metric The dimension Interlinking isdetermined by the criteria

ndash Interlinking via owlsameAsndash Validity of external URIs

The fulfillment degree of a KG g wrt the dimen-sion Interlinking is measured by the metrics mInst andmURIs which are defined as follows

Interlinking via owlsameAs The forth LinkedData principle according to Berners-Lee [8] is the inter-linking of data resources so that the user can explorefurther information According to Hogan et al [30] theinterlinking has a side effect It does not only result inotherwise isolated KGs but the number of incominglinks of a KG indicates the importance of the KG in theLinked Open Data cloud We measure the interlinkingon instance level38 by calculating the extent to which in-stances have at least one owlsameAs link to externalKGs

34See httpwwwgeonamesorg requested on Dec 312016

35See httpwwwgeonamesorg2950159berlinhtml requested on Feb 4 2017

36See httpwwwgeonamesorg2950157land-berlinhtml requested on Feb 4 2017

37See httpwwwgeonamesorg6547383berlin-stadthtml requested on Feb 4 2017

38The interlinking on schema level is already measured via thecriterion Interoperability of proprietary vocabulary

mInst(g) = |x isin Ig (Pg cup Cg) |

exist(xowlsameAs y) isin g and y isin Uextg |

|Ig (Pg cup Cg)|

Validity of external URIs The considered KG maycontain outgoing links referring to RDF resourcesor Web documents (non-RDF data) The linking toRDF resources is usually enabled by owlsameAsowlequivalentProperty and owlequivalentClass relations Web documents are linked viarelations such as foafhomepage and foafdepiction Linking to external resources always entailsthe problem that those links might get invalid over timeThis can have different causes For instance the URIsare not available anymore We measure the Validity ofexternal URIs by evaluating the URIs from an URI sam-ple set wrt whether there is a timeout a client error(HTTP response 4xx) or a server error (HTTP response5xx)

mURIs(g) =|x isin A | resolvable(x)|

|A|

where A = y | exist(x p y) isin g (p isin Peqandx isin Ug(CgcupPg)andx isin U local

g andy isin Uextg ) and resolvable(x)

returns true if HTTP status code 200 is returned Peq isthe set of relations used for linking to external sourcesExamples for such relations are owlsameAs andfoafhomepage

In case of an empty set A the metric should evaluateto 1

36 Conclusion

In this section we provided 34 DQ criteria which canbe applied in the form of DQ metrics to KGs in order toassess those KGs wrt data quality The DQ criteria areclassified into 11 DQ dimensions These dimensionsare themselves grouped into 4 DQ categories In totalwe have the following picture

ndash Intrinsic category

lowast Accuracylowast Syntactic validity of RDF documentslowast Syntactic validity of literalslowast Semantic validity of triples

18 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

lowast Trustworthinesslowast Trustworthiness on KG levellowast Trustworthiness on statement levellowast Using unknown and empty values

lowast Consistencylowast Check of schema restrictions during inser-

tion of new statementslowast Consistency of statements wrt class con-

straintslowast Consistency of statements wrt relation con-

straints

ndash Contextual category

lowast Relevancylowast Creating a ranking of statements

lowast Completenesslowast Schema completenesslowast Column completenesslowast Population completeness

lowast Timelinesslowast Timeliness frequency of the KGlowast Specification of the validity period of state-

mentslowast Specification of the modification date of

statements

ndash Representational data quality

lowast Ease of understandinglowast Description of resourceslowast Labels in multiple languageslowast Understandable RDF serializationlowast Self-describing URIs

lowast Interoperabilitylowast Avoiding blank nodes and RDF reificationlowast Provisioning of several serialization formatslowast Using external vocabularylowast Interoperability of proprietary vocabulary

ndash Accessibility category

lowast Accessibilitylowast Dereferencing possibility of resourceslowast Availability of the KGlowast Provisioning of public SPARQL endpointlowast Provisioning of an RDF exportlowast Support of content negotiationlowast Linking HTML sites to RDF serializationslowast Provisioning of KG metadata

lowast Licenselowast Provisioning machine-readable licensing in-

formationlowast Interlinking

lowast Interlinking via owlsameAslowast Validity of external URIs

4 Selection of KGs

We consider the following KGs for our comparativeevaluation

ndash DBpedia DBpedia39 is the most prominent KGin the LOD cloud [4] The project was initiatedby researchers from the Free University of Berlinand the University of Leipzig in collaborationwith OpenLink Software Since the first public re-lease in 2007 DBpedia is updated roughly once ayear40 By means of a dedicated open source ex-traction framework DBpedia is created from infor-mation contained in Wikipedia such as infobox ta-bles categorization information geo-coordinatesand external links Due to its role as the hub ofthe LOD cloud DBpedia contains many links toother datasets in the LOD cloud such as FreebaseOpenCyc UMBEL41 GeoNames Musicbrainz42

CIA World Factbook43 DBLP44 Project Guten-berg45 DBtune Jamendo46 Eurostat47 Uniprot48

and Bio2RDF4950 DBpedia has been used exten-sively in the Semantic Web research communitybut has become also relevant in commercial set-tings for instance companies such as the BBC[33] and the New York Times [41] use DBpediato organize their content The version of DBpediawe analyzed is 2015-04

39See httpdbpediaorg requested on Nov 1 201640There is also DBpedia live which started in 2009 and which

gets updated when Wikipedia is updated See httplivedbpediaorg requested on Nov 1 2016 Note however thatDBpedia live only provides a restricted set of relations compared toDBpedia Also the provisioning of data varies a lot While for sometime ranges DBpedia live provides data for each hour for other timeranges DBpedia live data is only available once a month

41See httpumbelorg requested on Dec 31 201642See httpmusicbrainzorg requested on Dec 31

201643See httpswwwciagovlibrary

publicationsthe-world-factbook requested on Dec31 2016

44See httpwwwdblporg requested on Dec 31 201645See httpswwwgutenbergorg requested on Dec

31 201646See httpdbtuneorgjamendo requested on Dec

31 201647See httpeurostatlinked-statisticsorg

requested on Dec 31 201648See httpwwwuniprotorg requested on Dec 31

201649See httpbio2rdforg requested on Dec 31 201650See a complete list of the links on the websites describing the sin-

gle DBpedia versions such as httpdownloadsdbpediaorg2016-04links (requested on Nov 1 2016)

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 19

ndash Freebase Freebase51 is a KG announced byMetaweb Technologies Inc in 2007 and was ac-quired by Google Inc on July 16 2010 In con-trast to DBpedia Freebase had provided an in-terface that allowed end-users to contribute tothe KG by editing structured data Besides user-contributed data Freebase integrated data fromWikipedia NNDB52 FMD53 and MusicBrainz54

Freebase uses a proprietary graph model for stor-ing also complex statements Freebase shut downits services completely on August 31 2016 Onlythe latest data dump is still available WikimediaDeutschland and Google integrate Freebase datainto Wikidata via the Primary Sources Tool55 Fur-ther information about the migration from Free-base to Wikidata is provided in [44] We analyzedthe latest Freebase version as of March 2015

ndash OpenCyc The Cyc56 project started in 1984 bythe industry research and development consortiumMicroelectronics and Computer Technology Cor-poration The aim of Cyc is to store ndash in a machine-processable way ndash millions of common sense factssuch as ldquoEvery tree is a plantrdquo The main focus ofCyc has been on inferencing and reasoning SinceCyc is proprietary a smaller version of the KGcalled OpenCyc57 was released under the opensource Apache license Version 2 In July 2006 Re-searchCyc58 was published for the research com-munity containing more facts than OpenCyc Wedid not consider Cyc and ResearchCyc since thoseKGs do not meet the chosen requirements namelythat the KGs are freely available and freely us-able in any context The version of OpenCyc weanalyzed is 2012-05-10

ndash Wikidata Wikidata59 is a project of WikimediaDeutschland which started on October 30 2012The aim of the project is to provide data whichcan be used by any Wikimedia project including

51See httpfreebasecom requested on Nov 1 201652See httpwwwnndbcom requested on Dec 31 201653See httpwwwfashionmodeldirectorycom re-

quested on Dec 31 201654See httpmusicbrainzorg requested on Dec 31

201655See httpswwwwikidataorgwikiWikidata

Primary_sources_tool requested on Apr 8 201656See httpwwwcyccom requested on Dec 31 201657See httpwwwopencycorg accessed on Nov 1

201658See httpresearchcyccom requested on Dec 31

201659See httpwikidataorg accessed on Nov 1 2016

Wikipedia Wikidata does not only store facts butalso the corresponding sources so that the valid-ity of facts can be checked Labels aliases anddescriptions of entities in Wikidata are providedin almost 400 languages Wikidata is a commu-nity effort ie users collaboratively add and editinformation Also the schema is maintained andextended based on community agreements Wiki-data is currently growing considerably due to theintegration of Freebase data [44] The version ofWikidata we analyzed is 2015-10

ndash YAGO YAGO60 ndash Yet Another Great Ontol-ogy ndash has been developed at the Max PlanckInstitute for Computer Science in Saarbruumlckensince 2007 YAGO comprises information ex-tracted from Wikipedia (such as information fromthe categories redirects and infoboxes) Word-Net [19] (such as information about synsets andhyponomies) and GeoNames61 The version ofYAGO we analyzed is YAGO3 which was pub-lished in March 2015

5 Comparison of KGs

51 Key Statistics

In the following we present statistical commonal-ities and differences of the KGs DBpedia FreebaseOpenCyc Wikidata and YAGO We thereby use thefollowing key statistics

ndash Number of triplesndash Number of classesndash Number of relationsndash Distribution of classes wrt the number of their

corresponding instancesndash Coverage of classes with at least one instance per

classndash Covered domains wrt entitiesndash Number of entitiesndash Number of instancesndash Number of entities per classndash Number of unique subjectsndash Number of unique predicatesndash Number of unique objects

In Section 72 we provide an overview of relatedwork wrt those key statistics

60See httpwwwmpi-infmpgdedepartmentsdatabases-and-information-systemsresearchyago-nagayagodownloads accessed on Nov 1 2016

61See httpwwwgeonamesorg requested on Dec 312016

20 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

511 TriplesRanking of KGs wrt number of triples The num-

ber of triples (see Table 2) differs considerably betweenthe KGs Freebase is the largest KG with over 31Btriples while OpenCyc resides the smallest KG withonly 24M triples The large size of Freebase can betraced back to the fact that large data sets such as Mu-sicBrainz have been integrated into this KG OpenCycin contrast has been built purely manually by expertsIn general this indicates a correlation between the wayof building up a KG and its size

Size differences between DBpedia and YAGO Asboth DBpedia and YAGO were created automaticallyby extracting semantically-structured information fromWikipedia the significant difference between their sizesndash in terms of triples ndash is in particular noteworthy Wecan mention here the following reasons YAGO inte-grates the statements from different language versionsof Wikipedia in one single KG while for the canon-ical DBpedia dataset (which is used in our evalua-tions) solely the English Wikipedia was used as in-formation source Besides that YAGO contains con-textual information and detailed provenance informa-tion Contextual information is for instance the an-chor texts of all links within Wikipedia For repre-senting the anchor texts the relation yagohasWikipediaAnchorText (330M triples in total) is usedThe provenance information of single statements isstored in a reified form In particular the relationsyagoextractionSource (1612M triples) andyagoextractionTechnique (1762M triples)are applied therefore3nInfluence of reification on the number of triples

DBpedia Freebase Wikidata and YAGO use someform of reification Reification in general describesthe possibility of making statements about statementsWhile reification has an influence on the number oftriples for DBpedia Freebase and Wikidata the num-ber of triples in YAGO is not influenced by reificationsince data is here provided in N-Quads62 This style ofreification is called Named Graph [27] The additionalcolumn (in comparison to triples) contains a unique IDof the statement by which the triple becomes identifiedFor backward compatibility the ID is commented andtherefore not imported into the triple store Note how-ever that transforming N-Quads to N-Triples leads to a

62The idea of N-Quads is based on the assignment of triples todifferent graphs YAGO uses N-Quads to identify statements per ID

high number of unique subjects concerning the set ofall triples

In case of DBpedia Freebase and Wikidata reifica-tion is implemented by means of n-ary relations Ann-ary relation denotes the relation between more thantwo resources and is implemented via additional inter-mediate nodes since in RDF only binary statementscan be modeled [1627] In Freebase and DBpedia datais mostly provided in the form of plain N-Triples andn-ary relations are only used for data from higher ar-ity63 Wikidata in contrast has the peculiarity that notonly every statement is expressed with the help of ann-ary relation but that in addition each statement is in-stantiated with wdoStatement This leads to about74M additional instances which is about one tenth ofall triples in Wikidata

512 ClassesMethods for counting classes The number of

classes can be calculated in different ways Classes canbe identified via rdfsClass and owlClass re-lations or via rdfssubClassOf relations64 SinceFreebase does not provide any class hierarchy withrdfssubClassOf relations and since Wikidatadoes not instantiate classes explicitly as classes butuses instead only ldquosubclass ofrdquo (wdtP279) relationsthe method of calculating the number of classes de-pends on the considered KG

Ranking of KG wrt number of classes Our eval-uations revealed that YAGO contains the highest num-ber of classes of all considered KGs DBpedia in con-trast has the fewest (see Table 2)

Number of classes in YAGO and DBpedia Howdoes it come to this gap between DBpedia and YAGOwith respect to the number of classes although bothKGs were created automatically based on WikipediaFor YAGO the classes are extracted from the categoriesin Wikipedia while the hierarchy of the classes is de-ployed with the help of WordNet synset relations TheDBpedia ontology in contrast is very small since itis created manually based on the mostly used infobox

63In Freebase Compound Value Types are used for reifi-cation [44] In DBpedia it is named Intermedia Node Map-ping see httpmappingsdbpediaorgindexphpTemplateIntermediateNodeMapping (requested on Dec31 2016)

64The number of classes in a KG may also be calculated by takingall entity type relations (rdftype and ldquoinstance ofrdquo (wdtP31)in case of Wikidata) on the instance level into account However thiswould result only in a lower bound estimation as here those classesare not considered which have no instances

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 21

DBpe

dia

Freeb

ase

Ope

nCyc

Wikidat

a

YAGO

20

40

60

80

100

Co

ve

rag

e in

Fig 1 Coverage of classes having at least one instance

templates in Wikipedia Besides those 736 classes theDBpedia KG contains further 444895 classes whichoriginate from the imported YAGO classes and whichare published in the namespace yago Those YAGOclasses are ndash like the DBpedia ontology classes ndash inter-connected via rdfssubClassOf to form a taxon-omy In the evaluation of DBpedia the YAGO classesare ignored as they do not belong to the DBpedia on-tology given as OWL file

Coverage of classes with at least one instanceFig 1 shows for each KG the extent to which classes areinstantiated that is for how many classes at least oneinstance exists YAGO exhibits the highest coveragerate (826) although it contains the highest numberof classes among the KGs This can be traced back tothe fact that YAGO classes are chosen by a heuristicthat considers Wikipedia leaf categories which tend tohave instances [43] OpenCyc (with 65) and Wiki-data (54) come last in the ranking Wikidata has thesecond highest number of classes in total (see Table 2)out of which relatively little are used on instance levelNote however that in some scenarios solely the schemalevel information (including classes) of KGs is neces-sary so that the low coverage of instances by classes isnot necessarily an issue

Correlation between number of classes and num-ber of instances In Fig 2 we can see a histogramof the classes with respect to the number of instancesper class That is for each KG we can spot how manyclasses have a high number of instances and how manyclasses have a low number of instances Note the log-arithmic scale on both axes The curves seem to fol-low power law distributions For DBpedia the line de-

Table 1Percentage of considered entities per KG for covered domains

DB FB OC WD YA

Reach of method 88 92 81 41 82

creases consistently for the first 250 classes before itdecreases more than exponentially beyond class 250

513 DomainsAll considered KGs are cross-domain meaning that a

variety of domains are covered in those KGs Howeverthe KGs often cover the single domains to a differentdegree Tartir [45] proposed to measure the covered do-mains of ontologies by determining the usage degree ofcorresponding classes the number of instances belong-ing to one or more subclasses of the respective domainis compared to the number of all instances In our workhowever we decided to evaluate the coverage of do-mains concerning the classes per KG via manual assign-ments of the mostly used classes to the domains peoplemedia organizations geography and biology65 Thislist of domains was created by aggregating the mostfrequent domains in Freebase

The manual assignment of classes to domainsis necessary in order to obtain a consistent assign-ment of the classes to the domains across all con-sidered KGs Otherwise the same classes in differ-ent KGs may be assigned to different domains More-over in some KGs classes may otherwise appear invarious domains simultaneously For instance theFreebase classes freebasemusicartist andfreebasepeopleperson overlap in terms oftheir instances and multiple domains (such as musicand people) might be assigned to them

As the reader can see in Table 1 our method to de-termine the coverage of domains and hence the reachof our evaluation includes about 80 of all entities ofeach KG except Wikidata It is calculated as the ratio ofthe number of unique entities of all considered domainsof a given KG divided by the number of all entities ofthis KG66 If the ratio was at 100 we were able toassign all entities of a KG to the chosen domains

Fig 3 shows the number of entities per domain in thedifferent KGs with a logarithmic scale Fig 4 presents

65See our website for examples of classes per domain andper KG httpkmaifbkitedusitesknowledge-graph-comparison (requested on Dec 31 2016)

66We used the number of unique entities of all domains and notthe sum of the entities measured per domain since entities may be inseveral domains at the same time

22 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

100 101 102 103

Classes

100

102

104

106

108

Nu

mb

er

of

insta

nce

s

DBpedia

Freebase

OpenCyc

Wikidata

YAGO

Fig 2 Distribution of classes wrt the number of instances per KG

persons media organizations geography biology100

102

104

106

108

1010

Num

ber

of entities

DBpedia

Freebase

OpenCyc

Wikidata

YAGO

Fig 3 Number of entities per domain

the relative coverage of each domain in each KG It iscalculated as the ratio of the number of entities in eachdomain to the total number of entities of the KG Avalue of 100 means that all instances reside in onesingle domain

The case of Freebase is especially outstanding here77 of all entities here are located in the media

domain This fact can be traced back to large-scaledata imports such as from MusicBrainz The classfreebasemusicrelease_track is account-able for 42 of the media entities As shown in Fig 3Freebase provides the most entities in four out of thefive domains when considering all KGs

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 23

persons media organizations geography biology

10

20

30

40

50

60

70

80

Rela

tive n

um

ber

of entities in p

erc

ent

DBpedia

Freebase

OpenCyc

Wikidata

YAGO

Fig 4 Relative number of entities per domain

In DBpedia and YAGO the domain of people is thelargest domain (50 and 34 respectively) Peculiar isthe higher coverage of YAGO regarding the geographydomain compared to DBpedia As one reason for thatwe can point out the data import of GeoNames intoYAGO

Wikidata contains around 150K entities in the do-main organization This is relativly few consideringthe total amount of entities being around 187M andconsidering the number of organizations in other KGsNote that even DBpedia provides more organizationentities than Wikidata The reason why Wikidata hasnot so many organization entities is not fully compre-hensible to us However we can point out that for ouranalysis we only considered Wikidata classes whichappeared more than 6000 times67 and that about 16Kclasses were therefore not considered It is possible thatentities of the domain organization are belonging tothose rather rarely occurring classes

514 Relations and PredicatesEvaluation method In this article we differentiate

between relations and predicates (see also Section 2)

ndash Relations ndash as short term for explicitly defined re-lations ndash refers to (proprietary) vocabulary definedon the schema level of a KG We identify the setof relations of a KG as the set of those links which

67This number is based on heuristics We focused on the 150 mostinstantiated classes and cut the long tail of classes having only fewinstances

are explicitly defined as such via assignments (forinstance with rdfsProperty) to classes InSection 2 we used Pg to denote this set

ndash In contrast we use predicates to denote links usedin the KG independently of their introduction onthe schema level The set of unique predicates perKG denoted as P imp

g is nothing else than the setof unique RDF terms on the predicate position ofall triples in the KG

It is important to distinguish the key statistics for rela-tions from the key statistics for predicates since theycan differ considerably depending on to which degreerelations are only defined on schema level but not usedon instance level

Evaluation resultsRelationsRanking regarding relations As presented in Ta-

ble 2 Freebase exhibits by far the highest number ofunique relations (around 785K) among the KGs YAGOshows only 106 relations which is the lowest value inthis comparison In the following we point out furtherfindings regarding the relations of the single KGs

DBpedia Regarding DBpedia relations we need todistinguish between so-called mapping-based prop-erties and non-mapping-based properties Mapping-based properties are created by extracing the informa-tion from infoboxes in Wikipedia using manually cre-ated mappings These mappings are specified in the DB-

24 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

pedia Mappings Wiki68 Mapping-based properties arecontained in the DBpedia ontology and located in thenamespace httpdbpediaorgontologyWe count 2819 such relations for the considered DB-pedia version 2015-04 Non-mapping-based properties(also called ldquoraw infobox propertiesrdquo) are extractedfrom Wikipedia without the help of manually createdmappings and hence without any manual adjustmentsTherefore they are generally of lower quality We count58776 such unique relations They reside in the names-pace httpdbpediaorgproperty Bothmapping-based and non-mapping-based properties areinstantiated in DBpedia with rdfProperty We ig-nore the non-mapping based properties for the calcu-lation of the number of relations |Pg| (see Table 2)since in contrast to DBpedia in YAGO non-mappingbased properties are not instantiated Note that themapping-based properties and the non-mapping basedproperties in DBpedia are not aligned69 and may over-lap until DBpedia version 2016-0470

Freebase The high number or Freebase relations canbe explained by two facts 1 About a third of all rela-tions in Freebase are duplicates in the sense that they aredeclared by means of the owlinverseOf relationas being inverse of other relations An example is the re-lation freebasemusicartistalbum and itsinverse relation freebasemusicalbumartist2 Freebase allowed users to introduce their own rela-tions without any limits These relations were originallyin each userrsquos namespace So-called commons adminswere able to approve those relations so that they gotincluded into the Freebase commons schema

OpenCyc For OpenCyc we measure 18028 uniquerelations We can assume that most of them are dedi-cated to statements on the schema level

Wikidata In Wikidata a relatively small set of rela-tions is provided Note in this context that despite thefact that Wikidata is curated by a community (just likeFreebase) Wikidata community members cannot insertarbitrarily new relations as it was possible in Freebaseinstead relations first need to be proposed and thenget accepted by the community if and only if certain

68See httpmappingsdbpediaorgindexphpMain_Page accessed on Nov 4 2016

69For instance The DBpedia ontology containsdbobirthName for the name of a person while the non-mappingbased property set contains dbpname dbpfirstname anddbpalternativeNames

70For instance dbpalias and dboalias

criteria are met71 One of those criteria is that each newrelation is presumably used at least 100 times Thisrelation proposal process can be mentioned as likelyreason why in Wikidata in relative terms more relationsare actually used than in Freebase

YAGO For YAGO we measure the small set of 106unique relations Although relations are curated man-ually for YAGO and DBpedia the size of the relationset differs significantly between those KGs Hoffart etal [28] mention the following reasons for that

1 Peculiarity of relations The DBpedia ontologyprovides quite many special relations For in-stance there exists the relation dboaircraftFighter between dboMilitaryUnit anddboMeanOfTransportation

2 Granularity of relations Relations in the DB-pedia ontology are more fine-grained than rela-tions in YAGO For instance DBpedia contains therelations dboauthor and dbodirectorwhereas in YAGO there is only the generic relationyagocreated

3 Date specification The DBpedia ontology intro-duces several relations for dates For instance DB-pedia contains the relations dbobirthDateand dbobirthYear for birth dates while inYAGO only the relation yagobirthOnDateis used Incomplete date specifications ndash for in-stance if only the year is known ndash are specifiedin YAGO by wildcards (ldquordquo) so that no multiplerelations are needed

4 Inverse relations YAGO has no relations ex-plicitly specified as being inverse In DBpediawe can find relations specified as inverse such asdboparent and dbochild

5 Reification YAGO introduces the SPOTL(X) for-mat This format extends the triple format ldquoSPOldquowith a specification of Time Location and conteXtIn this way no contextual relations are necessary(such as dbodistanceToLondon or dbopopulationAsOf) which occur if the relationsare closely aligned to Wikipedia template attributenames

Frequency of the usage of relations Fig 5 showsthe relative proportions of how often relations are usedper KG grouped into three classes Surprisingly DB-pedia and Freebase exhibit a high number of relationswhich are not used at all on the instance level In case of

71See httpswwwwikidataorgwikiWikidataProperty_proposal requested on Dec 31 2016

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 25

0 1-500 gt500

Number of relations

20

40

60

80

100

Re

lative

occu

ren

cie

s in

pe

rce

nt

DBpedia

Freebase

OpenCyc

Wikidata

YAGO

Fig 5 Frequency of the usage of the relations per KG grouped by(i) zero occurrences (ii) 1ndash500 occurrences and (iii) more than 500occurrences in the respective KG

OpenCyc 992 of the defined relations are never usedWe assume that those relations are used only withinCyc the commercial version of OpenCyc In case ofFreebase only 5 of the relations are used more than500 times and about 70 are not used at all Analo-gously to the discussion regarding the number of Free-base relations we can mention again the high numberof defined owlinverseOf relations and the highnumber of usersrsquo relation proposals as reasons for that

PredicatesRanking regarding predicates Freebase is here ndash

like in case of the ranking regarding relations ndash rankedfirst The lowest number of unique predictes is providedby OpenCyc which exhibits only 165 predicates AllKGs except OpenCyc provide more predicates then re-lations Our single observations regarding the predicatesets are as follows

DBpedia DBpedia is ranked third in terms of the ab-solute numbers of predicates about 60K predicates areused in DBpedia The set of relations and the set of pred-icates varies considerably here since also facts are ex-tracted from Wikipedia info-boxes whose predicates areconsidered by us as being only implicitly defined andwhich hence occur only as predicates These are the so-called non-mapping-based properties Note that in the

studied DBpedia version 2015-04 the set of explicitlydefined relations (mapping-based properties) and theset of implicitly defined relations (non-mapping-basedproperties) overlaps An example is dbpalias withdboalias

Freebase We can observe here a similar picture asfor the set of Freebase relations With about 785Kunique predicates Freebase exceeds the other KGs byfar Note however that 95 of the predicates (around743K) are used only once This relativizes the highnumber Most of the predicates are keys in the senseof ids and are used for internal modeling (for instancefreebasekeyuseradrianb)

OpenCyc In contrast to the 18028 unique relationswe measure only 164 unique predicates for OpenCycMore predicates are presumably used in Cyc

Wikidata We measure more Wikidata predicates thanWikidata relations since Wikidata predicates are cre-ated by modifying Wikidata relations An exampleare the following triples which express the statementBarack Obama (wdtQ76) is a human (wdtQ5) byan intermediate node (wdtQ76S123 abbreviated)

wdtQ76 wdtP31s wdtQ76S123wdtQ76S123 wdtP31v wdtQ5

The relation extension ldquosrdquo indicates that the RDF termin the object position is a statement The ldquovrdquo extensionallows to refer to a value (in Wikidata terminology)Besides those extensions there is ldquorrdquo to refer to a ref-erence and the ldquoqrdquo extension to refer to a qualifier Ingeneral these relation extensions are used for realizingreification via n-ary relations For that intermediatenodes are used which represent statements [16]

YAGO YAGO contains more predicates than DBpe-dia since infobox attributes from different languageversions of Wikipedia are aggregated into one KG72

while for DBpedia separate localized KG versions areoffered for non-English languages

515 Instances and EntitiesEvaluation method We distinguish between in-

stances Ig and entities Eg of a KG (cf Section 2)

1 Instances are belonging to classes They are iden-tified by retrieving the subjects of all triples wherethe predicates indicate class affiliations

72The language of each attribute is encoded in theURI for instance yagoinfoboxdeflaumlche andyagoinfoboxenareakm

26 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGODBpe

diaFre

ebas

eOpe

nCyc

Wiki

data

YAGO

10 010 110 210 310 410 510 610 710 810 9

Num

ber

of In

stan

ces

Fig 6 Number of instances per KG

2 Entities are real-world objects This excludesfor instance instantiated statements for beingentities Determining the set of entities is par-tially tricky In DBpedia and YAGO entitiesare determined as being an instance of theclass owlThing In Freebase entities are in-stances of freebasecommontopic and inWikidata instance of wdoItem In OpenCyccychIndividual corresponds to owlThingbut not all entities are classified in this way There-fore we approximately determine the set of en-tities in OpenCyc by manually classifying allclasses having more than 300 instances includingat least one entity73 In this way abstract classessuch as cychExistingObjectType are ne-glected

Ranking wrt the number of instances Table 2and Fig 6 show the number of instances per KG Wecan see that Wikidata comprises the highest numberof instances (142M) in total and OpenCyc the fewest(242K)

Ranking wrt the number of entities Table 2shows the ranking of KGs regarding the number of en-tities Freebase contains by far the highest number ofentities (about 499M) OpenCyc is at the bottom withonly about 41K entities

Differences in number of entities The reason whythe KGs show quite varying numbers of entities are theinformation sources of the KGs We illustrate this withthe music domain as example

1 Freebase had been created mainly from data im-ports such as from MusicBrainz Therefore enti-

73For instance cychIndividual cychMovie_CW andcychCity

ties in the domain of media and especially songrelease tracks are covered very well in Freebase77 of all entities are in the media domain (seeSection 513) out of which 42 are releasetracks74

Due to the large size and the world-wide coverageof entities in MusicBrainz Freebase contains al-bums and release tracks of both English and non-English languages For instance regarding the En-glish language the album ldquoThrillerrdquo from MichaelJackson and its single ldquoBillie Jeanrdquo are there aswell as rather unknown songs from the ldquoThrillerrdquoalbum such as ldquoThe Lady in My Liferdquo Regard-ing non-English languages Freebase contains forinstance songs and albums from Helene Fischersuch as ldquoLassrsquo mich in dein Lebenrdquo and ldquoZauber-mondrdquo also rather unknown songs such as ldquoHabrsquoden Himmel beruumlhrtrdquo can be found

2 In case of DBpedia the English Wikipedia is thesource of information In the English Wikipediamany albums and singles of English artists are cov-ered ndash such as the album ldquoThrillerrdquo and the singleldquoBillie Jeanrdquo Rather unknown songs such as ldquoTheLady in My Liferdquo are not covered in WikipediaFor many non-English artists such as the Germansinger Helene Fischer no music albums and nosingles are contained in the English Wikipedia Inthe corresponding language version of Wikipedia(and localized DBpedia version) this informationis often available (for instance the album ldquoZauber-mondrdquo and the song ldquoLassrsquo mich in dein Lebenrdquo)but not the rather unknown songs such as ldquoHabrsquoden Himmel beruumlhrtrdquo

3 For YAGO the same situation as for DBpediaholds with the difference that YAGO in additionimports entities also from the different languageversions of Wikipedia and imports also data fromsources such as GeoNames However the abovementioned works (ldquoLassrsquo mich in dein LebenrdquoldquoZaubermondrdquo and ldquoHabrsquo den Himmel beruumlhrtrdquo)of Helene Fischer are not in the YAGO althoughthe song ldquoLassrsquo mich in dein Lebenrdquo exists inthe German Wikipedia since May 2014 and al-though the used YAGO version 3 is based on theWikipedia dump of June 201475 Presumably theYAGO extraction system was unable to extract any

74Those release tracks are expressed via freebasemusicrelease_track

75See httpwwwmpi-infmpgdededepartmentsdatabases-and-information-

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 27DBpe

diaFre

ebas

eOpe

nCyc

Wiki

data

YAGO

10 0

10 1

10 2

10 3

10 4

Ave

rage

num

ber

of e

ntiti

es

Fig 7 Average number of entities per class per KG

types for those entities so that those entities werediscarded

4 Wikidata is supported by the community and con-tains music albums of English and non-Englishartists even if they do not exist in Wikipedia Anexample is the song ldquoThe Lady in My Liferdquo Notehowever that Wikidata does not provide all artistrsquosworks such as from Helene Fischer

5 OpenCyc contains only very few entities in themusic domain The reason is that OpenCyc has itsfocus mainly on common-sense knowledge andnot so much on facts about entities

Average number of entities per class Fig 7 showsthe average number of entities per class which can bewritten as |Eg||Cg| Obvious is the difference betweenDBpedia and YAGO (despite the similar number of en-tities) The reason for that is that the number of classesin the DBpedia ontology is small (as created manually)and in YAGO large (as created automatically)

Comparing number of instances with number ofentities Comparing the ratio of the number of instancesto the number of entities for each KG Wikidata ex-poses the highest difference As reason for that we canstate that each statement in Wikidata is modeled as aninstance of wdoStatement leading to 74M addi-tional instances In other KGs such as DBpedia state-ments are modeled without any dedicated statementassignment OpenCyc exposes also a high ratio sinceit contains mainly common sense knowledge and notas many entities as the other KGs Furthermore for ouranalysis we do not regard 100 of the entities but onlya large fraction of it (more precisely the classes with

systemsresearchyago-nagayagoarchive re-quested on Dec 31 2016

DBpedia

Freeb

ase

OpenC

ycW

ikida

ta

YAGO

0

1

2

3

4

5

6

7

8

Rat

io o

f num

ber

of in

stan

ces

to n

umbe

r of

ent

ities

Fig 8 Ratio of the number of instances to the number of entities foreach KG

the most frequently occurring instantiations) since en-tities are not consistently instantiated in OpenCyc (seebeginning of Section 515)

516 Subjects and ObjectsEvaluation method The number of unique subjects

and unique objects can be a meaningful KG charac-teristic regarding the link structure within the KG andin comparison to other KGs Especially interesting aredifferences between the number of unique subjects andthe number of unique objects

We measure the number of unique subjects by count-ing the unique resources (ie URIs and blank nodes) onthe subject position of N-Triples Sg = s | (s p o) ising Furthermore we measure the number of uniqueobjects by counting the unique resources on the ob-ject position of N-Triples excluding literals Og =o | (s p o) isin g and o isin U cup B Complementary thenumber of literals is given as Olit

g = o | (s p o) ising and o isin L

Ranking of KGs regarding number of uniquesubjects The number of unique subjects per KG is pre-sented in Fig 9 YAGO contains the highest number ofdifferent subjects while OpenCyc contains the fewest

Ranking of KGs regarding number of unique ob-jects The number of unique objects is also presented inFig 9 Freebase shows the highest score in this regardOpenCyc again the lowest

Ranking of KGs regarding the ratio of numberof unique subjects to number of unique objects Theratios of the number of unique subjects to the number ofunique objects vary considerably between the KGs (seeFig 8) We can observe that DBpedia has 265 timesmore objects than subjects while YAGO on the otherside has 19 times more unique subjects than objects

28 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Table 2Summary of key statistics

DBpedia Freebase OpenCyc Wikidata YAGO

Number of triples |(s p o) isin g| 411 885 960 3 124 791 156 2 412 520 748 530 833 1 001 461 792

Number of classes |Cg| 736 53 092 116 822 302 280 569 751

Number of relations |Pg| 2819 70 902 18 028 1874 106

No of unique predicates |P impg | 60 231 784 977 165 4839 88 736

Number of entities |Eg| 4 298 433 49 947 799 41 029 18 697 897 5 130 031

Number of instances |Ig| 20 764 283 115 880 761 242 383 142 213 806 12 291 250

Avg number of entities per class |Eg||Cg| 58403 9408 035 619 90

No of unique subjects |Sg| 31 391 413 125 144 313 261 097 142 278 154 331 806 927

No of unique non-literals in obj pos |Og| 83 284 634 189 466 866 423 432 101 745 685 17 438 196

No of unique literals in obj pos |Olitg | 161 398 382 1 782 723 759 1 081 818 308 144 682 682 313 508

DBpedia

Freeb

ase

OpenC

yc

Wiki

data

YAGO10 0

10 2

10 4

10 6

10 8

10 1 0

10 1 2

unique subjectsunique objects

Fig 9 Number of unique subjects and objects per KG Note thelogarithmic scale on the axis of ordinates

The high number of unique subjects in YAGO is sur-prising and can be explained by the reification styleused in YAGO Facts are stored as N-Quads in orderto allow for making statements about statements (forinstance storing the provenance information for state-ments) To that end IDs (instead of blank nodes) whichidentify the triples are used on the first position of N-Triples They lead to 308M unique subjects such asyagoid_6jg5ow_115_lm6jdp In the RDF ex-port of YAGO the IDs which identify the triples arecommented out in order to facilitate the N-Triple for-mat However the statements about statements are alsotransformed to triples In those cases the IDs identi-fying the reified statements are in the subject positionleading to such a high number of unique subjects

DBpedia contains considerably more owlsameAslinks to external resources than KGs like YAGO (290M

vs 38M links) leading to a bias of DBpedia towards ahigh number of unique objects

517 Summary of Key StatisticsBased on the evaluation results presented in the last

subsections we can highlight the following insights

1 Triples All KGs are very large Freebase is thelargest KG in terms of number of triples whileOpenCyc is the smallest KG We notice a corre-lation between the way of building up a KG andthe size of the KG automatically created KGs aretypically larger as the burdens of integrating newknowledge become lower Datasets which havebeen imported into the KGs such as MusicBrainzinto Freebase have a huge impact on the numberof triples and on the number of facts in the KGAlso the way of modeling data has a great impacton the number of triples For instance if n-aryrelations are expressed in N-Triples format (as incase of Wikidata) many intermediate nodes needto be modeled leading to many additional triplescompared to plain statements Last but not leastthe number of supported languages influences thenumber of triples

2 Classes The number of classes is highly varyingamong the KGs ranging from 736 (DBpedia) upto 300K (Wikidata) and 570K (YAGO) Despite itshigh number of classes YAGO contains in relativeterms the most classes which are actually used(ie classes with at least one instance) This canbe traced back to the fact that heuristics are usedfor selecting appropriate Wikipedia categories asclasses for YAGO Wikidata in contrast containsmany classes but out of them only a small fraction

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 29

is actually used on instance level Note howeverthat this is not necessarily a burden

3 Domains Although all considered KGs are speci-fied as crossdomain domains are not equally dis-tributed in the KGs Also the domain coverageamong the KGs differs considerably Which do-mains are well represented heavily depends onwhich datasets have been integrated into the KGsMusicBrainz facts had been imported into Free-base leading to a strong knowledge representation(77) in the domain of media in Freebase In DB-pedia and YAGO the domain people is the largestlikely due to Wikipedia as data source

4 Relations and Predicates Many relations arerarely used in the KGs Only 5 of the Freebaserelations are used more than 500 times and about70 are not used at all In DBpedia half of therelations of the DBpedia ontology are not usedat all and only a quarter of the relations is usedmore than 500 times For OpenCyc 992 of therelations are not used We assume that they areused only within Cyc the commercial version ofOpenCyc

5 Instances and Entities Freebase contains by farthe highest number of entities Wikidata exposesrelatively many instances in comparison to theentities as each statement is instantiated leadingto around 74M instances which are not entities

6 Subjects and Objects YAGO provides the high-est number of unique subjects among the KGsand also the highest ratio of the number of uniquesubjects to the number of unique objects This isdue to the fact that N-Quad representations needto be expressed via intermedium nodes and thatYAGO is concentrated on classes which are linkedby entities and other classes but which do not pro-vide outlinks DBpedia exhibits more unique ob-jects than unique subjects since it contains manyowlsameAs statements to external entities

52 Data Quality Analysis

We now present the results obtained by applyingthe DQ metrics introduced in the Sections 32 ndash 35 tothe KGs DBpedia Freebase OpenCyc Wikidata andYAGO

521 AccuracyThe fulfillment degrees of the KGs regarding the

Accuracy metrics are shown in Table 3

Table 3Evaluation results for the KGs regarding the dimension Accuracy

DB FB OC WD YA

msynRDF 1 1 1 1 1msynLit 099 1 1 1 062msemTriple 099 lt1 1 099 099

Syntactic validity of RDF documents msynRDF

Evaluation method For evaluating the Syntactic va-lidity of RDF documents we dereference the entityldquoHamburgrdquo as resource sample in each KG In caseof DBpedia YAGO Wikidata and OpenCyc thereare RDFXML serializations of the resource availablewhich can be validated by the official W3C RDF valida-tor76 Freebase only provides a Turtle serialization Weevaluate the syntactic validity of this Turtle documentby verifying if the document can be loaded into an RDFmodel of the Apache Jena Framework77

Evaluation result All considered KGs provide syn-tactically valid RDF documents In case of YAGO andWikidata the RDF validator declares the used languagecodes as invalid since the validator evaluates languagecodes in accordance with ISO-639 The criticized lan-guage codes are however contained in the newer stan-dard ISO 639-3 and actually valid

Syntactic validity of literals msynLit

Evaluation method We evaluate the Syntactic va-lidity of literals by means of the relations date ofbirth number of inhabitants and International Stan-dard Book Number (ISBN) as those relations cover dif-ferent domains ndash namely people cities and books ndashand as they can be found in all KGs In general do-main knowledge is needed for selecting representativerelations so that a meaningful coverage is guaranteed

Note that OpenCyc is not taken into account forthis criterion Although OpenCyc comprises around11M literals in total these literals are essentially la-bels and descriptions (given via rdfslabel andrdfscomment) ie not aligned to specific datatypes Hence OpenCyc has no syntactic invalid literalsand is assigned the metric value 1

As long as a literal with data type is given its syntaxis verified with the help of the function RDFDatatypeisValid(String) of the Apache Jena framework

76See httpsw3orgRDFValidator requested onMar 2 2016

77See httpsjenaapacheorg requested Mar 22016

30 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Thereby standard data types such as xsddate canbe validated easily especially if different data types areprovided78 If no data type is provided or if the literalvalue is of type xsdString the literal is evaluatedby a regular expression which is created manually (seebelow depending on the considered relation) For eachof the three relations we created a sample of 1M literalvalues per KG as long as the respective KG containsso many literals

Evaluation results All KGs except YAGO per-formed very well regarding the Syntactic validity ofliterals

Date of Birth For Wikidata DBpedia and Freebaseall verified literal values (1M per KG) were syntacti-cally correct79 For YAGO we detected around 519Ksyntactic errors (given 1M literal values) due to the us-age of wildcards in the date values For instance thebirth date of yagoSocrates is specified as ldquo470--rdquo which does not correspond to the syntax ofxsddate Obviously the syntactic invalidity of lit-erals is accepted by the YAGO publishers in order tokeep the number of relations low80

Number of inhabitants The data types of the literalvalues regarding the number of inhabitants were validin all KGs For DBpedia YAGO and Wikidata weevaluated the syntactic validity of the number of inhab-itants by checking if xsdnonNegativeIntegerxsddecimal and xsdinteger were used asdata types for the typed literals In Freebase no datatype is specified Therefore we evaluated the values bymeans of a regular expression which allows only thedecimals 0-9 periods and commas

ISBN The ISBN is an identifier for books and maga-zines The identifier can occur in various formats withor without preceding ldquoISBNrdquo with or without delim-iters and with 10 or 13 digits Gupta81 provided a regu-lar expression for validating ISBN in its different formswhich we used in our evaluation All in all most ofthe ISBN were assessed as syntactically correct The

78In DBpedia for instance data for the relationdbobirthDate is stored both as xsdgYear and xsddate

79Surprisingly the Jena Framework assessed data values with anegative year (ie BC eg ldquo-600rdquo for xsdgYear) as invaliddespite the correct syntax

80In order to model the dates to the extent they are known furtherrelations would be necessary such as using wasBornOnYearwith range xsdgYear wasBornOnYearMonth with rangexsdgYearMonth

81See httphowtodoinjavacomregexjava-regex-validate-international-standard-book-number-isbns requested on Mar 1 2016

lowest fulfillment degree was obtained for DBpediaWe found the following findings for the single KGs InFreebase around 699K ISBN numbers were availableOut of them 38 were assessed as syntactically incorrectTypical mistakes were too long numbers and wrongprefixes82 In case of Wikidata 18 of around 11K ISBNnumbers were syntactically invalid However some in-valid numbers have meanwhile been corrected This in-dicates that the Wikidata community does not only careabout inserting new data but also about curating givenKG data In case of YAGO we could only find 400triples with the relation yagohasISBN Seven of theliterals on the object position were syntactically incor-rect For DBpedia we evaluated around 24K literals7419 of them were assessed as syntactically incorrectIn many cases comments next to the ISBN numbers inthe info-boxes of Wikipedia led to an inaccurate extrac-tion of data so that the comments are either extractedas additional facts about ISBN numbers83 or togetherwith the actual ISBN numbers as coherent strings84

Semantic validity of triples msemTriple

Evaluation method The semantic validity can be re-liably measured by means of a reference data set which(i) contains at least to some degree the same facts asin the KG and (ii) which is regarded as some kind ofauthority We decided to use the Integrated AuthorityFile (Gemeinsame Normdatei GND)85 which is anauthority file especially concerning persons and corpo-rate bodies and which was created manually by Ger-man libraries Due to the focus on persons (especiallyauthors) we decided to evaluate a random sample ofperson entities wrt the following relations birth placedeath place birth date and death date For each ofthese relations the corresponding relations in the KGswere determined Then a random sample of 100 personentities per KG was chosen For each entity we retrievedthe facts with the mentioned relations and assessedmanually whether a GND entry exists and whether thevalues of the relations match with the values in the KG

Evaluation result We evaluated up to 400 facts perKG and observed only for a few facts some discrep-ancies For instance Wikidata states as death date of

82Eg we found the 16 digit ISBN 9789780307986931 (cffreebasem0pkny27) and the ISBN 2940045143431 with pre-fix 294 instead of 978 (cf freebasem0v3xf7b)

83See dbrPrince_Caspian84An example is ldquoISBN 0755111974 (hardcover edition)rdquo for

dbrMy_Family_and_Other_Animals85See httpwwwdnbdeENStandardisierung

GNDgndhtml requested on Sep 8 2016

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 31

ldquoAnton Erkelenzldquo (wdtQ589196) April 24 whereasGND states April 25 For DBpedia and YAGO we en-countered 3 and for Wikidata 4 errors Hence thoseKGs were evaluated with 099 Note that OpenCyc hasno values for the chosen relations and thus evaluates to1

During evaluation we identified the following issues

1 For finding the right entry in GND more informa-tion besides the name of the person is needed Thisinformation is sometimes not given so that entitydisambiguation is in those cases hard to perform

2 Contrary to assumptions often either no corre-sponding GND entry exists or not many facts ofthe GND entity are given In other words GND isincomplete wrt to entities (cf Population com-pleteness) and relations (cf Column complete-ness)

3 Values of different granularity need to be matchedsuch as an exact date of birth against the indicationof a year only

In conclusion the evaluation of semantic validity ishard even if a random sample set is evaluated manuallyMeaningful differences among the KGs might be re-vealed only when a very large sample is evaluated egby using crowd-sourcing [2348] Another approachfor assessing the semantic validity is presented by Kon-tokostas et al [34] who propose a test-driven evalu-ation where test cases are created to evaluate triplessemi-automatically For instance an interval specifiesthe valid height of a person and all triples which lieoutside of this interval are evaluated manually In thisway outliers can be easily found but possible wrongvalues within the interval are not detected

Our findings appear to be consistent with the evalua-tion results of the YAGO developer team for YAGO2where manually assessing 4412 statements resulted inan accuracy of 98186

522 TrustworthinessThe fulfillment degrees of the KGs regarding the

Trustworthiness criteria are shown in Table 4

Trustworthiness on KG level mgraph

Evaluation method Regarding the trustworthinessof a KG in general we differentiate between the method

86With a weighted averaging of 95 see httpwwwmpi-infmpgdededepartmentsdatabases-and-information-systemsresearchyago-nagayagostatistics requested on Mar 3 2016

Table 4Evaluation results for the KGs regarding the dimensionTrustworthiness

DB FB OC WD YA

mgraph 05 05 1 075 025mfact 05 1 0 1 1mNoV al 0 1 0 1 0

of how new data is inserted into the KG and the methodof how existing data is curated

Evaluation results The KGs differ considerablywrt this metric OpenCyc obtains the highest scorehere followed by Wikidata In the following we pro-vide findings for the single KGs which are listed bydecreasing fulfillment score

Cyc is edited (expanded and modified) exclusively bya dedicated expert group The free version OpenCycis derived from Cyc and only a locally hosted versioncan be modified by the data consumer

Wikidata is also curated and expanded manually butby volunteers of the Wikidata community Wikidataallows importing data from external sources such asFreebase87 However new data is not just inserted butis approved by the community

Freebase was also curated by a community of vol-unteers In contrast to Wikidata the proportion of dataimported automatically is considerably higher and newdata imports were not dependent on community ap-provals

DBpedia and YAGO The knowledge of both KGs isextracted from Wikipedia but DBpedia differs fromYAGO wrt the community involvement Any usercan engage (i) in mapping the Wikipedia infobox tem-plates to the DBpedia ontology in the DBpedia map-pings wiki88 and (ii) in the development of the DBpediaextraction framework

Trustworthiness on statement levelWe determine the Trustworthiness on statement level

by evaluating whether provenance information for state-ments is used in the KGs The picture is mixed

DBpedia uses the relation provwasDerivedFrom to store the sources of the entities and their state-

87Note that imports from Freebase require the approval ofthe community (see httpswwwwikidataorgwikiWikidataPrimary_sources_tool) Besides that there arebots which import automatically (see httpswwwwikidataorgwikiWikidataBotsde)

88See httpmappingsdbpediaorg requested onMar 3 2016

32 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

ments However as the source is always the correspond-ing Wikipedia article89 this provenance informationis trivial and the fulfillment degree is hence of ratherformal nature

YAGO uses its own vocabulary to indicate thesource of information Interestingly YAGO stores perstatement both the source (via yagoextractionSource eg the Wikipedia article) and the used ex-traction technique (via yagoextractionTech-nique eg ldquoInfobox Extractorrdquo or ldquoCategoryMap-perrdquo) The number of statements about sources is 161Mand hence many times over the number of instances inthe KG The reason for that is that in YAGO the sourceis stored for each fact

In Wikidata several relations can be used for refer-ring to sources such as ldquoimported fromrdquo (wdtP143)ldquostated inrdquo (wdtP248) and ldquoreference URLrdquo (wdtP854)90 Note that ldquoimported fromrdquo relations are usedfor automatic imports but that statements with such areference are not accepted (ldquodata is not sourcedrdquo)91 Tosource data the other relations ldquostated inrdquo and ldquoref-erence URLrdquo can be used The number of all storedreferences in Wikidata92 is around 971K Based on thenumber of all statements93 74M this corresponds to acoverage of around 13 Note however that not everystatement in Wikidata requires a reference according tothe Wikidata guidelines In order to be able to state howmany references are actually missing a manual evalua-tion would be necessary However such an evaluationwould be presumably highly subjective

Freebase uses proprietary vocabulary for represent-ing provenance via n-ary relations which are in Free-base called Compound Value Types (CVT) data fromhigher arity can be expressed [44]94

OpenCyc differs from the other KGs in that it usesneither an external vocabulary nor a proprietary vocab-ulary for storing provenance information

89Eg httpenwikipediaorgwikiHamburg fordbrHamburg

90All relations are instances of Wikidata property to indicate asource (wdtQ18608359)

91See httpswwwwikidataorgwikiPropertyP143 requested Mar 3 2016

92This is the number of instances of wdoReference93This is the number of instances of wdoStatement94Eg for a statement with the relation freebaselocation

statistical_regionpopulation the source can bestored via freebasemeasurement_unitdated_integersource

Table 5Evaluation results for the KGs regarding the dimension Consistency

DB FB OC WD YA

mcheckRestr 0 1 0 1 0mconClass 088 1 lt1 1 033mconRelat 099 045 1 050 099

Indicating unknown and empty values mNoV al

This criterion highlights the subtle data model ofWikidata and Freebase in comparison to the data mod-els of the other KGs Wikidata allows for storing un-known values and empty values (eg that ldquoElizabeth Iof Englandrdquo (wdtQ7207) had no children) Howeverin the Wikidata RDF export such statements are onlyindirectly available since they are represented via blanknodes and via the relation owlsomeValuesFrom

YAGO supports the representation of unknown val-ues and empty values by providing explicit relationsfor such cases95 Inexact dates are modeled by meansof wildcards (eg ldquo1940--rdquo if only the year isknown) Note however the invalidity of such stringsas date literals (see Section 521) Unknown dates arenot supported by YAGO

523 ConsistencyThe fulfillment degrees of the KGs regarding the

Consistency criteria are shown in Table 5

Check of schema restrictions during insertion of newstatements mcheckRestr

The values of the metric mcheckRestr indicating re-strictions during the insertion of new statements arevarying among the KGs The web interfaces of Free-base and Wikidata verify during the insertion of newstatements by the user whether the input is compatiblewith the respective data type For instance data of therelation ldquodate of birthrdquo (wdtP569) is expected to bein a syntactically valid form DBpedia OpenCyc andYAGO have no checks for schema restriction during theinsertion of new statements

Consistency of statements wrt class constraintsmconClass

Evaluation method For evaluating the consis-tency of class constraints we considered the relationowldisjointWith since this is the only rela-tion which is used by more than half of the consid-

95Eg freebasefreebasevaluenotationhas_no_value

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 33

ered KGs We only focused on direct instantiationshere if there is for instance the triple (dboPlantowldisjointWith dboAnimal) then theremust not be a resource which is instantiated both asdboPlant and dboAnimal

Evaluation results We obtained mixed results hereOnly Freebase OpenCyc and Wikidata perform verywell96

Freebase and Wikidata do not specify any constraintswith owldisjointWith Hence those two KGshave no inconsistencies wrt class restrictions and wecan assign the metric value 1 to them In case of Open-Cyc 5 out of the 27112 class restrictions are incon-sistent DBpedia contains 24 class constraints Threeout of them are inconsistent For instance over 1200instances exist which are both a dboAgent and adboPlace YAGO contains 42 constraints dedi-cated mainly for WordNet classes which are mostlyinconsistent

Consistency of statements wrt relation constraintsmconRelat

Evaluation method Here we considered the rela-tions rdfsrange and owlFunctionalProperty as those are used in more than every second con-sidered KG rdfsrange specifies the expected typeof an instance on the object position of a triple whileowlFunctionalProperty indicates that a rela-tion should only be used at most once per resource Weonly took datatype properties into account for this eval-uation since consistencies regarding object propertieswould require to distinguish Open World assumptionand Closed World assumption

Evaluation results In the following we considerthe fulfillment degree for the relation constraintsrdfsrange and owlFunctionalPropertyseparately In Table 5 we show the average of the fulfill-ment scores of each KG regarding rdfsrange andowlFunctionalProperty Note that the num-bers of evaluated relation constraints varied from KG toKG depending on how many relation constraints wereavailable per KG

Range Wikidata does not use any rdfsrangerestrictions Within the Wikidata data model there iswdopropertyType but this indicates not the ex-act allowed data type of a relation (eg wdoprop

96Note that the sample size varies among the KGs (depend-ing on how many owldisjointWith statements are availableper KG) Therefore inconsistencies measured on a small set ofowldisjointWith facts become more visible

Table 6Evaluation results for the KGs regarding the dimension Relevancy

DB FB OC WD YA

mRanking 0 1 0 1 0

ertyTypeTime can represent a year or an exact date)On the talk pages of Wikidata relations users can indi-cate the allowed values of relations via One of state-ments97 Since One of statements are only listed onthe property talk pages and since not only entity typesbut also concrete instances are used as One of valueswe do not consider those statements here

DBpedia obtains the highest measured fulfillmentscore wrt consistency of rdfsrange statementsAn example for a range inconsistency is that the relationdbobirthDate requires a data type xsddatein about 20 of those relations the data type xsdgYear is used though

YAGO Freebase and OpenCyc contain range incon-sistencies primarily since they specify designated datatypes via range relations which are not consistentlyused on the instance level For instance YAGO spec-ifies proprietary data types such as yagoyagoURLand yagoyagoISBN On the instance level how-ever either no data type is used or the unspecific datatype xsdstring

FunctionalProperty The restriction indicated byowlFunctionalProperty is used by all KGsexcept Wikidata On the talk pages about the rela-tions in Wikidata users can specify the cardinalityrestriction via setting the relation to single how-ever this is not part of the Wikidata data modelThe other KGs mostly comply with the usage re-strictions of owlFunctionalProperty Note-worthy is that in Freebase 999 of the inconsis-tencies obtained here are caused by the usages ofthe relations freebasetypeobjectname andfreebasecommonnotable_fordisplay_name

524 RelevancyThe fulfillment degrees of the KGs regarding the

Relevancy criteria are shown in Table 6

Creating a ranking of statements mRanking

Only Wikidata supports the modeling of a rankingof statements Each statement is ranked with ldquopre-

97See httpswwwwikidataorgwikiCategoryProperties_with_one-of_constraints for an overviewrequested on Jan 29 2017

34 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Table 7Evaluation results for the KGs regarding the dimensionCompleteness

DB FB OC WD YA

mcSchema 091 076 092 1 095mcColumn 040 043 0 029 033mcPop 093 094 048 099 089mcPop (short) 1 1 082 1 090mcPop (long) 086 088 014 098 088

ferred rankrdquo (wdoPreferredRank) ldquonormal rankrdquo(wdoNormalRank) or ldquodeprecated rankrdquo (wdoDeprecatedRank) The preferred rank corre-sponds to the up-to-date value or the consensus of theWikidata community wrt this relation Freebase doesnot provide any ranking of statements entities or re-lations However the meanwhile shutdown FreebaseSearch API provided a ranking for resources98

525 CompletenessThe fulfillment degrees of the KGs regarding the

Completeness criteria are shown in Table 7

Schema completeness mcSchema

Evaluation method Since a gold standard for eval-uating the Schema completeness of the considered KGshas not been published we built one on our own Thisgold standard is available online99 It is based on thedata set used in Section 513 where we needed as-signments of classes to domains and comprises of 41classes as well as 22 relations It is oriented towards thedomains people media organizations geography andbiology The classes in the gold standard were alignedto corresponding WordNet synsets (using WordNet ver-sion 31) and were grouped into main classes

Evaluation results Generally Wikidata performsoptimal also DBpedia OpenCyc and YAGO exhibitresults which can be judged as acceptable for most usecases Freebase shows considerable room for improve-ment concerning the coverage of typical cross-domainclasses and relations The results in more detail are asfollows

DBpedia DBpedia shows a good score regardingSchema completeness and its schema is mainly limited

98See httpsdevelopersgooglecomfreebasev1search-cookbookscoring-and-ranking re-quested on Mar 4 2016

99See httpkmaifbkitedusitesknowledge-graph-comparison requested on Jan 29 2017

due to the characteristics of how information is storedand extracted from Wikipedia

1 Classes The DBpedia ontology was created man-ually and covers all domains well However it is incom-plete in the details and therefore appears unbalancedFor instance within the domain of plants the DBpe-dia ontology does not use the class tree but the classginko which is a subclass of trees We can mentionas reason for such gaps in the modeling the fact thatthe ontology is created by means of the most frequentlyused infobox templates in Wikipedia

2 Relations Relations are considerably well cov-ered in the DBpedia ontology Some missing relationsor modeling failures are due to the Wikipedia infoboxcharacteristics For example to represent the gender ofa person the existing relation foafgender seemsto fit However it is only modeled in the ontology asbelonging to the class dbolanguage and not usedon instance level Note that the gender of a person is of-ten not explicitly mentioned in the Wikipedia infoboxesbut implicitly mentioned in the category names (forinstance American male singers) While DBpediadoes not exploit this knowledge YAGO does use it andprovides facts with the relation yagohasGender

Freebase Freebase shows a very ambivalent schemacompleteness On the one hand Freebase targets ratherthe representation of facts on instance level than therepresentation of classes and their hierarchy On theother hand Freebase provides a vast amount of rela-tions leading to a very good coverage of the requestedrelations

1 Classes Freebase lacks a class hierarchy and sub-classes of classes are often in different domains (for in-stance the classes freebasemusicartist andsportsmen freebasesportspro_athlete arelogically a subclass of the class people freebasepersonpeople but not explicitly stated as such)which makes it difficult to find suitable sub- and su-perclasses Noteworthy the biology domain containsno classes This is due to the fact that classes are rep-resented as entities such as tree100 and ginko101 Theginko tree is not classified as tree but by the genericclass freebasebiologyoganism_classification

2 Relations Freebase exhibits all relations requestedby our gold standard This is not surprising given thevast amount of available relations in Freebase (see Sec-tion 514 and Table 2)

100Freebase ID freebasem07j7r101Freebase ID freebasem0htd3

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 35

OpenCyc In total OpenCyc exposes a quite highSchema completeness scoring This is due to the factthat OpenCyc has been created manually and has itsfocus on generic and common-sense knowledge

1 Classes The ontology of OpenCyc covers bothgeneric and specific classes such as cychSocialGroup and cychLandTopographicalFeatureWe can state that OpenCyc is complete with respect tothe considered classes

2 Relations OpenCyc lacks some relations of thegold standard such as the number of pages or the ISBNof books

Wikidata According to our evaluation Wikidata iscomplete both with respect to classes and relations

1 Classes Besides frequently used generic classessuch as ldquohumanrdquo (wdtQ5) also very specific classesexist such as ldquolandformrdquo (wdtQ271669) in the senseof a geomorphologial unit with over 3K instances

2 Relations In particular remarkable is that Wiki-data covers all relations of the gold standard eventhough it has extremely less relations than FreebaseThus the Wikidata methodology to let users proposenew relations to discuss about their outreach and fi-nally to approve or disapprove the relations seems tobe appropriate

YAGO Due to its concentration on modeling classesYAGO shows the best overall Schema completenessfulfillment score among the KGs

1 Classes To create the set of classes in YAGOthe Wikipedia categories are extracted and connectedto WordNet synsets Since also our gold standard isalready aligned to WordNet synsets we can measure afull completeness score for YAGO classes

2 Relations The YAGO schema does not containmany unique but rather abstract relations which canbe understood in different senses The abstract rela-tion names make it often difficult to infer the mean-ing The relation yagowasCreatedOnDate forinstance can be used reasonably for both the founda-tion year of a company and for the publication dateof a movie DBpedia in contrast provides the rela-tion dbpfoundationYear Often the meaning ofYAGO relations is only fully understood after consider-ing the associated classes using domain and range ofthe relations Expanding the YAGO schema by furthermore fine-grained relations appears reasonable

Column completeness mcColumn

Evaluation method For evaluating KGs wrt Col-umn completeness for each KG 25 class-relation-

Table 8Metric values of mcCol for single class-relation-pairs

Relation DB FB OC ED YA

Personndashbirthdate 048 048 0 070 077

Personndashsex ndash 057 0 094 064

Bookndashauthor 091 093 0 082 028

BookndashISBN 073 063 ndash 018 001

combinations102 were created based on our gold stan-dard created for measuring the Schema completenessIt was ensured that only those relations were selectedfor a given class for which a value typically exists forthat class For instance we did not include the deathdate as potential relation for living people

Evaluation results In general no KG yields a met-ric score of over 043 As visible in Table 8 KGs oftenhave some specific class-relation-pairs which are wellrepresented on instance level while the rest of the pairsare poorly represented The well-represented pairs pre-sumably originate either from column-complete datasets which were imported (cf MusicBrainz in case ofFreebase) or from user edits focusing primarily on factsabout entities of popular classes such as people Wenotice the following observations with respect to thesingle KGs

DBpedia DBpedia fails regarding the relation sex forinstances of class Person since it does not containsuch a relation in its ontology If we considered the non-mapping-based property dbpgender instead (notdefined in the ontology) we would gain a coverage ofonly 025 (about 5K people) We can note hence thatthe extraction of data out of the Wikipedia categorieswould be a further fruitful data source for DBpedia

Freebase Freebase surprisingly shows a very highcoverage (927) of the authors of books given the ba-sic population of 17M books Note however that thereare not only books modeled under freebasebookbook but also entities of other types such as a descrip-tion of the Lord of Rings (see freebasem07bz5)Also the coverage of ISBN for books is quite high(634)

OpenCyc OpenCyc breaks ranks as mostly no val-ues for the considered relations are stored in this KG It

102The selection of class-relation-pairs was depending on the factwhich class-relation-pairs were available per KG Hence the choiceis varying from KG to KG Also note that less class-relation-pairswere used if no 25 pairs were available in the respective KG

36 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

contains mainly taxonomic knowledge and only thinlyspread instance facts

Wikidata Wikidata achieves a high coverage of birthdates (703) and of gender (941) despite the highnumber of 3M people103

YAGO YAGO obtains a coverage of 635 for gen-der relations as it in contrast to DBpedia extracts thisimplicit information from Wikipedia

Population completeness mcPop

Evaluation method In order to evaluate the Popu-lation completeness we need a gold standard consist-ing of a basic entity population for each consideredKG This gold standard which is available online104

was created on the basis of our gold standard usedfor evaluating the Schema completeness and the Col-umn completeness For its creation we selected fiveclasses from each of the five domains and determinedtwo well-known entities (called short head) and tworather unknown entities (called long tail) for each ofthose classes The exact entity selection criteria are asfollows

1 The well-known entities were chosen without tem-poral and location-based restrictions To take themost popular entities per domain we used quan-titative statements For instance to select well-known athletes we ranked athletes by the numberof won olympic medals to select the most popu-lar mountains we ranked the mountains by theirheights

2 To select the rather unknown entities we consid-ered entities associated to both Germany and aspecific year For instance regarding the athleteswe selected German athletes active in the year2010 such as Maria Houmlfl-Riesch The selectionof rather unknown entities in the domain of biol-ogy is based on the IUCN Red List of ThreatenedSpecies105106

Selecting four entities per class and five classes perdomain resulted in 100 entities to be used for evaluatingthe Population completeness

103These 3M instances form about 185 of all instances in Wiki-data See httpswwwwikidataorgwikiWikidataStatistics requested on Nov 7 2016

104See httpkmaifbkitedusitesknowledge-graph-comparison requested on Jan 29 2017

105See httpwwwiucnredlistorg requested on Apr2 2016

106Note that selecting entities by their importance or popularity ishard in general and that also other popularity measures such as thePageRank scores may be taken into account

Evaluation results All KGs except OpenCyc showgood evaluation results Since also Wikidata exhibitsgood evaluation results the population degree appar-ently does not depend on the age or the insertion methodof the KG Fig 10 additionally depicts the populationcompleteness for the single domains for each KG Inthe following we firstly present our findings for well-known entities before we secondly go into the detailsof rather unknown entities

Well-known entities Here all considered KGsachieve good results DBpedia Freebase and Wikidataare complete wrt the well-known entities in our goldstandard YAGO lacks some well-known entities al-though some of them are represented in Wikipedia Onereason for this fact is that those Wikipedia entities donot get imported into YAGO for which a WordNet classexists For instance there is no ldquoGreat White Sharkrdquoentity only the WordNet class yagowordnet_great_white_shark_101484850

Not-well-known entities First of all not very surpris-ing is the fact that all KGs show a higher degree of com-pleteness regarding well-known entities than regard-ing rather unknown entities as the KGs are orientedtowards general knowledge and not domain-specificknowledge Secondly two things are in particular pe-culiar concerning long-tail entities in the KGs Whilemost of the KGs obtain a score of about 088 Wiki-data deflects upwards and OpenCyc deflects stronglydownwards

Wikidata exhibits a very high Population complete-ness degree for long tail entities This is a result fromthe central storage of interwiki links between differentWikimedia projects (especially between the differentWikipedia language versions) in Wikidata A Wikidataentry is added to Wikidata as soon as a new entity isadded in one of the many Wikipedia language versionsNote however that in this way English-language labelsfor the entities are often missing We measure that onlyabout 546 (102M) of all Wikidata resources have anEnglish label

OpenCyc exhibits a poor population degree score of014 for long-tail entities OpenCycrsquos sister KGs Cycand ResearchCyc are apparently considerably bettercovered with entities [36] leading to higher Populationcompleteness scores

526 TimelinessThe evaluation results concerning the dimension

Timeliness are presented in Table 9

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 37

DBpedia Freebase OpenCyc Wikidata YAGO0

01

02

03

04

05

06

07

08

09

1

PeopleMediaOrganizationsGeographyBiology

Fig 10 Population completeness regarding the different domains per KG

Table 9Evaluation results for the KGs regarding the dimension Timeliness

DB FB OC WD YA

mFreq 05 0 025 1 025mV alidity 0 1 0 1 1mChange 0 1 0 0 0

Timeliness frequency of the KG mFreq

Evaluation results The KGs are very diverse re-garding the frequency in which the KGs are updatedranging from a score of 0 for Freebase (not updated anymore) to 1 for Wikidata (updates immediately visibleand retrievable) Note that the Timeliness frequency ofthe KG can be a crucial point and a criterion for exclu-sion in the process of choosing the right KG for a givensetting [17] In the following we outline some charac-teristics of the KGs with respect to their up-to-dateness

DBpedia is created about once to twice a year andis not modified in the meantime From September2013 until November 2016 six DBpedia versions havebeen published107 Besides the static DBpedia DBpe-dia live108 has been continuously updated by trackingchanges in Wikipedia in real-time However it does notprovide the full range of relations as DBpedia

107These versions are DBpedia 38 DBpedia 39 DBpedia 2014DBpedia 2015-04 DBpedia 2015-10 and DBpedia 2016-04 Alwaysthe latest DBpedia version is published online for dereferencing

108See httplivedbpediaorg requested on Mar 42016

Freebase had been updated continuously until itsclose-down and is not updated anymore

OpenCyc has been updated less than once per yearThe last OpenCyc version dates from May 2012109 Tothe best of our knowledge Cyc and OpenCyc respec-tively are developed further but no exact date of thenext version is known

Wikidata provides the highest fulfillment degree forthis criterion Modifications in Wikidata are via browserand via HTTP URI dereferencing immediately visibleHence Wikidata falls in the category of continuousupdates Besides that an RDF export is provided ona roughly monthly basis (either via the RDF exportwebpage110 or via own processing using the Wikidatatoolkit111)

YAGO has been updated less than once per yearYAGO3 was published in 2015 YAGO2 in 2011 andthe interim version YAGO2s in 2013 A date of the nextrelease has not been published

Specification of the validity period of statementsmV alidity

Evaluation results Although representing the va-lidity period of statements is obviously reasonable formany relations (for instance the presidentrsquos term of

109See httpswopencycorg requested on Nov 82016

110See httptoolswmflabsorgwikidata-exportsrdfexports requested on Nov 23 2016

111See httpsgithubcomWikidataWikidata-Toolkit requested on Nov 8 2016

38 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Table 10Evaluation results for the KGs regarding the dimension Ease ofunderstanding

DB FB OC WD YA

mDescr 070 097 1 lt1 1mLang 1 1 0 1 1muSer 1 1 0 1 1muURI 1 05 1 0 1

office) specifying the validity period of statements isin several KGs either not possible at all or only rudi-mentary performed

DBpedia and OpenCyc do not realize any specifi-cation possibility In YAGO Freebase and Wikidatathe temporal validity period of statements can be spec-ified In YAGO this modeling possibility is madeavailable via the relations yagooccursSinceyagooccursUntil and yagooccursOnDateWikidata provides the relations ldquostart timerdquo (wdtP580)and ldquoend timerdquo (wdtP582) In Freebase CompoundValue Types (CVTs) are used to represent relations withhigher arity [44] As part of this representation validityperiods of statements can be specified An example isldquoVancouverrsquos population in 1997rdquo

Specification of the modification date of statementsmChange

Evaluation results The modification date of state-ments can only be specified in Freebase but not in theother KGs Together with the criteria on Timelinessthis reflects that the considered KGs are mostly notsufficiently equipped with possibilities for modelingtemporal aspects within and about the KG

In Freebase the date of the last review of a fact can berepresented via the relation freebasefreebasevaluenotationis_reviewed In the DBpediaontology the relation dctermsmodified is usedto state the date of the last revision of the DBpediaontology When dereferencing a resource in Wikidatathe latest modification date of the resource is returnedvia schemadateModified This however doesnot hold for statements Thus Wikidata is evaluatedwith 0 too

527 Ease of UnderstandingDescription of resources mDescr

Evaluation method We measured the extent towhich entities are described Regarding the labelswe considered rdfslabel for all KGs Regard-ing the descriptions the corresponding relations dif-

fer from KG to KG DBpedia for instance usesrdfscomment and dcelementsdescriptionwhile Freebase provides freebasecommontopicdescription112

Evaluation result For all KGs the rule applies thatin case there is no label available usually there isalso no description available The current metric couldtherefore (without significant restrictions) be applied tordfslabel occurrences only

YAGO Wikidata and OpenCyc contain a label foralmost every entity In Wikidata the entities withoutany label are of experimental nature and are most likelynot used113

Surprisingly DBpedia shows a relatively low cov-erage wrt labels and descriptions (only 704) Ourmanual investigations suggest that relations with higherarity are modeled by means of intermediate nodeswhich have no labels114

Labels in multiple languages mLang

Evaluation method Here we measure whether theKGs contain labels (rdfslabel) in other languagesthan English This is done by means of the languageannotations of literals such as ldquoderdquo for literals inGerman

Evaluation results DBpedia provides labels in 13languages Further languages are provided in the lo-calized DBpedia versions YAGO integrates statementsof the different language versions of Wikipedia intoone KG Therefore it provides labels in 326 differentlanguages Freebase and Wikidata also provide a lot oflanguages (244 and 395 languages respectively) Con-trary to the other KGs OpenCyc only provides labelsin English

Coverage of languages We also measured the cov-erage of selected languages in the KGs ie the extentto which entities have an rdfslabel with a specificlanguage annotation115 Our evaluation shows that DB-pedia YAGO and Freebase achieve a high coveragewith more than 90 regarding the English language Incontrast to those KGs Wikidata shows a relative low

112Human-readable resource descriptions may also be representedby other relations [15] However we focused on those relations whichare commonly used in the considered KGs

113For instance wdtQ5127809 represents a game fo the Nin-tendo Entertainment System but there is no further information foran identification of the entity available

114Eg dbrNayim links via dboCareerStation to 10entities of his carrier stations

115Note that literals such as rdfslabel do not necessarily havelanguage annotations In those cases we assume that no languageinformation is available

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 39

coverage regarding the English language of only 546but a coverage of over 30 for further languages suchas German and French Wikidata is hence not only themost diverse KG in terms of languages but has also thehighest coverage regarding non-English languages

Understandable RDF serialization muSer

The provisioning of understandable RDF serializa-tions in the context of URI dereferencing leads to a bet-ter understandability for human data consumers DB-pedia YAGO and Wikidata provide N-Triples andN3Turtle serializations Freebase in contrast onlyprovides a Turtle serialization OpenCyc only usesRDFXML which is regarded as not easily understand-able by humans

Self-describing URIs muURI

We can observe two different paradigms of URI us-age On the one hand DBpedia OpenCyc and YAGOrely on descriptive URIs and therefore achieve the fullfulfillment degree In DBpedia and YAGO the URIsof the entities are determined by the corresponding En-glish Wikipedia article The mapping to the EnglishWikipedia is thus trivial In case of OpenCyc two RDFexports are provided one using opaque and one us-ing self-describing URIs The self-describing URIs arethereby derived from the rdfslabel values of theresources

On the other hand Wikidata and Freebase (the latterin part) rely on opaque URIs Wikidata uses Q-IDsfor resources (items in Wikidata terminology) andP-IDs for relations Freebase uses self-describing URIsonly partially namely opaque M-IDs for entities andself-describing URIs for classes and relations116

528 InteroperabilityThe evaluation results of the dimension Interoper-

ability are presented in Table 11

Avoiding blank nodes and RDF reification mReif

Reification allows to represent further informationabout single statements In conclusion we can state thatDBpedia Freebase OpenCyc and YAGO use someform of reification However none of the consideredKGs uses the RDF standard for reification Wikidatamakes extensive use of reification every relation isstored in the form of an n-ary relation In case of DB-pedia and Freebase in contrast facts are predominantlystored as N-Tripels and only relations of higher arity

116Eg freebasemusicalbum for the class music al-bums and freebasepeoplepersondate_of_birthfor the relation day of birth

Table 11Evaluation results for the KGs regarding the dimensionInteroperability

DB FB OC WD YA

mReif 05 05 05 0 05miSerial 1 0 05 1 1mextV oc 061 011 041 068 013mpropV oc 015 0 051 gt0 0

are stored via n-ary relations117 YAGO stores facts asN-Quads in order to be able to store meta informationof facts like provenance information When the quadsare loaded in a triple store the IDs referring to thesingle statements are ignored and quads are convertedinto triples In this way most of the statements are stillusable without the necessity to deal with reification

Blank nodes are non-dereferencable anonymous re-sources They are used by the Wikidata and OpenCycdata model

Provisioning of several serialization formats miSerial

DBpedia YAGO and Wikidata fulfill the criterion ofProvisioning several RDF serialization formats to thefull extent as they provide data in RDFXML and sev-eral other serialization formats during the URI derefer-encing In addition DBpedia and YAGO provide fur-ther RDF serialization formats (eg JSON-LD Micro-data and CSV) via their SPARQL endpoints Freebaseis the only KG providing RDF only in Turtle format

Using external vocabulary mextV oc

Evaluation method This criterion indicates the ex-tent to which external vocabulary is used For that foreach KG we divide the occurrence number of tripleswith external relations by the number of all relations inthis KG

Evaluation results DBpedia uses 37 unique exter-nal relations from 8 different vocabularies while theother KGs mainly restrict themselves to the externalvocabularies RDF RDFS and OWL

Wikidata reveals a high external vocabulary ratiotoo We can mention two obvious reasons for that fact1 Information in Wikidata is provided in a huge varietyof languages leading to 85M rdfslabel and 140Mschemadescription literals 2 Wikidata makesextensive use of reification Out of the 140M triplesused for instantiations via rdftype about 74M (ie

117See Section 511 for more details wrt the influence of reifica-tion on the number of triples

40 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

about the half) are taken for instantiations of statementsie for reification

Interoperability of proprietary vocabulary mpropV oc

Evaluation method This criterion determines the ex-tent to which URIs of proprietary vocabulary are linkedto external vocabulary via equivalence relations Foreach KG we measure which classes and relationsare linked via owlsameAs118 owlequivalentClass (in Wikidata wdtP1709) and owlequivalentProperty (in Wikidata wdtP1628) to ex-ternal vocabulary Note that other relations such asrdfsubPropertyOf could be taken into accounthowever in this work we only consider equivalencyrelations

Evaluation results In general we obtained low ful-fillment scores regarding this criterion OpenCyc showsthe highest value We achieved the following singlefindings

Regarding its classes DBpedia reaches a relativehigh interlinking degree of about 484 Classes arethereby linked to FOAF Wikidata schemaorg andDUL119 Regarding its relations DBpedia links to Wiki-data and schemaorg120 Only 63 of the DBpediarelations are linked to external vocabulary

Freebase only provides owlsameAs links in theform of a separate RDF file but these links are only oninstance level Thus the KG is evaluated with 0

In OpenCyc about half of all classes exhibit at leastone external linking via owlsameAs Internal linksto resources of swcyccom the commercial ver-sion of OpenCyc were ignored in our evaluation Theconsidered classes are mainly linked to FOAF UM-BEL DBpedia and linkedmdborg the relations mainlyto FOAF DBpedia Dublin Core Terms and linked-mdborg The relative high linking degree of OpenCyccan be attributed to dedicated approaches of linkingOpenCyc to other KGs (see eg Medelyan et al [38])

Regarding the classes Wikidata provides linksmainly to DBpedia Considering all Wikidata classesonly 01 of all Wikidata classes are linked to equiva-

118OpenCyc uses owlsameAs both on schema and instancelevel This is appropriate as the OWL primer states The built-inOWL property owlsameAs links an individual to an individualas well as The owlsameAs statements are often used in definingmappings between ontologies see httpswwww3orgTR2004REC-owl-ref-20040210sameAs-def (requestedon Feb 4 2017)

119See httpwwwontologydesignpatternsorgontdulDULowl requested on Jan 11 2017

120Eg dbobirthDate is linked to wdtP569 andschemabirthDate

Table 12Evaluation results for the KGs regarding the dimension Accessibility

DB FB OC WD YA

mDeref 1 1 044 041 1mAvai lt1 073 lt1 lt1 1mSPARQL 1 1 0 1 0mExport 1 1 1 1 1mNegot 05 1 0 1 0mHTMLRDF 1 1 1 1 0mMeta 1 0 0 0 1

lent external classes This may be due to the high num-ber of classes in Wikidata in general Regarding therelations Wikidata provides links in particular to FOAFand schemaorg and achieves here a linking coverageof 21 Although this is low frequently used relationsare linked121

YAGO contains around 553K owlequivalentClass links to classes within the DBpedia namespacedby However as YAGO classes (and their hierarchy)were imported also into DBpedia (using the namespacehttpdbpediaorgclassyago) we donot count those owlequivalentClass links inYAGO as external links for YAGO

529 AccessibilityThe evaluation results of the dimension Accessibility

are presented in Table 12

Dereferencing possibility of resources mDeref

Evaluation method We measured the dereferenc-ing possibilities of resources by trying to dereferenceURIs containing the fully-qualified domain name ofthe KG For that we randomly selected 15K URIs inthe subject predicate and object position of triples ineach KG We submitted HTTP requests with the HTTPaccept header field set to applicationrdf+xmlin order to perform content negotiation

Evaluation results In case of DBpedia OpenCycand YAGO all URIs were dereferenced successfullyand returned appropriate RDF data so that they fulfilledthis criterion completely For DBpedia 45K URIs wereanalyzed for OpenCyc only around 30K due to thesmall number of unique predicates We observed almost

121Frequently used relations with stated equivalence to externalrelations are eg wdtP31 linked to rdftype and wdtP279linked to rdfssubClassOf

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 41

the same picture for YAGO namely no notable errorsduring dereferencing

For Wikidata which contains also not that manyunique predicates we analyzed around 35K URIs Notethat predicates which are derived from relations using asuffix (eg the suffix s as in wdtP1024s is usedfor predicates referring to a statement) could not bedereferenced at all Furthermore the blank nodes usedfor reification cannot be dereferenced

Regarding Freebase mainly all URIs on subjectand object position of triples could be dereferencedSome resources were not resolvable even after multi-ple attempts (HTTP server error 503 eg freebasem0156q) Surprisingly server errors also appearedwhile browsing the website freebasecom so that datawas partially not available Regarding the predicate po-sition many URIs are not dereferencable due to servererrors (HTTP 503) or due to unknown URIs (HTTP404) Note that if a large number of Freebase requestsare performed an API key from Google is necessaryIn our experiments the access was blocked after a fewthousand requests Hence we can point out that withoutan API key the Freebase KG is only usable to a limitedextent

Availability of the KG mAvai

Evaluation method We measured the availabilityof the officially hosted KGs with the monitoring servicePingdom122 For each KG an uptime test was set upwhich checked the availability of the resource Ham-burg as representative resource for successful URI re-solving (ie returning the status code HTTP 200) ev-ery minute over the time range of 60 days (Dec 182015ndashFeb 15 2016)

Evaluation result While the other KGs showed al-most no outages and were again online after some min-utes on average YAGO outages took place frequentlyand lasted on average 35 hours123 In the given timerange four outages took longer than one day Based onthese insights we recommend to use a local version ofYAGO for time-critical queries

Availability of a public SPARQL endpoint mSPARQL

The SPARQL endpoints of DBpedia and YAGO are

122See httpswwwpingdomcom requested Mar 2 2016The HTTP requests of Pingdom are executed by various servers sothat caching is prevented

123See diagrams per KG on our website (httpkmaifbkitedusitesknowledge-graph-comparisonrequested on Jan 31 2017)

provided by a Virtuoso server124 the Wikidata SPARQLendpoint via Blazegraph125 Freebase and OpenCyc donot provide an official SPARQL endpoint However anendpoint for the MQL query language for the FreebaseKG was available

Especially regarding the Wikidata SPARQL endpointwe observed access restrictions The maximum execu-tion time per query is set to 30 seconds but there is nolimitation regarding the returning number of rows How-ever the front-end of the SPARQL endpoint crashed incase of large result sets with more than 15M rows Al-though public SPARQL endpoints need to be preparedfor inefficient queries the time limit of Wikidata mayimpede the execution of reasonable queries

Provisioning of an RDF export mExport

All considered KGs provide RDF exports as down-loadable files The format of the data differs from KGto KG Mostly data is provided in N-Triples and Turtleformat

Support of content negotiation mNegot

We measure the support of content negotiation re-garding the serialization formats RDFXML N3Turtleand N-Triples OpenCyc does not provide any contentnegotiation only RDFXML is supported as contenttype Therefore OpenCyc does not fulfill the criterionof supporting content negotiation

The endpoints for DBpedia Wikidata and YAGOcorrectly returned the appropriate RDF serializationformat and the corresponding HTML representationof the tested resources Freebase does currently notprovide any content negotiation and only the contenttype textplain is returned

Noteworthy is also that regarding the N-Triples seri-alization YAGO and DBpedia require the accept headertextplain and not applicationn-triplesThis is due to the usage of Virtuoso as endpoint For DB-pedia the forwarding to httpdbpediaorgdata[resource]ntriples does not work in-stead the HTML representation is returned Thereforethe KG is evaluated with 05

Linking HTML sites to RDF serializations mHTMLRDF

All KGs except OpenCyc interlink the HTML represen-tations of resources with the corresponding RDF repre-sentations by means of ltlink rel=alternate

124See httpsvirtuosoopenlinkswcom re-quested on Dec 28 2016

125See httpswwwblazegraphcom requested on Dec28 2016

42 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Table 13Evaluation results for the KGs regarding the dimension License

DB FB OC WD YA

mmacLicense 1 0 0 1 0

type=content type href=URLgtin the HTML header

Provisioning of metadata about the KG mmeta

For this criterion we analyzed if KG metadata isavailable such as in the form of a VoID file126 DBpediaintegrates the VoID vocabulary directly in its KG127 andprovides information such as the SPARQL endpointURL and the number of all triples OpenCyc revealsthe current KG version number via owlversionInfo For YAGO Freebase and Wikidata no metainformation could be found

5210 LicenseThe evaluation results of the dimension License are

shown in Table 13

Provisioning machine-readable licensing informationmmacLicense

DBpedia and Wikidata provide licensing informa-tion about their KG data in machine-readable form ForDBpedia this is done in the ontology via the predi-cate cclicense linking to CC-BY-SA128 and GNUFree Documentation License (GNU FDL)129 Wikidataembeds licensing information during the dereferenc-ing of resources in the RDF document by linking withcclicense to the license CC0130 YAGO and Free-base do not provide machine-readable licensing infor-mation However their data is published under the li-cense CC-BY131 OpenCyc embeds licensing informa-tion into the RDF document during dereferencing butnot in machine-readable form132

126See httpswwww3orgTRvoid requested on Apr7 2016

127See httpdbpediaorgvoidpageDataset re-quested on Mar 5 2016

128See httpcreativecomonsorglicensesby-sa30 requested on Feb 4 2017

129See httpwwwgnuorgcopyleftfdlhtml re-quested on Feb 4 2017

130See httpcreativecomonsorgpublicdomainzero10 requested on Feb 4 2017

131See httpcreateivecommonsorglicensesby30 requested on Feb 4 2017

132License information is provided as plain text among furtherinformation with the relation rdfscomment

Table 14Evaluation results for the KGs regarding the dimension Interlinking

DB FB OC WD YA

mInst 025 0 038 0 (09) 031mURIs 093 091 089 096 096

5211 InterlinkingThe evaluation results of the dimension Interlinking

are shown in Table 14

Linking via owlsameAs mInst

Evaluation method Given all owlsameAs triplesin each KG we queried all those subjects thereof whichare instances but neither classes nor relations133 andwhere the resource in the object position of the triple isan external source ie not belonging to the namespaceof the KG

Evaluation result OpenCyc and YAGO achieve thebest results wrt this metric but DBpedia has by farthe most instances with at least one owlsameAs linkWe can therefore confirm the statement by Bizer et al[12] that DBpedia has established itself as a hub in theLinked Data cloud

In DBpedia there are about 52M instances with atleast one owlsameAs link Links to localized DBpe-dia versions (eg dedbpediaorg) were countedas internal links and hence not considered here Intotal one-fourth of all instances have at least oneowlsameAs link

In Wikidata neither owlsameAs links are pro-vided nor a corresponding proprietary relation is avail-able Instead Wikidata uses for each linked data seta proprietary relation (called identifier) to indicateequivalence For example the M-ID of a Freebase in-stance is stored via the relation ldquoFreebase identifierrdquo(wdtP646) as literal value (eg m01x3gpk)So far links to 426 different data sources are maintainedin this way

Although the equivalence statements in Wikidata canbe used to generate corresponding owlsameAs state-ments and although the stored identifiers are providedin the Browser interface as hyperlinks there are no gen-uine owlsameAs links available Hence Wikidata isevaluated with 0 If we view each equivalence relationas owlsameAs relation we would obtain around122M instances with owlsameAs statements Thiscorresponds to 86 of all instances If we consider

133The interlinking on schema level is already covered by thecriterion Interoperability of proprietary vocabulary

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 43

only entities instead of instances (since there are manyinstances due to reification) we obtain a coverage of65 Note however that although the linked resourcesprovide relevant content the resources are not alwaysRDF documents but instead HTML web pages There-fore we cannot easily subsume all identifiers (equiv-alence statements) under owlsameAs

YAGO has around 36M instances with at least oneowlsameAs link However most of them are linksto DBpedia based on common Wikipedia articles Ifthose links are excluded YAGO contains mostly linksto GeoNames and would be evaluated with just 001

In case of OpenCyc links to Cyc134 the commercialversion of OpenCyc were considered as being internalStill OpenCyc has the highest fulfillment degree witharound 40K instances with at least one owlsameAslink As mentioned earlier the relative high linkingdegree of OpenCyc can be attributed to dedicated ap-proaches of linking OpenCyc to other KGs135

Validity of external URIs mURIs

Regarding the dimension Accessibility we alreadyanalyzed the dereferencing possibility of resources inthe KG namespace Now we analyze the links to exter-nal URIs

Evaluation method External links include owlsameAs links as well as links to non-RDF-based Webresources (eg via foafhomepage) We measureerrors such as timouts client errors (HTTP response4xx) and server errors (HTTP response 5xx)

Evaluation result The external links are in most ofthe cases valid for all KGs All KGs obtain a metricvalue between 089 and 096

DBpedia stores provenance information via the re-lation provwasDerivedFrom Since almost alllinks refer to Wikipedia 99 of the resources are avail-able

Freebase achieves high metric values here sinceit contains owlsameAs links mainly to WikipediaAlso Wikipedia URIs are mostly resolvable

OpenCyc contains mainly external links to non-RDF-based Web resources to wikipediaorg and w3org

YAGO also achieves high metric values since it pro-vides owlsameAs links only to DBpedia and Geo-Names whose URIs do not change

For Wikidata the relation reference URL (wdtP854) which states provenance information amongother relations belongs to the links linking to external

134Ie swcyccom135See Interoperability of proprietary vocabulary in sec 528

Web resources Here we were able to resolve around955 without errors

Noticeable is that DBpedia and OpenCyc containmany owlsameAs links to URIs whose domains donot exist anymore136 One solution for such invalid linksmight be to remove them if they have been invalid for acertain time span

5212 Summary of ResultsWe now summarize the results of the evaluations

presented in this section

1 Syntactic validity of RDF documents All KGsprovide syntactically valid RDF documents

2 Syntactic validity of Literals In general the KGsachieve good scores regarding the Syntactic valid-ity of literals Although OpenCyc comprises over1M literals in total these literals are mainly labelsand descriptions which are not formatted in a spe-cial format For YAGO we detected about 519Ksyntactic errors (given 1M literal values) due to theusage of wildcards in the date values Obviouslythe syntactic invalidity of literals is accepted bythe publishers in order to keep the number of rela-tions low In case of Wikidata some invalid literalssuch as the ISBN have been corrected in newerversions of Wikidata This indicates that knowl-edge in Wikidata is curated continuously For DB-pedia comments next to the values to be extracted(such as ISBN) in the infoboxes of Wikipedia ledto inaccurately extracted values

3 Semantic validity of triples All considered KGsscored well regarding this metric This shows thatKGs can be used in general without concerns re-garding the correctness Note however that eval-uating the semantic validity of facts is very chal-lenging since a reliable ground truth is needed

4 Trustworthiness on KG level Based on the way ofhow data is imported and curated OpenCyc andWikidata can be trusted the most

5 Trustworthiness on statement level Here espe-cially good values are achieved for Freebase Wiki-data and YAGO YAGO stores per statement boththe source and the extraction technique which isunique among the KGs Wikidata also supports tostore the source of information but only around13 of the statements have provenance informa-tion attached Note however that not every state-

136Eg httprdfaboutcom httpwww4wiwissfu-berlindefactbook and httpwikicompanyorg (requested on Jan 11 2017)

44 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

ment in Wikidata requires a reference and that itis hard to evaluate which statements lack such areference

6 Using unknown and empty values Wikidata andFreebase support the indication of unknown andempty values

7 Check of schema restrictions during insertion ofnew statements Since Freebase and Wikidata areeditable by community members simple consis-tency checks are made during the insertion of newfacts in the user interface

8 Consistency of statements wrt class constraintsFreebase and Wikidata do not specify any classconstraints via owldisjointWith while theother KGs do

9 Consistency of statements wrt relation con-straints The inconsistencies of all KGs regardingthe range indications of relations are mainly due toinconsistently used data types (eg xsdgYearis used instead of xsdDate)Regarding the constraint of functional proper-ties the relation owlFunctionalPropertyis used by all KGs except Wikidata in most casesthe KGs comply with the usage restrictions of thisrelation

10 Creating a ranking of statements Only Wikidatasupports a ranking of statements This is in partic-ular worthwhile in case of statements which areonly temporally limited valid

11 Schema completeness Wikidata shows the highestdegree of schema completeness Also for DBpe-dia OpenCyc and YAGO we obtain results whichare presumably acceptable in most cross-domainuse cases While DBpedia classes were sometimesmissing in our evaluation the DBpedia relationswere covered considerably well OpenCyc lackssome relations of the gold standard but the classesof the gold standard were existing in OpenCycWhile the YAGO classes are peculiar in the sensethat they are connected to WordNet synsets it isremarkable that YAGO relations are often keptvery abstract so that they can be applied in differ-ent senses Freebase shows considerable room forimprovement concerning the coverage of typicalcross-domain classes and relations Note that Free-base classes are belonging to different domainsHence it is difficult to find related classes if theyare not in the same domain

12 Column completeness DBpedia and Freebaseshow the best column completeness values ie inthose KGs the predicates used by the instances of

each class are on average frequently used by all ofthose class instances We can name data importsas one reason for it

13 Population completeness Not very surprising isthe fact that all KGs show a higher degree of com-pleteness regarding well-known entities than re-garding rather unknown entities Especially Wiki-data shows an excellent performance for both well-known and rather unknown entities

14 Timeliness frequency of the KG Only Wikidataprovides the highest fulfillment degree for thiscriterion as it is continuously updated and as thechanges are immediately visible and queryable byusers

15 Specification of the validity period of statementsIn YAGO Freebase and Wikidata the temporalvalidity period of statements (eg term of office)can be specified

16 Specification of the modification date of state-ments Only Freebase keeps the modification datesof statements Wikidata provides the modificationdate of the queried resource during URI derefer-encing

17 Description of resources YAGO Wikidata andOpenCyc contain a label for almost every entitySurprisingly DBpedia shows a relatively low cov-erage wrt labels and descriptions (only 704)Manual investigations suggest that the interme-diate node mapping template is the main reasonfor that By means of this template intermediatenodes are introduced and instantiated but no la-bels are provided for them137

18 Labels in multiple languages YAGO Freebaseand Wikidata support hundreds of languages re-garding their stored labels Only OpenCyc con-tains labels merely in English While DBpediaYAGO and Freebase show a high coverage re-garding the English language Wikidata does nothave such a high coverage regarding English butinstead covers other languages to a considerableextent It is hence not only the most diverse KGin terms of languages but also the KG which con-tains the most labels for languages other than En-glish

19 Understandable RDF serialization DBpediaWikidata and YAGO provide several understand-

137An example is dbrVolkswagen_Passat_(B1)which has dboengine statements to the intermediate nodesVolkswagen_Passat_(B1)__1 etc representing differentengine variations

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 45

able RDF serialization formats Freebase onlyprovides the understandable format RDFTurtleOpenCyc relies only on RDFXML which is con-sidered as being not easily understandable for hu-mans

20 Self-describing URIs We can find mixed paradigmsregarding the URI generation DBpedia YAGOand OpenCyc rely on descriptive URIs whileWikidata and Freebase (in part classes and rela-tions are identified with self-describing URIs) usegeneric IDs ie opaque URIs

21 Avoiding blank nodes and RDF reification DB-pedia Wikidata YAGO and Freebase are theKGs which use reification ie which formulatestatements about statements There are differentways of implementing reification [27] DBpediaWikidata and Freebase use n-ary relations whileYAGO uses N-Quads creating so-called namedgraphs

22 Provisioning of several serialization formatsMany KGs provide RDF in several serializationformats Freebase is the only KG providing datain the serialization format RDFTurtle only

23 Using external vocabulary DBpedia and Wiki-data show high degrees of external vocabularyusage In DBpedia the RDF RDFS and OWLvocabularies are used Wikidata has a high ex-ternal vocabulary ratio since there exist manylanguage labels and descriptions (modeled viardfslabel and schemadescription)Also due to instantiations of statements withwdoStatement for reification purposes theexternal relation rdftype is used a lot

24 Interoperability of proprietary vocabulary Weobtained low fulfillment scores regarding this cri-terion OpenCyc shows the highest value Wecan mention as reason for that the fact thathalf of all OpenCyc classes exhibit at least oneowlsameAs linkWhile DBpedia has equivalence statements to ex-ternal classes for almost every second class only63 of all relations have equivalence relations torelations outside the DBpedia namespaceWikidata shows a very low interlinking degreeof classes to external classes and of relations toexternal relations

25 Dereferencing possibility of resources Resourcesin DBpedia OpenCyc and YAGO can be derefer-enced without considerable issues Wikidata usespredicates derived from relations that are not deref-erencable at all as well as blank nodes For Free-

base we measured a quite considerable amountof dereferencing failures due to server errors andunknown URIs Note also that Freebase requiredan API key for a large amount of requests

26 Availability of the KG While all other KGsshowed almost no outages YAGO shows a note-worthy instability regarding its online availabilityWe measured around 100 outages for YAGO ina time interval of 8 weeks taking on average 35hours

27 Provisioning of public SPARQL endpoint DBpe-dia Wikidata and YAGO provide a SPARQL end-point while Freebase and OpenCyc do not Note-worthy is that the Wikidata SPARQL endpoint hasa maximum execution time per query of 30 sec-onds This might be a bottleneck for some queries

28 Provisioning of an RDF export RDF exports areavailable for all KGs and are provided mostly inN-Triples and Turtle format

29 Support of content negotiation DBpedia Wiki-data and YAGO correctly return RDF data basedon content negotiation Both OpenCyc and Free-base do not support any content negotiation WhileOpenCyc only provides data in RDFXML Free-base only returns data with textplain as con-tent type

30 Linking HTML sites to RDF serializations AllKGs except OpenCyc interlink the HTML rep-resentations of resources with the correspondingRDF representations

31 Provisioning of KG metadata Only DBpedia andOpenCyc integrate metadata about the KG insome form DBpedia has the VoID vocabulary in-tegrated while OpenCyc reveals the current KGversion as machine-readable metadata

32 Provisioning machine-readable licensing informa-tion Only DBpedia and Wikidata provide licens-ing information about their KG data in machine-readable form

33 Interlinking via owlsameAs OpenCyc andYAGO achieve the best results wrt this met-ric but DBpedia has by far the most instanceswith at least one owlsameAs link Based onthe resource interlinkage DBpedia is justifiablycalled Linked Data hub Wikidata does not provideowlsameAs links but stores identifiers as liter-als that could be used to generate owlsameAslinks

34 Validity of external URIs The links to exter-nal Web resources are for all KGs valid inmost cases DBpedia and OpenCyc contain many

46 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Step 1 Requirements Analysis

- Identifying the preselection criteria P- Assigning a weight wi to each DQ criterion ci isin C

Step 2 Preselection based on the Preselection Criteria

- Manually selecting the KGs GP that fulfill the preselection criteria P

Step 3 Quantitative Assessment of the KGs

- Calculating the DQ metric mi(g) for each DQ criterion ci isin C- Calculating the fulfillment degree h(g) for each KG g isin GP

- Determining the KG g with the highest fulfillment degree h(g)

Step 4 Qualitative Assessment of the Result

- Assessing the selected KG g wrt qualitative aspects- Comparing the selected KG g with other KGs in G P

Fig 11 Proposed process for using our KG recommendation frame-work

owlsameAs links to RDF documents on do-mains which do not exist anymore those linkscould be deleted

6 KG Recommendation Framework

We now propose a framework for selecting themost suitable KG (or a set of suitable KGs) for agiven concrete setting based on a given set of KGsG = g1 gn To use this framework the user needsto go through the steps depicted in Fig 11

In Step 1 the preselection criteria and the weightsfor the criteria are specified The preselection criteriacan be both quality criteria or general criteria and needto be selected dependent on the use case The Timeli-ness frequency of the KG is an example for a qualitycriterion The license under which a KG is provided(eg CC0 license) is an example for a general criterionAfter weighting the criteria in Step 2 those KGs areneglected which do not fulfill the preselection criteriaIn Step 3 the fulfillment degrees of the remaining KGsare calculated and the KG with the highest fulfillmentdegree is selected Finally in Step 4 the result can be as-sessed wrt qualitative aspects (besides the quantitativeassessments using the DQ metrics) and if necessary analternative KG can be selected for being applied for thegiven scenario

Use case application In the following we showhow to use the KG recommendation framework in aparticular scenario The use case is based on the usageof DBpedia and MusicBrainz for the project BBC Musicas described in [33]

Description of the use case The publisher BBCwants to enrich news articles with fact sheets providingrelevant information about musicians mentioned in thearticles In order to obtain more details about the mu-sicians the user can leave the news section and accessthe musicians section where detailed information is pro-vided including a short description a picture the birthdate and the complete discography for each musicianFor being able to integrate the musicians informationinto the articles and to enable such a linking editorsshall tag the article based on a controlled vocabulary

The KG Recommendation Framework can be appliedas follows

1 Requirements analysis

ndash Preselection criteria According to the sce-nario description [33] the KG in questionshould (i) be actively curated and (ii) con-tain an appropriate amount of media enti-ties Given these two criteria a satisfactoryand up-to-date coverage of both old and newmusicians is expected

ndash Weighting of DQ criteria Based on the pre-selection criteria an example weighting ofthe DQ metrics for our use case is given inTable 15 Note that this is only one exam-ple configuration and the assignment of theweights is subjective to some degree Giventhe preselection criteria the criterion Timeli-ness frequency of the KG and the criteria ofthe DQ dimension Completeness are empha-sized Furthermore the criteria Dereferenc-ing possibility of resources and Availabilityof the KG are important as the KG shall beavailable online ready to be queried138

2 Preselection Freebase and OpenCyc are not con-sidered any further since Freebase is not being up-dated anymore and since OpenCyc contains onlyaround 4K entities in the media domain

3 Quantitative Assessment The overall fulfillmentscore for each KG is calculated based on the for-mula presented in Section 31 The result of thequantitative KG evaluation is presented in Ta-ble 15 By weighting the criteria according tothe constraints Wikidata achieves the best rankclosely followed by DBpedia Based on the quan-titative assessment Wikidata is recommended bythe framework

138We assume that in this use case rather the dereferencing ofHTTP URIs than the execution of SPARQL queries is desired

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 47

Table 15Framework with an example weighting which would be reasonablefor a user setting as given in [33]

Dimension Metric DBpedia Freebase OpenCyc Wikidata YAGO Example of UserWeighting wi

Accuracy msynRDF 1 1 1 1 1 1msynLit 0994 1 1 1 0624 1msemTriple 0990 0995 1 0993 0993 1

Trustworthiness mgraph 05 05 1 075 025 0mfact 05 1 0 1 1 1mNoV al 0 1 0 1 0 0

Consistency mcheckRestr 0 1 0 1 0 0mconClass 0875 1 0999 1 0333 0mconRelat 0992 0451 1 0500 0992 0

Relevancy mRanking 0 1 0 1 0 1

Completeness mcSchema 0905 0762 0921 1 0952 1mcCol 0402 0425 0 0285 0332 2mcPop 093 094 048 099 089 3

Timeliness mFreq 05 0 025 1 025 3mV alidity 0 1 0 1 1 0mChange 0 1 0 0 0 0

Ease of understanding mDescr 0704 0972 1 09999 1 1mLang 1 1 0 1 1 0muSer 1 1 0 1 1 0muURI 1 05 1 0 1 1

Interoperability mReif 05 05 05 0 05 0miSerial 1 0 05 1 1 1mextV oc 061 0108 0415 0682 0134 1mpropV oc 0150 0 0513 0001 0 1

Accessibility mDeref 1 0437 1 0414 1 2mAvai 09961 09998 1 09999 07306 2mSPARQL 1 0 0 1 1 1mExport 1 1 1 1 1 0mNegot 05 0 0 1 1 0mHTMLRDF 1 1 0 1 1 0mMeta 1 0 1 0 0 0

Licensing mmacLicense 1 0 0 1 0 0

Interlinking mInst 0251 0 0382 0 0310 3mURIs 0929 0908 0894 0957 0956 1

Unweighted Average 0683 0603 0496 0752 0625Weighted Average 0701 0493 0556 0714 0648

48 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

4 Qualitative Assessment The high population com-pleteness in general and the high coverage of enti-ties in the media domain in particular give Wiki-data advantage over the other KGs FurthermoreWikidata does not require that there is a Wikipediaarticle for each entity Thus missing Wikidata en-tities can be added by the editors directly and arethen available immediatelyThe use case requires to retrieve also detailed infor-mation about the musicians from the KG such as ashort descripion and a discography DBpedia tendsto store more of that data especially wrt discogra-phy A specialized database like MusicBrainz pro-vides even more data about musicians than DBpe-dia as it is not limited to the Wikipedia infoboxesWhile DBpedia does not provide any links to Mu-sicBrainz Wikidata stores around 120K equiva-lence links to MusicBrainz that can be used to pullmore data In conclusion Wikidata especially inthe combination with MusicBrainz seems to bean appropriate choice for the use case In this casethe qualitative assessment confirms the result ofthe quantitative assessment

The use case shows that our KG recommendationframework enables users to find the most suitable KGand is especially useful in giving an overview of themost relevant criteria when choosing a KG Howeverapplying our framework to the use case also showedthat besides the quantitative assessment there is stilla need for a deep understanding of commonalities anddifference of the KGs in order to make an informedchoice

7 Related Work

71 Linked Data Quality Criteria

Zaveri et al [49] provide a conceptual framework forquality assessment of linked data based on quality cri-teria and metrics which are grouped into quality dimen-sions and categories and which are based on the frame-work of Wang et al [47] Our framework is also basedon Wangrsquos dimensions and extended by the dimensionsConsistency [11] Licensing and Interlinking [49] Fur-thermore we reintroduce the dimensions Trustworthi-ness and Interoperability as a collective term for multi-ple dimensions

Many published DQ criteria and metrics are ratherabstract We in contrast selected and developed con-

crete criteria which can be applied to any KG in theLinked Open Data cloud Table 16 shows which ofthe metrics introduced in this article have already beenused to some extent in existing literature In summaryrelated work mainly proposed generic guidelines forpublishing Linked Data [26] introduced DQ criteriawith corresponding metrics (eg [2030]) and criteriawithout metrics (eg [4029]) 27 of the 34 criteria in-troduced in this article have been introduced or sup-ported in one way or another in earlier works The re-maining seven criteria namely Trustworthiness on KGlevel mgraph Indicating unknown and empty valuesmNoV al Check of schema restrictions during insertionof new statements mcheckRestr Creating a rankingof statements mRanking Timeliness frequency of theKG mFreq Specification of the validity period of state-ments mV alidity and Availability of the KG mAvaihave not been proposed so far to the best of our knowl-edge In the following we present more details of singleexisting approaches for Linked Data quality criteria

Pipino et al [40] introduce the criteria Schema com-pleteness Column completeness and Population com-pleteness in the context of databases We introducethose metrics for KGs and apply them to the best ofour knowledge the first time on the KGs DBpediaFreebase OpenCyc Wikidata and YAGO

OntoQA [45] introduces criteria and correspondingmetrics that can be used for the analysis of ontologiesBesides simple statistical figures such as the average ofinstances per class Tartir et al introduce also criteriaand metrics similar to our DQ criteria Description ofresources mDescr and Column completeness mcCol

Based on a large-scale crawl of RDF data Hogan etal [29] analyze quality issues of published RDF dataLater Hogan et al [30] introduce further criteria andmetrics based on Linked Data guidelines for data pub-lishers [26] Whereas Hogan et al crawl and analyzemany KGs we analyze a selected set of KGs in moredetail

Heath et al [26] provide guidelines for Linked Databut do not introduce criteria or metrics for the assess-ment of Linked Data quality Still the guidelines can beeasily translated into relevant criteria and metrics Forinstance Do you refer to additional access methodsleads to the criteria Provisioning of public SPARQLendpoint mSPARQL and Provisioning of an RDF ex-port mExport Also Do you map proprietary vocabu-lary terms to other vocabularies leads to the criterionInteroperability of proprietary vocabulary mpropV ocMetrics that are based on the guidelines of Heath et alcan also be found in other frameworks [3020]

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 49

Table 16Overview of related work regarding data quality criteria for KGs

DQ Metric [40] [45] [29] [26] [20] [22] [30] [48] [2] [34]

msynRDF X X

msynLit X X X X

msemTriple X X X X

mfact X X

mconClass X X X

mconRelat X X X X X X

mcSchema X X

mcCol X X X X

mcPop X X

mChange X X

mDescr X X X X

mLang X

muSer X

muURI X

mReif X X X

miSerial X

mextV oc X X

mpropV oc X

mDeref X X X X

mSPARQL X

mExport X X

mNegot X X X

mHTMLRDF X

mMeta X X X

mmacLicense X X X

mInst X X X

mURIs X X

Flemming [20] introduces a framework for the qual-ity assessment of Linked Data quality This frameworkmeasures the Linked Data quality based on a sample ofa few RDF documents Based on a systematic literaturereview criteria and metrics are introduced Flemmingintroduces the criteria Labels in multiple languagesmLang and Validity of external URIs mURIs the firsttime The framework is evaluated on a sample of RDFdocuments of DBpedia In contrast to Flemming weevaluate the whole KG DBpedia and also four otherwidely used KGs

SWIQA[22] is a quality assessment framework intro-duced by Fuumlrber et al that introduces criteria and met-rics for the dimensions Accuracy Completeness Timeli-ness and Uniqueness In this framework the dimensionAccuracy is divided into Syntactic validity and Sematicvalidity as proposed by Batini et al [6] Furthermorethe dimension Completeness comprises Schema com-pleteness Column completeness and Population com-pleteness following Pipino et al [40] In this articlewe make the same distinction but in addition distin-guish between RDF documents RDF triples and RDF

50 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

literals for evaluating the Accuracy since we considerRDF KGs

TripleCheckMate [35] is a framework for LinkedData quality assessment using a crowdsourcing-approachfor the manual validation of facts Based on this ap-proach Zaveri et al [48] and Acosta et al [23] analyzeboth syntactic and semantic accuracy as well as theconsistency of data in DBpedia

Kontokostas et al [34] present the test-driven evalu-ation framework RDFUnit for assessing Linked Dataquality This framework is inspired by the paradigmof test-driven software development The frameworkintroduces 17 SPARQL templates of tests that can beused for analyzing KGs wrt Accuracy and Consis-tency Note that those tests can also be used for eval-uating external constraints that exist due to the usageof external vocabulary The framework is applied byKontokostas et al on a set of KGs including DBpedia

72 Comparing KGs by Key Statistics

Duan et al [14] Tartir [45] and Hassanzadeh [25]can be mentioned as the most similar related work re-garding the evaluation of KGs using the key statisticspresented in Section 51

Duan et al [14] analyze the structuredness of data inDBpedia YAGO2 UniProt and in several benchmarkdata sets To that end the authors use simple statisticalkey figures that are calculated based on the correspond-ing RDF dumps In contrast to that approach we useSPARQL queries to obtain the figures thus not limitingourselves to the N-Tripel serialization of RDF dumpfiles Duan et al claim that simple statistical figures arenot sufficient to gain fruitful findings when analyzingthe structuredness and differences of RDF datasets Theauthors therefore propose in addition a coherence met-ric Accordingly we analyze not only simple statisti-cal key figures but further analyze the KGs wrt dataquality using 34 DQ metrics

Tartir et al [45] introduce with the system OntoQAmetrics that can be used for analyzing ontologies Moreprecisely it can be measured to which degree theschema level information is actually used on instancelevel An example of such a metric is the class richnessdefined as the number of classes with instances dividedby the number of classes without instances SWETOTAP and GlycO are used as showcase ontologies

Tartir et al [45] and Hassanzadeh et al [25] analyzehow domains are covered by KGs on both schema andinstance level For that Tartir et al introduce the mea-sure importance as the number of instances per class

and their subclasses In our case we cannot use this ap-proach since Freebase has no hierarchy Hassanzadehet al analyze the coverage of domains by listing themost frequent classes with the highest number of in-stances as a table This gives only little overview of thecovered domains since instances can belong to multi-ple classes in the same domain such as dboPlaceand dboPopulatedPlace For determining thedomain coverages of KGs for this article we there-fore adapt the idea of Hassanzadeh et al by manu-ally mapping the most frequent classes to domains anddeleting duplicates within the domains That meansif an instance is instantiated both as dboPlaceand dboPopulatedPlace the instance will becounted only once in the domain geography

8 Conclusion

Freely available knowledge graphs (KGs) have notbeen in the focus of any extensive comparative study sofar In this survey we defined a range of aspects accord-ing to which KGs can be analyzed We analyzed andcompared DBpedia Freebase OpenCyc Wikidata andYAGO along these aspects and proposed a frameworkas well as a process to enable readers to find the mostsuitable KG for their settings

References

[1] M Acosta E Simperl F Floumlck and M Vidal HARE AHybrid SPARQL Engine to Enhance Query Answers viaCrowdsourcing In Proceedings of the 8th InternationalConference on Knowledge Capture K-CAP 2015 pages111ndash118 ACM 2015

[2] M Acosta A Zaveri E Simperl D Kontokostas S Auer andJ Lehmann Crowdsourcing linked data quality assessment InThe Semantic WebndashISWC 2013 pages 260ndash276 Springer 2013

[3] M Acosta A Zaveri E Simperl D Kontokostas F Floumlckand J Lehmann Detecting Linked Data Quality Issues viaCrowdsourcing A DBpedia Study Semantic Web 2016

[4] S Auer C Bizer G Kobilarov J Lehmann R Cyganiak andZ Ives DBpedia A Nucleus for a Web of Open Data InProceedings of the 6th International Semantic Web Conferenceand 2nd Asian Semantic Web Conference ISWC 2007ASWC2007 pages 722ndash735 Springer 2007

[5] S Auer J Lehmann A-C Ngonga Ngomo and A ZaveriIntroduction to Linked Data and Its Lifecycle on the Web InReasoning Web Semantic Technologies for Intelligent DataAccess volume 8067 of Lecture Notes in Computer Sciencepages 1ndash90 Springer Berlin Heidelberg 2013

[6] C Batini C Cappiello C Francalanci and A MaurinoMethodologies for Data Quality Assessment and ImprovementACM Comput Surv 41(3)161ndash1652 July 2009

M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO 51

[7] S Bechhofer F van Harmelen J Hendler I Horrocks D LMcGuinness and P F Patel-Schneider OWL Web OntologyLanguage Reference httpswwww3orgTR2004REC-owl-ref-200402102004 [Online accessed 06-Apr-2016]

[8] T Berners-Lee Linked Data httpwwww3orgDesignIssuesLinkedDatahtml2006 [Online accessed 28-Feb-2016]

[9] T Berners-Lee Linked Data Is Merely More Datahttpwwww3orgDesignIssuesLinkedDatahtml 2006[Online accessed 28-02-2016]

[10] T Berners-Lee J Hendler and O Lassila The Semantic WebScientific American 284(5)29ndash37 5 2001

[11] C Bizer Quality-Driven Information Filtering in the Contextof Web-Based Information Systems VDM Publishing 2007

[12] C Bizer J Lehmann G Kobilarov S Auer C BeckerR Cyganiak and S Hellmann DBpediandashA crystallizationpoint for the Web of Data Web Semantics science servicesand agents on the world wide web 7(3)154ndash165 2009

[13] X Dong E Gabrilovich G Heitz W Horn N LaoK Murphy T Strohmann S Sun and W Zhang KnowledgeVault A Web-Scale Approach to Probabilistic KnowledgeFusion In Proceedings of the 20th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining KDD rsquo14 pages 601ndash610 New York NY USA 2014ACM

[14] S Duan A Kementsietsidis K Srinivas and O UdreaApples and Oranges A Comparison of RDF Benchmarks andReal RDF Datasets In Proceedings of the ACM SIGMODInternational Conference on Management of Data SIGMOD2011 pages 145ndash156 2011

[15] B Ell D Vrandecic and E Simperl Proceedings of the 10thInternational Semantic Web Conference (ISWC 2011) chapterLabels in the Web of Data pages 162ndash176 Springer BerlinHeidelberg Berlin Heidelberg 2011

[16] F Erxleben M Guumlnther M Kroumltzsch J Mendez andD Vrandecic Introducing Wikidata to the Linked Data WebIn Proceedings of the 13th International Semantic WebConference ISWC 2014 pages 50ndash65 Springer 2014

[17] M Faumlrber F Bartscherer C Menne and A Rettinger LinkedData Quality of DBpedia Freebase OpenCyc Wikidata andYAGO Semantic Web Journal 2017 to be published

[18] M Faumlrber C Menne and A Rettinger A Linked DataWrapper for CrunchBase Semantic Web Journal 2017 to bepublished

[19] C Fellbaum WordNet ndash An Electronic Lexical Database MITPress 1998

[20] A Flemming Qualitaumltsmerkmale von LinkedData-veroumlffentlichenden Datenquellen (Quality characteristicsof linked data publishing datasources) Diploma ThesisHumboldt University of Berlinhttpwwwdbisinformatikhu-berlindefileadminresearchpapersdiploma_seminar_thesisDiplomarbeit_Annika_Flemmingpdf 2011

[21] G Freedman and E G Reynolds Enriching Basal ReaderLessons with Semantic Webbing Reading Teacher33(6)677ndash684 1980

[22] C Fuumlrber and M Hepp SWIQA ndash A Semantic WebInformation Quality Assessment Framework In Proceedings ofthe 19th European Conference on Information Systems

(ECIS2011) volume 15 page 19 2011[23] R Guns Tracing the origins of the Semantic Web Journal of

the American Society for Information Science and Technology64(10)2173ndash2181 2013

[24] H Halpin P J Hayes J P McCusker D L McGuinness andH S Thompson The Semantic Web ndash ISWC 2010 9thInternational Semantic Web Conference ISWC 2010 ShanghaiChina chapter When owlsameAs Isnrsquot the Same An Analysisof Identity in Linked Data pages 305ndash320 Springer BerlinHeidelberg Berlin Heidelberg 2010

[25] O Hassanzadeh M J Ward M Rodriguez-Muro andK Srinivas Understanding a Large Corpus of Web TablesThrough Matching with Knowledge Bases ndash An EmpiricalStudy In Proceedings of the 10th International Workshop onOntology Matching collocated with the 14th InternationalSemantic Web Conference ISWC 2015 2015

[26] T Heath and C Bizer Linked data Evolving the web into aglobal data space Synthesis lectures on the semantic webtheory and technology 1(1)1ndash136 2011

[27] D Hernaacutendez A Hogan and M Kroumltzsch Reifying RDFWhat Works Well With Wikidata In Proceedings of the 11thInternational Workshop on Scalable Semantic Web KnowledgeBase Systems co-located with 14th International Semantic WebConference pages 32ndash47 2015

[28] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 A Spatially and Temporally Enhanced KnowledgeBase from Wikipedia Artificial Intelligence 19428ndash61 2013

[29] A Hogan A Harth A Passant S Decker and A PolleresWeaving the Pedantic Web Proceedings of the WWW2010Workshop on Linked Data on the Web 628 2010

[30] A Hogan J Umbrich A Harth R Cyganiak A Polleres andS Decker An empirical survey of linked data conformanceWeb Semantics Science Services and Agents on the WorldWide Web 1414ndash44 2012

[31] P Jain P Hitzler K Janowicz and C Venkatramani TherersquosNo Money in Linked Data httpcorescholarlibrarieswrighteducse240 2013 accessedJuly 20 2015

[32] J M Juran F M Gryna and R S Bingham editors QualityControl Handbook McGraw-Hill 1974

[33] G Kobilarov T Scott Y Raimond S Oliver C SizemoreM Smethurst C Bizer and R Lee Media Meets SemanticWeb ndash How the BBC Uses DBpedia and Linked Data to MakeConnections In Proceedings of the 6th European SemanticWeb Conference on The Semantic Web Research andApplications ESWC 2009 Heraklion pages 723ndash737 BerlinHeidelberg 2009 Springer

[34] D Kontokostas P Westphal S Auer S HellmannJ Lehmann R Cornelissen and A Zaveri Test-drivenevaluation of linked data quality In Proceedings of the 23rdinternational conference on World Wide Web pages 747ndash758ACM 2014

[35] D Kontokostas A Zaveri S Auer and J LehmannTripleCheckMate A Tool for Crowdsourcing the QualityAssessment of Linked Data In Knowledge Engineering andthe Semantic Web ndash 4th International Conference KESW 2013St Petersburg Russia October 7-9 2013 Proceedings pages265ndash272 Springer 2013

[36] C Matuszek J Cabral M J Witbrock and J DeOliveira AnIntroduction to the Syntax and Content of Cyc In AAAI SpringSymposium Formalizing and Compiling Background

52 M Faumlrber et al Linked Data Quality of DBpedia Freebase OpenCyc Wikidata and YAGO

Knowledge and Its Applications to Knowledge Representationand Question Answering pages 44ndash49 AAAI - Association forthe Advancement of Artificial Intelligence 2006

[37] M Mecella M Scannapieco A Virgillito R BaldoniT Catarci and C Batini Managing data quality in cooperativeinformation systems In On the Move to Meaningful InternetSystems 2002 CoopIS DOA and ODBASE pages 486ndash502Springer 2002

[38] O Medelyan and C Legg Integrating Cyc and WikipediaFolksonomy meets rigorously defined common-sense InWikipedia and Artificial Intelligence An Evolving SynergyPapers from the 2008 AAAI Workshop page 65 2008

[39] F Naumann Quality-Driven Query Answering for IntegratedInformation Systems volume 2261 Springer Science ampBusiness Media 2002

[40] L L Pipino Y W Lee and R Y Wang Data QualityAssessment Communications of the ACM 45(4)211ndash2182002

[41] E Sandhaus Semantic Technology at the New York TimesLessons Learned and Future Directions In Proceedings of the9th International Semantic Web Conference on The SemanticWeb - Volume Part II ISWCrsquo10 pages 355ndash355 BerlinHeidelberg 2010 Springer

[42] A Singhal Introducing the Knowledge Graph things notstrings httpsgoogleblogblogspotde201205introducing-knowledge-graph-things-nothtml retrieved on Aug 29 2016 2012

[43] F M Suchanek G Kasneci and G Weikum YAGO A LargeOntology from Wikipedia and WordNet Web SemanticsScience Services and Agents on the World Wide Web6(3)203ndash217 2008

[44] T P Tanon D Vrandecic S Schaffert T Steiner andL Pintscher From Freebase to Wikidata The Great MigrationIn Proceedings of the 25th International Conference on WorldWide Web WWW 2016 pages 1419ndash1428 2016

[45] S Tartir I B Arpinar M Moore A P Sheth andB Aleman-meza OntoQA Metric-Based Ontology QualityAnalysis In IEEE Workshop on Knowledge Acquisition fromDistributed Autonomous Semantically Heterogeneous Dataand Knowledge Sources 2005

[46] R Y Wang M P Reddy and H B Kon Toward quality dataAn attribute-based approach Decision Support Systems13(3)349ndash372 1995

[47] R Y Wang and D M Strong Beyond Accuracy What DataQuality Means to Data Consumers Journal of managementinformation systems 12(4)5ndash33 1996

[48] A Zaveri D Kontokostas M A Sherif L BuumlhmannM Morsey S Auer and J Lehmann User-driven qualityevaluation of dbpedia In Proceedings of the 9th InternationalConference on Semantic Systems pages 97ndash104 ACM 2013

[49] A Zaveri A Rula A Maurino R Pietrobon J Lehmann andS Auer Quality Assessment for Linked Data A SurveySemantic Web 7(1)63ndash93 2015

Page 11: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with
Page 12: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with
Page 13: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with
Page 14: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with
Page 15: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with
Page 16: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with
Page 17: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with
Page 18: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with
Page 19: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with
Page 20: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with
Page 21: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with
Page 22: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with
Page 23: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with
Page 24: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with
Page 25: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with
Page 26: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with
Page 27: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with
Page 28: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with
Page 29: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with
Page 30: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with
Page 31: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with
Page 32: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with
Page 33: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with
Page 34: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with
Page 35: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with
Page 36: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with
Page 37: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with
Page 38: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with
Page 39: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with
Page 40: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with
Page 41: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with
Page 42: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with
Page 43: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with
Page 44: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with
Page 45: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with
Page 46: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with
Page 47: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with
Page 48: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with
Page 49: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with
Page 50: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with
Page 51: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with
Page 52: Semantic Web 0 (2017) 1–0 IOS Press Linked Data Quality of … · (s;rdf:type;freebase:common.topic) 2 g _(s;rdf:type;cych:Individual) 2 gg –Relations (interchangeably used with

Recommended