+ All Categories
Home > Documents > Semantic Web 1 (2015) 1–5 IOS Press A Comparative Survey ... · / A Comparative Survey of...

Semantic Web 1 (2015) 1–5 IOS Press A Comparative Survey ... · / A Comparative Survey of...

Date post: 04-May-2020
Category:
Upload: others
View: 9 times
Download: 0 times
Share this document with a friend
26
Semantic Web 1 (2015) 1–5 1 IOS Press A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO Michael Färber *,** , Basil Ell, Carsten Menne, and Achim Rettinger Karlsruhe Institute of Technology (KIT), Institute AIFB, 76131 Karlsruhe, Germany Abstract. In recent years, several noteworthy large, crossdomain and openly available knowledge graphs (KGs) have been created. These include DBpedia, Freebase, OpenCyc, Wikidata, and YAGO. Although extensively in use, these KGs have not been subject to an in-depth comparison so far. In this survey, we first define aspects according to which KGs can be analyzed. Next, we analyze and compare the above mentioned KGs along those aspects and finally propose a method for finding the most suitable KG for a given setting. Keywords: Knowledge Graph, Comparison, DBpedia, Freebase, OpenCyc, Wikidata, YAGO 1. Introduction The idea of the Semantic Web is that of publishing and querying knowledge on the Web in a semantically structured way. According to Guns [27], the term “Se- mantic Web” already was being used in fields such as educational psychology, before it became prominent in computer science. Freedman and Reynolds [25], for instance, describe “semantic webbing” as organiz- ing information and relationships in a visual display. Berners-Lee presented his idea of using typed links as vehicle of semantics for the first time at the World Wide Web Fall 1994 Conference under the heading “Semantics,” and under the heading “Semantic Web” in 1995 [27]. The idea of a Semantic Web was introduced to a wider audience by Berners-Lee in 2001 [11]. Accord- ing to his vision, the traditional Web as a Web of Doc- uments should be extended to a Web of Data where not only documents and links between documents, but any entity (e.g., a person or organization) and any relation * Corresponding author. E-mail: [email protected]. ** This work was carried out with the support of the German Fed- eral Ministry of Education and Research (BMBF) within the Soft- ware Campus project SUITE (Grant 01IS12051). between entities (e.g., isSpouseOf ) can be represented on the Web. When it comes to realizing the idea of the Seman- tic Web, knowledge graphs (KGs) are currently seen as one of the most essential components. We define a knowledge graph as a knowledge base (KB) (de- fined as the combination of an ontology and instances of the classes in the ontology [59, p. 739]) consist- ing to a large amount of facts about entities. Besides domain-specific KGs, often general, i.e. encyclope- dic/crossdomain knowledge is covered in openly avail- able KGs as DBpedia exemplifies. This makes KGs widely applicable: not only a small set of users – as in the case of expert systems – benefit from using the stored structured knowledge (e.g., via using specific search interfaces of expert systems), but any person on the street having access to the Web can benefit, e.g., by using Web search functionalities where semantic queries against a KG extend traditional information re- trieval queries on documents. In this survey, we focus on those KGs (i) that are freely accessible and freely usable, (ii) that incorpo- rate the Semantic Web standards to some extent such 1570-0844/15/$27.50 c 2015 – IOS Press and the authors. All rights reserved
Transcript
Page 1: Semantic Web 1 (2015) 1–5 IOS Press A Comparative Survey ... · / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO modeling with RDF1 and querying with SPARQL,2

Semantic Web 1 (2015) 1–5 1IOS Press

A Comparative Survey of DBpedia, Freebase,OpenCyc, Wikidata, and YAGOMichael Färber ∗,∗∗, Basil Ell, Carsten Menne, and Achim RettingerKarlsruhe Institute of Technology (KIT), Institute AIFB,76131 Karlsruhe, Germany

Abstract. In recent years, several noteworthy large, crossdomain and openly available knowledge graphs (KGs) have beencreated. These include DBpedia, Freebase, OpenCyc, Wikidata, and YAGO. Although extensively in use, these KGs have notbeen subject to an in-depth comparison so far. In this survey, we first define aspects according to which KGs can be analyzed.Next, we analyze and compare the above mentioned KGs along those aspects and finally propose a method for finding the mostsuitable KG for a given setting.

Keywords: Knowledge Graph, Comparison, DBpedia, Freebase, OpenCyc, Wikidata, YAGO

1. Introduction

The idea of the Semantic Web is that of publishingand querying knowledge on the Web in a semanticallystructured way. According to Guns [27], the term “Se-mantic Web” already was being used in fields such aseducational psychology, before it became prominentin computer science. Freedman and Reynolds [25],for instance, describe “semantic webbing” as organiz-ing information and relationships in a visual display.Berners-Lee presented his idea of using typed linksas vehicle of semantics for the first time at the WorldWide Web Fall 1994 Conference under the heading“Semantics,” and under the heading “Semantic Web”in 1995 [27].

The idea of a Semantic Web was introduced to awider audience by Berners-Lee in 2001 [11]. Accord-ing to his vision, the traditional Web as a Web of Doc-uments should be extended to a Web of Data where notonly documents and links between documents, but anyentity (e.g., a person or organization) and any relation

*Corresponding author. E-mail: [email protected].**This work was carried out with the support of the German Fed-

eral Ministry of Education and Research (BMBF) within the Soft-ware Campus project SUITE (Grant 01IS12051).

between entities (e.g., isSpouseOf ) can be representedon the Web.

When it comes to realizing the idea of the Seman-tic Web, knowledge graphs (KGs) are currently seenas one of the most essential components. We definea knowledge graph as a knowledge base (KB) (de-fined as the combination of an ontology and instancesof the classes in the ontology [59, p. 739]) consist-ing to a large amount of facts about entities. Besidesdomain-specific KGs, often general, i.e. encyclope-dic/crossdomain knowledge is covered in openly avail-able KGs as DBpedia exemplifies. This makes KGswidely applicable: not only a small set of users – asin the case of expert systems – benefit from using thestored structured knowledge (e.g., via using specificsearch interfaces of expert systems), but any person onthe street having access to the Web can benefit, e.g.,by using Web search functionalities where semanticqueries against a KG extend traditional information re-trieval queries on documents.

In this survey, we focus on those KGs (i) that arefreely accessible and freely usable, (ii) that incorpo-rate the Semantic Web standards to some extent such

1570-0844/15/$27.50 c© 2015 – IOS Press and the authors. All rights reserved

Page 2: Semantic Web 1 (2015) 1–5 IOS Press A Comparative Survey ... · / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO modeling with RDF1 and querying with SPARQL,2

2 M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO

modeling with RDF1 and querying with SPARQL,2

and (iii) that do not cover special domains such as thebiomedical domain, but covers instead general knowl-edge (often also called crossdomain or encyclopedicknowledge).

Thus, out of scope are KGs which are not openlyavailable such as the Google Knowledge Graph,3

the Google Knowledge Vault [21], and the FacebookGraph4 as well as KGs which are not based on Seman-tic Web standards at all or are only accessible via anAPI (see WolframAlpha5). Also excluded are unstruc-tured or weakly structured knowledge collections.

For selecting the KGs for analysis, we regarded alldatasets which were registered at the online datasetcatalog http://datahub.io6 and which weretagged as “crossdomain”. Besides that, we took furtherdatasets into consideration which fulfilled the abovementioned requirements (e.g., Wikidata). In total, wenominated DBpedia, Freebase, Cyc, Wikidata, andYAGO as KGs for our comparison.

In this paper, we give a systematic overview of theseKGs in their current versions, and discuss how the factsof these KGs are modeled, stored, and queried. Notethat the focus of this survey is not the life cycle of KGson the Web or in enterprises. We can refer in this re-spect to [8].

Besides juxtaposing the characteristics of the KGswe provide a recipe for users who are interested in us-ing one of the mentioned KGs in a research or indus-trial setting, but who are inexperienced in which KG tochoose for their concrete settings.

The main contributions of this survey are:

1. We define 35 aspects (characteristics) accordingto which KGs can be analyzed.

2. We analyze DBpedia, Freebase, Cyc, Wikidata,and YAGO along these aspects.

3. We propose a checklist which enables users tofind the most suitable KG for their needs.

1See http://www.w3.org/RDF/ (accessed June 16, 2015).2See http://www.w3.org/TR/rdf-sparql-query/

(accessed June 16, 2015).3See http://www.google.com/insidesearch/

features/search/knowledge.html4See https://developers.facebook.com/docs/

graph-api5See http://products.wolframalpha.com/api/6This catalog is also used for registering Linked Open Data

datasets.

The organization of this survey is as follows:

– In Section 2 we describe the genesis of semanticdata models and provide a definition for both se-mantic data models and graph models, since KGsare realizations of both models.

– In Section 3 we describe aspects by which knowl-edge graphs can be analyzed.

– In Section 4 we describe the knowledge graphswe analyze.

– In Section 5 we analyze the knowledge graphsalong the aspects listed in Section 3.

– In Section 6 we present a guideline to assess theknowledge graphs according to the user’s setting.

– In Section 7 we outline current limitations of KGs– In Section 8 we glance over the possible future of

the KGs and of the Semantic Web– In Section 9 we conclude the survey.

2. Semantic Data Models and Graph Models

Two data model types are especially relevant withrespect to KBs and, hence, to KGs: Semantic datamodels and graph data models. In this section, we firstdescribe the genesis of semantic data models and showhow both semantic data models and graph data modelshave been defined. For an in-depth introduction intosemantic data models and graph data models, the in-terested reader is referred to [47] and [5], respectively.

2.1. Genesis of Semantic Data Models

The evolution from database (DB) design towardKB design is coupled with increasing abstraction lay-ers. In the early stages of DB design, models for rep-resenting data were modeled conceptually close to thephysical layer of data storage. After the basic physicalmodels, hierarchical models [63] became prominent.They were superseded conceptually by network mod-els [62] and later by relational models [17]. Contraryto the hierarchical and network models, the relationalmodel was not located on the primitive record levelanymore, but between the physical and logical level,although still affiliated to the record level.

First essential landmarks of semantic data model-ing arose in the mid-seventies when databases increas-ingly supported the user’s view on the data. Then, newparadigms enriched semantic data modeling over time.The following concepts are notable in this respect:

Page 3: Semantic Web 1 (2015) 1–5 IOS Press A Comparative Survey ... · / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO modeling with RDF1 and querying with SPARQL,2

M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO 3

1. The idea of data independence (see [16]) statesthat data is not modeled in the way as requiredby the storage architecture, but according tothe user’s application (i.e., having entities andrelationships among them). This application-oriented paradigm was accompanied by the de-velopments and the emergence of new program-ming languages such as C (developed between1969 and 1973) and Smalltalk, which were moreabstract than traditional languages and not datastorage-oriented such as COBOL.

2. The idea of semantic “injection”/enrichment(see for instance [54] and [56]) states that: Al-though semantics was encoded into data mod-els on a low level (i.e. to single data items) upto the mid-seventies, new approaches consideredsemantic relations between data items and howto model interrelational dependencies. By imple-menting rules which are based on these interre-lational dependencies, consistency checks weremade possible. We can mention the following no-table modeling approaches as steps in the evolu-tion of semantic data modeling:

(a) Schmid [54] introduced the idea to model ba-sic semantic properties that entities of a cer-tain class (e.g., person) may have as well asrelationships that entities of certain classesmay have (e.g., the has-spouse-relationship).

(b) Smith [56] introduced generalization and ag-gregation as new forms of abstraction: a) Gen-eralization is used to express similarities (seeFigure 1a) and is modeled between classes:One class (e.g., carnivores) is subclass of an-other class (e.g., animals) and shares proper-ties with the superclass. b) An aggregationis the composition of an object from a set ofobjects. The aggregation class (see Class inFigure 1b) stands as a whole unit in place ofits components (in Figure 1b Instructor andCourse).

(c) Brodie [14] introduced classification and as-sociation relationships as further modelingapproaches: a) Classification means to assigna class to an entity (e.g., Markus is a per-son). b) An association is a relationship be-tween classes and describes the connectionsbetween classes in terms of the shared seman-tics and structure. Any aggregation or compo-sition are associations.

Animals

Carnivores Herbivores

(a) Generalization

Class

Instructor Course

(b) Aggregation

Fig. 1. Examples for generalization and aggregation.

3. Later, in the 1980s, object-oriented models [36]appeared: Data was considered as collections ofobjects of specific classes. In parallel to the up-rise of object-oriented models, graph models ap-peared. With this model users were able to repre-sent the inherent graph structure of data.

4. Afterwards, other models such as semi-structuredmodels [15] and the XML model [13] were pro-posed.

5. The Resource Description Framework (RDF)was originally published by the W3C as rec-ommendation in 1999 [40] and in a new ver-sion in 2004 [37]. In 2014, RDF 1.1 [20] waspublished. RDF builds the basis of the semanticgraph model as we consider it in this survey.

2.2. Definition of the Semantic Data Model

According to Hammer [29], semantic data modelsare characterized as data models adhering to the fol-lowing principles:

1. A database is a collection of entities that corre-spond to actual objects in the application envi-ronment.

2. Entities in the database are organized into classes.3. Classes may be interconnected.4. Entities and classes are characterized by relations

(called attributes by Hammer) and relations mayinterconnect entities.

5. Relations can be derived from other relations viaentailment.

Well-known examples of the semantic data modelare the entity-relationship model [16] and RDF.7 Fur-ther examples are the IFO model [1] and SDM [30].

7See http://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/ (accessed July 22,2015).

Page 4: Semantic Web 1 (2015) 1–5 IOS Press A Comparative Survey ... · / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO modeling with RDF1 and querying with SPARQL,2

4 M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO

2.3. Definition of the Graph Data Model

When deciding for a data model, the choice typi-cally depends on (i) the modeling domain, (ii) the endusers, and (iii) the hardware and software constraints.Codd [18] distinguishes three components each datamodel possesses: 1) The set of data structure types, 2)the set of operators or inference rules, and 3) the setof integrity rules. Angles and Gutierrez [5] define thegraph data model along this categorization:

1. Data structure types: In the graph data model, in-stance data and/or schema information is repre-sented as/by graphs, or by data structures whichgeneralize the notion of graph.

2. Operators and inference rules: Data is manip-ulated via graph transformation operators (see[28]).

3. Integrity rules: Rules can be constructed whichensure data consistency.

In graph data models, the semantics comes into playas follows: (i) Graph nodes are interpreted as entitiesor values; (ii) typed relations between nodes are inter-preted as facts about the involved entities; and (iii) aschema is introduced by assigning types to instancesand introducing relations among classes. The informa-tion focus supported by the graph data model thereforeencompasses the schema, the instances and the rela-tions.

Established graph models are GOOD [28], GMOD [4],G-Log [46], Gram [3], and RDF. They all use a labeleddirected graph both for modeling the schema as wellas the instance level.

KGs, as we consider it in this survey, are realizationsof both semantic data models and graph models, sincethe KGs are characterized by having a set of entities, aset of classes, a set of relations between entities, and aset of relations between entities and classes.

3. Criteria for Comparison

Several works compared semantic data models:

– Brodie [14] categorized semantic data modelsinto: (i) classical models, (ii) mathematical mod-els, (iii) irreducible data models, (iv) static se-mantic hierarchy models, and (v) dynamic se-mantic hierarchy models.

– Tsichritzis and Lochovsky [64] categorized se-mantic data models into: (i) traditional models,

(ii) entity-relationship models, (iii) binary mod-els, (iv) semantic network models, and (v) info-logical data models.

– Hull and King [31] compared semantic data mod-els according to the features they provide, such asaggregation, grouping, printable, object-valued,and multi-valued.

– Kerschberg et al. [35] analyzed data models ac-cording to mathematical foundations, terminol-ogy, and semantic levels of abstraction, and dis-tinguished between graph theoretic and set theo-retic models.

These approaches do neither consider current KGswith their data models nor ontologies. Also, the usedcriteria for comparing semantic data models are veryabstract, since a wide range of data models are com-pared against each other. Besides these works, thereare approaches which analyze (and sometimes assess)explicitly ontologies, but not KGs. Some of themare mentioned in the following (see also [12] for anoverview of ontology evaluation):

– Tartir et al. [61] introduced the approach On-toQA by which ontology schemas and their pop-ulations (i.e., KBs) can be analyzed through a setof metrics so that key characteristics of an ontol-ogy schema can be highlighted. The analysis fo-cused on numerically expressible characteristicsof ontologies and results regarding the ontologiesSWETO, TAP, and GlycO were presented.

– Lozano-Tello et al. [42] proposed ONTOMET-RIC, which allows the users to measure thesuitability of existing ontologies, regarding therequirements of their systems. OntoMETRICpresents a generic methodology and does not an-alyze specific ontologies.

– Vrandecic et al. [66] differentiated between struc-tural and ontological metrics and provided prin-ciple means for the definition of metrics that takethe semantic of the ontology appropriately intoaccount.

– Poveda-Villalon et al. [48] present a tool calledOOPS! by which an RDF document describing anontology can be analyzed. Potential pitfalls thatcould lead to modeling errors are then presentedto the user.

Since these approaches focus only on ontologies, wecannot compare the used datasets. Also the criteria forcomparison are different to ours, since we do not onlyfocus on the schema. To the best of our knowledge, a

Page 5: Semantic Web 1 (2015) 1–5 IOS Press A Comparative Survey ... · / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO modeling with RDF1 and querying with SPARQL,2

M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO 5

systematic comparison of openly-available knowledgegraphs has not been carried out so far. Therefore, wesystematically analyze and compare knowledge graphsaccording to aspects of the following categories:

– General information: What general propertiesdoes the KG have?

– Format and representation: How are facts repre-sented, stored, and queried?

– Genesis and usage: How was the KG created andhow is it used?

– Entities: How are entities represented and de-scribed in the KG?

– Relations: How are relations represented and de-scribed in the KG?

– Schema: What are the features of the schema ofthe KG?

– Particularities: What particularities (special fea-tures) does the underlying data model of the KGhave?

In the following, we list the criteria we use for com-paring the different KGs, grouped by the categoriesmentioned.

3.1. General Information

We use the following aspects to collect general in-formation about the KGs:

1. Homepage: The URL where the KG can be ac-cessed.

2. Current version: The version of the knowledgebase we consider in this survey.

3. Languages: What languages (e.g., English) areused in the KG on schema and instance level?

4. Covered domains: Which domains are coveredby the KG? Are there any fields where the KB isfilled only rudimentary?

5. License: Under which license is the content ofthe KG provided?

3.2. Format and Representation

For comparing the different approaches for repre-senting, storing and querying knowledge, we use thefollowing aspects:

1. Fact representation: Facts can be represented astriples, quadrupels, or similar.

2. Dataset formats: The data storage format (e.g.,JSON) in which data is provided.

3. Dynamicity: Is the KG updated continuously(dynamic KG) or are only fixed versions of theKG offered (static KG)?

4. HTTP lookup: Is machine-readable informationabout resources available via live HTTP lookup(i.e., querying on demand in order to follow theLinked Data principles [9], so that no exportfunctionality or file download is needed)?

5. RDF export: Is data available as RDF export, ei-ther via files or via SPARQL endpoint?

6. Software for data storage: Which software isused for storing and querying the KG?

7. Query language (online): Each KG may pro-vide one or several query languages in whichqueries against the KG are formulated.

8. Size of schema and instance graph: How manyclasses and relations are in the KG, how manyfacts, and how many unique instances?

3.3. Genesis and Usage

Where the stored facts in the KGs come from andwhere they are applied, is addressed by these aspects:

1. Provenance of facts: Is the KG content derivedfrom unstructured or semi-structured data by in-formation extraction techniques or is it gatheredmanually by users and/or bots?

2. Quality ensurance of facts: Are there any re-strictions or constraints regarding the quality ofstored knowledge? If the correctness of facts isensured, how and with what precision is this per-formed?

3. Software projects: Which software projectsmake use of the KG?

4. Influence on other LOD datasets: Which otherKG-building initiatives take the KG as a startingpoint?

3.4. Entities

The following aspects address the characteristics ofthe entities in the KGs:

1. Entity reference: What kind of IDs are used torefer to entities?

2. LOD registration: Is the dataset registered athttp://datahub.io as part of the LinkedOpen Data (LOD) cloud?8

8The Linked Open Data (LOD) cloud is a collection of datasetspublished on the Web following the Linked Data principles [9].

Page 6: Semantic Web 1 (2015) 1–5 IOS Press A Comparative Survey ... · / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO modeling with RDF1 and querying with SPARQL,2

6 M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO

3. LOD linkage: Are entities linked to entities ofother KGs in the LOD Cloud?

4. Entity relevance: Is the ordering or ranking ofentities according to some function such as a rel-evance function supported?

5. Description of entities: Are entities human-readably described within the KG, e.g., via tex-tual descriptions? What format is used for that?

3.5. Relations

The following aspects address the characteristics ofthe relations in the KGs:

1. Relation reference: What kind of IDs are usedto refer to relations?

2. Relation relevance: Is the ordering or rankingof relations according to relevance supported? Inthis way, relations can be declared as more im-portant, for instance since they are more relevantto most users than other relations.

3. Description of relations: Are relations human-readably described within the KG, e.g., via tex-tual descriptions? What format is used for that?

3.6. Schema

The characteristics of the schema of the differentKGs can be addressed by the following aspects:

1. Schema restrictions: Is a fixed schema used orcan the schema be extended by users?

2. Schema constraints: Are there any schema con-straints which need to be observed? May the KGcontain data that is inconsistent regarding theschema?For example, if – according to the (logical)schema constraints – an entity may only oc-cur once as subject of a certain relation such ashas-spouse, but within the KG occurs severaltimes as subject with different objects (which areexplicitly defined as different entities), then theKG contains data which violates the schema.

3. Hierarchy and network of relations: Does theKG contain relations among relations, e.g., a tax-onomy of relations (sub-relation, super-relation)or other types of relations (e.g., inverse relation)?

The LOD cloud project originated from the W3C Linking OpenData project (see http://www.w3.org/wiki/SweoIG/TaskForces/CommunityProjects/LinkingOpenData)and currently consists of almost 1000 datasets; see http://lod-cloud.net/.

4. External vocabulary: Is external vocabulary(classes or relations that belong to existing schemas,e.g., schemas from datasets in the LOD cloud)reused in the KG?

5. Description of concepts: Are concepts describedwithin the KG, e.g., via textual descriptions?What format is used for that?

6. Forms of abstraction: Is classification, general-ization, aggregation, or association supported?

7. Data types: Which data types are used in theKG?

3.7. Particularities

Here, particularities about the analyzed KGs areidentified which are not covered by the other aspects.Aspects anticipated before the analysis were:

1. Temporal aspects: Facts that may change overtime (e.g., a country’s president) may be anno-tated according to the time when the fact wasvalid (e.g., time interval). Furthermore, othertime-related information about a fact may bestored, such as the point in time when the factwas added to the KG or when the fact was up-dated.

2. Source of facts: Is it stored where the knowledgein the KG was retrieved from (e.g., the documentit was extracted from)?

3. Reification: Is it possible to represent statementsabout statements? Reification here means to havea means for referring to a statement via an iden-tifier thus enabling to formulate statements aboutstatements.

4. Selection of KGs

We consider the following knowledge graphs for ourcomparative evaluation:

– DBpedia: DBpedia9 is the most popular andprominent KG in the LOD cloud [7]. The projectwas initiated by researchers from the Free Uni-versity of Berlin and the University of Leipzig,in collaboration with OpenLink Software. Sincethe first public release in 2007, DBpedia is up-dated roughly once a year.10 DBpedia is cre-

9See http://dbpedia.org10There is also DBpedia live which started in 2009 and which

is updated when Wikipedia is updated. See http://live.dbpedia.org/.

Page 7: Semantic Web 1 (2015) 1–5 IOS Press A Comparative Survey ... · / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO modeling with RDF1 and querying with SPARQL,2

M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO 7

ated from automatically-extracted structured in-formation contained in the Wikipedia, such asfrom infobox tables, categorization information,geo-coordinates, and external links. Due to itsrole as the hub of LOD, DBpedia contains manylinks to other datasets in the LOD cloud suchas Freebase, OpenCyc, UMBEL,11 GeoNames,12

Musicbrainz,13 CIA World Factbook,14 DBLP,15

Project Gutenberg,16 DBtune Jamendo,17 Euro-stat,18 Uniprot,19 and Bio2RDF.20 DBpedia isused extensively in the Semantic Web researchcommunity, but is also relevant in commercialsettings: companies use it to organize their con-tent, such as the BBC [38] and the New YorkTimes [52].

– Freebase: Freebase21 is a KG announced byMetaweb Technologies, Inc. in 2007 and wasacquired by Google Inc. on July 16, 2010. Incontrast to DBpedia, Freebase had provided aninterface that allowed end-users to contributeto the KG by editing structured data. Besidesuser-contributed data, Freebase integrated datafrom Wikipedia, NNDB,22 FMD,23 and Mu-sicBrainz.24 Freebase uses a proprietary graphmodel for storing also complex statements. OnDecember 16, 2014, the Freebase team announcedthat Freebase will shutdown its services on June30, 2015. Wikimedia Deutschland and Googleplan to integrate Freebase data into Wikidata inthe near future – a tool for that will be developeduntil August 2015 – and to close the Freebasewebsite earliest three months later.25

11See http://umbel.org/12See http://www.geonames.org/13See http://musicbrainz.org/14See https://www.cia.gov/library/

publications/the-world-factbook/15See http://www.dblp.org16See https://www.gutenberg.org/17See http://dbtune.org/jamendo/18See http://eurostat.linked-statistics.org/19See http://www.uniprot.org/20See http://bio2rdf.org/21See http://freebase.com/22See http://www.nndb.com23See http://www.fashionmodeldirectory.com/24See http://musicbrainz.org/25See https://plus.google.com/u/0/

109936836907132434202/posts/bu3z2wVqcQcand https://www.wikidata.org/wiki/Wikidata:WikiProject_Freebase.

– OpenCyc: The Cyc26 project started in 1984 aspart of Microelectronics and Computer Technol-ogy Corporation. The aim of Cyc is to store (ina machine-processable way) millions of commonsense facts such as “Every tree is a plant.” Whilethe focus of Cyc in the first decades was on infer-encing and reasoning, more recent work puts a fo-cus on human-interaction such as building ques-tion answering systems based on Cyc. Since Cycis proprietary, a smaller version of the KG calledOpenCyc27 was released under the open sourceApache license. In July 2006, ResearchCyc28 waspublished for the research community, containingmore facts than OpenCyc.

– Wikidata: Wikidata29 is a project of WikimediaDeutschland which started on October 30, 2012.The aim of the project is to provide data whichcan be used by any Wikipedia project, includingWikipedia.Wikidata does not only store facts, but also thecorresponding sources, so that the validity of factscan be checked. Labels, aliases, and descriptionsof entities in Wikidata are provided in more than350 languages. Wikidata is a community effort,i.e., users collaboratively add and edit informa-tion. Also, the schema is maintained and extendedbased on community agreements. In the near fu-ture, Wikidata will grow due to the integration ofFreebase data.

– YAGO: YAGO – Yet Another Great Ontology– has been developed at the Max Planck In-stitute for Computer Science in Saarbrückensince 2007. YAGO comprises information ex-tracted from the Wikipedia (e.g., categories, redi-rects, infoboxes), WordNet[23] (e.g., synsets, hy-ponymy), and GeoNames.30 As of March 24,2015, YAGO3 is available.31

5. Comparison

In the Tables 1 – 7 we summarize our comparisonof the knowledge graphs listed in Section 4 using the

26See http://www.cyc.com/27See http://www.opencyc.org/28See http://research.cyc.com/29See http://wikidata.org/.30See www.geonames.org/31See http://www.mpi-inf.mpg.de/departments/

databases-and-information-systems/research/yago-naga/yago/downloads/

Page 8: Semantic Web 1 (2015) 1–5 IOS Press A Comparative Survey ... · / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO modeling with RDF1 and querying with SPARQL,2

8 M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO

aspects given in Section 3. An online version is alsoavailable at http://kg-wiki.org. In the follow-ing subsections, we provide a detailed analysis in termsof the different aspects of the introduced evaluationcategories.

5.1. Comparison of General KB Information

The following findings are notable with regard togeneral information of the considered KGs (cf. Ta-ble 1; information about the homepage and the consid-ered KG version is not discussed any further):

– Language support: Most KGs either only sup-port the English language (such as OpenCyc), orother languages than English are added on top.For Freebase and YAGO, entity and property la-bels for additional languages are provided.Remarkable in this context is Wikidata in termsof number of languages supported and in termsof its language-agnostic KG model (cf. the identi-fiers for entities and relations which intentionallyconsist of a character and a number).

– Covered domains: Since we restricted our KGchoice to crossdomain KGs, all considered KGscontain general knowledge, for instance generalinformation about instances of persons such asBarack Obama. Besides specific domains suchas the biomedical domain, also common senseknowledge (class-relationships such as “A humanhas two legs” or “A child is a human”) and lin-guistic knowledge (relationships between linguis-tic concepts such as “to compose is a synonym ofto write”) were excluded from this survey.Although the scope of the considered KGs isbroad and unrestricted in nature, we can makestatements about the “relative filling degrees” (interms of number of entities or number of state-ments) with respect to located parts of the KGsand, hence, about the maturity of the consideredKGs:Firstly, Wikidata still is in a start-up phase in thesense that not all subdomains (indicated by theclasses) are covered in depth. Wikidata is espe-cially well populated in fields such as “Person”and biological entities, but provides only rudi-mentary information about entities in fields suchas society. All other KGs can be classified as ma-ture, since they do not only exist for a rather longtime, but are well positioned in all general do-mains.

Secondly, OpenCyc can be seen as mature, butconsists of much schema information and is – interms of the entities, and, hence in the sense ofa KG – rather a collection of entities belongingto different classes. Hence, OpenCyc is predes-tinated for reasoning, but not so much for entityretrieval purposes.

– License: Data of all considered KGs except Wiki-data is licensed under the Creative Commons At-tribution 3.0 license32 which means that it is al-lowed to use the data for private and commercialsettings and to modify the data by the user.In case of Wikidata, all structured data of the mainname space and the property name space of Wiki-data is licensed under Creative Commons CC0,33

while text of all other namespaces of Wiki-data is available under the Creative CommonsAttribution/Share-Alike License.34 The CreativeCommons CC0 licence enables to waive as manyrights as legally possible and is especially usedfor databases.In summary we can state that all considered KGcan be used without expenses, but in return appro-priate credit has to be given and the same licensehas to be used for further usage.Interesting in the context of KG licenses is thestudy of Jain et al. [32] who studied the appli-cability of well-known Linked Data datasets forcommercial applications. The conclusion the au-thors drew is that not the technical issues of de-ployment and use of Linked Data datasets is thecrucial point, but legal aspects. Often, the licenseunder which a Linked Data dataset can be reusedis not specified by the data providers.

5.2. Comparison of Format and Representation

– Fact representation: The KGs DBpedia and Open-Cyc store facts as single triples and do not regardadditional meta-information about facts such asthe confidence of the triples being correct or tem-poral information related to the facts (e.g., the va-lidity time).

32See https://creativecommons.org/licenses/by/3.0/.

33See https://creativecommons.org/publicdomain/zero/1.0/.

34See https://creativecommons.org/licenses/by-sa/3.0/ and https://www.wikidata.org/wiki/Wikidata:Database_download/en#License.

Page 9: Semantic Web 1 (2015) 1–5 IOS Press A Comparative Survey ... · / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO modeling with RDF1 and querying with SPARQL,2

M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO 9Ta

ble

1:C

ompa

riso

nof

the

KG

sre

gard

ing

thei

rgen

eral

info

rmat

ion.

DB

pedi

aFr

eeba

seO

penC

ycW

ikid

ata

YAG

O3

Hom

epag

ehttp://dbpedia.

org

http://freebase.

com

http://opencyc.

org

http://wikidata.

org

http://www.

mpi-inf.mpg.

de/departments/

databases-and-

information-systems/

research/

yago-naga/yago/

Cur

rent

vers

ion

DB

pedi

a20

15-0

4co

ntin

uous

lyup

date

dun

tilM

ar31

,201

5aO

penC

yc4.

0C

ont.

upda

ted

sinc

eO

ct20

12YA

GO

3

Lan

guag

es“M

ain”

DB

pedi

ais

Eng

lish

(pro

pert

ies

etc.

),bu

tlin

ked

loca

lized

vers

ions

are

avai

labl

ein

125

lang

uage

s(l

ocal

ized

are

text

ual

desc

rip-

tions

such

asrd

fs:la

bel,

rdfs

:com

men

t,db

pedi

a-ow

l:abs

trac

t.T

here

are

also

links

tolo

cal

vers

ions

ofW

ikip

edia

)

hum

anre

adab

leID

sar

ein

Eng

lish,

but

ever

yen

tity

and

prop

erty

has

ani1

8nin

man

yla

ngua

ges

Eng

lish

Alm

oste

very

lang

uage

(by

com

mun

ity),

even

dial

ects

All

entit

yna

mes

are

from

Eng

lish

Wik

iped

ia,

som

erd

fs:la

bel

valu

esha

vedi

f-fe

rent

lang

uage

s

Cov

ered

do-

mai

nsG

ener

alkn

owle

dge

Gen

eral

know

ledg

e,ve

rybr

oad,

som

etim

esde

epC

omm

onse

nse

Gen

eral

know

ledg

eG

ener

alkn

owle

dge

Lic

ense

(con

-te

nt)

Cre

ativ

eC

omm

ons

Attr

ibut

ion-

Shar

eAlik

e3.

0,G

NU

Free

Doc

u-m

enta

tion

Lic

ense

Cre

ativ

eC

omm

ons

Attr

i-bu

tion

Onl

yC

reat

ive

Com

mon

sA

ttri-

butio

n3.

0C

reat

ive

Com

mon

sC

C0

1.0

Uni

vers

al(C

C0

1.0)

Publ

icD

omai

nD

edic

atio

n

Cre

ativ

eC

omm

ons

Attr

i-bu

tion

3.0

a Goo

gle

anno

unce

dto

clos

eFr

eeba

seon

June

30,2

015.

How

ever

,cur

rent

ly(J

uly

30,2

015)

itis

still

avai

labl

e.

Page 10: Semantic Web 1 (2015) 1–5 IOS Press A Comparative Survey ... · / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO modeling with RDF1 and querying with SPARQL,2

10 M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO

Table2:C

omparison

oftheK

Gs

regardingform

atandrepresentation.

DB

pediaFreebase

OpenC

ycW

ikidataYA

GO

3

FactrepresentationTriple

Triplew

ithconfidence

valuesTriple

Every

entityhas

mul-

tiplestatem

ents.E

achstatem

enthas

oneor

more

referencesand

oneclaim

.T

heclaim

consistsof

aproperty

anda

value,accom

pa-nied

byqualifiers

“SPOT

Ltuple”

(SPOtriple

with

time

andlo-

cation)

Datasetform

atsR

DF

(nt,nq,ttl)R

DF

(nt)R

DF

(owl),

proprietaryfile

format

JSON

,XM

L,SQ

L,and

RD

Fa

RD

F(ttl),T

SV

Dynam

icityStatic

Continuously

updatedStatic

Continuously

updatedStatic

HT

TP

lookuplookup

ondem

andlookup

ondem

andlookup

ondem

andlookup

ondem

and(but

incomplete)

lookupon

demand

RD

Fexport

asfiles

andvia

SPAR

QL

asfiles

asO

WL

fileas

files(by

thirdparty)

andSPA

RQ

L(by

thirdparty)

asfiles

andvia

SPAR

QL

Software

fordata

stor-age

Virtuoso

Universal

ServerG

raphdC

ycK

nowledge

ServerW

ikibaserudim

entaryquery

interface,rudim

entarybrow

seranddem

os

Query

language(on-

line)SPA

RQ

L(http:

//dbpedia.org/

sparql)

MQ

L(M

etaweb

Query

Language;

https:

//freebase.com/

query)

CycL

Language

Wikibase-A

PI,SPA

RQ

L(third

party)SPA

RQ

L(http://lod2.

openlinksw.com/

sparql)

Sizeof

schema

andin-

stancegraph

4.58M

ioentities,

685classes,

1,079object

properties,1,600

datatype

properties,116

specializeddata

typeproperties

1.9B

iotriples

239kterm

s,2

mio

triples,47k

linksto

DB

pedia

63.2M

iostatem

ents>10

Mio

entities,>120M

iofacts

ahttps://tools.wm

flabs.org/wikidata-exports/rdf/

Page 11: Semantic Web 1 (2015) 1–5 IOS Press A Comparative Survey ... · / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO modeling with RDF1 and querying with SPARQL,2

M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO 11

In contrast to these KGs stand Freebase, YAGO,and Wikidata: For each triple, Freebase alsostores a confidence value. The authors of YAGOuse the so-called SPOTL(X) tuples for represent-ing spatio-temporally enhanced facts (with the el-ements subject (S), predicate (P), object (O), time(T), and location (L)). For Wikidata, a model isused where each statement consists of a claimthat something is the case and a list of referencesproviding evidence for that claim.

– Dataset formats: Regarding all considered KGs,data is available in RDF format: Data from DBpe-dia and Freebase is available in the form of RDFfiles,35 data from OpenCyc is available as OWLfiles,36 and YAGO is available as both TSV filesand RDF files.37

Wikidata has a special position here: It is devel-oped on the basis of Wikibase,38 a proprietarydata model for Wikidata. This data model is perse not based on the RDF format. However, unof-ficial RDF export files of Wikidata39 (and someSPARQL endpoints40) are provided.

– Dynamicity: Many KGs are static in the sensethat they are not continuously updated. One rea-son for that is that some KGs such as DBpe-dia are created by computationally-expensive in-formation extraction processes. Therefore, DBpe-dia is static; however, DBpedia live – a derivedversion of DBpedia – is continuously updated.For that, Wikipedia provides a OAI-PMH updatestream, by means of which 84 articles are ana-lyzed per minute.41

Dynamic KGs are Freebase and Wikidata sincedata is maintained by a user community. In caseof these KGs, even the schema is extended by theusers.

35See http://wiki.dbpedia.org/Downloads2014,https://developers.google.com/freebase/data,http://tools.wmflabs.org/wikidata-exports/rdf/.

36See http://sw.opencyc.org/.37See http://www.mpi-inf.mpg.de/departments/

databases-and-information-systems/research/yago-naga/yago/downloads/.

38https://www.mediawiki.org/wiki/Wikibase/DataModel

39See http://tools.wmflabs.org/wikidata-exports/rdf/ and [22].

40See https://www.wikidata.org/wiki/Wikidata:Data_access/en.

41See http://wiki.dbpedia.org/online-access/DBpediaLive.

– HTTP Lookup: Regarding all considered KGs,data is made available via HTTP lookups on de-mand: Given a resource (of a KG which is part ofthe LOD cloud) identified by a HTTP URI, dataabout this resource can be obtained by derefer-encing this URI.42 Typically, the returned infor-mation is made available using W3C standardssuch as RDF. The idea of dereferencing is a cru-cial point of the Semantic Web vision: In this way,agents can traverse the LOD graph (i.e. follow-ing links within and across single LOD datasets)and gather the information which they need andwhich is available in the LOD cloud.HTTP lookups on demand are possible for allKGs considered in this survey – thus allowing fordata exports.

– RDF export: Besides the HTTP lookup availabil-ity, data from the KGs is also made available asfiles. The idea is here that the KG data can alsobe downloaded and processed otherwise insteadof retrieving data via HTTP lookups on demand;this includes parsing the files directly or import-ing the data into an appropriate database such as atriple store. In this way, queries can be set up andthe hardware load is on the client side.

– Data storage software: Data is stored using dif-ferent systems: While DBpedia uses VirtuosoUniversal Server43 and its available RDF dumpscan be loaded into any triple store (such as Virtu-oso or 4store44), all other considered KGs (Free-base, OpenCyc, Wikidata, and YAGO3) are – dueto their different data models used internally –based on proprietary software systems. However,the provided RDF dumps of these KGs can beloaded into any triple store.

– Query language (online): Although all consid-ered KGs both support RDF as data format andare available online for HTTP lookups, not allonline versions of the KGs are offered with aSPARQL endpoint: Only DBpedia and YAGO arequeriable in this way.45 For Wikidata, several un-

42See http://tools.ietf.org/html/rfc3986#section-1.2.2 for more information about dereferencingURIs.

43See http://virtuoso.openlinksw.com/44See http://4store.org/45See http://dbpedia.org/sparql and http:

//lod2.openlinksw.com/sparql.

Page 12: Semantic Web 1 (2015) 1–5 IOS Press A Comparative Survey ... · / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO modeling with RDF1 and querying with SPARQL,2

12 M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO

official SPARQL endpoints are available.46 If theclient wants to query the KGs Freebase and Open-Cyc by means of SPARQL – and therefore rely ona W3C Recommendation instead of a proprietaryquery language –, he needs to load the providedRDF data into a triple store. This procedure issuggested by the authors of many KGs in case ofpermanent, extensive querying, since in this waythe user needs to provide the hardware resourcesby himself. Furthermore, it should be noted thatin case of OpenCyc the language CycL47 was de-veloped to enable expressive reasoning.

– Size of schema and instance graph: In our anal-ysis, we only took KGs into consideration whichare already widely used and which are representa-tive for open, semantically-structured datasets onthe Web. All considered KG are therefore largeand a comparison of size per se is not reasonable,since the KGs often cover different domains to adifferent extent or emphasize different levels ofknowledge. Note that the numbers given in the ta-ble with respect to the KG size are not directlycomparable. A complex fact may be representedand counted as one statement in one KG but rep-resented and counted as multiple statements inanother KG.As already outlined above under coverage, wecan state that all considered KGs fulfill the re-quirements of being a KG. Outstanding are theKGs Wikidata and OpenCyc. While Wikidata isnot mature in all areas, but very focused on in-stances, the primary focus of OpenCyc is schemainformation; however, it contains many instancesand is therefore numbered among the KGs.

5.3. Comparison of Genesis and Usage

– Provenance of facts: For covering knowledgeabout general domain entities – as done primar-ily by DBpedia, Freebase, Wikidata, and YAGO–, Wikipedia content is exploited to some degreewith the help of information extraction tools. Forcreating a more formal-logical representation ofknowledge, experts need to be consulted as thecase of Cyc/OpenCyc demonstrates. In the case

46See https://www.wikidata.org/wiki/Wikidata:Data_access for an overview and http://wdqs-beta.wmflabs.org/ for an example.

47See http://www.cyc.com/documentation/ontologists-handbook/cyc-basics/syntax-cycl/.

of Freebase, provenance data stored for facts arefor example the IDs of the users that added thefacts. For Wikidata, to each statement (consist-ing, e.g., of a property and a value, such as (coun-try, Germany)) references can be attached whichreveal the source – and therefore indirectly thetrustfulness – of the statement.

– Quality ensurance: The quality ensurance of factscan be aligned with the two ways of fact provi-sioning in general (see also the aspect Provenanceof facts): (i) Knowledge for the KG is extractedautomatically from a database such as Wikipedia.In that case, no quality ensurance check is im-plemented, but a posteriori evaluations confirmeda sufficient high average accuracy across the KGYAGO [43]. (ii) Knowledge is gathered by usercontributions. In those cases (see Freebase, Cyc,and Wikidata) no fact consistency checks are ap-plied, but the correctness is based on the trustful-ness of the contributors.In general it is not possible to prioritize one ofthese two ways a priori. Using solely approach(i) is only duable if the information is alreadyavailable in semi-structured formats (as in case ofWikipedia-DBpedia), so that the proportion of in-correct facts in the KG is kept small.

– Software projects: All considered KGs are ex-ploited in many ways in research projects of uni-versities and in industry, so that we only presentprojects which are commonly known in the com-munity. Notable is in particular the project IBMWatson48 which uses several of the consideredKGs (namely, DBpedia and YAGO so far). For adescription of applications of Linked Data in gen-eral in the industry, we can refer to the use caseslisted by the W3C.49

– Influence on other LOD datasets: Data of the sin-gle KGs has been reused in other data sources ofthe LOD cloud – especially in datasets which fo-cus more on the integration of multiple datasetsinstead of building a genuine own knowledgebase (see UMBEL50 and BabelNet51 as exam-ples).

48See http://www.aaai.org/Magazine/Watson/watson.php

49See http://www.w3.org/2001/sw/sweo/public/UseCases.

50See http://www.umbel.org/.51See http://babelnet.org/.

Page 13: Semantic Web 1 (2015) 1–5 IOS Press A Comparative Survey ... · / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO modeling with RDF1 and querying with SPARQL,2

M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO 13

Tabl

e3:

Com

pari

son

ofth

eK

Gs

rega

rdin

gge

nesi

san

dus

age.

DB

pedi

aFr

eeba

seO

penC

ycW

ikid

ata

YAG

O3

Prov

enan

ceof

fact

sA

utom

atic

ally

ex-

trac

ted

from

Wik

iped

iaIn

itial

lyfr

omW

ikip

edia

,M

u-si

cBra

inz

etc.

;ne

win

form

atio

nis

gath

ered

byal

gori

thm

s,th

eFr

eeba

sete

am,

and

the

com

mun

ity

Fille

dby

expe

rts

Dat

ais

mai

ntai

ned

byus

ers

and

bots

Wik

iped

ia,

Wor

dnet

,G

eona

mes

gaze

tter,

Wik

idat

a

Qua

lity

ensu

ranc

eof

fact

squ

ality

depe

nds

onW

ikip

edia

cont

ent

and

onex

trac

tion

algo

rith

ms/

tem

plat

em

appi

ngs

trus

ted

“Fre

ebas

eex

-pe

rts”

keep

aney

eon

chan

ges,

scri

pts

look

fori

ncor

rect

data

noU

sers

shou

ldon

lyad

dve

rifia

ble

info

rmat

ion

from

sour

ces

such

asbo

oks,

scie

ntifi

cpu

blic

atio

ns,

orne

ws-

pape

rar

ticle

s,as

inth

eor

igin

alW

ikip

edia

,da

tais

cont

rolle

dby

com

mun

ity

noen

sura

nce,

eval

ua-

tion

of>

95

%co

rrec

t-ne

ssfo

rYA

GO

2

Soft

war

epr

ojec

tsD

Bpe

dia

Spot

light

,a

Wik

iped

iaM

iner

,bIB

MW

atso

nc

Goo

gle

Kno

wle

dge

Vau

lt,B

ing

Terr

oris

mK

now

ledg

eB

ase

som

efir

stpr

otot

ypes

YAG

ON

AG

A,

IBM

Wat

son,

Bro

ccol

id

Influ

ence

onot

herL

OD

data

sets

Free

base

,YA

GO

Wik

idat

aU

MB

EL

YAG

O3

SUM

O,e

DB

pedi

a,U

MB

EL

,Fre

ebas

e

a Seehttp://spotlight.dbpedia.org/

b Seehttp://wikipedia-miner.cms.waikato.ac.nz/

c Seehttp://www.ibm.com/smarterplanet/us/en/ibmwatson/

d Seebroccoli.informatik.uni-freiburg.de

e Seehttp://www.adampease.org/OP/

Page 14: Semantic Web 1 (2015) 1–5 IOS Press A Comparative Survey ... · / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO modeling with RDF1 and querying with SPARQL,2

14 M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO

DBpedia YAGO

FreebaseWikidata

OpenCyc

Fig. 2. owl:sameAs relations between the considered KGs.

In this context, the relation owl:sameAs isworth mentioning: This relation indicates that thetwo resources which are linked by an owl:sameAsrelation refer to the same real-world object, eventhough they might have different URIs in the dif-ferent datasets. In this way, further informationabout the resources can be retrieved from linkeddatasets and no information needs to be copied toother datasets.The most important sets of owl:sameAs rela-tions between the considered KGs are as follows(see also Figure 2): DBpedia and YAGO link toeach other; DBpedia links in addition to Freebaseand Wikidata; Wikidata links to Freebase; Open-Cyc links to DBpedia; DBpedia links to brokenCyc links.52

5.4. Comparison of Entities

– Entity reference: Since all considered KGs wereshaped by the vision of the Semantic Web, entitiesdo not only have unique IDs, but URIs by whichthey can be referred to. While DBpedia, Free-base, Cyc, and YAGO provide human-readableIDs, Freebase, and Cyc additionally operate withopaque URIs. Wikidata only provides entity IDswhich consist of “Q” followed by a number in or-der to be language-agnostic. The labels for the en-tities are stored in Wikidata separately.As outlined by Berners-Lee in 1998 in “CoolURIs don’t change” [10], URIs should be de-signed with three things in mind: simplicity, sta-bility, and manageability. In the context of KGswhere each entity has a URI, well-chosen URIsbecome even more important. Sauermann andCyganiak [53] present so-called 303 URIs (whichare human readable) and hash URIs (which are

52HTTP requests of URIs with the domain http://sw.cyc.com result in a DNS error, but these URIs are dereferencable if thedomain is replaced by http://sw.openyc.org.

not human readable) for the Semantic Web. Bothforms have their advantages and disadvantages.

– LOD registration: Publishing a dataset accord-ing to the Linked Data principles already im-plies that this dataset is part of the LOD cloud.Besides that, there are Linked Open Data regis-tration portals such as http://datahub.iowhere LOD datasets can be registered and, hence,found quickly.All KGs considered in this survey are publishedin RDF and are part of the LOD cloud. Besidesthat, until the submission of this survey, all con-sidered KGs except Wikidata were also registeredat http://datahub.io as part of the LODcloud.53

– LOD linkage: Most of the considered KGs linktheir entities to entities of other datasets in theLOD cloud. Remarkable are hereby DBpedia andFreebase in terms of their high degree of connec-tivity with other LOD datasets. DBpedia is justi-fiably called the hub of the LOD cloud [41,50].

– Entity relevance: In some scenarios it is helpfulto rank or order entities based on some impor-tance and/or relevance score (e.g., to find the mostwell-known football players or politicians). In thepast, several approaches were presented whichcalculate scores for entities. However, currentlyonly Freebase provides relevance scores for enti-ties that were created by using the link counts inFreebase and Wikipedia.54

– Description of entities: It can be difficult for usersto figure out which entity is meant by a givenID or URI – especially if the ID is a mostly nu-merical value due to the constraint of the knowl-edge representation being language-independent.In such cases, a textual description of the entitiesis important. While some KGs offer textual de-scriptions via special properties (see DBpedia andFreebase) or fields within the data model (as incase of Wikidata), YAGO does not offer any en-tity description and OpenCyc only for a fractionof the entities.

Page 15: Semantic Web 1 (2015) 1–5 IOS Press A Comparative Survey ... · / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO modeling with RDF1 and querying with SPARQL,2

M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO 15

Tabl

e4:

Com

pari

son

ofth

eK

Gs

rega

rdin

gen

titie

s.

DB

pedi

aFr

eeba

seO

penC

ycW

ikid

ata

YAG

O3

Ent

ityre

fere

nce

UR

I(b

ased

onW

ikip

edia

page

title

),e.

g.,

http://dbpedia.

org/resource/

Karlsruhe

MID

and

som

etim

esID

(MID

ispe

rma-

nent

cons

iste

ntw

ithch

ange

s,bu

tno

thu

man

-rea

dabl

eas

IDs)

/m/0qb1z

uniq

ueID

,E

nglis

hID

,e.

g.,

T6OkHdRS-eUiqO5n8NA1g,

Karlsruhe

uniq

ueID

,e.g

.,Q1040

UR

I(b

ased

onW

ikip

edia

page

title

),e.

g.,

http://

yago-knowledge.

org/

resource/

Karlsruhe

LO

Dre

gist

ratio

nye

sye

sye

sno

yes

LO

Dlin

kage

Lin

ksto

Free

base

,O

penC

yc,

YAG

O,

UM

BE

L,

Geo

Nam

es,

etc.

Lin

ksto

BB

CM

usic

,G

eosp

ecie

sL

inks

toD

Bpe

dia

Som

een

titie

sha

velin

ksto

Free

base

and

Mus

icbr

ainz

Lin

ksto

DB

pedi

a

Ent

ityre

leva

nce

noE

ntiti

esha

vere

leva

nce

scor

es(c

alcu

late

dby

link

coun

tsin

Free

base

and

Wik

iped

ia)

nono

no

Des

crip

tion

ofen

titie

sY

es,

via

prop

erty

http://dbpedia.

org/ontology/

abstract

yes,

via

prop

erty

/common/topic/

description

;of

-te

nth

ere

isan

imag

e/common/topic/

image

Text

ual

desc

ript

ion

inco

mm

entfi

eld

Des

crip

tion

field

fore

v-er

yen

tity,

nosp

ecia

lpr

oper

ty

no

Page 16: Semantic Web 1 (2015) 1–5 IOS Press A Comparative Survey ... · / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO modeling with RDF1 and querying with SPARQL,2

16 M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO

Table5:C

omparison

oftheK

Gs

regardingrelations.

DB

pediaFreebase

OpenC

ycW

ikidataYA

GO

3

Relation

referenceU

RI

(two

propertyprefixes:

ontologyand

property)

IDand

MID

(seeentity

referencingin

Table4)

Phrase(e.g.,IsA

)ID

UR

I

Relation

relevanceno

nonotm

entionedyes,e.g.,by

dateno

Description

ofrelationsA

llproperties

havea

label,som

eproperties

havea

linkto

adescrip-

tion

yes,via

/common/topic/

description;

mostly

onesen-

tence:e.g.

/film/

producer/films_

executive_

produced:

“Films

thisperson

andserved

asan

executivepro-

duceron”

noyes,specific

fieldin

ev-ery

propertyproperties

havea

prop-erty

“hasGloss”

with

apattern

ofusage

Page 17: Semantic Web 1 (2015) 1–5 IOS Press A Comparative Survey ... · / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO modeling with RDF1 and querying with SPARQL,2

M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO 17

5.5. Comparison of Relations

– Relation reference: Analogously to entity refer-ences, relations in the KGs are represented as IDsor URIs.

– Relation relevance: In this respect, Wikidata isin particular noteworthy. Since each statementstored in Wikidata has some meta-informationsuch as a timestamp attached to it, properties ofitems in Wikidata can be ordered with respect tothese meta-information aspects.

– Description of relations: Freebase, Wikidata, andYAGO provide a textual description for all rela-tions, so that the meaning of properties can bederived. YAGO stores how the properties are ex-pressed in freetext, while the other KGs providecustom-built textual descriptions. DBpedia pro-vides a label and a link to a definition only forsome properties. Cyc and BabelNet do not pro-vide any description at all.

5.6. Comparison of Schema

– Schema restrictions: In KGs where the data is de-rived via automated extraction methods, both theset of classes and the set of relations is fixed. Incase of KGs where end-users can contribute (seeWikidata and Freebase) the schema is fixed, butcan be extended.

– Schema constraints: Constraints regarding theschema become relevant when new facts areadded to the KG (see integrity constraints anddata consistency as conceptual keystone of anygraph data model according to [18]). None of theconsidered KGs use significant constraints whenfacts are added: In case of DBpedia the type isfixed during the mapping process. No further con-straints are given. For YAGO, a type checker andconstraint checker is provided. Wikidata has noconstraint check tool, but users can report con-straint violations. OpenCyc has no constraintssince the facts are created manually by experts.No information about constraints was found incase of Freebase.

– Hierarchy and network of relations: Only Open-Cyc implements a hierarchy of relations [19]. No-table is, however, that DBpedia properties are au-

53See http://datahub.io/group/lodcloud.54See http://wiki.freebase.com/wiki/Search_

Cookbook#Scoring_and_Ranking

tomatically extracted from Wikipedia, leading tomany properties whose meaning is not given,55

remains unclear56 or which are semantically over-lapping with other properties.57

– External vocabulary: DBpedia and YAGO usevocabularies from other datasets (DBpedia usesowl, xsd, rdfs, rdf, foaf, dc, skos, umbel58; YAGOuses skos, umbel, rdfs und rdf), while Freebase,and Cyc only use their own vocabulary. Wiki-data uses also their own vocabulary, but also linkssometimes to external vocabulary via “equivalentproperty” property.

– Description of classes: DBpedia, Freebase, Cyc,and Wikidata provide human-readable descrip-tions of their classes (DBpedia uses dbpedia-owl:abstract and rdfs:comment; YAGOonly uses rdf:label, no description; Freebaseuses the relation /common/topic/descrip-tion; OpenCyc has a comment relation; Wiki-data provides a description and is exported in theRDF dumps as schema:description).

– Forms of abstraction: As outlined in Section 2.1,classification, generalization, aggregation, andassociation are among the most important meth-ods to model in a more abstract way. All con-sidered KGs support the modeling of generaliza-tion, classification, and association. Freebase andWikidata also support aggregation.

– Data types: The KGs either do not support anydata types for literal values, but just store strings(as in case of Cyc), or they support simple datatypes such as a subset of the XML Schema (seeDBpedia, Freebase, Wikidata, and YAGO; a typ-ical data type is xsd:integer). The highestnumber of data types is used by DBpedia59 andFreebase.60

55An example for that is http://dbpedia.org/property/s.

56An example for that is http://dbpedia.org/property/useEw%25_

57An example for that is http://dbpedia.org/property/develop, http://dbpedia.org/property/developer, and http://dbpedia.org/property/develops.

58See http://lov.okfn.org/dataset/lov/vocabs/dbpedia-owl

59See http://mappings.dbpedia.org/index.php/DBpedia_Datatypes.

60See https://wikidata.org/wiki/Special:ListDatatypes.

Page 18: Semantic Web 1 (2015) 1–5 IOS Press A Comparative Survey ... · / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO modeling with RDF1 and querying with SPARQL,2

18 M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO

Table6:C

omparison

oftheK

Gs

regardingthe

schema.

DB

pediaFreebase

OpenC

ycW

ikidataYA

GO

3

Schema

restrictionsyes,

DB

pediaon-

tology;in

additionnon-m

apping-basedproperties

with

nofixed

domain

yes,fixed

schema

isused,

butusers

areal-

lowed

toedit

andex-

pandthe

schema

propertiesare

fixedproperties

arefixed,but

usersareallow

edto

editand

expandthe

schema

notm

entioned,>560k

classes

Schema

constraintsinconsistent

datam

ayoccur

notmentioned

noconstraints

noconstraint

checks,users

canreport

con-straintviolations

typeand

constraintcheckerprovided

Netw

orkor

hierarchyofrelations

no,but

many

fuzzyrelations

where

them

eaningis

unclearor

which

isoverlapping

with

otherrelations

noyes

notmentioned

no

Externalvocabulary

yes,FO

AF,

RD

FS,O

WL

,YAG

O,U

MB

EL

nono

no,but

sometim

esa

“equivalentprop-

erty”property

linksto

externalvocabulary

yes,R

DFS,

OW

L,

WordN

et

Description

ofclassesD

Bpedia

ontologyfor

concepts,inO

WL

classesare

describedin/common/topic/

description

(e.g.,https:

//www.freebase.

com/computer/

computer?props=)

oftenthere

isa

com-

mentfora

collectionyes,justas

inentities

no

Forms

ofabstractiongeneralization,

classifi-cation,association

generalization,classifi-

cation,aggregation,as-sociation

generalization,classifi-

cation,associationgeneralization,

classifi-cation,aggregation,as-sociation

generalization,classifi-

cation,association

Data

typesxsd

datatypes,different

otherdatatypes

forcur-rencies

etc. a

datatypes

suchas

date-tim

e,int,

float,m

e-dia_type,etc.

notusedonly

simple

datatypes

suchas

time,tim

ezone,coordinations

simple

datatypes

suchas

xsd:integer

aSeehttp://mappings.dbpedia.org/index.php/DBpedia_Datatypes

Page 19: Semantic Web 1 (2015) 1–5 IOS Press A Comparative Survey ... · / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO modeling with RDF1 and querying with SPARQL,2

M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO 19

5.7. Comparison of Particularities

Besides general KG information and informationabout the storage of instances, relations, and theschema of the KGs, there are several aspects of datamodeling which can be found only in the models ofdistinct KGs. These aspects are:

– Temporal aspects: There are three types of tem-poral aspects which can be attached to facts ina KG (see [57]): The valid time, i.e., the pointin time or time span the fact is valid; the inser-tion time, i.e., the time the fact is or was insertedinto the KG; and a relevance time aspect, i.e., thepoint in time or time interval which is relevant forthe user’s application. Only two of the KGs sup-port the storage of temporal aspects besides purefacts. The data model of Wikidata allows usersto store the time interval in which the statementholds true. In this way, facts which are valid onlyfor specific time spans such as election periodscan be stored. YAGO also supports temporal in-formation to be stored attached to the fact such asthe occurrence date.

– Source of facts: For reasons of traceability thesource (reference) of facts is stored together withthe facts. The knowledge where the facts are de-rived from might be important for the user to as-sess the validity and trustability of the fact. Wiki-data and YAGO are the only KGs where the stor-age of facts is both supported by its data modeland used by the users.61

The other KGs do not store the source. In case ofDBpedia the source of facts is obvious, namelyWikipedia, and does not need to be stored. Cyc iscreated completely by experts. It can be assumedthat the source of facts is not stored here, since allfacts are reliable.

– Reification: RDF reification was intended as amechanism for making provenance statementsand other statements about RDF triples [67]. Re-garding our KGs, only Cyc and YAGO use reifi-cation to some extent: Cyc allows the reificationof literals. In case of YAGO, time and location isattached to facts by reification.

61It can be noted that many facts of Wikidata are derived fromWikipedia, so that in many cases the Wikipedia URL is the onlysource. For YAGO, the source of facts is provided in a separate file.

6. Assessment of KGs

Based on the Tables 1 – 7, we created a matrix (seeTable 8) where the most important aspects in which theKGs differ (extracted from the Tables 1 – 7) are formu-lated as yes-no-questions. These questions serve thepurpose of guiding users that are interested in choosingamong the KGs those that best fit their purposes.

For assessing the KGs, a score is calculated for eachKG. For each KG, this “fulfillment score” can be cal-culated as the number of times the answer for the de-sired KG matches the answer of the KG in question.Also more sophisticated scoring functions are possi-ble where the matching regarding specific questionsis weighted higher. In the end, the KG which hasachieved the highest score is the KG which is favoredby this framework.

7. Limitations of KGs

Peckham and Maryanski [47] argued that semanticdata models will be used widely when they performsufficiently well for real-world settings (especially inenterprises). We can argue that there are already someKG applications and many Linked Open Data datasetsavailable. Examples where Linked Open Data is usedare the BBC,62 Best Buy,63 and the German NationalLibrary.64

However, there are several limitations of the KGsand, hence, of the Semantic Web in its current uptakewhich became apparent during the analysis of the con-sidered KGs and which we therefore would like to em-phasize:

1. Domain specificity limitation: During the pro-cess of selecting the KGs for comparison, it be-came apparent that either many KBs in the LODcloud are highly aligned to the general domainWikipedia covers – since Wikipedia is used asknowledge source in these cases – or (i) theKBs focus on specific domains (cf. the lexi-cal databases WordNet and BabelNet) and/or (ii)cover more schema information than instance in-formation, so that they cannot be called KGs anymore (cf. the common sense KBs ConceptNet

62See http://www.bbc.co.uk/ontologies63See http://www.bestbuy.com/64See http://www.dnb.de/DE/Service/

DigitaleDienste/LinkedData/linkeddata_node.html

Page 20: Semantic Web 1 (2015) 1–5 IOS Press A Comparative Survey ... · / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO modeling with RDF1 and querying with SPARQL,2

20 M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO

Table7:C

omparison

oftheK

Gs

regardingtheirparticularities.

DB

pediaFreebase

OpenC

ycW

ikidataYA

GO

3

Temporalaspects

nono

noyes,

thevalid

time

offacts

(e.g.,the

popula-tion

fordifferentpoints

intim

e)

yes,e.g.,thetim

eofoc-

currence

Sourceoffacts

no(allfrom

Wikipedia)

notmentioned

noyes,m

ostlyyes

(infile

yagoSources)

Reification

Currently

notexploitednotm

entionedreification

ofliteralsno,decision

againstityes,

time

andlocation

attachedvia

reification

Page 21: Semantic Web 1 (2015) 1–5 IOS Press A Comparative Survey ... · / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO modeling with RDF1 and querying with SPARQL,2

M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO 21

Tabl

e8:

Dec

isio

nm

atri

xfo

rKG

sele

ctio

n.

DB

pedi

aFr

eeba

seO

penC

ycW

ikid

ata

YAG

O3

1.Is

the

avai

labl

eK

Gco

ntin

uous

lyup

date

d(n

ofix

edve

rsio

ns)?

(X)a

X-

X-

2.A

reot

herl

angu

ages

than

Eng

lish

supp

orte

d?X

X-

XX

3.Is

itra

ther

anin

stan

ceK

Gth

ana

sche

ma

KG

?X

X-

XX

4.Is

data

avai

labl

evi

aH

TT

Plo

okup

?X

XX

(X)b

X

5.Is

anof

ficia

lSPA

RQ

Len

dpoi

ntpr

ovid

ed?

X-

--

X

6.Is

som

eda

taqu

ality

leve

len

sure

d(m

anua

llyor

via

man

ual

appr

ovem

ent)

?-

XX

X(X

)c

7.Is

the

KG

part

ofth

eL

OD

clou

dac

cord

ing

tohttp://

datahub.io

?X

XX

-X

8.D

oth

een

titie

sha

vean

yor

deri

ngor

rank

ing?

-X

--

-9.

Are

entit

yde

scri

ptio

nsav

aila

ble?

XX

XX

-10

.Do

the

rela

tions

have

any

orde

ring

orra

nkin

g?-

--

X-

11.A

rere

latio

nde

scri

ptio

nsav

aila

ble?

XX

-X

X

12.I

sth

eoc

curr

ence

time

offa

cts

stor

ed?

--

-X

X

13.I

sth

eso

urce

offa

cts

stor

ed?

--

-X

-14

.Is

reifi

catio

nsu

ppor

ted

and

done

inpr

actic

e?-

-X

-X

15.A

reco

ncep

tsde

scri

ptio

nsav

aila

ble?

-X

XX

-16

.Are

any

stan

dard

data

type

sus

edfo

rlite

rals

?X

X-

XX

a Con

tinuo

usup

date

sar

eav

aila

ble

forD

Bpe

dia

live,

butn

otfo

rDB

pedi

ab V

iaH

TT

Plo

okup

sth

eus

erca

non

lyre

trie

veth

ela

bels

and

the

Wik

iped

iaca

tego

ries

ofW

ikid

ata

item

s.c A

nev

alua

tion

ofth

equ

ality

was

only

perf

orm

edfo

rYA

GO

2,bu

tnot

forY

AG

O3.

Page 22: Semantic Web 1 (2015) 1–5 IOS Press A Comparative Survey ... · / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO modeling with RDF1 and querying with SPARQL,2

22 M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO

and UMBEL). Although there is a considerableamount of KGs which are freely available andfocus on specific domains, most domains withpotential use cases are not covered by KGs. Re-garding the prediction by Peckham and Maryan-ski [47] according to which semantic data mod-els are used mainly for the management of sci-entific, engineering, and manufacturing data, wecan state: This data exists, but most of the datais not semantically enriched and/or in the formatRDF – and if, there are not many attributes. Fu-ture projects on the so called “Industry 4.0” (in-cluding concepts of cyber-physical systems, theInternet of Things, and the Internet of Services)aim at changing this.

2. Limitations of modeling time aspects: Conceptsfor modeling dynamic and/or temporal aspectswithin semantic (graph) data models have beendeveloped since the uprise of semantic data mod-els (see [58,57] for an overview of temporaldatabase theory). Snowgrass and Ahn [57] distin-guish between transaction time (time when datais stored), valid time (time the data is useful orvalid), and user-defined time (additional time in-formation to be stored) as the dimensions of rep-resenting temporal data in databases. Despite re-search on temporal data models, dynamic andtemporal data models have not become prevalentso far. Still today, most semantic graph data mod-els neither encompass the temporal characteris-tics of knowledge facts nor the spatial-temporalgrounded representation of events. Also, Rulaet al. [51] showed that the amount of temporalinformation available in the Linked Open Datacloud is still very small. One reason might bethat adding temporal aspects multiplies the num-ber of statements and also may complicate thesituation for users and software developers whowrite queries since queries become more com-plex. Keeping things simple – and neglectingtemporal aspects – is the often selected mantrafor building up scalable environments such asKBs.

3. KG Population: The Semantic Web suffers fromthe difficulties of transforming text and other,mainly unstructured data into RDF. KGs of to-day have already some potential and can be ap-plied to many settings; however, the KGs are de-pendent on the supply of structured data from ex-ternal sources. Knowledge extraction tools and

ontology learning tools are the key for buildingKGs.

4. Limitations regarding the Linked Open Datacloud (partly based on [34]):

(a) Lack of Conceptual Description of Datasets:In order to identify the domain a specific LODdataset covers, a human expert is needed.There is currently no standard mechanismor dataset description interface which statesthat, e.g., MusicBrainz is about music relatedinformation while Geonames is about geo-graphical information. This leads to a miss-ing overview of what datasets are there andwhich can be used in a certain setting. Thereare some attempts [49,2] to describe LODdatasets,65 but they do not focus on the con-ceptual or semantic level, but instead on sta-tistical information or a prosaic description.Since the LOD cloud consists of datasetswhich were published under the Linked Dataprinciples, nobody knows the complete pic-ture of the LOD cloud. Even the well-knownLOD cloud diagram66 is only a particularperspective on the Web of Data, and manyother valid perspectives are possible.67 Otherapproaches automatically assess, annotateand index linked datasets, e.g., by extractingtopic annotations for arbitrary Linked Datadatasets [24]; but these tools have not yet be-come widely used.

(b) Lack of LOD Schema Alignment: Links be-tween LOD datasets are almost exclusivelyon the level of instances. There are only afew approaches or good practices for map-ping concepts at the schema level of the LODcloud. Although ontology matching has beenwidely studied in the Semantic Web area andits tools usually produced strict mappingsbetween concepts such as equivalence andsubsumption, the situation in case of LinkedData is difficult: Even though concepts mayhave a strong semantic similarity, the con-cepts are not necessarily equivalent. One ex-ample for an inconsistency is the fact thatdbpedia:Actor denotes professional ac-

65See also the LOD data catalogs http://datahub.io and http://linkeddatacatalog.dws.informatik.uni-mannheim.de.

66See http://lod-cloud.net/.67See http://lod-cloud.net/.

Page 23: Semantic Web 1 (2015) 1–5 IOS Press A Comparative Survey ... · / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO modeling with RDF1 and querying with SPARQL,2

M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO 23

tors, while the concept movie:actor ofLinkedMDB68 means a person who plays arole in a movie, but who is not a stage actor.The UMBEL ontology was developed to con-nect schemas used by LOD datasets. How-ever, UMBEL does not take the individual us-age patterns of the concepts into account [45].Remarkable approaches for finding schema-level links between LOD datasets are pro-vided by Nokolov et al. [45] and Jain et al.[33].

(c) Lack of Expressivity: Publishing Linked Datain the LOD cloud is done for rapid data re-leases and for relying on the Web of Data aswashing machine (cleaning data over time)[6]. However, in Linked Data the rich fea-tures of OWL are rarely used. Although enti-ties can be interlinked between datasets withowl:sameAs relations, there is no auto-matic constraint check or reasoning whetherthe entities in different datasets contain inco-herent information. The city of Berlin, for in-stance, can have a different population sizein DBpedia and in Geonames, and this is noteradicated. This task remains as burden forthe data consumers.

8. Outlook

Future work on KGs and Semantic Web technolo-gies might focus on the following areas:

– There are new approaches of how to modelknowledge in a semantically-structured form –against the background of having learned of 15years of ontology engineering. One example ofsuch a new approach is the design and use of socalled ontology design patterns [26]: An ontologydesign pattern is a reusable solution to a recur-rent modeling problem. The focus is on reusingexisting components, since ontologies and ontol-ogy components have been reused only to a verylimited extend so far.

– There might be new forms of KGs and KBs whichdo not focus on the storage of entities and their re-lations, but instead on other things such as events[39,55,65]. Papers published the last years indi-cate that event-centric KGs will become more im-portant and also widely applicable.

68See http://www.linkedmdb.org/.

– Recently, there is noticeable progress towardsconstructing KGs automatically. This is neces-sary, since constructing and/or populating KGsneither with the help of experts nor with the helpof open communities does not scale to an extendthat is needed for most applications. For instance,in Freebase the place of birth relation was miss-ing for 71% of all people instances, although thisrelation was mandatory according to the schema[68]. Also, Buh et al. [60] showed that the growthof Wikipedia has been slowing down. Conse-quently, automatic knowledge base construction(AKBC) methods have been attracted more atten-tion [44]. Noteworthy in this context is the ap-proach of statistical inference in KGs. Predictivemodels are trained on known facts from the KG.From that, unknown facts are derived and com-pared to “noisy” facts extracted from external, of-ten unstructured sources such as the Web. Newfacts are added to the KG if they are supportedby both models with a certain confidence. Thismethodology is for instance used in Google’sKnowledge Vault project [21].

9. Conclusion

Freely available knowledge graphs (KGs) have notbeen in the focus of any extensive comparative studyso far. In this survey, we defined aspects according towhich KGs can be analyzed. We analyzed and com-pared DBpedia, Freebase, Cyc, Wikidata, and YAGOalong these aspects and proposed a checklist to enablereaders to find the most suitable KG for their settings.We discussed the essential issues current KGs are con-flicted with and glanced over the possible future of theSemantic Web.

References

[1] S. Abiteboul and R. Hull. IFO: A Formal Semantic DatabaseModel. ACM Trans. Database Syst., 12(4):525–565, Nov.1987.

[2] K. Alexander. Describing Linked Datasets – On the Designand Usage of voiD, the ’Vocabulary Of Interlinked Datasets’.WWW 2009 Workshop: Linked Data on the Web, 2009.

[3] B. Amann and M. Scholl. Gram: A Graph Data Model andQuery Languages. In Proceedings of the ACM Conference onHypertext, ECHT ’92, pages 201–211, New York, NY, USA,1992. ACM.

Page 24: Semantic Web 1 (2015) 1–5 IOS Press A Comparative Survey ... · / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO modeling with RDF1 and querying with SPARQL,2

24 M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO

[4] M. Andries, M. Gemis, J. Paredaens, I. Thyssens, and J. V.den Bussche. Concepts for Graph-Oriented ObjectManipulation. In Proceedings of the 3rd InternationalConference on Extending Database Technology: Advances inDatabase Technology, EDBT ’92, pages 21–38, London, UK,UK, 1992. Springer-Verlag.

[5] R. Angles and C. Gutierrez. Survey of Graph DatabaseModels. ACM Computing Surveys, 40(1):1:1–1:39, 2 2008.

[6] S. Auer. Creating Knowledge out of Interlinked Data: Makingthe Web a Data Washing Machine. In Proceedings of theInternational Conference on Web Intelligence, Mining andSemantics, WIMS ’11, pages 4:1–4:8, New York, NY, USA,2011. ACM.

[7] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak,and Z. Ives. DBpedia: A Nucleus for a Web of Open Data. InProceedings of the 6th International Semantic WebConference and 2nd Asian Semantic Web Conference, ISWC2007/ASWC 2007, pages 722–735. Springer, 2007.

[8] S. Auer, J. Lehmann, A.-C. Ngonga Ngomo, and A. Zaveri.Introduction to Linked Data and Its Lifecycle on the Web. InReasoning Web. Semantic Technologies for Intelligent DataAccess, volume 8067 of Lecture Notes in Computer Science,pages 1–90. Springer Berlin Heidelberg, 2013.

[9] T. Berners-Lee. Linked Data – Design issues. http://www.w3.org/Designissues/LinkedData.html.accessed May 15,2015.

[10] T. Berners-Lee. Cool URIs don’t change. Technical report,World Wide Web Consortium, 1998.

[11] T. Berners-Lee, J. Hendler, and O. Lassila. The semantic web.Scientific American, 284(5):29–37, 5 2001.

[12] J. Brank, M. Grobelnik, and D. Mladenic. A survey ofontology evaluation techniques. In Proceedings of theconference on data mining and data warehouses (SiKDD2005), pages 166–170, 2005.

[13] T. Bray, J. Paoli, and C. M. Sperberg-McQueen. ExtensibleMarkup Language (XML) 1.0, W3C Recommendation 10.http://www.w3.org/TR/1998/REC-xml-19980210.accessed July 31, 2015.

[14] M. L. Brodie. On the Development of Data Models. In M. L.Brodie, J. Mylopoulos, and J. W. Schmidt, editors, OnConceptual Modelling, Topics in Information Systems, pages19–47. Springer New York, 1984.

[15] P. Buneman. Semistructured Data. In Proceedings of theSixteenth ACM SIGACT-SIGMOD-SIGART Symposium onPrinciples of Database Systems, PODS ’97, pages 117–121,New York, NY, USA, 1997. ACM.

[16] P. P.-S. Chen. The Entity-relationship Model: Toward aUnified View of Data. ACM Trans. Database Syst., 1(1):9–36,Mar. 1976.

[17] E. F. Codd. A Relational Model of Data for Large SharedData Banks. Commun. ACM, 13(6):377–387, June 1970.

[18] E. F. Codd. Data Models in Database Management. SIGPLANNot., 16(1):112–114, June 1980.

[19] J. Curtis, J. Cabral, and D. Baxter. On the Application of theCyc Ontology to Word Sense Disambiguation. In FLAIRSConference, pages 652–657, 2006.

[20] R. Cyganiak, D. Wood, and M. Lanthaler. RDF 1.1 Conceptsand Abstract Syntax. 2014. accessed July 30, 2015.

[21] X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao,K. Murphy, T. Strohmann, S. Sun, and W. Zhang. Knowledge

Vault: A Web-scale Approach to Probabilistic KnowledgeFusion. In Proceedings of the 20th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining, KDD ’14, pages 601–610, New York, NY, USA,2014. ACM.

[22] F. Erxleben, M. Günther, M. Krötzsch, J. Mendez, andD. Vrandecic. Introducing Wikidata to the Linked Data Web.In Proceedings of the 13th International Semantic WebConference (ISWC’14), LNCS. Springer, 2014.

[23] C. Fellbaum. WordNet – An Electronic Lexical Database.MIT Press, 1998.

[24] B. Fetahu, S. Dietze, B. P. Nunes, D. Taibi, and M. A.Casanova. Profiling of Linked Datasets using StructuredDescriptions. In The 12th International Semantic WebConference (ISWC2013), 2013.

[25] G. Freedman and E. Reynolds. Enriching basal reader lessonswith semantic webbing. Reading Teacher, 33(6):677–684,1980.

[26] A. Gangemi. Ontology Design Patterns for Semantic WebContent. In Y. Gil, E. Motta, V. Benjamins, and M. Musen,editors, The Semantic Web – ISWC 2005, volume 3729 ofLecture Notes in Computer Science, pages 262–276. SpringerBerlin Heidelberg, 2005.

[27] R. Guns. Tracing the Origins of the Semantic Web. Journal ofthe American Society for Information Science andTechnology, 64(10):2173–2181, 2013.

[28] M. Gyssens, J. Paredaens, and D. van Gucht. AGraph-oriented Object Database Model. In Proceedings of the9th ACM SIGACT-SIGMOD-SIGART Symposium onPrinciples of Database Systems, PODS ’90, pages 417–424,New York, NY, USA, 1990. ACM.

[29] M. Hammer and D. Mc Leod. Database Description withSDM: A Semantic Database Model. ACM Trans. DatabaseSyst., 6(3):351–386, Sept. 1981.

[30] M. Hammer and D. McLeod. The Semantic Data Model: AModelling Mechanism for Data Base Applications. InProceedings of the 1978 ACM SIGMOD InternationalConference on Management of Data, SIGMOD ’78, pages26–36, New York, NY, USA, 1978. ACM.

[31] R. Hull and R. King. Semantic Database Modeling: Survey,Applications, and Research Issues. ACM Comput. Surv.,19(3):201–260, Sept. 1987.

[32] P. Jain, P. Hitzler, K. Janowicz, and C. Venkatramani. There’sNo Money in Linked Data. http://corescholar.libraries.wright.edu/cse/240, 2013. accessedJuly 20, 2015.

[33] P. Jain, P. Hitzler, A. P. Sheth, K. Verma, and P. Z. Yeh.Ontology Alignment for Linked Open Data. In Proceedings ofthe 9th International Semantic Web Conference on TheSemantic Web - Volume Part I, ISWC’10, pages 402–417,Berlin, Heidelberg, 2010. Springer-Verlag.

[34] P. Jain, P. Hitzler, P. Z. Yeh, K. Verma, and A. P. Sheth.Linked Data Is Merely More Data. In AAAI SpringSymposium: linked data meets artificial intelligence,volume 11, 2010.

[35] L. Kerschberg, A. C. Klug, and D. Tsichritzis. A Taxonomyof Data Models. In Systems for Large Data Bases, pages43–64. North Holland & IFIP, 1976.

[36] W. Kim. Object-Oriented Databases: Definition and ResearchDirections. IEEE Transactions on Knowledge and DataEngineering, 2(3):327–341, 1990.

Page 25: Semantic Web 1 (2015) 1–5 IOS Press A Comparative Survey ... · / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO modeling with RDF1 and querying with SPARQL,2

M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO 25

[37] G. Klyne and J. J. Carroll. Resource Description Framework(RDF): Concepts and Abstract Syntax. http://www.w3.org/TR/2004/REC-rdf-concepts-20040210,2004. accessed July 20, 2015.

[38] G. Kobilarov, T. Scott, Y. Raimond, S. Oliver, C. Sizemore,M. Smethurst, C. Bizer, and R. Lee. Media Meets SemanticWeb – How the BBC Uses DBpedia and Linked Data to MakeConnections. In Proceedings of the 6th European SemanticWeb Conference on The Semantic Web: Research andApplications, ESWC 2009 Heraklion, pages 723–737, Berlin,Heidelberg, 2009. Springer-Verlag.

[39] E. Kuzey, J. Vreeken, and G. Weikum. A Fresh Look onKnowledge Bases: Distilling Named Events from News. InProceedings of the 23rd ACM International Conference onConference on Information and Knowledge Management,CIKM ’14, pages 1689–1698, New York, NY, USA, 2014.ACM.

[40] O. Lassila and R. R. Swick. Resource Description Framework(RDF) Model and Syntax Specification. http://www.w3.org/TR/1999/REC-rdf-syntax-19990222/, 1999.accessed July 4, 2015.

[41] J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas,P. N. Mendes, S. Hellmann, M. Morsey, P. van Kleef, S. Auer,and C. Bizer. DBpedia – A large-scale, multilingualknowledge base extracted from Wikipedia. Semantic Web,6(2), 2012.

[42] A. Lozano-Tello and A. Gómez-Pérez. Ontometric: A methodto choose the appropriate ontology. Journal of DatabaseManagement, 2(15):1–18, 2004.

[43] F. Mahdisoltani, J. Biega, and F. M. Suchanek. YAGO3: AKnowledge Base from Multilingual Wikipedias. In CIDR2015, Seventh Biennial Conference on Innovative DataSystems Research, Asilomar, CA, USA, January 4-7, 2015,Online Proceedings, 2015.

[44] M. Nickel, K. Murphy, V. Tresp, and E. Gabrilovich. AReview of Relational Machine Learning for KnowledgeGraphs: From Multi-Relational Link Prediction to AutomatedKnowledge Graph Construction. arXiv preprintarXiv:1503.00759, 2015.

[45] A. Nikolov, V. Uren, E. Motta, and A. de Roeck. OvercomingSchema Heterogeneity between Linked Semantic Repositoriesto Improve Coreference Resolution. In A. Gómez-Pérez,Y. Yu, and Y. Ding, editors, The Semantic Web, volume 5926of Lecture Notes in Computer Science, pages 332–346.Springer Berlin Heidelberg, 2009.

[46] J. Paredaens, P. Peelman, and L. Tanca. G-Log: AGraph-Based Query Language. IEEE Trans. on Knowl. andData Eng., 7(3):436–453, June 1995.

[47] J. Peckham and F. Maryanski. Semantic Data Models. ACMComput. Surv., 20(3):153–189, Sept. 1988.

[48] M. Poveda-Villalón, M. Suárez-Figueroa, andA. Gómez-Pérez. Did You Validate Your Ontology? OOPS!In E. Simperl, B. Norton, D. Mladenic, E. Della Valle,I. Fundulaki, A. Passant, and R. Troncy, editors, The SemanticWeb: ESWC 2012 Satellite Events, volume 7540 of LectureNotes in Computer Science, pages 402–407. Springer BerlinHeidelberg, 2015.

[49] B. Quilitz and U. Leser. Querying Distributed RDF DataSources with SPARQL. In Proceedings of the 5th EuropeanSemantic Web Conference on The Semantic Web: Researchand Applications, ESWC’08, pages 524–538, Berlin,

Heidelberg, 2008. Springer-Verlag.[50] M. A. Rodriguez. A graph analysis of the Linked Data cloud.

arXiv preprint arXiv:0903.0194, 2009. accessed July 31,2015.

[51] A. Rula, M. Palmonari, A. Harth, S. Stadtmüller, andA. Maurino. On the Diversity and Availability of TemporalInformation in Linked Open Data. In The Semantic Web –ISWC 2012, volume 7649 of Lecture Notes in ComputerScience, pages 492–507. Springer Berlin Heidelberg, 2012.

[52] E. Sandhaus. Semantic Technology at the New York Times:Lessons Learned and Future Directions. In Proceedings of the9th International Semantic Web Conference on The SemanticWeb - Volume Part II, ISWC’10, pages 355–355, Berlin,Heidelberg, 2010. Springer-Verlag.

[53] L. Sauermann and R. Cyganiak. Cool URIs for the SemanticWeb. W3C Note, http://www.w3.org/TR/2008/NOTE-cooluris-20081203/, 12 2008. accessed July10, 2015.

[54] H. A. Schmid and J. R. Swenson. On the Semantics of theRelational Data Model. In Proceedings of the 1975 ACMSIGMOD International Conference on Management of Data,SIGMOD ’75, pages 211–223, New York, NY, USA, 1975.ACM.

[55] R. Segers, P. Vossen, M. Rospocher, L. Serafini, E. Laparra,and G. Rigau. ESO: a Frame based Ontology for Events andImplied Situations. Proceedings of Maplex2015, 2015.

[56] J. M. Smith and D. C. P. Smith. Database Abstractions:Aggregation and Generalization. ACM Trans. Database Syst.,2(2):105–133, June 1977.

[57] R. Snodgrass and I. Ahn. A Taxonomy of Time Databases. InProceedings of the 1985 ACM SIGMOD InternationalConference on Management of Data, SIGMOD ’85, pages236–246, New York, NY, USA, 1985. ACM.

[58] R. T. Snodgrass. Temporal databases. In Theories andMethods of Spatio-Temporal Reasoning in Geographic Space,volume 639 of Lecture Notes in Computer Science, pages22–64. Springer Berlin Heidelberg, 1992.

[59] S. Staab and R. Studer. Handbook on Ontologies. SpringerPublishing Company, Incorporated, 2nd edition, 2009.

[60] B. Suh, G. Convertino, E. H. Chi, and P. Pirolli. TheSingularity is Not Near: Slowing Growth of Wikipedia. InProceedings of the 5th International Symposium on Wikis andOpen Collaboration, WikiSym ’09, pages 8:1–8:10, NewYork, NY, USA, 2009. ACM.

[61] S. Tartir, I. B. Arpinar, M. Moore, A. P. Sheth, andB. Aleman-meza. OntoQA: Metric-based ontology qualityanalysis. In IEEE Workshop on Knowledge Acquisition fromDistributed, Autonomous, Semantically Heterogeneous Dataand Knowledge Sources, 2005.

[62] R. W. Taylor and R. L. Frank. CODASYL Data-BaseManagement Systems. ACM Comput. Surv., 8(1):67–103,Mar. 1976.

[63] D. C. Tsichritzis and F. H. Lochovsky. HierarchicalData-Base Management: A Survey. ACM Comput. Surv.,8(1):105–123, Mar. 1976.

[64] D. C. Tsichritzis and F. H. Lochovsky. Data Models. PrenticeHall Professional Technical Reference, 1982.

[65] P. Vossen, T. Caselli, and Y. Kontzopoulou. Storylines forstructuring massive streams of news. ACL-IJCNLP 2015,pages 40–49, 2015.

[66] D. Vrandecic and Y. Sure. How to Design Better Ontology

Page 26: Semantic Web 1 (2015) 1–5 IOS Press A Comparative Survey ... · / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO modeling with RDF1 and querying with SPARQL,2

26 M. Färber et al. / A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO

Metrics. In E. Franconi, M. Kifer, and W. May, editors, TheSemantic Web: Research and Applications, volume 4519 ofLecture Notes in Computer Science, pages 311–325. SpringerBerlin Heidelberg, 2007.

[67] E. R. Watkins and D. A. Nicole. Named Graphs as aMechanism for Reasoning About Provenance. In X. Zhou,J. Li, H. Shen, M. Kitsuregawa, and Y. Zhang, editors,Frontiers of WWW Research and Development - APWeb 2006,

volume 3841 of Lecture Notes in Computer Science, pages943–948. Springer Berlin Heidelberg, 2006.

[68] R. West, E. Gabrilovich, K. Murphy, S. Sun, R. Gupta, andD. Lin. Knowledge Base Completion via Search-basedQuestion Answering. In Proceedings of the 23rd InternationalConference on World Wide Web, WWW ’14, pages 515–526,New York, NY, USA, 2014. ACM.


Recommended