+ All Categories
Home > Documents > 1 IOS Press From hyperlinks to Semantic Web properties using Open Knowledge Extraction · NERD)...

1 IOS Press From hyperlinks to Semantic Web properties using Open Knowledge Extraction · NERD)...

Date post: 10-May-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
27
Undefined 1 (2009) 1–5 1 IOS Press From hyperlinks to Semantic Web properties using Open Knowledge Extraction Editor(s): Name Surname, University, Country Solicited review(s): Name Surname, University, Country Open review(s): Name Surname, University, Country Valentina Presutti a,* , Andrea Giovanni Nuzzolese a and Sergio Consoli a and Diego Reforgiato Recupero a and Aldo Gangemi a,b a STLab, Institute of Cognitive Sciences and Technologies, National Research Council, via San Martino della Battaglia 44, 00185, Roma, Italy E-mail: [email protected] b LIPN, Université Paris 13 - Sorbonne Cité - CNRS E-mail: {andrea.nuzzolese, sergio.consoli, diego.reforgiato}@istc.cnr.it, [email protected] Abstract. Open information extraction approaches are useful but insufficient alone for populating the Web with machine readable information as their results are not directly linkable to, and immediately reusable from, other Linked Data sources. This work proposes a novel Open Knowledge Extraction approach that performs unsupervised, open domain, and abstractive knowledge extraction from text for producing directly usable machine readable information. The method is based on the hypothesis that hyperlinks (either created by humans or knowledge extraction tools) provide a pragmatic trace of semantic relations between two entities, and that such semantic relations, their subjects and objects, can be revealed by processing their linguistic traces (i.e. the sentences that embed the hyperlinks) and formalised as Semantic Web triples and ontology axioms. Experimental evaluations conducted with the help of crowdsourcing confirm this hypothesis showing very high performances. A demo of Open Knowledge Extraction at http://wit.istc.cnr.it/stlab-tools/legalo. Keywords: open knowledge extraction, open information extraction, abstractive summarisation, link semantics, relation extraction. 1. Introducing Open Knowledge Extraction The vision of the Semantic Web is to populate the Web with machine understandable data so that intelli- gent agents are able to automatically interpret its con- tent - just like humans do by inspecting Web content - and assist users in performing a significant number of tasks, relieving them of cognitive overload. The Linked Data movement [2] kicked-off the vi- sion by realising a key bootstrap in publishing ma- chine understandable information mainly taken from structured data (typically databases) or semi-structured data (e.g. Wikipedia infoboxes). However, most of the * Corresponding author. E-mail: [email protected] Web content consists of natural language text, hence a main challenge is to extract as much relevant knowl- edge as possible from this content, and publish them in the form of Semantic Web triples. This work aims at solving this problem by extracting relational knowl- edge that is “hidden” in hyperlinks, which can be either defined manually by humans (e.g. Wikipedia pagelinks) or created automatically by Knowledge Ex- traction (KE) systems (e.g. a KE system can automati- cally add links to Wikipedia pages). Current knowledge extraction (KE) systems address very well the task of linking pieces of text to Se- mantic Web entities (e.g. owl:sameAs) by means 0000-0000/09/$00.00 c 2009 – IOS Press and the authors. All rights reserved
Transcript
Page 1: 1 IOS Press From hyperlinks to Semantic Web properties using Open Knowledge Extraction · NERD) also perform sense tagging, i.e. adding knowl-edge about entity types (rdf:type). Nevertheless,

Undefined 1 (2009) 1–5 1IOS Press

From hyperlinks to Semantic Web propertiesusing Open Knowledge ExtractionEditor(s): Name Surname, University, CountrySolicited review(s): Name Surname, University, CountryOpen review(s): Name Surname, University, Country

Valentina Presutti a,∗, Andrea Giovanni Nuzzolese a and Sergio Consoli a andDiego Reforgiato Recupero a and Aldo Gangemi a,ba STLab, Institute of Cognitive Sciences and Technologies, National Research Council, via San Martino dellaBattaglia 44, 00185, Roma, Italy E-mail: [email protected] LIPN, Université Paris 13 - Sorbonne Cité - CNRSE-mail: {andrea.nuzzolese, sergio.consoli, diego.reforgiato}@istc.cnr.it, [email protected]

Abstract. Open information extraction approaches are useful but insufficient alone for populating the Web with machine readableinformation as their results are not directly linkable to, and immediately reusable from, other Linked Data sources. This workproposes a novel Open Knowledge Extraction approach that performs unsupervised, open domain, and abstractive knowledgeextraction from text for producing directly usable machine readable information. The method is based on the hypothesis thathyperlinks (either created by humans or knowledge extraction tools) provide a pragmatic trace of semantic relations between twoentities, and that such semantic relations, their subjects and objects, can be revealed by processing their linguistic traces (i.e. thesentences that embed the hyperlinks) and formalised as Semantic Web triples and ontology axioms. Experimental evaluationsconducted with the help of crowdsourcing confirm this hypothesis showing very high performances. A demo of Open KnowledgeExtraction at http://wit.istc.cnr.it/stlab-tools/legalo.

Keywords: open knowledge extraction, open information extraction, abstractive summarisation, link semantics, relationextraction.

1. Introducing Open Knowledge Extraction

The vision of the Semantic Web is to populate theWeb with machine understandable data so that intelli-gent agents are able to automatically interpret its con-tent - just like humans do by inspecting Web content -and assist users in performing a significant number oftasks, relieving them of cognitive overload.

The Linked Data movement [2] kicked-off the vi-sion by realising a key bootstrap in publishing ma-chine understandable information mainly taken fromstructured data (typically databases) or semi-structureddata (e.g. Wikipedia infoboxes). However, most of the

*Corresponding author. E-mail: [email protected]

Web content consists of natural language text, hence amain challenge is to extract as much relevant knowl-edge as possible from this content, and publish themin the form of Semantic Web triples. This work aimsat solving this problem by extracting relational knowl-edge that is “hidden” in hyperlinks, which can beeither defined manually by humans (e.g. Wikipediapagelinks) or created automatically by Knowledge Ex-traction (KE) systems (e.g. a KE system can automati-cally add links to Wikipedia pages).

Current knowledge extraction (KE) systems addressvery well the task of linking pieces of text to Se-mantic Web entities (e.g. owl:sameAs) by means

0000-0000/09/$00.00 c© 2009 – IOS Press and the authors. All rights reserved

Page 2: 1 IOS Press From hyperlinks to Semantic Web properties using Open Knowledge Extraction · NERD) also perform sense tagging, i.e. adding knowl-edge about entity types (rdf:type). Nevertheless,

2 V. Presutti et al. / From hyperlinks to Semantic Web properties

of named entity linking methods, e.g. NERD1 [43],FOX2, conTEXT3 [26], Dbpedia Spotlight4, Stan-bol5, TAGME [13], Babelfy [31]. Some of them (e.g.NERD) also perform sense tagging, i.e. adding knowl-edge about entity types (rdf:type).

Nevertheless, it is desirable to enrich Web contentwith other semantic relations than owl:sameAs andrdf:type, i.e. factual relations between entities. Apragmatic trace of a factual relation between two en-tities is the presence of a hyperlink, which is associ-ated with its linguistic trace, i.e. the text surroundingthe hyperlink. In fact, when we include a link in a Webpage, we usually have a semantic relation in mind be-tween something we are referring within the page, i.e.subject, and something referred by the target page, i.e.object, and the text where the hyperlink is embeddedtypically contains an explanation of what such relationis. For example, a link to “Usenet” in the Wikipediapage of “John McCarthy”6 suggests a semantic rela-tion between those two entities, which is explained bythe sentence: “McCarthy often commented on worldaffairs on the Usenet forums”.

Besides common sense, this hypothesis is also sup-ported by a previous study [35], which describes theextraction of encyclopedic knowledge patterns forDBpedia types, based on links between Wikipediapages. A user study showed that hyperlinks betweenWikipedia pages determine relevant descriptive con-texts for DBpedia entities at the type level, which sug-gests that these links mirror relevant semantic relationsbetween entities.

A hyperlink in a Web page can be produced either bya human or a KE system (e.g., by linking a piece of textto a Wikipedia page, which in turn refers to a Seman-tic Web entity, i.e. a DBpedia entity). If a KE systemrecognises two or more entities in a sentence, there is apossibility that such sentence expresses some relationbetween them. For example, the following sentence:

The New York Times reported that John McCarthydied. He invented the programming language LISP.

can be automatically enriched using a KE system bylinking the pieces of text “The New York Times”,

1http://nerd.eurecom.fr2http://aksw.org/Projects/FOX.html3http://context.aksw.org/app/4http://dbpedia-spotlight.github.com/demo5http://stanbol.apache.org6Cf. http://en.wikipedia.org/wiki/John_

McCarthy_(computer_scientist)

“John MacCarthy”, and “LISP” to the Wikipedia pageswikipedia:The_New_York_Times7, wikipedia:-John_McCarthy_(computer_scientist) and wi-kipedia:Lisp_(programming_language) (respec-tively), resulting in the following:

The New York Times reported that John McCarthydied. He invented the programming language LISP.

In this example, the three hyperlinks identify entitiesthat are relevantly related by factual relations: “JohnMcCarthy” with “The New York Times”, and “JohnMcCarthy” with “LISP”. Revealing the semantics ofhyperlinks (either defined by humans or KE systems)has a high potential impact on the amount of Webknowledge that can be published in machine readableform.

In the Semantic Web era such factual relationsshould be expressed as RDF triples where subjects,objects, and predicates have a URI (except for lit-eral objects and blank nodes), and predicates are for-malised as RDF/OWL properties, in order to facilitatetheir reuse and alignment to existing vocabularies, andfor example to annotate hyperlinks with RDFa, withinHTML anchor tags.

While subjects and objects are mostly directly re-solved on existing Semantic Web entities, predicatesare to be defined by performing “paraphrasing”, a sum-marisation task able to abstract over the text (whenneeded) in order to design labels that are as close aspossible to what a human would design in a LinkedData vocabulary. In this respect, [42] distinguishesbetween extractive and abstractive summarisation ap-proaches. Extractive methods select pieces of textsfrom the original source in order to define a summary(i.e. they rely only on the available text), while abstrac-tive techniques ideally rely on modeling the text, andthen combining it with other resources and languagegeneration techniques for generating a summary. Ab-stractive methods are usually applied to large docu-ments to the aim of producing a meaningful summaryof their content.

This work proposes to apply the guiding principleof abstractive techniques to open information extrac-tion as a novel contribution. Open information extrac-tion refers to an open domain and unsupervised ex-traction paradigm. Existing open information extrac-tion approaches are mainly extractive, hence show-ing a complimentary nature to what we present in this

7wikpedia: stands for http://it.wikipedia.org/wiki/

Page 3: 1 IOS Press From hyperlinks to Semantic Web properties using Open Knowledge Extraction · NERD) also perform sense tagging, i.e. adding knowl-edge about entity types (rdf:type). Nevertheless,

V. Presutti et al. / From hyperlinks to Semantic Web properties 3

Subject Predicate Object ApproachJohn Stigall received a Bachelor of arts extractive

John Stigall received from the State University of New York at Cortland extractive

dbpedia:John_Stigall myprop:receive_academic_degree dbpedia:Bachelor_of_arts abstractive

dbpedia:John_Stigall myprop:receive_academic_degree_from dbpedia:State_University_of_New_York abstractiveTable 1

Comparison between relations resulting from ex-tractive and abstractive approaches for the sentence“John Stigall received a Bachelor of arts from theState University of New York at Cortland”.

paper. They mostly focus on breaking text in mean-ingful pieces for building resources of relational pat-terns (e.g. PATTY [32]8, Wisenet [30]9), in some casesdisambiguated on external semantic resources such asWordNet10. Others focus on extracting facts, which arerepresented as simplified strings between entities (e.g.Open Information Extraction (OIE) [28]11) that are notgiven a Semantic Web identity.

Knowledge extraction for the Semantic Web shouldinstead include an abstractive step, which exploits aformal semantic representation of text, and producesoutput that is compliant with Semantic Web principlesand requirements. The method described in this paperdemonstrates a novel unsupervised, open domain andabstractive paradigm, driven by Semantic Web princi-ples and requirements, called open knowledge extrac-tion (OKE). For example, given the sentence:

John Stigall received a Bachelor of arts from theState University of New York at Cortland.

Table 1 compares the extracted relations resulting froman extractive approach (such as OIE [28]12) - the firsttwo rows - and from an abstractive approach - the lasttwo rows. The abstractive results exemplify the ex-pected result of a OKE method. The main difference isthat with the abstractive approach, subjects and objectsare identified as Semantic Web entities, the predicate isas close as possible to what a human would define fora Linked Data vocabulary by possibly using terms thatare not mentioned in the original text. In addition towhat Table 1 shows, the predicate would be formallydefined in terms of OWL axioms and possibly alignedwith existing Semantic Web vocabularies.

8https://d5gate.ag5.mpi-sb.mpg.de/pattyweb/9http://lcl.uniroma1.it/wisenet/10http://wordnet.princeton.edu/11http://openie.cs.washington.edu/12Notice that this is the output of OIE for this sentence

1.1. Contribution

Open information extraction approaches are useful butinsufficient alone for populating the Web with ma-chine readable information as their results are notdirectly linkable to, and immediately reusable from,other Linked Data sources. This work main contribu-tions are:

– a novel approach Open Knowledge Extraction(OKE) approach that performs unsupervised,open domain, and abstractive knowledge extrac-tion from text for producing directly usable ma-chine readable information;

– an implementation of OKE, named Legalo13 thatgiven an English sentence produces a set of RDFtriples representing relevant factual relations ex-pressed in the sentence, the predicates of whichare formally defined in terms of OWL axioms;

– an evaluation of Legalo performed on a numberof sample sentences taken from a corpus14 of val-idated sentences that provide evidence of factualrelations. The results have been evaluated withthe help of crowdsourcing and the creation of agold standard, all showing high values of preci-sion, recall, and accuracy;

– a discussion highlighting the current limits of theapproach and possible ways of improving it, andincluding an informal comparison of the proposedmethod with one of the main existing open infor-mation extraction tools.

Additionally, the paper includes a brief descrip-tion of a specific implementation of OKE, specialisedfor extracting the semantics of Wikipedia pagelinks,which has been evaluated in [40] showing promisingresults.

13Legalo demo can be played at http://wit/istc.cnr.it/stlab-tools/legalo/

14https://code.google.com/p/relation-extraction-corpus/

Page 4: 1 IOS Press From hyperlinks to Semantic Web properties using Open Knowledge Extraction · NERD) also perform sense tagging, i.e. adding knowl-edge about entity types (rdf:type). Nevertheless,

4 V. Presutti et al. / From hyperlinks to Semantic Web properties

The paper is structured as follows: Section 2 presentsthe novel OKE method. Section 3 describes datasources that were used for developing a demonstra-tor of OKE and for its evaluation, while Section 4 de-scribes Legalo, an implementation of OKE that hasbeen evaluated with the help of crowdsourcing, as de-scribed in Section 5. Section 6 discusses evaluation re-sults, limits of the method and possible ways to im-prove it, and informally compares Legalo with OpenInformation Extraction (OIE) [28]. Section 7 discussesrelevant research work and finally, Section 8 sum-marises the contribution of this work and briefly indi-cates future works.

2. A method for generating Semantic Webproperties from text

OKE addresses the following capabilities:

– to assess if a natural language sentence providesan evidence of a relevant relation between a givenpair of entities, which are identified by hyper-links; relevant here means that there are enoughexplicit traces in the sentence to support the exis-tence of a (conceptual) relation;

– to generate a predicate for this relation, with a la-bel that is as close as possible to what a humanwould define in a Linked Data vocabulary;

– to formalise this relation as an OWL object prop-erty with TBox axioms (conceptual level), as wellas to produce ABox axioms (factual level) usingthat property.

More formally:

Definition 1. (Relevant relation)Let s be a natural language textual sentence embed-

ding some hyperlinks, and (esubj , eobj) a pair of enti-ties mentioned in s, where esubj and eobj are the targetentities referred by two hyperlinks in s, ϕs(esubj , eobj)is a relevant relation between esubj and eobj , expressedin s, with esubj being the subject of ϕ and eobj beingits object. Λ ≡ {λ1, ..., λn} is a set of Linked Data la-bels generated by humans for ϕs(esubj , eobj). Finally,λ

′is a label generated by OKE for ϕs(esubj , eobj).

OKE is able to assess the existence of ϕs(esubj , eobj)and to generate a label λ

′equal or very similar to λi ∈

Λ. OKE method is based on six main steps:

1. (abstractive step) internal formal representationof the sentence;

2. assessment of the existence of a relevant relationbetween pairs of entities identified in s, accord-ing to the content of sentence;

3. (extractive step) extraction of relevant terms forthe predicate;

4. (abstractive step) generation of the predicate la-bel;

5. (abstractive step) formal definition of the pred-icate within the scope of its linguistic evidenceand formal representation;

6. alignment (whenever possible) to existing Se-mantic Web properties.

2.1. Frame-based formal representation of a sentence

Given the capabilities described in Section 2, i.e. theassessment of ϕs(esubj , eobj) existence and the gener-ation of λ

′(cf. Definition 1), for both tasks OKE relies

on a set of rules to be applied to a frame-based formalrepresentation G of the sentence s (cf. Definition 2).G is a RDF graph designed following a frame-basedapproach, where nodes represent entities mentioned ins.

Definition 2. (Frame-based graph)Let s be a natural language text sentence and G =

(V,E) a RDF (directed, multi-) graph modelling aframe-based formal representation of s, where V ≡{v0, ..., vn} is the set of nodes (i.e. subjects and ob-jects from RDF triples) in G, E ≡{edge1, ..., edgen}is the set of edges (i.e. RDF triples) in G, whereedgei=(vi−1, p, vi), is a triple connecting vi−1 and viwith the RDF property p in G, and vi ∈ V is the nodein G representing the entity ei mentioned in s.

Frame Semantics [14] is a formal theory of meaning:its basic idea is that humans can better understand themeaning of a single word by knowing the relationalknowledge associated to that word. For example, thesense of the word buy can be clarified in a certain con-text or task by knowing about the situation of a com-mercial transfer that involves certain individuals play-ing specific roles, e.g. a seller, a buyer, goods, money,etc.

In this work, frames are usually expressed by verbsor other linguistic constructions, and their occurrencesin a sentence are represented as RDF n-ary relations,all being instances of some type of event or situ-ation (e.g. myont:buy_1 rdf:type myont:Buy),which is on its turn represented as a subclass of

Page 5: 1 IOS Press From hyperlinks to Semantic Web properties using Open Knowledge Extraction · NERD) also perform sense tagging, i.e. adding knowl-edge about entity types (rdf:type). Nevertheless,

V. Presutti et al. / From hyperlinks to Semantic Web properties 5

Fig. 1. Frame-based formal representation for the sentence: “The New York Times reported that John McCarthy died.”

dul:Event15. Intuitively, dul:Event is the top cat-egory of all frames expressed by verbs. In the contextof this paper, the terms frame occurrence and event oc-currences are used as synonyms. Entities that are men-tioned in s are represented as individuals or classes,depending on their nature, which (ideally) have a type,defined based on the information available in the sen-tence s. When appropriate, entities are represented asarguments of n-ary relations, according to the role theyplay in the corresponding frame occurrence. The roleof an entity in an event occurrence can be expressedeither by a preposition, e.g. Rico Lebrun taught at theChouinard Art Institute, or it can be abstracted fromthe text and represented by reusing the set of thematicroles defined by VerbNet [44], e.g. Rico Lebrun is theagent of the event occurrence “teach” in the abovesample sentence.

A formal and detailed discussion of the theory be-hind frame-based formal representation of knowledgeextracted from text, and used by OKE is beyond thescope of this paper. This modeling approach and itsfounding theories are extensively described in [14,41,34]. However an example may be useful to convey theintuition behind the theory. Figure 1 shows a frame-based representation of the sentence:

The New York Times reported that John McCarthydied.

15The prefix dul: stands for http://www.ontologydesignpatterns.org/ont/dul/dul.owl#

The knowledge extracted from the sentence s isformalised as a set of RDF triples G16. Two en-tities can be identified in this sentence, i.e. “NewYork Times” and “John McCarthy”, represented inG as individuals i.e. fred:New_York_Times andfred:John_McCarthy, respectively. Two frame oc-currences can be identified in the sentence: one ex-pressed by (an inflected form of) the verb report andthe other expressed by the verb die. These frameoccurrences are represented as n-ary relations: i.e.,fred:report_1 and fred:die_1, both being in-stances of classes (fred:Report and fred:Die re-spectively) that are of type dul:Event. Let us con-sider the event occurrence fred:report_1. Its argu-ments are: (i) fred:New_York_Times, which playsan agentive role in this event occurrence, formallyexpressed by the predicate vn.role:Agent17, andfred:John_McCarthy, who plays a passive role, for-malised by the predicate vn.role:Theme, both Verb-Net thematic roles.

2.2. Relevant relation assessment

OKE method for assessing if a relevant relationϕs(esubj , eobj)exists in s between a pair of entities (esubj , eobj) relies

16The figure is derived from the output of FRED [41] (see Section4), the component providing the frame-based formal representationwithin OKE implementation. The prefix fred: stands for a localconfigurable namespace.

17Prefix vn.role: stands for http://www.ontologydesignpatterns.org/ont/vn/abox/role/,which defines all VerbNet [44] thematic roles.

Page 6: 1 IOS Press From hyperlinks to Semantic Web properties using Open Knowledge Extraction · NERD) also perform sense tagging, i.e. adding knowl-edge about entity types (rdf:type). Nevertheless,

6 V. Presutti et al. / From hyperlinks to Semantic Web properties

on the analysis of the semantic structure of G. Firstly,ϕs(esubj , eobj) is assumed to hold only if there is atleast one path in G connecting vsubj and vobj , i.e. thenodes representing esubj and eobj in G, regardless ofthe edge direction in G. This is formally expressed byAxiom 1, given Definition 3.

Definition 3. (Graph path)G

′= (V,E

′) is the undirected version of G= (V,E).

A path P (vsubj , vobj)=[v0, edge1, ..., edgen, vn] withv0 = vsubj and vn = vobj is any sequence alternatingnodes and edges inG

′connecting vsubj to vobj , or vice

versa. The set Psetsubj,obj ≡ {edge1, v1, ..., edgen}includes all edges and nodes in P (vsubj , vobj) exclud-ing vsubj and vobj .

Axiom 1. (ϕ assessment: necessary condition)

ϕs(esubj , eobj)⇒ ∃P (vsubj , vobj)

If P (vsubj , vobj) exists, OKE distinguishes whetherP (vsubj , vobj) contains an event occurrence, or not. IfP (vsubj , vobj) does not contain any event occurrence,then the existence of P (vsubj , vobj) is a sufficient con-dition to the existence of ϕs(esubj , eobj) (cf. Axiom2).

Axiom 2. (Assessment of ϕ: sufficient condition with-out event occurrences)ϕs(esubj , eobj)⇐ ∃P (vsubj , vobj) such that∀vi∈Psetsubj,obj ,¬dul:Event(vi)

In the other case, i.e. the path includes an event occur-rence, OKE method states that ϕs(esubj , eobj) exists ifej is the subject of the event verb in the sentence. Inthe graph G this means that the node vsubj represent-ing esubj inG participates in the event occurrence withan agentive role. This is formalised by Axiom 3, givenDefinition 4.

Definition 4. (Agentive roles)Let f be a node of G such that dul:Event(f)Role ≡ {ρ1, ..., ρn} is the set of possible roles par-ticipating in f , AgRole ≡ {ρm1

, ..., ρmm} is the set

of VerbNet agentive roles, with AgRole ⊆ Role, andρ(f, vsubj) is a role connecting the event occurrence fto its participant vsubj (the node representing esubj) ins.

Axiom 3. (Assessment of ϕ with event occurrences:sufficient condition)ϕs(esubj , eobj)⇐ ∃P (vsubj , vobj) and∃f ∈Psetsubj,obj such that dul:Event(f)and ρ(f, vsubj)∈AgRole

This axiom is based on linguistic typology results(e.g. [8]), by which SVO (Subject-Verb-Object) lan-guages such as English have almost always an explicit(or explicitatable) subject. This subject is formalizedin a frame-based representation of s by means of anagentive role. Based on this observation, OKE methodstates that ϕs(esubj , eobj) exists if esubj is the subjectof a verb in s. This axiom is potentially restrictive withrespect to the idea of a relevant relation expressed ina sentence, which may consider any pair of entities asrelated just because they are mentioned in a same sen-tence. In fact, this idea is quite difficult to implement,since relations between pairs of entities that play e.g.oblique roles18 in a frame occurrence are hard to para-phrase even for a human. For example, consider thesentence:

After a move to Southern California in 1938,Rico Lebrun taught at the Chouinard Art Instituteand then at the Disney Studios.

the frame-based representation of this sentence, de-picted in Figure 2 identifies Rico Lebrun as the agentof a “teach” frame occurrence, while Southern Cal-ifornia, Chouinard Art Institute, and Disney Studiosparticipate in it with oblique roles. This sentence ex-presses three relevant relations: one between Rico Le-brun and Chouinard Art Institute, one between RicoLebrun and Disney Studios, and another between RicoLebrun and Southern California. All those relationscan be easily summarized and represented as RDFtriples, typically selected by OKE from the knowledgeextracted from this sentence.

On the other hand, while it is correct to statethat Chouinard Art Institute and Disney Studios co-participate in an occurrence of the frame “teach”, it isfar from straightforward to paraphrase the meaning ofthis relation. E.g., one might say that Chouinard Art In-stitute and Disney Studios are both places where RicoLebrun used to teach, but this paraphrase is not easilyreconstructable from the text, and needs a stronger lan-guage generation approach, which has not been tack-led for the moment. OKE may still represent this re-lation inferentially as a generic co-participation rela-tion, which is however too generic to be considered asrelevant.

For this reason, the investigation of paraphrases ofrelation between entities co-participating in an event

18Oblique roles are neither agentive or passive, e.g. “manner”,“location”, etc.

Page 7: 1 IOS Press From hyperlinks to Semantic Web properties using Open Knowledge Extraction · NERD) also perform sense tagging, i.e. adding knowl-edge about entity types (rdf:type). Nevertheless,

V. Presutti et al. / From hyperlinks to Semantic Web properties 7

Fig. 2. Frame-based formal representation for the sentence: “After a move to Southern California in 1938, Rico Lebruntaught at the Chouinard Art Institute and then at the Disney Studios”. OKE will select the pairs of entities(fred:Rico_lebrun, Chouinard_art_institute), (fred:Rico_lebrun, fred:Disney_studios), and(fred:Chouinard_art_institute, fred:Southern_California)

with oblique roles is left to further study. An interest-ing analysis on this problem that could suggest newwork directions is discussed in [9].

2.3. Combining extractive and abstractive design forproperty label generation

As far as the generation of λ′

is concerned (cf. Def-inition 1), OKE combines extractive with abstractivetechniques [42]. It means that it both reuses the termsin the text (extractive) and generates other terms de-rived from a semantic analysis of the text (abstrac-tive). To this aim, OKE uses the semantic informa-tion provided by the frame-based representation G ofthe sentence s, which is further enriched with knowl-edge retrieved from external semantic resources. Theresources DBpedia [3], Schema.org19, WiBi [15], andVerbNet [44] are used as examples as they are includedin the OKE implementation presented in this work, i.e.Legalo (cf. Section 4). In particular, DBpedia is usedfor resolving (disambiguating) the nodes {vi}∈V thatrepresent the entities {ei} in the sentence s, on LinkedData. WiBi and Schema.org are used for retrieving thetypes of these entities, and the labels of WiBi types areused in the generation of λ

′. From VerbNet, as antic-

ipated in Section 2.1, OKE obtains the thematic rolesplayed by the DBpedia entities participating in a frameoccurrence. Additionally, VerbNet is also used for dis-ambiguating the sense of frame occurrences. For ex-ample, consider the sentence:

19http://schema.org/

In February 2009 Evile began the pre-productionprocess for their second album with Russ Russell.

Figure 3 shows the enriched frame-based formal rep-resentation of this sentence. The graph does not showWiBi types but they are actually retrieved by OKE im-plementation, for each resolved DBpedia entity. Twoentities are resolved on DBpedia, i.e. dbpedia:Evil,and dbpedia:Russ_Russell, and two frame occur-rences are identified, i.e. fred:begin_1 andfred:process_1. Furthermore, each node is as-signed with a type that, when possible, is alignedto existing Linked Data vocabularies. For Example,dbpedia:Evil has type schema.org:MusicGroup20,and the entity fred:album_1 (representing the albummentioned in the sentence) is typed by the taxonomyfred:SecondAlbum rdf:type fred:Album. Fol-lowing Axiom 1 and Axiom 3 (cf. Section 2.2), OKEwill select from the graph of Figure 3 the pair of (DB-pedia) entities:dbpedia:Evil, dbpedia:Russ_RussellOKE design strategy for generating predicate labels isbased on three main generative rules (GR). The firstone concerns the concatenation of the labels that areused in the shortest path connecting the two nodes, in-cluding the labels of the edges and the labels of thenode types in the path. This rule is defined by GR 1. Itis important to remark that the path used as a referencefor generating the predicate label is the one connect-

20Prefix schema.org: stands for http://schema.org

Page 8: 1 IOS Press From hyperlinks to Semantic Web properties using Open Knowledge Extraction · NERD) also perform sense tagging, i.e. adding knowl-edge about entity types (rdf:type). Nevertheless,

8 V. Presutti et al. / From hyperlinks to Semantic Web properties

Fig. 3. Frame-based formal representation for the sentence: “In February 2009 Evile began the pre-production process for their second albumwith Russ Russell” The graph is enriched with verb senses to disambiguate frame types, DBpedia entity resolutions, thematic roles played byDBpedia entities participating in frame occurrences, and entity types.

ing the nodes vsubj and vobj and not the correspondingresolved DBpedia entities.

GR 1. (Labels concatenation)Given a pair (vsubj , vobj):

– identify the shortest path(s) P (vsubj , vobj) con-necting vsubj and vobj ;

– extract all labels (matching sentence terms) of theedges in the path;

– extract all labels of the most general types21 of thenodes that compose the path, except the types ofvsubj and vobj ;

– concatenate the extracted labels following theiralternating sequence in P (vsubj , vobj).

Hence, referring to Figure 3, OKE will produce a pred-icate label λ = “begin process for album with” forexpressing ϕs(Evil, Russ Russell). Notice that theonly labels that are included in the concatenation arethose with prefix fred: meaning that they are ex-tracted from s.

The second rule for generating predicate labels takesinto account the possible presence of an event occur-rence in the path connecting the pair (vsubj , vobj). In-tuitively, in this case the path is a tree, rooted in anevent occurrence, i.e. a node f , such as dul:Event(f).The labels in this cases are extracted only from thepath starting from f and ending in vobj (referred asthe right branch of the tree), including also the label off type. The rationale behind this rule is that the rightbranch of the tree including the root event (i.e. its type)provides the relevant information expressing the rela-

21If a node is typed by a taxonomy, the most general type in thetaxonomy is extracted.

tion between the two nodes, according to an empiricalobservation conducted on a sample of ∼200 cases.

For example, consider the (excerpt of the) frame-based representation of the sentence “Joey Foster Ellishas published on The New York Times, and The WallStreet Journal.” shown in Example 2.1

Example 2.1. (Path including an event)

fred:publish_1 rdf:type fred:Publish;

vn.role:Agent fred:Joey_Foster_Ellis;

fred:on fred:New_York_Times;

fred:on fred:Wall_Street_Journal .

fred:Publish rdfs:subClassOf dul:Event .

fred:Joey_Foster_Ellis

owl:sameAs dbpedia:Joey_Foster_Ellis .

fred:New_York_Times

owl:sameAs dbpedia:The_New_York_Times .

fred:Wall_Street_Journal

owl:sameAs dbpedia:Wall_Street_Journal.

Following GR 1 and applying this additional rule forthe selected pair:dbpedia:Joey_Foster_Ellis,

dbpedia:Wall_Street_Journal

leads to a predicate λ = “publish on” forϕs(Joey Foster Ellis,Wall Street Journal).Additionally, if the right branch of the tree path is oflength 1 and the only edge is a passive role, i.e. vobjparticipates with a passive role to f , the label of theWiBi type of vobj is concatenated to the predicate la-bel. The rationale behind this rule is that when vsubjand vobj play respectively an agentive and a passiverole in an event occurrence, the resulting predicate la-bel following only GR 1 would be too generic, hence

Page 9: 1 IOS Press From hyperlinks to Semantic Web properties using Open Knowledge Extraction · NERD) also perform sense tagging, i.e. adding knowl-edge about entity types (rdf:type). Nevertheless,

V. Presutti et al. / From hyperlinks to Semantic Web properties 9

adding the WiBi type label makes the property labelmore specific and informative.

For example, a frame-based representation of thesentence “Elton John plays the piano” is given in Ex-ample 2.2:

Example 2.2. (Right branch of tree path with onlypassive role)

fred:play_1 rdf:type fred:Play;

vn.role:Agent fred:Elton_John;

vn.role:Theme fred:piano_1 .

red:Elton_John

owl:sameAs dbpedia:Elton_John .

fred:piano_1 rdf:type dbpedia:Piano .

fred:Play rdfs:subClassOf dul:Event .

dbpedia:Piano

rdf:type wibi:MusicalInstrument .

If we apply the additional rules described so far tothe pair (dbpedia:Elton_John, dbpedia:Piano)

we obtain a label λ = “play musical instrument” forϕs(Elton John, piano), which is more informativethan a simple “play” that would result without addingthe WiBi type label of dbpedia:Piano. This rule isdefined by GR 2.

GR 2. (Path including event occurrences)Given a selected pair (vsubj , vobj) and the shortest

path P (vsubj , vobj) connecting them. If P (vsubj , vobj)is a tree rooted in f , such as dul:Event(f),

– extract P (f, vobj) from P (vsubj , vobj);– extract all edge labels in P (vsubj , vobj) that

match with terms extracted from s;– for each vi (including f and excluding vobj) inP (f, vobj) extract the label of its more generaltype;

– concatenate the extracted labels following theiralternating sequence in P (f, vobj);

– if P (f, vobj) has only 1 edge (length = 1), andthis edge identifies a VerbNet passive role, thanextract the WiBi type of vobj and append it to thelabel concatenation.

The third rule for predicate label generation comple-ments GR 1 and GR 2 by associating VerbNet rolesto labels. Such labels have been defined top-down byanalysing VerbNet thematic roles and their usage ex-amples. The rule is defined in GR 3.

GR 3. (Thematic roles labels)If a path contains a VerbNet thematic role, replace

its label with an empty one, unless the role is associ-

ated with a non empty label according to the followingscheme:vn.role:Actor1 -> “with”vn.role:Actor2 -> “with”vn.role:Beneficiary -> “for”vn.role:Instrument -> “with”vn.role:Destination -> “to”vn.role:Topic -> “about”vn.role:Source -> “from”

For example, consider the (excerpt of the) frame-based representation of the sentence “Lincoln’s wifesuspects that John Wilkes Booth and Andrew Johnsonconspired to kill Lincoln.” shown in Example 2.3.

Example 2.3. (Thematic roles associated with labels)

fred:conspire_1 rdf:type fred:Conspire;

vn.role:Actor1 fred:Andrew_Johnson;

vn.role:Actor2 fred:John_Wilkes_Booth .

fred:Conspire rdfs:subClassOf dul:Event .

fred:Andrew_Johnson

owl:sameAs dbpedia:Andrew_Johnson .

fred:John_Wilkes_Booth

owl:sameAs dbpedia:John_Wilkes_Booth;

fred:Lincoln

owl:sameAs dbpedia:Abraham_Lincoln .

By applying GR 1, 2 and 3 to the path connecting thepair:dbpedia:Andrew_Johnson,

dbpedia:John_Wilkes_Booth

OKE generates a label λ = “conspire with” forϕs(Andrew Johnson, John Wilkes Booth).The mapping scheme (role<->label) is an evolving re-source, which improves based on the periodic evalua-tion of OKE implementation results.

2.4. Formalisation of extracted knowledge

Given a textual sentence s and its frame-based formalrepresentationG, by following the generative rules de-scribed in Section 2.3 Legalo generates a label λ foreach relationϕs(esubj , eobj) that it is able to identify ins, based on the shortest path P (vsubj , vobj) connecting(vsubj , vobj) in G (cf. Definitions 1, 2, and 3). Theselabels constitute the basis for automatically generatinga set of RDF triples that can be used for semanticallyannotating the hyperlinks included in s, additionallythese set of triples provides a (formalised) summary ofs.

Page 10: 1 IOS Press From hyperlinks to Semantic Web properties using Open Knowledge Extraction · NERD) also perform sense tagging, i.e. adding knowl-edge about entity types (rdf:type). Nevertheless,

10 V. Presutti et al. / From hyperlinks to Semantic Web properties

The aim of the formalisation step is to favour thereuse of the extracted knowledge by representing it asRDF triples, by augmenting it with informative anno-tations and axiomatisation, and by linking it to exist-ing Semantic Web data. In particular, the formalisationstep addresses the following tasks:

– producing a RDF triple (vsubj , pλ, vobj) for eachhyperlink in s associated with eobj , such thatϕs(esubj , eobj) exists in s, where pλ is a predicatehaving label λ, vsubj is the node in G represent-ing esubj , and vobj is the node in G representingeobj ;

– formally defining pλ: its domain and range, andpossible other OWL axioms that specify its for-mal semantics;

– annotating each triple (vsubj , pλ, vobj) with infor-mation about its linguistic evidence, i.e. the sen-tence s;

– annotating each triple and predicate with informa-tion about the frame-based formal representationfrom which they were extracted.

RDF triples can be used for annotating hyperlinks, e.g.with RDFa, OWL axiomatisation supports ontology

reuse, and scope annotations (i.e. linguistic evidenceand formal representation) support reuse in relation ex-traction systems, e.g. relation extraction based on dis-tant supervision [29,1]

Locality of produced predicates. OKE method workson the assumption that each generated predicate andits associated formalisation are valid in the conceptualscope identified by the sentence s. This means that sidentifies the scope of predicate names definitions, i.e.the namespace of a predicate depends on s. Pragmat-ically, this is implemented in Legalo by including thechecksum of s in the predicate namespace. This stronglocality constraint may lead to producing a high num-ber of potentially equivalent properties (i.e. having thesame intensional meaning) defined as they were dif-ferent. This issue is tackled by formalising all pred-icates with domain and range axioms having values,i.e. classes, from external (open domain) resources, aswell as by keeping the binding between a predicate,its linguistic evidence, i.e. s, and its formal represen-tation source, i.e. G. The latter containing informa-tion about the disambiguated senses of the verbs, i.e.frame occurrences, used in s. All these features allowon one hand to inspect a specific property for under-

Fig. 4. Frame-based formal representation for the sentence: “The New York Times reported that John McCarthy died. He invented the program-ming language LISP.”

Fig. 5. Legalo’s triples produced from the sentence:“The New York Times reported that John McCarthy died. He invented the programminglanguage LISP. ”

Page 11: 1 IOS Press From hyperlinks to Semantic Web properties using Open Knowledge Extraction · NERD) also perform sense tagging, i.e. adding knowl-edge about entity types (rdf:type). Nevertheless,

V. Presutti et al. / From hyperlinks to Semantic Web properties 11

standing its meaning, e.g. in case of manual reuse, onthe other to automatically reconcile predicates by com-puting a similarity measure based on them. In this pa-per, we focus on the generative part of the problem, i.e.generating usable labels for predicates and producingtheir formal definition, while we leave the reconcilia-tion task to future work.

RDF factual statements. For each hyperlink in s as-sociated with a true assessment of ϕs(esubj , eobj) (cf.Axioms 1, 2, and 3), OKE produces at least one RDFtriple. As explained in Section 2.3, the nodes vsubj andvobj in G representing esubj and eobj are resolved onDBpedia, which ensures that all produced triples arelinked to the Linked Data cloud. The predicate is for-malised as an OWL object property having λ

′as label

and an ID derived by transforming λ′

according to theCamelCase notation22.

For example, consider the enriched frame-based for-mal representation of the sentence

The New York Times reported that John McCarthydied. He invented the programming language LISP.

depicted in Figure 4, OKE (i.e. its implementation)produces the triples depicted in Figure 5, according tothe generative rules GR 1, 2, and 3, where the prefixlegalo: is a namespace defined using the checksumof the sentence s. Notice that Figure 5 shows the WiBitypes23 for the resolved DBpedia entities.

OWL property formalisation. For each generatedproperty, OKE produces an additional set of OWL ax-ioms that formally define it. The predicate formalisa-tion states that the predicate is an OWL object prop-erty, and includes domain and range axioms, whosevalues are defined according to the WiBi types as-signed to vsubj and vobj . In case of multi-typing of anentity, the value is the union of all types. In case a WiBitype is not available, the default type is owl:Thing.Example 2.4 shows the axioms formalising domainand range of the properties shown in Figure 5.

Example 2.4. (Domain and range axioms.)

legalo:reportDie a owl:ObjectProperty ;

rdfs:domain wibi:Newspaper ;

rdfs:range wibi:Computer_scientist .

22According to a common Linked Data convention, using theCamelCase notation for OWL object properties makes the first termof the ID start with lower case, e.g. “invent programming language”-> inventProgrammingLanguage.

23http://www.wibitaxonomy.org/

legalo:inventProgrammingLanguage

a owl:ObjectProperty ;

rdfs:domain wibi:Computer_scientist ;

rdfs:range wibi:Programming_language ;

rdfs:subPropertyOf legalo:invent .

As the reader may notice, an additionalrdfs:subPropertyOf axiom is included in the for-mal definition of legalo:inventProgrammingLanguage.In fact, if a predicate is derived with GR 2, meaningthat vsubj and vobj participate in an event with respec-tively, an agentive and a passive role, then OKE alsogenerates a more general property based on the eventtype, and produces a rdfs:subPropertyOf axiom.We remind that in these cases, the rule requires to gen-erate a specialised property label by appending theWiBi type of vobj to the label of the event type. Ex-ample 2.4 shows one of this cases. All properties pro-duced by OKE are derived from a formal representa-tionG of the sentence s, meaning thatG provides theirformal scope. Based on this principle, OKE producesan additional set of triples, which formalise the gener-ated properties with reference to G. As stated by GR1 and 2, there are two main types of paths from whichthe properties derive. In the first case, the path connect-ing vsubj and vobj does not include any event node.In this case, OKE produces a OWL property chain ax-iom stating that the generated property is implied bythe chain of properties participating in the path, whereeach property of the path is formalised with domainand range axioms according to the locality of G. Thesame concept applies to the case of a path that includesan event node. Similarly, OKE produces a propertychain axiom. However, in this case the path has twodifferent directions in G. For this types of paths OKEdefines the concepts of left branch path, i.e. the oneconnecting the event node with vsubj , and right branchpath, i.e. the one connecting the event node withvobj . For example, in Figure 4 the path P connectingfred:John_Mccarthy with fred:Lisp includes anevent, i.e. fred:invent_1. Hence P is a tree, whichroot is this event node. The left branch path of P is theone connecting fred:invent_1 with fred:Lisp,while the right branch path of P is the one connect-ing fred:invent_1 with fred:John_Mccarthy. Inorder to define a property chain axiom OKE needs todefine the inverses of all properties in the left branchof P . However, these branch paths may contain prop-erties defined by VerbNet, i.e. thematic roles, whichare independent of the event they are associated with,

Page 12: 1 IOS Press From hyperlinks to Semantic Web properties using Open Knowledge Extraction · NERD) also perform sense tagging, i.e. adding knowl-edge about entity types (rdf:type). Nevertheless,

12 V. Presutti et al. / From hyperlinks to Semantic Web properties

Fig. 6. The grounding vocabulary used for annotating the generated triples and properties with information about their linguistic and formalrepresentation scope.

in the scope of G., i.e. they are general domain prop-erties. In other words, these properties do not carryany information about the event included in the path,which is relevant as far as the formal semantics of theOKE generated property is concerned. OKE tacklesthis issue by defining a local thematic role property foreach VerbNet role participating in the event includedin the path. For example, let us consider the propertylegalo:inventProgrammingLanguage in Figure5. Its reference path includes the two (thematic roles)properties vn.role:Agent and vn.role:Product.OKE generates two new properties, legalo:AgentInventand legalo:ProductInvent, defined as sub-propertiesof vn.role:Agent and vn.role:Product, respec-tively. Given these two new properties, the axioms pro-duced for formalising the generated propertylegalo:inventProgrammingLanguage are givenin Example 2.5

Example 2.5. (Property Chain Axiom when the con-necting path includes an event.)

legalo:inventProgrammingLanguage

a owl:ObjectProperty ;

owl:propertyChainAxiom _:b1 .

legalo:AgentInvent

rdfs:subPropertyOf vn.role:Agent .

legalo:ProductInvent

rdf:subPropertyOf vn.role:Product .

_:b1:-([owl:inverseOf legalo:AgentInvent]

legalo:ProductInvent) .

Scope annotations. Finally, OKE annotates all gen-erated properties and triples with information relatedto the linguistic and formal representation scopesfrom which they were derived. To this aim a spe-cific OWL ontology has been defined, named ground-

ing24, depicted in Figure 6. This ontology reuses Ear-mark 25, a vocabulary for annotating textual content,and semiotics26, a content ontology pattern that en-codes a basic semiotic theory. Earmark defines theclass earmark:Docuverse, which represents anycontainer of strings that may appear in a document.In the context of OKE this class can be used for rep-resenting the sentence s. The semiotics content pat-tern defines three main classes: Expression, Meaning,Reference (the semiotic triangle). The class Expres-sion is also reused for representing the sentence s. Asfor the annotation of the linguistic scope of a RDFtriple, the grounding vocabulary defines the more spe-cific concept of “linguistic evidence”. In fact, accord-ing to the axioms defined in Section 2.2 and the gen-erative rules defined in Section 2.3, the sentence sprovides an evidence of the relation ϕs(esubj , eobj),which is formalised by a RDF triple (vsubj , pλ, vobj).The concept of “linguistic evidence” is represented bythe class LinguisticEvidence that specialises bothearmark:Docuverse and semiotics:Expression.The OWL property that relates a RDF triple generatedby OKE and its linguistic evidence ishasLinguisticEvidence.Additionally, the class FrameBasedFormalModel isdefined for representing the concept of frame-basedformal representation of a textual sentence, describedin detail in Section 2.1. This class is instantiated bythe graph G representing s, which provides the for-mal scope for all generated properties and triples. Theproperty derivedFromFormalRepresentation ofthe grounding ontology, connects a OKE generated

24The vocabulary can be downloaded from http://ontologydesignpatterns.org/cp/owl/grounding.owl

25http://www.essepuntato.it/2008/12/earmark26http://www.ontologydesignpatterns.org/cp/

owl/semiotics.owl

Page 13: 1 IOS Press From hyperlinks to Semantic Web properties using Open Knowledge Extraction · NERD) also perform sense tagging, i.e. adding knowl-edge about entity types (rdf:type). Nevertheless,

V. Presutti et al. / From hyperlinks to Semantic Web properties 13

property as well as a RDF triple, with the graph Gfrom which they were derived. As an example, let usconsider the sentence represented by the graph in Fig-ure 4 and the generated RDF triple of the propertylegalo:inventProgrammingLanguage depictedin Figure 5. The scope annotations shown in Example2.6 are generated.

Example 2.6. (Scope annotations of a OKE property)legalo:sentence

a grounding:LinguisticEvidence ;

earmark:hasContent "The New York Times

reported that John McCarthy died. He invented

the programming language LISP." .

[] a owl:Axiom ;

grounding:hasLinguisticEvidence legalo:sentence;

owl:annotatedProperty

legalo:inventProgrammingLanguage ;

owl:annotatedSource

dbpedia:John_McCarthy_(computer_scientist);

owl:annotatedTarget

dbpedia:Lisp_(programming_language) .

legalo:inventProgrammingLanguage

a owl:ObjectProperty ;

grounding:derivedFromFormalRepresentation

krgraph:52f88ca22 ;

grounding:definedFromLinguisticEvidence

legalo:sentence .

The first two axioms simply create an individual oftype LinguisticEvidence for representing the sen-tence. The second group of axioms annotates the RDFtriple for “John McCarthy invented Lisp” with its lin-guistic evidence.Finally, the legalo:inventProgrammingLanguageproperty is annotated with its linguistic as well as itsformal scope.

2.5. Alignment to Semantic Web vocabularies

This step has the goal of aligning the generatedproperties to existing Semantic Web ones. The ideais to maximise reuse and linking of extracted knowl-edge to existing Linked Data. OKE method does notdefine a specific procedure for addressing this task,rather this step is included in the method for empha-sising its importance. The current version of OKE pro-totype, named Legalo (cf. Section 4) implements asimple string matching technique based on the Leven-shtein distance measure. The implementation of moresophisticated approaches for aligning OKE properties

to existing vocabularies is part of future work. Rel-evant related work are ontology matching techniquessuch as [12] (cf. see the Ontology Alignment Evalua-tion Initiative27). A possible strategy is to apply state-of-the-art techniques in ontology matching exploitingthe information and features provided by the formali-sation step (cf. Section 2.4).

3. Data sources

In the context of this work a number of data sourceswere used for different purposes.

3.1. Abstractive property design

– WiBi [15] is a Wikipedia bitaxonomy: a refined,rich and high quality taxonomy that integratesWikipedia pages and categories. WiBi is used asa reference semantic resource for property labeldesign. WiBi types are exploited in OKE “para-phrasing” task (i.e. abstractive step) when the ex-tracted terms are too general to be informativeenough. Details on how WiBi types are used aregiven in Section 2.3.

– VerbNet [44] is the largest domain-independenthierarchical verb lexicon, available for English.It is organized into verb classes. Each verb classis described by thematic roles, selectional restric-tions on the arguments, and frames. We mapa subset of VerbNet thematic roles to specificprepositions, which are used in OKE paraphras-ing task. Details on this mapping are given in Sec-tion 2.3.

3.2. Properties alignment to existing Semantic Webvocabularies

Given a sentence including a link, OKE creates a RDFproperty synthesising the link’s semantics. When pos-sible, it is desirable to align this property to existingSemantic Web properties. The current implementationof OKE uses three semantic resources for addressingthis task:

– Watson28 [10] is a service that provides accessto Semantic Web knowledge, in particular ontolo-gies;

27http://oaei.ontologymatching.org/28http://watson.kmi.open.ac.uk/WatsonWUI/

Page 14: 1 IOS Press From hyperlinks to Semantic Web properties using Open Knowledge Extraction · NERD) also perform sense tagging, i.e. adding knowl-edge about entity types (rdf:type). Nevertheless,

14 V. Presutti et al. / From hyperlinks to Semantic Web properties

– Linked Open Vocabularies (LOV)29 is an ag-gregator of Linked Open vocabularies (includ-ing DBpedia), and provides services for accessingtheir data;

– Never-Ending Language Learning (NELL)30 [6]is a machine learning system that extracts struc-tured data from unstructured Web pages andstores it in a knowledge base. It runs continuouslysince 2010. From the learnt facts, NELL team hasderived an ontology of categories and properties:it includes 548 properties at the moment31.

3.3. Evaluation

– Relation extraction corpus In order to evaluateLegalo, i.e. the current implementation of OKE, acorpus for relation extraction developed at googleresearch32 was used. Each dataset composing thecorpus, contains Wikipedia snippets evaluated byat least five raters to assess if the snippet providesan evidence for a specific relation. Section 5 pro-vides additional details on how this corpus wasused in the context of this work;

– Wikipedia and Wikipedia Pagelinks. Wikipediais a collaboratively built encyclopedia on theWeb. Currently, English Wikipedia contains∼4.6Marticles33. Each Wikipedia page refers to one en-tity: these entities are represented in DBpedia,the RDF version of Wikipedia. The WikipediaPagelinks dataset34 represents internal links be-tween DBpedia instances as they occur in theircorresponding Wikipedia pages. This datasetcounts ∼149.7M dbpo:wikiPageWikiLinktriples (as of DBpedia 2014). A subset of Wikipediapages and their pagelinks35 as testing sample forevaluating the specialised version of OKE imple-mentation for typing Wikipedia pagelinks.

29http://lov.okfn.org/dataset/lov/30http://rtw.ml.cmu.edu/rtw/31http://nell-ld.telecom-st-etienne.fr/32https://code.google.com/p/relation-extraction-

corpus/downloads/list33Source: http://en.wikipedia.org/wiki/

Wikipedia:Size_of_Wikipedia,November 201434http://wiki.dbpedia.org/Downloads3935The previous version of the Wikipedia Pagelink dataset (DBpe-

dia 3.9) was used in the evaluation, which counted ∼136.6M triples,as the evaluation was performed before the release of DBPedia 2014.

4. Legalo prototype

In this section, an implementation of OKE, namedLegalo, is presented36. It is based on a pipeline of com-ponents and data sources, executed in the sequence il-lustrated in Figure 7.

1. FRED: Semantic Web machine reader. The corecomponent of the system is FRED [41], a Seman-tic Web machine reader able to produce a RDF/OWLframe-based representation of a text. It integrates,transforms, improves, and abstracts the output of sev-eral NLP tools. It performs deep semantic parsingby reusing Boxer [4], which in turn uses a statisti-cal parser (C&C) producing Combinatory CategorialGrammar trees, and thousands of heuristics that ex-ploit existing lexical resources and gazetteers to gen-erate structures according to Discourse RepresentationTheory (DRT) [25]. The basic NLP tasks performedby Boxer, and reused by FRED, include: event detec-tion, semantic role labeling, first-order logic represen-tation of predicate-argument structures, logical opera-tors scoping (called boxing), modality detection, andtense representation.

FRED reengineers DRT/Boxing discourse represen-tation structures according to SW and linked data de-sign practices in order to represent events, role la-beling, and boxing as typed n-ary logical patterns inRDF/OWL. The main class for typing events in FREDis dul:Event37. In addition, some variables created byBoxer as discourse referents are reified as individualswhen they refer to something that has a role in the for-mal semantics of the sentence. For example, cat_1 isthe reification of the x variable in the first-order predi-cation Cat(x) extracted from the sentence The cat is onthe mat.

Linguistic Frames [34], Ontology Design Patterns[20], open data, and various vocabularies are reusedthroughout FRED’s pipeline in order to resolve, align,or enrich extracted data and ontologies. The most usedinclude: VerbNet38, for disambiguation of verb-basedevents; WordNet-RDF39 and OntoWordNet [19] forthe alignment of classes to WordNet and DOLCE;DBpedia for the resolution and/or disambiguation of

36A demo of Legalo is available at http://wit.istc.cnr.it/stlab-tools/legalo/

37Prefix dul: stands for http://www.ontologydesignpatterns.org/ont/dul/dul.owl#

38http://verbs.colorado.edu/~mpalmer/projects/verbnet.html

39http://www.w3.org/TR/wordnet-rdf/

Page 15: 1 IOS Press From hyperlinks to Semantic Web properties using Open Knowledge Extraction · NERD) also perform sense tagging, i.e. adding knowl-edge about entity types (rdf:type). Nevertheless,

V. Presutti et al. / From hyperlinks to Semantic Web properties 15

Fig. 7. Pipeline implemented by Legalo for generating Semantic Web properties for semantic annotation of hyperlinks based on their linguistictrace, i.e. natural language sentence including the hyperlinks. Numbers indicate the order of execution of a component in the pipeline. Edgesindicates input/output flows. (*) denotes tools developed in this work, which are part this paper contribution.

named entities, as well as for enriching the graphwith existing facts known to hold between those enti-ties; schema.org (among others) for typing the recog-nized named entities. For Named Entity Recognition(NER) and Resolution (a.k.a. Entity Linking) FREDrelies on TAGME [13], an algorithmic NER resolverto Wikipedia that heavily uses sentence and Wikipediacontext to disambiguate named entities.

All figures depicted in Section 2 show examples ofFRED outputs: the reader may want to consider Figure3, which show the RDF/OWL graph for the sentence“In February 2009 Evile began the pre-production pro-cess for their second album with Russ Russell” as arepresentative output of FRED.

Additionally, FRED reuses The Earmark vocabularyand annotation method [37] for annotating text seg-

ments with the resources from its graphs40. For ex-ample, in the example sentence of Figure 3, the term“Evil”, starting from the text span “17” and endingat the text span “22” denotes the entity fred:Evilin the FRED graph G. This information is formalisedwith the following triples41:

fred:offset_17_22_Evila pos:PointerRange ;rdfs:label "Evil"^^xmls:string ;semio:denotes fred:Australia ;

40These triples are not returned in the graph-view result of FREDat http://wit.istc.cnr.it/stlab-tools/fred/, theyare returned with all other serialization output options.

41Prefix pos: stands for http://www.essepuntato.it/2008/12/earmark#, semio: stands for http://ontologydesignpatterns.org/cp/owl/semiotics.owl#, and xmls: stands for http://www.w3.org/2001XMLSchema#

Page 16: 1 IOS Press From hyperlinks to Semantic Web properties using Open Knowledge Extraction · NERD) also perform sense tagging, i.e. adding knowl-edge about entity types (rdf:type). Nevertheless,

16 V. Presutti et al. / From hyperlinks to Semantic Web properties

pos:begins "17"^^xmls:nonNegativeInteger ;pos:ends "22"^^xmls:nonNegativeInteger ;

2. Entity pair selection. This component is in chargeof detecting the resolved entities and associate themwith their lexical surface in s. This is done by query-ing FRED text span annotations. Another task ofthis component is, for each pair of detected entities(vsubj , vobj), to assess the existence of ϕs betweenthem. In other words, this component checks the ex-istence paths between vsubj and vobj (cf. Axiom 1),selects the shortest one and verifies if there are eventnodes in the selected path. If so, it verifies if vsubj par-ticipates in the event occurrence with an agentive role(cf. Axiom 3). All selected pairs and associated pathsare passed to the next component.

3. RDF/OWL writer. This component is in charge ofgenerating a predicate for each pair of entities receivedin input from the previous component, by applying thegenerative rules described in Section 2.3 to its asso-ciated path. In addition, this components implementstwo more modules: the “Property matcher” and the“Formaliser”.

The “Property matcher” is in charge of findingalignments between the generated predicate, and exist-ing Semantic Web vocabularies. As described in Sec-tion 3, three main sources are used for retrieving se-mantic property candidates. For assessing their simi-larity with the generated predicate a string matchingalgorithm was implemented, which computes a Leven-shtein distance metrics [33] between the IDs of the twopredicates. Of course, this component is not intendedto be a contribution to advance the state of the art inontology matching, its goal is to contribute to a com-plete implementation of OKE and to provide a possiblebaseline for comparing results with future improvedversions.

Finally, the RDF/OWL writer includes the compo-nent “Formaliser”. This component implements theformalisation step of the method (cf. Section 2.4). It isin charge of producing the triples summarising the re-lation expressed in s, and that can be used for annotat-ing the corresponding hyperlink, to generate OWL ax-ioms defining domain and range of the generated pred-icates, and finally to annotate the produced triples andpredicates with scope information.

Legalo for typing Wikipedia pagelinks. A specialisedversion of Legalo for typing Wikipedia pagelinks

(Legalo-Wikipedia)42 was presented in [40]. It de-pends on Legalo has core component and specialise itwith two additional features: (i) a sentence extractorspecialised for Wikipedia, and (ii) a subject resolverspecialised for Wikipedia. A detailed description ofthis implementation can be found in [40]. Briefly,Legalo Wikipedia takes in input a DBpedia entity, andretrieves all its pagelinks triples (from the PagelinksDBpedia datast). For each pagelink triple it extractsall Wikipedia snippets containing an hyperlink corre-sponding to the triple (sentence extractor). It then se-lect all the the snippets that contain a lexicalisation ofthe Wikipedia page subject (subject resolver), by rely-ing on the DBpedia Lexicalizations Dataset43. For ex-ample, the wikipage wp:Ron_Cobb includes a link towp:Sydney in the sentence:

“In 1972, Cobb moved to Sydney, Australia, wherehis work appeared in alternative magazines such asThe Digger.”

This sentence will be selected and stored as it con-tains the term “Cobb”, which is a lexicalization ofdbpedia:Ron_Cobb. The same wikipage includesa link to wp:Los_Angeles_Free_Press in thesentence:

“Edited and published by Art Kunkin, the Los AngelesFree Press was one of the first of the undergroundnewspapers of the 1960s, noted for its radical poli-tics.”

This sentence will be discarded as it does not in-clude any lexicalization of dbpedia:Ron_Cobb.This procedure is needed for identifying pagelinks thatactually convey a semantic factual relation between theWikipedia page subject and the target of the pagelink.Once the snippets are identified they are passed toLegalo as input together with their associated entitypair.

5. Results and evaluation

The specialised version of Legalo for typing Wikipediapagelinks has been previously evaluated. For the sakeof completeness, these results are summarised in Sec-tion 5.2 (for additional details, the reader can referto [40]). With the help of crowdsourcing an additional,

42A demo is available at http://wit.istc.cnr.it/stlab-tools/legalo/wikipedia

43http://wiki.dbpedia.org/Datasets/NLP?v=yqj

Page 17: 1 IOS Press From hyperlinks to Semantic Web properties using Open Knowledge Extraction · NERD) also perform sense tagging, i.e. adding knowl-edge about entity types (rdf:type). Nevertheless,

V. Presutti et al. / From hyperlinks to Semantic Web properties 17

more extensive evaluation of the current implementa-tion of Legalo was performed, which allowed to betterassess its performances and open issues. This sectionreports this evaluation results in terms of precision, re-call, and accuracy.

5.1. Legalo working hypothesis

Legalo is based on two working hypotheses.

Hypothesis 1 (Relevant relation assessment). Legalois able to assess if, given a sentence s, a relevant rela-tion exists which holds between two entities, accordingto the content of s:

∃ϕ.ϕs(esubj , eobj)

This means that if s contains evidence of a relevantrelation between esubj and eobj , then Legalo returns atrue value, otherwise it returns false.

Hypothesis 2 (Usable predicate generation). Legalo isable to generate a usable predicate λ

′for a relevant

relation ϕs between to entities, expressed in a sentences: given λ

′, a label generated by Legalo for ϕs, and

λi a label generated by a human for ϕs the followingholds (cf. Definition 1):

λ′ ∼= λi, λi ∈ Λ

which means that the label λ′

generated by Legalo isequal or very similar to a label λi that a human woulddefine in a Linked Data vocabulary for representing ϕsin a particular textual occurrence.

This section reports the evaluation of Legalo basedon the validation of Hypothesis 1 and Hypothesis 2.

Evaluation sample. As evaluation data, a corpusCrel−extraction for relation extraction developed atgoogle research44 was used. The available datasets inthis corpus are five, and each dataset is dedicated to aspecific relation: place of birth, attending or graduatingfrom an institution, place of death, date of death, de-gree of education. Each dataset includes a snippet fromWikpedia, a pair (subject, object) of freebase entities,and at least five user judgments that indicate if the snip-pet contains a sentence providing evidence of a refer-enced relation (e.g., place of death) between the givenpair of entities. It is important to remark that Wikipediasnippets included in the corpus contain more than one

44https://code.google.com/p/relation-extraction-corpus/downloads/list

sentence, which can be evidence of other relations thanthe ones for which they were evaluated. Based on thisobservation, the corpus has been used also for evalu-ating Legalo on its ability to assess the existence ofnon-pre-defined relations.

The evaluation was performed using a subset ofCrel−extraction. More specifically, three evaluationdatasets were derived from Crel−extraction and usedfor performing different experimental tasks.

– Cinstitution: a sample of 130 randomly selectedsnippets extracted from the file ofCrel−extractiondedicated to evidence of relations expressing “at-tending or graduating from an institution”. Legalowas executed on all 130 snippets, including inits input the pair of freebase entities associatedwith the snippet in Cinstitution. For each snip-pet Legalo gave always an output, either one ormore predicates or “no relation evidence” (i.e.false value);

– Ceducation: a sample of 130 randomly selectedsnippets extracted from the file ofCrel−extractiondedicated to evidence of relations expressing “ob-taining a degree of education”. Legalo was exe-cuted on all 130 snippets, including in its input thepair of freebase entities associated with the snip-pet in Ceducation. For each snippet Legalo gavealways an output, either one or more predicates or“no relation evidence”:

– Cgeneral: a sample of 60 randomly selected snip-pets extracted from Crel−extraction, 15 snippetsfrom each file (excluding “date of death” asLegalo only deals with object properties for themoment). The snippets were broken into singlesentences and pre-processed with Tagme [13]45

in order to enrich them with hyperlinks referringto Wikipedia pages (i.e. DBpedia entities): 186sentences with at least two recognised DBpediaentities were derived. In total, Legalo produced867 results, of which 262 predicates and 605 “norelation evidence”. Notice that the high numberof false values is not surprising as in many casesa single sentence may contain a high number ofentities, and Legalo had to assess the existence ofϕ on all possible combinations of pairs.

45http://tagme.di.unipi.it/

Page 18: 1 IOS Press From hyperlinks to Semantic Web properties using Open Knowledge Extraction · NERD) also perform sense tagging, i.e. adding knowl-edge about entity types (rdf:type). Nevertheless,

18 V. Presutti et al. / From hyperlinks to Semantic Web properties

The resulting triples, predicate formalisations, andscope annotations are accessible via a Virtuoso SPARQLendpoint46.

There are several works demonstrating that crowd-sourcing can be successfully used for building andevaluating semantic resources [16,47,36]. Followingthese experiences, Legalo was evaluated with thehelp of crowdsourcing. Five different types of crowd-sourced tasks were defined:

1. assessing if a sentence s is an evidence of thereferenced relation (i.e. either “insitution” or“education”) between two given entities esubjand eobj mentioned in s - based on data fromCinstitution and Ceducation, respectively;

2. assessing if a sentence s is an evidence of anyrelation between two given entities esubj and eobjmentioned in s - based on data from Cgeneral;

3. judging if a predicate λ′

generated by a machineadequately expresses (i.e. it is a good summari-sation of) a specific relation (i.e. either “institu-tion” or “education”) between two given entitiesesubj and eobj mentioned in s, according to thecontent of s - based on data fromCinstitution andCeducation, respectively;

4. judging if a predicate λ′

generated by a machineadequately expresses (i.e. is a good summarisa-tion of) any relation expressed by the content ofs, between two given entities esubj and eobj men-tioned in s - based on data from Cgeneral;

5. creating a phrase λ that summarises the relationexpressed by the content of s, between two givenentities esubj and eobj mentioned in s - based ondata from Cgeneral.

Task 1 and 2 were used for validating Hypothesis 1.The results of these two tasks were then combined withthose from Tasks 3 and 4, for validating Hypothesis 2.Finally, task 5 was used for comparing the similaritybetween λ values generated by humans and λ

′values

generated by Legalo, for validating Hypothesis 2 froma different perspective.

It is important to remark that Task 1 duplicates theinformation already available in Crel−extraction: thischoice was driven by the need of using smaller datasets(Crel−extraction samples) as Legalo evaluation exper-

46Legalo results can be inspected at http://wit.istc.cnr.it:8894/sparql. The reader can submit a pre-defined de-fault query for retrieving an overview of the dataset.

iments needed to address different evaluation tasks47.Hence, the corpus samples were re-evaluated on theevidence task, in order to ensure a high reliability ofthe judgements48.

The Crowdflower platform49 was used for conduct-ing the crowdsourcing experiments. All tasks includeda set of “gold questions” used for computing a trustscore t for each worker. Workers had to first performtheir job on 7 test questions, and only those reachingt > 0.7 were allowed to continue50. Given the strongsubjective nature of task 5, only for this task a lowertrust score t>0.6 was considered acceptable. Each runof a job for a worker contained 4 questions, and theywere free to stop contributing at any time. Each ques-tion was performed by at least three workers, in orderto allow the computation of inter-rater agreement. Be-sides the initial test questions, in order to keep moni-toring workers’ reliability, each job contained one testquestions. Results from test questions were excludedfrom the computation of performance measures (i.e.,precision, recall, accuracy, agreement).

For tasks 1 and 2, judgements were expressed as“yes” or “no” answers. For tasks 3 and 4, judgmentscould be assessed on a scale of three values: Agree(corresponding to a value 1 when computing rele-vance measures), Partly Agree (corresponding to avalue 0.5 when computing relevance measures), andDisagree (corresponding to a value 0 when comput-ing relevant measures). Task 5 was completely open.The confidence measure is provided by CrowdFlower,it measures the inter-rater agreement between workersweighted by their trust values, hence indicating bothagreement and quality of judgements at the same time.It is computed as described in Definition 551, and anexample is given in Example 5.1:

Definition 5. (Confidence score)Given a task unit u, a set of possible judgements {ji},with i = 1, ...n, a set of trust scores each represent-

47Using the whole Crel−extraction corpus for the different taskswas not feasible for the limited time available as well as for the costof a possible crowdsourcing experiment on such a huge amount ofdata.

48From a preliminary analysis of Crel−extraction it has beennoticed that some judgements were incorrect, which can be irrele-vant on big numbers while it can bias the results on smaller sets.

49http://www.crowdflower.com/50The value range of t is [0, 1], the higher the score, the more

reliable the worker.51http://success.crowdflower.com/customer/

portal/articles/1295977

Page 19: 1 IOS Press From hyperlinks to Semantic Web properties using Open Knowledge Extraction · NERD) also perform sense tagging, i.e. adding knowl-edge about entity types (rdf:type). Nevertheless,

V. Presutti et al. / From hyperlinks to Semantic Web properties 19

ing a rater {tk}, with k = 1, ...m, tsum =m∑k=1

tk the

sum of trust scores of raters giving judgements on u,and trust(ji) the sum of tk values of raters that choosejudgement ji, the confidence score confidence(ji, u)

for judgement ji on the task unit u is computed as fol-lows:

confidence(ji, u) = trust(ji)tsum

Example 5.1. (Confidence score for evidence judge-ment)Table 2 shows the judgements of three raters on the

same task unit, where possible judgements are “yes”and “no”. tsum = 0.95 + 0.89 + 0.98 = 2.82

Task unit Judgement t

582275117 yes 0.95

582275117 no 0.89

582275117 yes 0.98Table 2

Example of confidence score computation for a task unit.

confidence(“yes′′, 582275117) = 0.95+0.982.82 = 0.68

confidence(“no′′, 582275117) = 0.892.82 = 0.31

When aggregating results for a task unit, the judge-ment with the higher confidence score is selected. No-tice that confidence(ji, u) = 1 when all raters givethe same judgement.

Evaluation of Hypothesis 1 Table 3 shows the resultsof the evaluation of Hypothesis 1, i.e. Legalo abilityto assess if a sentence s provides evidence of a rela-tion ϕs between two entities (esubj , eobj). Task 1 wasdesigned for evaluating this capability on specific re-lations, while Task 2 was designed for evaluating thiscapability on any relation. Each row shows the perfor-mance results for a specific run of the task indicatingthe type of relation tackled and the crowdsourced task.The most important and informative measure in thiscase is “precision”. In fact, the value of recall is always1.0 and accuracy always equals precision. This hap-pens because Legalo always provides either a true orfalse value and raters can only answer “yes” or “no”.In other words, for this task there can not be neithertrue negatives or false negatives.

Evaluation of Hypothesis 2 Table 4 shows the resultsof the evaluation of Hypothesis 2, i.e. Legalo abilityof generating usable predicates for summarising rela-tions between entities, according to the content of asentence. Task 3 was designed for evaluating this ca-pability on specific properties, while Task 4 was de-signed for evaluating this capability on any property.Each row show the performance results indicating thetype of relation tackled and the crowdsourced task. Theresults for “institution” relation and for “any” relationare computed both on the overall set of results, then ona subset that ensured a higher confidence rate (i.e., onlyresults with confidence(ji, u) > 0.65 are included).As far as the evaluation of the “institution” relation,the subset of results with high confidence is 68% of thewhole evaluation dataset, while for “any” relation it is76%.

Finally, Hypothesis 2 was evaluated also by com-puting a similarity score between human created pred-icates and Legalo generated ones. Task 5 was per-formed for collecting at least three labels λi foreach triple (s, esubj , eobj). Surprisingly, the aver-age confidence value on this task was not that low(0.59). We compared Legalo predicate λ

′for a triple

(s, esubj , eobj) with all λi created by the users for thattriple. Two different similarity measures were com-puted: a string similarity score based on Jaccard dis-tance measure52, and a semantic similarity measurebased on the SimLibrary framework [38]53. The lat-ter is a semantic similarity score that extends stringsimilarity with measures exploiting external seman-tic resources such as such as WordNet, MeSH or theGene Ontology. The average Jaccard similarity scorebetween Legalo labels and human ones is 0.63, whilethe SimLibrary score is 0.80 (the interval value of bothscores is [0, 1], the higher the score, the more similarthe the two phrases). Before computing the similar-ity a pre-processing step was performed to the aim oftransforming all verbs to their base form and removingall auxiliary verbs from human predicates. The Stan-ford CoreNLP framework54 was used to compute thelemma and POS tag of each term in the phrase. Thislemmatisation step was necessary in order to ensure a

52Given two strings s1 and s2, where c1 and c2 are the two char-acter sets of s1 and s2, the Jaccard distance J(s1,s2) is defined asJsim(s1, s2) = Jsim(c1, c2) =

|c1∩c2||c1∪c2|

.53http://simlibrary.wordpress.com/54http://nlp.stanford.edu/software/corenlp.

shtml

Page 20: 1 IOS Press From hyperlinks to Semantic Web properties using Open Knowledge Extraction · NERD) also perform sense tagging, i.e. adding knowl-edge about entity types (rdf:type). Nevertheless,

20 V. Presutti et al. / From hyperlinks to Semantic Web properties

Task Relation Precision Recall F-measure Accuracy Confidence2 Any 0.84 1.0 0.91 0.84 0.82

1 Education 0.87 1.0 0.93 0.87 0.96

1 Institution 0.84 1.0 0.91 0.84 0.94Table 3

Results of Legalo performance in assessing the evidence of relationsbetween entity pairs in a given sentence s. Performance measuresare computed on the judgements collected in Task 1 and 2 based ondata from Cinstitution, Ceducation, and Cgeneral.

Task Relation Precision Recall F-measure Accuracy Confidence3 Education 0.92 0.91 0.91 0.85 0.80

3 Institution 0.65 0.91 0.76 0.62 0.59

3 (high confidence only) Institution 0.74 0.89 0.81 0.68 0.71

4 Any 0.68 0.90 0.78 0.71 0.64

4 (high confidence only) Any 0.73 0.87 0.80 0.75 0.76Table 4

Results of Legalo performance in producing a usable label for relations between entity pairs in a given sentence. Performance measures arecomputed on the judgements collected in Tasks 3 and 4 based on data from Cinstitution, Ceducation, and Cgeneral.

fair comparison of labels based on string similarity ascurrently Legalo uses only base verb forms.

Evaluating the alignment with existing Semantic Webvocabularies. The matching process performed againstLOV55, NELL56 [6], and Watson57 [10] returned anumber of proposed alignments between predicatesgenerated by Legalo and existing properties in LinkedData vocabularies. In order to accept an alignmentand include it in the formalisation of a Legalo prop-erty pnew, a threshold dmin = 0.70 on the computedsimilarity score (i.e., normalised difference percentagebased Levenshtein distance58) was set, i.e. only align-ments between properties having d > 0.70 were keptfor the evaluation. All alignments satisfying this re-quirements were included in the formalisation of theproperties generated during this study59.

The alignment procedure was executed on 629Legalo properties pnew. For 250 pnew, it produced atleast one alignment to a Semantic Web property pswwith d > 0.70. Three raters independently judged ona scale of three values (Agree, Neutral, Disagree) theresulting alignments based on the available metadataof psw i.e., comments, labels, domain and range. Ta-

55http://lov.okfn.org/dataset/lov/56http://rtw.ml.cmu.edu/rtw/57http://watson.kmi.open.ac.uk/WatsonWUI/58http://bit.ly/1qd45AQ59All triples, property formalisations, and alignments can be re-

trieved at http://wit.istc.cnr.it:8894/sparql.

ble 5 shows the results of the user-based evaluationof the alignments between pnew and psw. The threeraters have independently judged the proposed align-ment very accurate (Precision 0.84) with a high inter-rater agreement (Kendall’s W 0.76). Although it wasnot possible to compute recall for this evaluation, thelow percentage of proposed alignments (only 40%)and the simple method applied suggest that there isconsiderable room for improvement. This evaluationand the implemented method are to be considered abaseline for future work on this specific task.

# pnew with atleast one psw

Total # of (pnew ,psw)

Levenshteinthreshold

Precision Kendall’s W

250 693 0.7 0.84 0.76

Table 5Evaluation results on the accuracy of the alignment between pnew and psw .

5.2. Results and evaluation of Legalo applied toWikipedia pagelinks

A previous study [40] described the evaluation of aspecialised implementation of Legalo for dealing withWikipedia pagelinks, i.e. Legalo-Wikipedia. In thissection the results of this evaluation are reported. Thespecialised version of Legalo for Wikipedia pagelinksis available online60. This version of Legalo works un-

60http://wit.istc.cnr.it/stlab-tools/legalo-wikipedia/

Page 21: 1 IOS Press From hyperlinks to Semantic Web properties using Open Knowledge Extraction · NERD) also perform sense tagging, i.e. adding knowl-edge about entity types (rdf:type). Nevertheless,

V. Presutti et al. / From hyperlinks to Semantic Web properties 21

der the assumption that pagelinks in a Wikipedia pagecorrespond to a semantic relation between the subjectof the Wikipage containing the link, and the entity re-ferred by the target page of the link. The linguistictrace of a pagelink is extracted from the text surround-ing the hyperlink, and for each pair (s, hyperlink)

Legalo generates a predicate if a path between the twoentities exists in the graphG that formally represents s.The main difference between Legalo and its Wikipediaspecialised version is that in the latter, the subjectof the predicate is always given and there is a highprobability that it is correct based on the design prin-ciples that guide Wikipedia page writing. More pre-cisely, Legalo has to tackle with identifying the subjectof a relation, which can be any of the entities recog-nised in s. Furthermore, the evaluation experiment ofLegalo-Wikipedia was performed by experts of LinkedData, hence comparing the new results with the previ-ous ones provides insights on the usability of the gen-erated predicates regardless the expertise of the evalu-ators.

The evaluation results of Legalo-Wikipedia are pub-lished as RDF data and accessible through a SPARQLenpoint61.

The evaluated sample set consisted of 629 pairs(s, hyperlink), each associated with a FRED graphG. Legalo was executed on this corpus and gener-ated 629 predicates (referred to as pnew from nowon). The user-based evaluation involved three raters,who are computer science researchers familiar withLinked Data, but not familiar with Legalo. Indepen-dently, they have judged the results of Legalo basedon two separate tasks, using a Likert scale of fivevalues (Strongly Agree, Agree, Partly Agree, Dis-agree, Strongly Disagree). When computing perfor-mance measures the scale was reduced to three values.Specifically, Strongly Disagree and Agree where asso-ciated with a value 1, Partly Agree with 0.5, and Dis-agree and Strongly Disagree with 0.

The results of the user-based evaluation of pnew arereported in Table 6. The three raters have indepen-dently judged that the generated predicates pnew werevery well designed and accurate (F-measure 0.83) incapturing the semantics of their associated pagelinksaccording to the content of the sentence s, with a highinter-rater agreement (Kendall’s W 0.73).

61http://isotta.cs.unibo.it:9191/sparql

Number of pnew Precision Recall F-measure Kendall’s W62

629 0.72 0.97 0.83 0.73Table 6

Evaluation results on the accuracy of pnew .

6. Discussion

Evaluation results. The results of the crowdsourcedtasks demonstrate that Legalo method has high perfor-mance (average F-measure=0.92) on the assessment ofϕs(vsubj , vobj) existence (cf. Hypothesis 1). These re-sults are really satisfactory especially compared withperformance results of Legalo-Wikipedia [40], wherethis aspect was not tackled, and ϕs existence waspartly ensured by the nature of input data (cf. see alsoSection 5.2).

As far as Hypothesis 2 (i.e. the usability of gener-ated λ

′) is concerned, Legalo also in this case show

very satisfactory performance. An impressive result isthe high average value of the semantic similarity score(0.80) between user created predicates and Legalo gen-erated ones. This result confirms the hypothesis dis-cussed in [40], saying that Legalo design strategy wasgood at producing predicates that are very close towhat a human would do when creating a Linked Datavocabulary. In the context of this work, this hypothesiscan be extended to the capability to summarise suchrelations in a way very close to what a generic userwould do. This result is very promising from the per-spective of evolving Legalo into a summarisation tool,which is one of the envisioned directions of research.

Improving usability of generated predicates Never-theless, there is room for improvement, and this is bet-ter figured by analysing the results on the evaluationof Hypothesis 2, i.e. the usability of generated λ

′. Re-

sults show that also on this task Legalo has high perfor-mance, with an average F-measure=0.82, however, byinspecting the different relevance measures, it emergesthat while recall is very high on all tasks (0.90 on av-erage), average accuracy is 0.73 and average preci-sion is 0.75. Although these are very satisfactory per-formances, it is worth identifying the cases that causethe generation of less usable or even bad results. Aninsight is that lower precision and accuracy are reg-istered especially in the generation of predicates for“institution” (accuracy 0.62, precision 0.65) relationsand for “any” relations (accuracy 0.71, precision 0.68)while for “education” relations these measures showsignificantly higher values (accuracy 0.85, precision0.92). This turns out as an important lesson learnt. Infact, less satisfactory precision seems due to the factthat many “institution” relations between two entities

Page 22: 1 IOS Press From hyperlinks to Semantic Web properties using Open Knowledge Extraction · NERD) also perform sense tagging, i.e. adding knowl-edge about entity types (rdf:type). Nevertheless,

22 V. Presutti et al. / From hyperlinks to Semantic Web properties

(X,Z) are described in the form “X received his Yfrom institution Z” (or similar), i.e. a ternary relation,which in a frame-based representation G correspondsto something like:

:receive_1 vn.role:Agent :X ;:receive_1 vn.role:Theme :Y ;:receive_1 vn.role:Source :Z .

Currently, based on this representation, Legalowould generate a predicate by following the path con-necting X to Y , hence without considering the infor-mation on Y. The resulting predicate in this case wouldbe “receive from”, while a more informative and us-able one would clearly be, e.g.“receive degree from”,assuming that the type of Y is degree63. This case canbe easily generalised by exploiting the semantic infor-mation about the thematic role that Y plays in partici-pating in the event receive_1. In fact, a representationpattern can be recognised here: when participating inthe event receive_1, X plays an agentive role (as ex-pected from Axiom 3), Y plays a passive role, and Zplays an oblique role. The type of an entity playing apassive role, i.e. Y in this case, is a relevant informa-tion as far as the relation between an entity playing anagentive role, and another playing an oblique role in anevent, is concerned. This pattern can be generalised toother relations than institution, which explains a sim-ilar behaviour of Legalo in the two tasks focusing onassessing usability of predicates for “institution” and“any” relations.

Another example that shows this pattern is given bythe sentence,

“Hassan Husseini became an organizer for theCommunist Party.”

taken from the dataset Cgeneral. In this case, the rep-resentation is the following:

:become_1 vn.role:Agent Hassan_Husseini ;:become_1 vn.role:Patient :Organizer ;:become_1 :for :Communist_Party .

and Legalo would produce the predicate “become for”. Byapplying the new suggested generative rule, the generatedpredicate would be instead, the more informative and usable“become organizer for”. This type of observations leads tothe definition of additional generative rules that refine OKEand its implementation towards a highly probable improve-

63The term degree is an example of a possible type for Y, howeverwhatever is the type of Y including its type in the predicate wouldmake it much more informative and usable.

ment on precision and accuracy. New rules are currently be-ing implemented based on the data collected on the differ-ent tasks, hence the reader may find an evolved and betterversion of Legalo demo64 in short time.

Alignment to existing Semantic Web properties. Asfor the alignment procedure, there is also space for improve-ment, since this task was addressed by computing a simpleLevenshtein distance. More sophisticated alignment methodssuch as those from the Ontology Alignment Initiative65 orother approaches for entity linking such as SILK66 [22] canbe investigated for enhancing the alignment results. An inter-esting result is that our alignment results are good in termsof precision, although all properties that have been matchedwith a distance score > 0.70 came only from Watson67 [10]and LOV68. We observed that almost all properties retrievedfrom NELL69 [6] had an editing distance < 0.70 hence al-most none of them were judged appropriate. This reinforcesthe hypothesis the OKE generative rules simulate very wellthe results of human property creation, i.e. property namesare cognitively well designed. In fact, Watson and LOV arerepositories of Semantic Web authored ontologies and vo-cabularies, while NELL properties result from and artificialconcatenation of categories learnt automatically.

As for the alignment recall, it was not possible to computea standard recall metrics because it is impossible to computeFalse Negative results i.e., all existing Semantic Web prop-erties that would match pnew but that we did not retrieve.The relatively high number of missing properties suggestson one hand that a more sophisticated alignment method isneeded. On the other hand, if we combine this result with thehigh value of accuracy of pnew and the proposed alignmentsbetween pnew and psw, it is reasonable to hypothesise thatmany cases reveal a lack of intensional coverage in SemanticWeb vocabularies, and that OKE can help filling this gap.

Comparison to Open Information Extraction Extract-ing, discovering, or summarizing relations from text is notan easy task. Natural language is very subtle in providingforms that can express, allude, or entail relations, and syn-tax offers complex solutions to relate explicitly named en-tities, anaphoras to mentioned or alluded entities, concepts,and entire phrases, let alone tacit knowledge. Table 7 showssome kinds of (formalisable) relations that can be derivedfrom text.

A full-fledged analysis of those texts is possible to a cer-tain extent, specially if associated with background knowl-edge (as FRED does), but the conciseness and directness of

64http://wit.istc.cnr.it/stlab-tools/legalo/65http://oaei.ontologymatching.org/66http://wifo5-03.informatik.uni-mannheim.

de/bizer/silk/67http://watson.kmi.open.ac.uk/WatsonWUI/68http://lov.okfn.org/dataset/lov/69http://rtw.ml.cmu.edu/rtw/

Page 23: 1 IOS Press From hyperlinks to Semantic Web properties using Open Knowledge Extraction · NERD) also perform sense tagging, i.e. adding knowl-edge about entity types (rdf:type). Nevertheless,

V. Presutti et al. / From hyperlinks to Semantic Web properties 23

sentence argument#1 binary relation argument#2Mr. Miller, 25, entered North Koreaseven months ago.

Mr._Miller enter North_Korea

He was charged with unruly behavior. Mr._Miller charge_with x:unruly_behavior

North Korean officials suspected he wastrying to get inside one of the country’sfeared prison camps.

y:North_Korean_officials suspect try(he, (get_inside (he, z:Korea’s_feared_prison_camp)))

Table 7Sample sentences involving non-trivial relations, expressed in ageneric logical form.

hyper-linking based on binary relations is often lost. Hencethe importance of tools like Legalo, which are able to re-construct binary relations from complex machine readinggraphs.

It would be natural to compare the results of Legalo to re-lation extraction systems, but this would require to manip-ulate their output, which is beyond the scope of this work.Here follows an explanation of the difficulties involved.

A state-of-art tool like Open Information Extraction (OIE,[28]) applies an extractive approach to relation extraction,and solves the problem by extracting segments that can beassimilated to subjects, predicates, and objects of a triplet.As reported in [18], its accuracy was not very high with theversion of OIE implemented as the ReVerb tool, but it hassensibly improved recently. However, the segments that areextracted, though useful, are not always intuitively reusableas formal RDF properties or individuals. Table 8 shows onecase of a very complex segment #3, i.e. “with a West an-gry over Russia’s actions in Ukraine”, which is a phrase tobe further analyzed in order to be formalized, and typicallyleading to multiple triples; and another case of a complexsegment #2, i.e. “developed a passion for the native flora ofthe arid West Darling region identifying”, which is not easilytransformable into a RDF property.

The research presented here intends to go beyond textsegmentation, by using an abstractive approach that selectspaths in RDF graphs in order to generate RDF properties.The difference between the two approaches is striking, andleads to results that are difficult to compare. Table 9 showstwo of the examples from Table 8 (the third one has no re-solvable entity on the object position), but as they are ex-tracted and formalized by Legalo.

For the reasons described above, this work has not at-tempted a direct comparison in terms of accuracy betweenOIE and Legalo: it would have needed the transformationand formalization of OIE text segments into individuals andproperties, and arbitrary choices on how to formalize com-plex segments. At the end, it is not a measure of their outputsthat is obtained, but a measure of authors’ ability to redesignOIE’s output. For those interested in attempts to reuse het-erogeneous NLP outputs for formal knowledge extraction,see [18].

7. Related Work

The work presented here can be categorised as formal bi-nary relation discovery and labeling from arbitrary walksin connected fully-labeled multi-digraphs, which means inpractice that it is not just relation extraction (relations are ex-tracted by FRED [41], and Legalo reuses them), but Legalodiscovers complex relations that summarise information en-coded in several nodes and edges in the graph (RDF graphsare actually connected, fully-labeled multi-digraphs). It con-siders certain paths along arbitrary directions of edges, ag-gregating some of the existing labels, and concatenatingthem in order to provide property names that are typical ofLinked Data vocabularies, and finally axiomatizing the prop-erties with domain, range, subproperty, and property chainaxioms.

In other words, Legalo tries to answer the following ques-tion: what is the relation that links two (possibly distant) en-tities in a RDF graph?

There is not much that can be directly comparable in theliterature, but work from two related fields can be contrastedwith what Legalo does: relation extraction, and automaticsummarization.

The closest works in relation extraction include Open In-formation Extraction (e.g. [28],[32]), relation extraction ex-ploiting Linked Data [46][26], and question answering onlinked data [27].

Relation extraction. The main antecedent to Open Infor-mation Extraction is probably the 1999 Open Mind CommonSense project [45], which adopted an ante-litteram crowd-sourcing and games-with-a-purpose approach to populate alarge informal knowledge base of facts expressed in triplet-based natural language. The crowd was left substantially freeto express the subject, predicate, and object of a triplet, butduring its evolution, forms started stabilizing, or were learntby machine learning algorithms. Currently Open Mind is be-ing merged with several other repositories in ConceptNet[21].

Open Information Extraction (aka Machine Reading) asit is currently known in the NLP community performs boot-strapped (i.e. started with learning from a small set of seedexamples, and then recursively and incrementally applied toa huge corpus, cf. [11]), open-domain, and unsupervised in-

Page 24: 1 IOS Press From hyperlinks to Semantic Web properties using Open Knowledge Extraction · NERD) also perform sense tagging, i.e. adding knowl-edge about entity types (rdf:type). Nevertheless,

24 V. Presutti et al. / From hyperlinks to Semantic Web properties

segment #1 segment #2 segment #3 sentenceEugene Nickerson was quarterback of the football team and captain At St. Mark’s School in Southborough, Massachusetts, Eu-

gene Nickerson was quarterback of the football team andcaptain of the hockey team.

President Vladimir Putin faced with a West angry over Russia’s actionsin Ukraine

President Vladimir Putin, faced with a West angry overRussia’s actions in Ukraine, has been boosting ties to theEast.

Florence May Harding developed a passion for the native floraof the arid West Darling region identify-ing

plants Early in life Florence May Harding developed a passionfor the native flora of the arid West Darling region, col-lecting and identifying plants.

Table 8Some relations extracted bu OIE from sample sentences.

rdf:subject rdf:property rdf:object sentencedbpedia:Eugene_Nickerson legalo:quarterbackOf dbpedia:American_Football At St. Mark’s School in Southborough, Massachusetts, Eu-

gene_Nickerson was quarterback of the football team andcaptain of the hockey team.

dbpedia:Vladimir_Putin legalo:faceWithAngryOverActionLocatedIn dbpedia:Ukraine President Vladimir Putin, faced with a West angry overRussia’s actions in Ukraine, has been boosting ties to theEast.

Table 9Two sample extractions by Legalo from the same sentences as in Table 8.

formation extraction. E.g. OIE 70 is based on learning fre-quent triplet patterns from a huge shallow parsing of theWeb, in order to create a huge knowledge base of tripletscomposed of text chunks.

This idea (on a smaller scale) was explored in [7], withthe goal of resolving predicates to, or to enlarge, a biomed-ical ontology. On the contrary, OIE extracts binary relationsby segmenting the texts into triplets. However, there is usu-ally no attempt to resolve the subjects and objects of thosetriplets, nor to disambiguate or harmonize the predicatesused in the triples. Since predicates are not formally repre-sented, they are hardly reusable for e.g. annotating links withRDFa tags. See Section 6 for a comparison between OIE andLegalo, proving the difficulty of even designing a compari-son test.

Overall, Open Information Extraction looks like a com-ponent for extractive summarization (see below). In [32],named entity resolution is used to resolve the subjects andobjects, and there is an attempt to build a taxonomy of pred-icates, which are encoded as lexico-syntactic patterns ratherthan typical predicates.

Another important Open Information Extraction projectis Never Ending Language Learning (NELL)71 [6], a learn-ing tool that since 2010 processes the web for building anevolving knowledge base of facts, categories and relations.In this case there is a (shallow) attempt to build a structuredontology of recognized entities and predicates from the factslearnt by NELL. In this work, NELL is used in an attempt toalign the semantic relations resulting from Legalo to NELLontology.

The main difference between approaches such as OIE andNELL, and Legalo is that the formers focus on extractingmainly direct relations between entities, while Legalo fo-

70http://openie.cs.washington.edu/71http://rtw.ml.cmu.edu/rtw/

cuses on revealing the semantics of relations between enti-ties that can be: a) directly linked, b) implicitly linked, c)suggested by the presence of links in Web pages, d) indi-rectly linked, i.e. expressed by longer paths or n-ary rela-tions. Legalo novelty also resides in performing property la-bel generation. From the acquisition perspective, Legalo isnot bootstrapped, but it is open-domain and unsupervised.

Relation extraction and question answering targeted atLinked Data are quite different from both Open Informa-tion Extraction and Legalo, since they are oriented at formalknowledge, but they are not bootstrapped, open domain andunsupervised. They typically use a finite vocabulary of predi-cates (e.g. from DBpedia ontology), and use their extensionalinterpretation in data (e.g. DBpedia) to either link two enti-ties recognized in some text (as in [46][26]), or to find an an-swer to a question, from which some entities have been rec-ognized (as in [27]). Domain is therefore limited to the cov-erage of the vocabulary, and distant supervision is providedby the background knowledge (e.g. [1]. A growing repositoryof relationships extracted with this specific domain, distanlysupervised approach is sar-graphs [46].

Automatic summarisation. Automatic summarizationdeserves a short discussion, since ultimately Legalo’s rela-tion discovery can be used as a component for that applica-tion task. According to [42], the main goal of a summary isto present the main ideas from one or more documents inless space, typically less than half of one document. Differ-ent categorizations of summaries have been proposed: topic-based, indicative, generic, etc., but the most relevant seems todistinguish between “extracts” and “abstracts”. Extracts aresummaries created by reusing portions of the input text ver-batim, while abstracts are created by reformulating or regen-erating the extracted content. An extraction step is needed inany case, but while extracts compress the text by squeezingout unimportant material, and fuse the reused portions, ab-stracts typically model the text, by accessing external infor-

Page 25: 1 IOS Press From hyperlinks to Semantic Web properties using Open Knowledge Extraction · NERD) also perform sense tagging, i.e. adding knowl-edge about entity types (rdf:type). Nevertheless,

V. Presutti et al. / From hyperlinks to Semantic Web properties 25

mation, applying frames, deep parsing, etc., eventually gen-erating a summary that in principle could contain no word incommon with the original text.

Extractive summarization is now in mass usage, e.g. withsnippets provided by search engines. It has serious limitsof course, because size and relevance of the extracts can bequestionable and not as accurate as a human may be.

Legalo can be considered closer to abstractive summariza-tion, since it can be used to build frame-based abstractivesummaries of texts, consisting in binary relation discovery,which can then be filtered for relevance. The current imple-mentation of Legalo is not designed in view of abstractivesummarization, therefore it was not evaluated it for that task,but it is appropriate to report at least one relevant example ofrelated work in this area.

Opinosis [17] is the state-of-the-art system for abstrac-tive summarisation. It performs graph-based summarisation,generating concise abstractive summaries of highly redun-dant opinions. It uses a word graph data structure to representthe text, whereas Legalo uses a semantic graph. As the au-thors say: “Opinosis is a shallow abstractive summariser as ituses the original text itself to generate summaries. This is un-like a true abstractive summariser that would need a deeperlevel of natural language understanding”. Legalo is indeedbased on FRED [41], which provides such deeper level ofunderstanding.

In order to be considered an abstractive summariser,Legalo will need to be complemented with more capabilitiesto rank discovered relations across an entire or even multipletexts, to associate them in a way that final users can makesense of, and to evaluate summaries appropriately. Resultsfrom both abstractive summarisation (e.g. [49][17][24]) andRDF graph summary (e.g. [48][39][5]) can be reused to thatpurpose.

8. Conclusion and future work

Conclusion. This paper presents a novel approach forOpen Knowledge Extraction, and its implementation calledLegalo, for uncovering the semantics of hyperlinks based onframe-based formal representation of natural language text,and heuristics associated with subgraph patterns. The mainnovel aspects of the approach are: property label generation,automatic link tagging, graph-based relation extraction, ab-stractive summarisation.

The working hypothesis if that hyperlinks (either createdby humans or knowledge extraction tools) provide a prag-matic trace of semantic relations between two entities, andthat such semantic relations, their subjects and objects, canbe revealed by processing their linguistic traces: the sen-tences that embed the hyperlinks. Evaluation experimentsconducted with the help of a crowdsourcing platform con-firm the hypothesis, and show very high performances: themethod is able to predict the actual presence of a relation

with a high precision (average F-measure 0.92), and gener-ate accurate RDF properties between the hyperlinked enti-ties in single-relation corpora (average F-measure 0.84), theWikipedia page link corpus (average F-measure 0.84), aswell as in the challenging open domain corpus (average F-measure 0.78). The accuracy remains constant across crowd-sourced evaluation, and comparison to (crowdsourced) goldstandard for the open domain corpus. We also provide align-ments to Semantic Web vocabularies with a precision valueof 0.84.

A demo of Legalo Web service is available online72, aswell as the prototype dedicated to Wikipedia pagelinks73,and the binary properties produced in this study can be ac-cessed by means of a sparql endpoint74.

Ongoing work. Current work concentrates on designingand testing new heuristics, as required by evidence emergingfrom experiments and tests (cf. e.g. Section 6), on identify-ing new ways of aligning the relations generated by Legaloto existing ontologies, and on discovering regularities in therelation taxonomies that are increasingly discovered.

Future work. The main research line for the future isto apply Legalo to application tasks. An obvious one is areal abstractive summarisation task, both at single-text, andmultiple-text level, evaluating the results against state-of-the-art tools. The challenges there include at least: (i) managingmultiple (and possibly dynamically evolving) Open Knowl-edge Extraction graphs, (ii) assessing relevance of discov-ered relations, and their dependence across a same text, oracross multiple texts, and (iii) generating factoid sequencesthat make sense to a final user of abstractive summaries. Alsoother applications of Legalo are envisaged, including ques-tion answering and textual entailment.

References

[1] I. Augenstein, D. Maynard, and F. Ciravegna. Relation extrac-tion from the web using distant supervision. In Janowicz et al.[23], pages 26–41.

[2] C. Bizer, T. Heath, and T. Berners-Lee. Linked data - the storyso far. Int. J. Semantic Web Inf. Syst., 5(3):1–22, 2009.

[3] C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, R. Cy-ganiak, and S. Hellmann. Dbpedia - a crystallization point forthe web of data. J. Web Sem., 7(3):154–165, 2009.

[4] J. Bos. Wide-Coverage Semantic Analysis with Boxer. InJ. Bos and R. Delmonte, editors, Semantics in Text Processing,pages 277–286. College Publications, 2008.

[5] S. Campinas, T. E. Perry, D. Ceccarelli, R. Delbru, and G. Tum-marello. Introducing rdf graph summary with application toassisted sparql formulation. In Database and Expert Systems

72http://wit.istc.cnr.it/stlab-tools/legalo73http://wit.istc.cnr.it/stlab-tools/legalo/

wikipedia74http://isotta.cs.unibo.it:9191/sparql

Page 26: 1 IOS Press From hyperlinks to Semantic Web properties using Open Knowledge Extraction · NERD) also perform sense tagging, i.e. adding knowl-edge about entity types (rdf:type). Nevertheless,

26 V. Presutti et al. / From hyperlinks to Semantic Web properties

Applications (DEXA), 2012 23rd International Workshop on,pages 261–266. IEEE, 2012.

[6] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. H. Jr., andT. M. Mitchell. Toward an architecture for never-ending lan-guage learning. In Proceedings of the Twenty-Fourth Confer-ence on Artificial Intelligence (AAAI 2010), volume 5, page 3,2010.

[7] M. Ciaramita, A. Gangemi, E. Ratsch, J. Šaric, and I. Rojas.Unsupervised learning of semantic relations between conceptsof a molecular biology ontology. In Proceedings of the 19thInternational Joint Conference on Artificial Intelligence, IJ-CAI’05, pages 659–664, San Francisco, CA, USA, 2005. Mor-gan Kaufmann Publishers Inc.

[8] B. Comrie. Language universals and linguistic typology: Syn-tax and morphology. University of Chicago press, 1989.

[9] W. Croft. Syntactic categories and grammatical relations: Thecognitive organization of information. University of ChicagoPress, 1991.

[10] M. d’Aquin, E. Motta, M. Sabou, S. Angeletou, L. Grindinoc,V. Lopez, and D. Guidi. Towards a new generation of seman-tic web applications. IEEE Intelligent Systems, 23(3):80–83,2008.

[11] O. Etzioni, M. Banko, and M. J. Cafarella. Machine reading.In AAAI, volume 6, pages 1517–1519, 2006.

[12] J. Euzenat and P. Shvaiko. Ontology Matching, Second Edition.Springer, 2013.

[13] P. Ferragina and U. Scaiella. Tagme: On-the-fly annotation ofshort text fragments (by wikipedia entities). In Proceedingsof the 19th ACM International Conference on Information andKnowledge Management, CIKM ’10, pages 1625–1628, NewYork, NY, USA, 2010. ACM.

[14] C. J. Fillmore. Frame semantics, pages 111–137. HanshinPublishing Co., Seoul, South Korea, 1982.

[15] T. Flati, D. Vannella, T. Pasini, and R. Navigli. Two Is Bigger(and Better) Than One: the Wikipedia Bitaxonomy Project. InProceedings of the 52nd Annual Meeting of the Association forComputational Linguistics (ACL 2014), pages 945–955, Bal-timore, Maryland, 2014. Association for Computational Lin-guistics.

[16] M. Fossati, C. Giuliano, and S. Tonelli. Outsourcing framenetto the crowd. In Proceedings of the 51st Annual Meeting ofthe Association for Computational Linguistics, ACL 2013, 4-9August 2013, Sofia, Bulgaria, Volume 2: Short Papers, pages742–747. The Association for Computer Linguistics, 2013.

[17] K. Ganesan, C. Zhai, and J. Han. Opinosis: a graph-based ap-proach to abstractive summarization of highly redundant opin-ions. In Proceedings of the 23rd International Conferenceon Computational Linguistics, pages 340–348. Association forComputational Linguistics, 2010.

[18] A. Gangemi. A comparison of knowledge extraction tools forthe semantic web. In The Semantic Web: Semantics and BigData, pages 351–366. Springer, 2013.

[19] A. Gangemi, A. G. Nuzzolese, V. Presutti, F. Draicchio,A. Musetti, and P. Ciancarini. Automatic typing of dbpediaentities. In International Semantic Web Conference (1), pages65–81, 2012.

[20] A. Gangemi and V. Presutti. Ontology Design Patterns. InS. Staab and R. Studer, editors, Handbook on Ontologies, 2ndEdition. Springer Verlag, 2009.

[21] C. Havasi, R. Speer, and J. Alonso. Conceptnet: A lexicalresource for common sense knowledge. Recent advances in

natural language processing V: selected papers from RANLP,309:269, 2007.

[22] R. Isele and C. Bizer. Active learning of expressive linkagerules using genetic programming. J. Web Sem., 23:2–15, 2013.

[23] K. Janowicz, S. Schlobach, P. Lambrix, and E. Hyvönen, ed-itors. Knowledge Engineering and Knowledge Management -19th International Conference, EKAW 2014, Linköping, Swe-den, November 24-28, 2014. Proceedings, volume 8876 of Lec-ture Notes in Computer Science. Springer, 2014.

[24] H. Ji, B. Favre, W.-P. Lin, D. Gillick, D. Hakkani-Tur, andR. Grishman. Open-domain multi-document summarizationvia information extraction: Challenges and prospects. In Multi-source, Multilingual Information Extraction and Summariza-tion, pages 177–201. Springer, 2013.

[25] H. Kamp. A theory of truth and semantic representation. InJ. A. G. Groenendijk, T. M. V. Janssen, and M. B. J. Stokhof,editors, Formal Methods in the Study of Language, volume 1,pages 277–322. Mathematisch Centrum, 1981.

[26] A. Khalili, S. Auer, and A.-C. Ngonga Ngomo. ConTEXTlightweight text analytics using linked data. In Proc. of theEleventh Extended Semantic Web Conference (ESWC 2014),pages 628–643, Crete, Greece, 2014. Springer.

[27] V. Lopez, A. Nikolov, M. Sabou, V. Uren, E. Motta, andM. d’Aquin. Scaling up question-answering to linked data.In Knowledge Engineering and Management by the Masses,pages 193–210. Springer, 2010.

[28] Mausam, M. Schmitz, R. Bart, S. Soderland, and O. Etzioni.Open language learning for information extraction. In Pro-ceedings of Conference on Empirical Methods in NaturalLanguage Processing and Computational Natural LanguageLearning (EMNLP-CONLL), pages 523–534, 2012.

[29] M. Mintz, S. Bills, R. Snow, and D. Jurafsky. Distant supervi-sion for relation extraction without labeled data. In Proceed-ings of the Joint Conference of the 47th Annual Meeting ofthe ACL and the 4th International Joint Conference on Natu-ral Language Processing of the AFNLP: Volume 2 - Volume2, ACL ’09, pages 1003–1011, Stroudsburg, PA, USA, 2009.Association for Computational Linguistics.

[30] A. Moro and R. Navigli. Integrating syntactic and semanticanalysis into the open information extraction paradigm. In Pro-ceedings of the Twenty-Third International Joint Conferenceon Artificial Intelligence, IJCAI ’13, pages 2148–2154. AAAIPress, 2013.

[31] A. Moro, A. Raganato, and R. Navigli. Entity Linking meetsWord Sense Disambiguation: a Unified Approach. Transac-tions of the Association for Computational Linguistics (TACL),2:231–244, 2014.

[32] N. Nakashole, G. Weikum, and F. Suchanek. Patty: A tax-onomy of relational patterns with semantic types. In Pro-ceedings of the 2012 Joint Conference on Empirical Methodsin Natural Language Processing and Computational NaturalLanguage Learning, EMNLP-CoNLL ’12, pages 1135–1145,Stroudsburg, PA, USA, 2012. Association for ComputationalLinguistics.

[33] G. Navarro. A guided tour to approximate string matching.ACM Comput. Surv., 33(1):31–88, Mar. 2001.

[34] A. G. Nuzzolese, A. Gangemi, and V. Presutti. Gathering Lex-ical Linked Data and Knowledge Patterns from FrameNet. InProc. of the 6th International Conference on Knowledge Cap-ture (K-CAP), pages 41–48, Banff, Alberta, Canada, 2011.

Page 27: 1 IOS Press From hyperlinks to Semantic Web properties using Open Knowledge Extraction · NERD) also perform sense tagging, i.e. adding knowl-edge about entity types (rdf:type). Nevertheless,

V. Presutti et al. / From hyperlinks to Semantic Web properties 27

[35] A. G. Nuzzolese, A. Gangemi, V. Presutti, and P. Ciancar-ini. Encyclopedic Knowledge Patterns from Wikipedia Links.In L. Aroyo, N. Noy, and C. Welty, editors, Proceedings fothe 10th International Semantic Web Conference (ISWC2011),pages 520–536. Springer, 2011.

[36] J. Oosterman, A. Nottamkandath, C. Dijkshoorn, A. Boz-zon, G. Houben, and L. Aroyo. Crowdsourcing knowledge-intensive tasks in cultural heritage. In F. Menczer, J. Hendler,W. H. Dutton, M. Strohmaier, C. Cattuto, and E. T. Meyer, edi-tors, ACM Web Science Conference, WebSci ’14, Bloomington,IN, USA, June 23-26, 2014, pages 267–268. ACM, 2014.

[37] S. Peroni, A. Gangemi, and F. Vitali. Dealing with markupsemantics. In Proceedings of the 7th International Conferenceon Semantic Systems, pages 111–118. ACM, 2011.

[38] G. Pirró and J. Euzenat. A feature and information theoreticframework for semantic similarity and relatedness. In Pro-ceedings of the 9th International Semantic Web Conference onThe Semantic Web - Volume Part I, ISWC’10, pages 615–630,Berlin, Heidelberg, 2010. Springer-Verlag.

[39] V. Presutti, L. Aroyo, R. Adamou, A. Gangemi, andG. Schreiber. Extracting core knowledge from linked data. InIn COLD2011. Citeseer, 2011.

[40] V. Presutti, S. Consoli, A. G. Nuzzolese, D. R. Recupero,A. Gangemi, I. Bannour, and H. Zargayouna. Uncovering thesemantics of wikipedia pagelinks. In Janowicz et al. [23],pages 413–428.

[41] V. Presutti, F. Draicchio, and A. Gangemi. Knowledge ex-traction based on discourse representation theory and linguis-tic frames. In Knowledge Engineering and Knowledge Man-agement, volume 7603 of Lecture Notes in Computer Science,pages 114–129. Springer, 2012.

[42] D. R. Radev, E. Hovy, and K. McKeown. Introduction to thespecial issue on summarization. Comput. Linguist., 28(4):399–

408, Dec. 2002.[43] G. Rizzo, R. Troncy, S. Hellmann, and M. Bruemmer. NERD

meets NIF: Lifting NLP extraction results to the linked datacloud. In LDOW, 5th Wks. on Linked Data on the Web, Lyon,France, 04 2012.

[44] K. K. Schuler. VerbNet: A Broad-Coverage, ComprehensiveVerb Lexicon. PhD thesis, University of Pennsylvania, 2006.

[45] P. Singh et al. The public acquisition of commonsense knowl-edge. In Proceedings of AAAI Spring Symposium: Acquiring(and Using) Linguistic (and World) Knowledge for InformationAccess, 2002.

[46] H. Uszkoreit and F. Xu. From strings to things sar-graphs: Anew type of resource for connecting knowledge and language.In NLP-DBPEDIA@ ISWC, 2013.

[47] D. Vannella, D. Jurgens, D. Scarfini, D. Toscani, and R. Nav-igli. Validating and extending semantic knowledge bases usingvideo games with a purpose. In Proceedings of the 52nd An-nual Meeting of the Association for Computational Linguistics,ACL 2014, June 22-27, 2014, Baltimore, MD, USA, Volume 1:Long Papers, pages 1294–1304. The Association for ComputerLinguistics, 2014.

[48] X. Zhang, G. Cheng, and Y. Qu. Ontology summarizationbased on rdf sentence graph. In Proceedings of the 16th in-ternational conference on World Wide Web, pages 707–716.ACM, 2007.

[49] L. Zhou, C.-Y. Lin, D. S. Munteanu, and E. Hovy. Parae-val: Using paraphrases to evaluate summaries automatically.In Proceedings of the Main Conference on Human LanguageTechnology Conference of the North American Chapter of theAssociation of Computational Linguistics, HLT-NAACL ’06,pages 447–454, Stroudsburg, PA, USA, 2006. Association forComputational Linguistics.


Recommended