Ontology-Driven Relation Extraction by Pattern Discovery

Post on 01-Dec-2023

0 views 0 download

transcript

Ontology-Driven Relation Extraction by Pattern Discovery

A. BellandiDepartment of Computer Science

University of PisaPisa, Italy

Email: bellandi@di.unipi.it

S. Nasoni, A. Tommasi, C. ZavattariMetaware S.p.A.

Pisa, ItalyEmail: {s.nasoni, a.tommasi, c.zavattari}@metaware.it

Abstract—With this paper we describe an ontology-drivensystem that performs relation extraction over textual data.The system exploits expert knowledge of the domain, includinglexical resources, in the form of an ontology to drive theextraction of patterns using manually annotated texts. Suchpatterns are then applied in order to identify candidates forrelation extraction. Paired with basic, reliable named-entity-level text annotation, this results in the discovery of relationsamong entities in Italian newspaper articles. In the paper, wedescribe the system and measure its performance.

Keywords-Ontology-driven chunker; Relation Extraction;Pattern discovery.

I. INTRODUCTION

Laundering the proceeds of unlawful activities is one ofthe most serious crimes in financial markets: the reinvest-ment of the proceeds of crime into lawful activities and theexistence of operators and organisations supporting crimi-nals deeply unsettle the market and its efficiency and weakenthe economic environment. That is why Public authori-ties, Financial institutions, notaries, accountants and lawyersaround the world face increasing regulatory pressures tomeet anti-money laundering compliance requirements and toaccurately check that any individual they are dealing withhas not been, directly or indirectly, involved in any illegalactivity. Furthermore, it is commonly acknowledged thatcertain situations present a greater risk of money laundering.

Our tool, called Redada1 [1], represents an effectiveapplication of text analysis techniques to Italian news itemsin order to support law enforcement and intelligence ac-tivities against money laundering and corruption. Redadaexploits newspapers articles to identify and locate peopleand businesses, obtain information about criminal deedsand judicial proceedings, and learn more about the com-plex relationships between them. It searches, uncovers, andverifies connections from among thousands of publiclyavailable Italian news items, and delivers actionable resultsthat help to perform special due diligence and enhancedscrutiny of relationships (any immediate family memberor close associate: spouses, children, parents, siblings andother relatives, as well as close business colleagues, personal

1You can try Redada out at http://www.redada.it, by free registration.

advisors/consultants and anyone benefiting from closeness)with heightened risk entities involved in any illegal activ-ity and domestic politically exposed person. We presentedRedada in [1] from the NLP point of view, in terms of bothits architecture and performance results. Here we intend todescribe the prior knowledge Redada exploits; we presentthe general lines of our ontology construction approach,and how the ontology drives the relation extraction processand the semi-automatic creation process of the patternsspecifying both the syntactical and semantic structure of thesentence.

The paper is organised as follows. Section 2 providesthe description of the domain we are addressing, and theontology creation process we adopted. Section 3 describesthe relation extraction process driven by the ontology re-ferring to the general Redada architecture, and the semi-automatic pattern generation process. Section 4 discusses thesystem performance. Finally, Section 5 draws conclusionsand outlines future research. The examples given throughoutwill be in English language for clarity, altough the systemworks on Italian texts.

II. APPLICATION DOMAIN AND PRIOR KNOWLEDGE

In this section, we describe the application domain ofRedada system, and the structure of the knowledge exploitedby Redada. Then, we give a very brief survey about theOntology Web Language, and we describe our ontologycreation process approach.

A. Domain description

The purpose of Redada is to automatically discover re-lationships between the different search terms or objectsassociated with money laundering and corruption. Clearly,it is not only important to know that there is a relationship,but it is also important to know what the relationshipbetween objects is. To achieve these goals, Redada uses aset of language-based technologies that mine text to identifyconcepts, names of people and companies, places and therelationships among them. In Redada, these objects includepeople, businesses, crimes, judicial events, job positions andpersonal associations. Exploring relationships among thesedifferent objects helps identify networks of activity, both

2010 Second International Conference on Information, Process, and Knowledge Management

978-0-7695-3956-0/10 $26.00 © 2010 IEEE

DOI 10.1109/eKNOW.2010.17

9

2010 Second International Conference on Information, Process, and Knowledge Management

978-0-7695-3956-0/10 $26.00 © 2010 IEEE

DOI 10.1109/eKNOW.2010.17

1

Figure 1: Example of well defined model theoretic semantic.

legal and illegal. For example, if a person is associated withother persons or businesses that are known to be engagedin criminal conduct, then additional investigation of thatindividual may be warranted. These information discoveryprocesses are supported by prior knowledge organized informal and shareable knowledge repositories (i.e. ontolo-gies) and terminological resources specifically developed forapplications detecting anti-money laundering investigationsin italian news items. That knowledge has been developedby domain experts within the European Union co-fundedintegrated project MUSING [11] and specialized for thedomain and the language at hand within the EU co-fundedproject Bracco [12]. The application specific ontology de-velopment has collected a considerable set of concepts andrelationships, covering crime types, personal relationshipstype, judiciary events and job positions.

The crime ontology covers a wide range of predicateoffences (7 main types; 39 sub types, 3 of which furtherdivided; about 200 items), in line with the definition of se-rious crime in Council Framework Decision 2001/500/JHAof 26 June 2001 on money laundering. Crime-types include:(a) acts as defined in Articles 1 to 4 of Framework Decision2002/475/JHA; (b) any of the offences defined in Article3(1)(a) of the 1988 United Nations Convention against IllicitTraffic in Narcotic Drugs and Psychotropic Substances; (c)the activities of criminal organisations as defined in Article1 of Council Joint Action 98/733/JHA of 21 December 1998on making it a criminal offence to participate in a criminalorganisation in the Member States of the European Union(5); (d) fraud, at least serious, as defined in Article 1(1) andArticle 2 of the Convention on the Protection of the Euro-pean Communities’ Financial Interests (6); (e) corruption; (f)all offences which are punishable by deprivation of libertyor a detention order for a maximum of more than one yearor, as regards those States which have a minimum thresholdfor offences in their legal system, all offences punishable bydeprivation of liberty or a detention order for a minimum of

more than six months.The job positions ontology is a five-tiered schema that

combines the upper part of the ISTAT (Italian Institute forStatistics) framework with the classification scheme of exec-utives adopted by the Italian Business Register; it currentlyfeatures about 1000 items, reflecting influential positions andappointments ranging from political representatives - localor national - to academic professors, professional people andpublic officers, as well as entrepreneurs, business executivesand professionals. At the fifth tier, each class is in caserelated to the upper position which it depends from, withinthe public or private institution to which both pertain.

The personal relationships ontology accounts for 10 typesof relationship (professional, charismatic, fiduciary, parental,paternal, maternal, brotherhood, sentimental, descendants,other), 19/22 sub-types and about 300 items.

The judicial events ontology encompasses about 150 itemsarranged in four top-level classes: precautionary measures(e.g., arrest); condemnations; judicial inquiries, searches andseizures; lawsuits.

B. Ontologies

The OWL Web Ontology Language is designed for use byapplications that need to process the content of informationinstead of just presenting information to humans. OWL is afamily of three ontology languages: OWL-Lite, OWL-DL,and OWL-Full. The first two languages can be consideredsyntactic variants of SHIF(D) and SHOIN(D) descriptionlogics (DL), respectively, whereas the third language wasdesigned to provide full compatibility with RDF(S). We fo-cus mainly on the first two variants of OWL because OWL-Full has a nonstandard semantics that makes the languageundecidable and therefore difficult to implement. OWLcomes with several syntaxes, all of which are rather verbose.Hence, in this paper we use the standard DL syntax. For afull DL syntax description, please refer to [5]. The mainbuilding blocks of DL knowledge bases are concepts (orclasses), representing sets of objects, roles (or properties),representing relationships between objects, and individualsrepresenting specific objects. OWL ontologies consist oftwo parts: intensional and extensional. The former part,consisting of a TBox, contains knowledge about concepts(i.e. classes) and complex relations between them (i.e. roles).The latter part, consisting of an ABox, contains knowledgeabout entities (i.e. individuals) and how they relate to theclasses and roles from the intentional part. A knowledgebase KB is just a TBox plus an ABox.

The semantics of OWL DL abides by DL standards [2],[3], [4]. An example is shown in Figure 1. An interpretationI = (∆I , ·I ) is a tuple where ∆I , the domain of discourse, isthe union of two disjoint sets ∆I

O, (the object domain) and∆I

D (the data domain), and ·I is the interpretation functionthat gives meaning to the entities defined in the ontology. Imaps each OWL class C to a subset CI ⊆ ∆I

O, each object

102

property Pob j to a binary relation PIob j ⊆ ∆I

O × ∆IO, each

datatype property Pdata to a binary relation PIdata ⊆ ∆I

O×∆ID,

and r is the union of two disjoint sets Pob j and Pdata. Thewhole definition is in the OWL W3C Recommendation(http://www.w3.org/OWL/). In the following, according tothe syntax defined in [4], we show an example describing aTBox fragment of Redada ontology related to judicial eventsand crimes:

accuse v JudicialEvent (1)

location v T hing (2)

againstT heState v Crime (3)

corruption v Crime (4)

corruptionagainstT heState v corruptionuagainstT hestate (5)

accuse v ∃withCharge.Crime (6)

Crime v ∃hasLocation.location (7)

(1), (2), (3), (4) are simple classes; (5) is a class specifiedin terms of others, in particular is defined as the intersectionof the class (3) and the class (4). (6) and (7) are examplesof object properties. The first one binds the class accuse(that belongs to the Judicial events taxonomy) to the crimetaxonomy, and the second one ranges over the location class,representing the place where a crime happens. From thelinguistic knowledge point of view, the Redada ontologyprovides sets of terms for each class; we call them aliasesof the relation. All the aliases constitute the ABox which isinterlinked with the intensional knowledge. An example ofABox related to the previous TBox is the following:

ABox = {judicial corruption: cooruptionagainstTheState, convicted of : accuse}

The Redada ontology is composed of 555 classes andincludes over 2000 aliases. It is intented to capture theessential conceptual entities and relationships in the knowl-edge structure about anti-money laundering due diligenceon companies and individuals and the ontology-buildingapproach has been mainly bottom-up. Logical definitionsfor concepts and relationships are established as above, andeach ontology element can be justified from applicationperspective for real-life cases and stories.

III. ONTOLOGY-DRIVEN RELATION EXTRACTIONPROCESS

Redada is a fully implemented system exlpoiting priorknowledge, relying on the Freeling suite [6], [7], [8],[9], [10] for the NLP aspect; you can try Redada out athttp://www.redada.it, by free registration. Each componentand its performance are discussed in [1]. Here we describehow the ontology drives the relation extraction process andwhat part of the system exploits it. For a general view ofthe system you can refer to Figure 2:• The analyser annotates documents on the basis of text

analysis and of the recognition and identification ofnamed entities;

Figure 2: Redada system architecture.

• Redada ontology drives the chunking process, bymeans of the expert knowledge therein coded;

• The chunker identifies text snippets candidate to repre-sent relations;

• The classifier is applied to the candidates returned bythe chunker in order to increase the precision of theextracted relations, filtering irrelevant chunks out;

• The RDF database provides storage for the extractedknowledge, enabling querying processes.

Chunking consists of dividing sentences into non-overlapping phrases. It is often a useful tool whenmemorizing large amounts of information. By identifyingspecific blocks of text, information becomes easier to retainand recall. For example, consider the following sentence:Yesterday the Italian head of State took off to Madrid. Weexpect a noun phrase chunker to find the following spansof related tokens and group them into chunks, here calledNP (Part-of-speech tags are enclosed in square brackets):

S (

Yesterday[NN]

(NP (the[DT] Italian[NNP] head[NN] of[IN]State[NNP]))

took[VBD]

off[RP]

to[TO]

Madrid[NNP])

In our system, chunking is used to identify text snip-pets conveying an instance of a relation of interest. Forinstance, in the sentence: Luciano Moggi was convictedon his charges, after a lengthy court case, a good chunkspans from the beginning of the sentence up to the comma,since that is a short, focused text fragment expressing theconviction (one of the relations of interest) of a named entity(Luciano Moggi).

113

Figure 3: Small sketch of Redada ontology.

Chunkers often operate on annotated texts, and use theannotation to make chunking decisions. Part-of-speech(PoS) annotation, in our case, is not sufficient to determinea good chunking for a sentence. For example, the sentenceJohn has been accused of the homice of Jack in Romerequires, for a proper chunking, information about therelations of interest, namely: words that trigger the relation,and the slots of the relation. Note that sentences involvingmore than one realtionship can exist. For example, thesentence John, manager of the Foo company, has beenarrested for bankruptcy on June 25th, 2008, involves bothcrime and job position. The ontology provides the wayto express reliable and specific-coverage identification ofterms, and kinds of relations belonging to the domain ofinterest. This way, we can represent both relations amongconcepts, and flatter properties such as lexicon fragmentsrelating to a concept (e.g., various linguistic expressionsequivalent to “convicted”). Figure 3 shows a fragment ofRedada ontology related to crimes, judicial events, and jobpositions hierachies. A class of the judicial event hierarchy,specifically “accused”, is bound to a particular set of classesof the crime hierarchy, by means of “withCharge” objectproperty. The homicide class for example, is modeledas a crime with three main slots: “killer”, “victim”, and“location”, represented by the hasKiller, hasVictim and

hasLocation object properties, respectively (and hasDatepossibly). This knowledge fragment helps the chunkeridentify a semantically correct chunk structure, w.r.t. theknowledge coded by the ontology. To find the chunkstructure for a given sentence, the chunker begins with aflat structure in which no token belongs to any chunk. A setof chunking patterns, referencing the ontology, are appliedin turn, successively updating the chunk structure. Eachpattern is consistent with the ontological description of bothrelations slots and linguistic knowledge of relations. Thegeneral chunking pattern capturing that example sentenceis the following:

<PERSON>< ∗><JEV><CRM><PERSON>< ∗><LOCATION>

where < JEV > (judicial event) is, in the example, filledby the relation accused and < CRM > (crime) by therelation homicide. The pattern specifies both the syntacticand semantic structure of the sentence. Note that < CRM >,< JEV >, and each element of the pattern refer to ontologyclasses. < ∗ > indicates sentence words belonging to anyontology class. Please, refer to next susection for details.

As you can see in Figure 3, withCharge object propertyspecifies what crimes a person can be accused of. From the

124

linguistic knowledge point of view, the Redada ontology pro-vides sets of terms for each element of patterns, called aliasas described in Section 2. Aliases for < JEV > = accusedinclude charged with, convicted of ; aliases for < CRM >= homicide include murder, killing, manslaughter. Finally,Redada chunks the sentence and recognizes that John andJack are the killers and the victim, respectively. Rome is thecity (location class) where the homicide happened. Redadauses patterns constructed in a semi-automatic way. Thenext sub-section describes our approach for generating thosepatterns.

A. Semi-automatic pattern generation process

The input sentences for generating patterns like that one,presented in the previous section, have the following form:John/Person, has/VBZ, been/VBN, accused/Judicial event,of /IN, the/DT, homicide/Crime, of /IN, Jack/Person, in/IN,Rome/Location.

Sentences (or statements) are annotated with both PoStags, and the kinds of relations represented by aliases inthe ontology. The annotations of these sentences are thetags which our pattern discovery process is based on. Thepattern discovery task looks for frequent patterns of highlydiscriminating sequential tags extracted from a training setcomposed of both positive and negative examples of taggedphrases. Patterns are extracted by using an informationgain approach. First, we introduce some notations by thefollowing definitions.

Definition 1 (sequences and subsequences). Let stati andCHU j

i be the ith sequence of tag and the jth subsequence oftags of the ith statement, respectively. Then, we specify thefrequence of positive examples and negative examples refer-ring to each subsequence CHU j

i , as Freq+(i, j) and Freq−(i, j)

respectively.

Definition 2 (⊆s). Let s1 and s2 be sequences. s1 ⊆s s2 ifand only if s1 is a subsequence of s2.

The pattern extraction process is the following. For eachCHU j

i , the subsequence of tags of every statement in thetraining set, we calculate Freq+

(i, j) and Freq−(i, j), and weselect the longer ones that distinguish more between thetwo sets. More formally, CHU j

i is highly discriminating ifthe following condition holds:

(Freq+(i, j)−Freq−(i, j)) > ε∧ (¬∃w.(CHU j

i ⊂s CHUwi )∧ (Freq+

(i,w)−Freq−(i,w)) > ε)

where ε is a threshold determined experimentally.

An example of another very frequent kind of statement inour Italian news items, covered by such a pattern, is “Johnhas been processed for the kidnapping of Mary in London”.

IV. SYSTEM PERFORMANCE

Evaluating the performance of a complex system likeRedada is no easy task. Ideally, one could compare theRDF graph returned by the system with a similar graphbuilt manually. The problem is that building such a graphmanually is very complex, and there would be little agree-ment even among humans. As such, we have favored acomparison against a baseline approach, outlining the gain ineffectiveness. First and foremost, the precision of the overallsystem corresponds formally to the precision of its last step;precision has been measured against a test set at 80% forthe Maximum Entropy classifier. It is very high for a systemof this kind, but the picture would be incomplete withoutan assessment of the recall which, unfortunately, does notcorrespond to the recall of the last step in the pipeline.Lacking manual assessment over the input corpus, we cansimulate an overly ”loose” baseline system, one that outputsmore than is expected (thus with high recall), against whichwe can draw an underestimation of our recall figure. Thechunker patterns we have used are selective, but there isno pattern that does not contain at least one relation alias.A system that would take any sentence containing namedentities and an alias as the expression of a relation wouldbe returning a superset of our system, and have a largerecall figure. A reference system like this returns 21010relations over our corpus, or circa 8.5 relations per article inaverage. This is also the kind of recall one would have whenusing text-based search basing on the available alias lexicon(which is even more extensive than one would usuallyemploy). The number of relations returned by Redada, atabout 80% precision, is 8609. This would underestimateRedada’s recall figure at over 40%, which we consider agood result, since:• Redada outputs a considerably more structured infor-

mation, focusing on specific chunks of texts and roles;• the natural redundancy of the corpus (the same events

are reported in many places) suggest to privilege pre-cision against recall;

• the figure is a gross underestimation of the actual recallvalue.

The baseline reference system (which still exploits namedentity recognition and the alias lexicon, and as such isnot that basic) is bound to low precision by the fact thatit is unable (without the chunker) to isolate subparts ofsentences, and therefore involves wrong entities when manyare present in the sentence. With respect to an actual humanperformance, there are 3 elements to take into account:• a human reader is not limited to a fixed lexicon, and

can recognize relations or entities along periphrases,synonyms and pronouns;

• a human reader can use common knowledge, or evenknowledge drawn from other texts in the corpus todeduce implicit relations;

135

• however a human reader, to do any of the above, wouldtake a considerably longer time that the system does toprocess the same amount of data.

V. CONCLUSION

In this paper, we presented a fully implemented NLPtool for the Italian language called Redada, which use priorknowledge coded by ontology. We presented the contextapplication domain, by describing the ontology creationprocess. We described how the ontology drives the relationextraction process based on patterns, and how we generatethose patterns in a semi-automatic way. Our first resultslook very promising with about 80% of precision of therelations extracted. There are many improvements planned.Research is to be conducted at all the levels of the system.At the NLP level, we plan on improving the named entityrecognition, by employing more supervised learning andrelying less on gazetteers, especially for the company types.We also need to introduce the date type. We plan on addinghandling of pronouns in the coreference-resolution process,attaching pronouns to the relative entity, which should allowto discover even more relations. At the ontology level, weplan on exploiting rules in order to infer new relationsbasing on domain knowledge and specific information athand. This would allow to extract relations that are onlyimplied by the text, rather than explicitly stated. Clearly,verification of the correctness of these relations is verychallenging, since there would not be an immediate checkavailable in the text snippet responsible for the extraction ofexplicitly stated relations. Finally, in order to carry on theseactivities, we need to set up a stable evaluation environmentagainst which to check for improvements and to performvalidation of the various components. Efforts will be directedtowards the formalization along the ontology of a corpusof articles, so that evaluation could be conducted directlyagainst human performance. The results will still be affectedby the subjectivity of the task, but will constitute the besteffort in the direction of an absolute performance evaluation.

ACKNOWLEDGMENT

Developments on relation extraction and the definition ofrelevant relations was supported by the European Commis-sion ISEC Programme with the co-funded Bracco project(JLS/2007/ISEC/431) [12], and IST Programme with the co-funded integrated MUSING project (IST-27097) [11].

REFERENCES

[1] A. Bellandi, S. Nasoni, D. Tarini, A. Tommasi, and C. Za-vattari. Redada: Mining Knowledge Out Of Italian BusinessNews Items, In Proceedings of the 10th IASTED InternationalConference on Artificial Intelligence and Applications, 2010.

[2] I. Horrocks, P. F. Patel-Schneider, and F. Van Harmelen. FromSHIQ and RDF to OWL: The Making of a Web OntologyLanguage. Journal of Web Semantics, 1(1):7-26, 2003.

[3] M. Klein, J. Broekstra, D. Fensel, F. Van Harmelen, and I.Horrocks. Ontologies and Schema Languages on the Web. InDieter Fensel, James Hendler, Henry Lieberman, and WolfgangWahlster, editors, Spinning the Semantic Web: Bringing theWorld Wide Web to its full potential. MIT Press, 2003.

[4] I. Horrocks. Implementation and Optimisation Techniques InFranz Baader, Diego Calvanese, Deborah McGuinness, DanieleNardi, and Peter F. Patel-Schneider, editors, The DescriptionLogic Handbook: Theory, Implementation, and Applications,chapter 9, pages 306-346. Cambridge University Press, 2003.

[5] F. Baader, I. Horrocks, and U. Sattler. Description Logics asOntology Languages for the Semantic Web. In Dieter Hutterand Werner Stephan, editors, Mechanizing Mathematical Rea-soning: Essays in Honor of Jrg Siekmann on the Occasion ofHis 60th Birthday, number 2605 in Lecture Notes in ArtificialIntelligence, pages 228-248. Springer, 2005.

[6] J. Atserias, J. Carmona, I. Castelln, S. Cervell, M. Civit, L.Mrquez, A. Mart, L. Padr, R. Placer, H. Rodrguez, M. Taul,and J. Turmo. Morphosyntactic Analysis and Parsing of Unre-stricted Spanish Text. In Proceedings of the 1st InternationalConference on Language Resources and Evaluation, pages1267-1274, Granada, Spain, 1998.

[7] J. Atserias, B. Casas, E. Comelles, M. Gonzlez, L. Padr, and M.Padr. Freeling 1.3: Syntactic and semantic services in an open-source NLP library. In proceedings of the 5th InternationalConference on Language Resources and Evaluation, Genoa,Italy, 2006.

[8] J. Carmona, S. Cervell, L. Mrquez, A. Mart, L. Padr, R. Placer,H. Rodrguez, M. Taul, and J. Turmo. An Environment forMorphosyntactic Processing of Unrestricted Spanish Text. InProceedings of the 1st International Conference on LanguageResources and Evaluation, pages 915-922, Granada, Spain,1998.

[9] X. Carreras, and L. Padr. A Flexible Distributed Architecturefor Natural Language Analyzers. In Proceedings of the 3rd In-ternational Conference on Language Resources and Evaluation,Canary Island, Spain, 2002.

[10] X. Carreras, I. Chao, L. Padr, and M. Padr. Freeling: AnOpen-Source Suite of Language Analyzers. In the Proceedingsof the 4th International Conference on Language Resourcesand Evaluation, Lisbon, Portugal, 2004.

[11] MUSING Project website: http://www.musing.eu.

[12] Bracco Project website: http://bracco.metaware.it.

146