Semantic Relations Between Nominals

Morgan Claypool Publishers&SYNTHESIS LECTURES ON

HUMAN LANGUAGE TECHNOLOGIESw w w . m o r g a n c l a y p o o l . c o m

Series Editor: Graeme Hirst, University of Toronto

CM& Morgan Claypool Publishers&

About SYNTHESIsThis volume is a printed version of a work that appears in the Synthesis

Digital Library of Engineering and Computer Science. Synthesis Lectures

provide concise, original presentations of important research and development

topics, published quickly, in digital and print formats. For more information

visit www.morganclaypool.com

SYNTHESIS LECTURES ON

HUMAN LANGUAGE TECHNOLOGIES

Graeme Hirst, Series Editor

ISBN: 978-1-60845-979-7

9 781608 459797

90000

Series ISSN: 1947-4040 NA

ST

AS

E • N

AK

OV

• SÉ

AG

HD

HA

• SZ

PA

KO

WIC

ZS

EM

AN

TIC

RE

LA

TIO

NS

BE

TW

EE

N N

OM

INA

LS

MO

RG

AN

&C

LA

YP

OO

L

Semantic Relations Between NominalsVivi Nastase, FBK, Trento, Preslav Nakov, QCRI, Qatar Foundation

Diarmuid Ó Séaghdha, Computer Laboratory, University of Cambridge

Stan Szpakowicz, QCRI, EECS, University of Ottawa ICS, Polish Academy of Sciences

People make sense of a text by identifying the semantic relations which connect the entities or concepts

described by that text. A system which aspires to human-like performance must also be equipped to

identify, and learn from, semantic relations in the texts it processes. Understanding even a simple sentence

such as “Opportunity and Curiosity find similar rocks on Mars” requires recognizing relations (rocks arelocated on Mars, signalled by the word on) and drawing on already known relations (Opportunity and

Curiosity are instances of the class of Mars rovers). A language-understanding system should be able

to find such relations in documents and progressively build a knowledge base or even an ontology.

Resources of this kind assist continuous learning and other advanced language-processing tasks such as

text summarization, question answering and machine translation.

The book discusses the recognition in text of semantic relations which capture interactions between

base noun phrases. After a brief historical background, we introduce a range of relation inventories of

varying granularity,which have been proposed by computational linguists.There is also variation in the

scale at which systems operate, from snippets all the way to the whole Web, and in the techniques of

recognizing relations in texts, from full supervision through weak or distant supervision to self-supervised

or completely unsupervised methods. A discussion of supervised learning covers available datasets, feature

sets which describe relation instances, and successful algorithms. An overview of weakly supervised and

unsupervised learning zooms in on the acquisition of relations from large corpora with hardly any

annotated data. We show how bootstrapping from seed examples or patterns scales up to very large text

collections on the Web. We also present machine learning techniques in which data redundancy and

variability lead to fast and reliable relation extraction.

Semantic RelationsBetween Nominals

Vivi NastasePreslav NakovDiarmuid Ó SéaghdhaStan Szpakowicz


Synthesis Lectures on HumanLanguage Technologies

EditorGraeme Hirst, University of Toronto

Synthesis Lectures on Human Language Technologies is edited by Graeme Hirst of the Universityof Toronto. The series consists of 50- to 150-page monographs on topics relating to naturallanguage processing, computational linguistics, information retrieval, and spoken languageunderstanding. Emphasis is on important new techniques, on new applications, and on topics thatcombine two or more HLT subfields.

Semantic Relations Between NominalsVivi Nastase, Preslav Nakov, Diarmuid Ó Séaghdha, and Stan Szpakowicz2013

Computational Modeling of NarrativeInderjeet Mani2012

Natural Language Processing for Historical TextsMichael Piotrowski2012

Sentiment Analysis and Opinion MiningBing Liu2012

Discourse ProcessingManfred Stede2011

Bitext AlignmentJörg Tiedemann2011

iv

Linguistic Structure PredictionNoah A. Smith2011

Learning to Rank for Information Retrieval and Natural Language ProcessingHang Li2011

Computational Modeling of Human Language AcquisitionAfra Alishahi2010

Introduction to Arabic Natural Language ProcessingNizar Y. Habash2010

Cross-Language Information RetrievalJian-Yun Nie2010

Automated Grammatical Error Detection for Language LearnersClaudia Leacock, Martin Chodorow, Michael Gamon, and Joel Tetreault2010

Data-Intensive Text Processing with MapReduceJimmy Lin and Chris Dyer2010

Semantic Role LabelingMartha Palmer, Daniel Gildea, and Nianwen Xue2010

Spoken Dialogue SystemsKristiina Jokinen and Michael McTear2009

Introduction to Chinese Natural Language ProcessingKam-Fai Wong, Wenjie Li, Ruifeng Xu, and Zheng-sheng Zhang2009

Introduction to Linguistic Annotation and Text AnalyticsGraham Wilcock2009

v

Dependency ParsingSandra Kübler, Ryan McDonald, and Joakim Nivre2009

Statistical Language Models for Information RetrievalChengXiang Zhai2008

Copyright © 2013 by Morgan & Claypool

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted inany form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations inprinted reviews, without the prior permission of the publisher.

Semantic Relations Between Nominals

Vivi Nastase, Preslav Nakov, Diarmuid Ó Séaghdha, and Stan Szpakowicz

www.morganclaypool.com

ISBN: 9781608459797 paperbackISBN: 9781608459803 ebook

DOI 10.2200/S00489ED1V01Y201303HLT019

A Publication in the Morgan & Claypool Publishers seriesSYNTHESIS LECTURES ON HUMAN LANGUAGE TECHNOLOGIES

Lecture #19Series Editor: Graeme Hirst, University of Toronto

Series ISSNSynthesis Lectures on Human Language TechnologiesPrint 1947-4040 Electronic 1947-4059

www.morganclaypool.com


Vivi NastaseFBK, Trento

Preslav NakovQCRI, Qatar Foundation

Diarmuid Ó SéaghdhaComputer Laboratory, University of Cambridge

Stan SzpakowiczEECS, University of OttawaICS, Polish Academy of Sciences

SYNTHESIS LECTURES ON HUMAN LANGUAGE TECHNOLOGIES #19

CM& cLaypoolMorgan publishers&

ABSTRACTPeople make sense of a text by identifying the semantic relations which connect the entities orconcepts described by that text. A system which aspires to human-like performance must also beequipped to identify, and learn from, semantic relations in the texts it processes. Understandingeven a simple sentence such as “Opportunity and Curiosity find similar rocks on Mars” requiresrecognizing relations (rocks are located on Mars, signalled by the word on) and drawing on alreadyknown relations (Opportunity and Curiosity are instances of the class of Mars rovers). A language-understanding system should be able to find such relations in documents and progressively builda knowledge base or even an ontology. Resources of this kind assist continuous learning and otheradvanced language-processing tasks such as text summarization, question answering and machinetranslation.

The book discusses the recognition in text of semantic relations which capture interactionsbetween base noun phrases. After a brief historical background, we introduce a range of relation in-ventories of varying granularity, which have been proposed by computational linguists. There is alsovariation in the scale at which systems operate, from snippets all the way to the whole Web, and inthe techniques of recognizing relations in texts, from full supervision through weak or distant super-vision to self-supervised or completely unsupervised methods. A discussion of supervised learningcovers available datasets, feature sets which describe relation instances, and successful algorithms.An overview of weakly supervised and unsupervised learning zooms in on the acquisition of relationsfrom large corpora with hardly any annotated data. We show how bootstrapping from seed examplesor patterns scales up to very large text collections on the Web. We also present machine learningtechniques in which data redundancy and variability lead to fast and reliable relation extraction.

KEYWORDSnatural language processing, computational linguistics, lexical semantics, semantic re-lations, nominals, noun compounds, information extraction

ix

Contents

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 What this book is about . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 What this book is not about . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Relations between Nominals, Relations between Concepts . . . . . . . . . . . . . . . . . . . . .7

2.1 Historical Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 A Menagerie of Relation Schemata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.1 Relations between words, relations within noun compounds . . . . . . . . . . . 112.2.2 Relations in ontologies and knowledge bases . . . . . . . . . . . . . . . . . . . . . . . . . 182.2.3 A conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.3 Dimensions of Variation Across Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.3.1 Properties of Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.3.2 Properties of relation schemata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3 Extracting Semantic Relations with Supervision . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.1.1 MUC and ACE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.1.2 SemEval-2007/2010 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.1.3 Datasets for relations in noun-noun compounds . . . . . . . . . . . . . . . . . . . . . . 303.1.4 Relations in manually built ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.1.5 Relations from collaboratively built resources . . . . . . . . . . . . . . . . . . . . . . . . 343.1.6 Domain-specific data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.2.1 Entity features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.2.2 Relational features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.3 Learning methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.3.1 Supervised machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

x

3.3.2 Learning algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.3.3 Determining the semantic class of relation arguments . . . . . . . . . . . . . . . . . 53

3.4 Beyond binary relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543.5 Practical considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4 Extracting Semantic Relations with Little or No Supervision . . . . . . . . . . . . . . . . . 59

4.1 Mining ontologies from machine-readable dictionaries . . . . . . . . . . . . . . . . . . . . . . 604.2 Mining relations with patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.2.1 Bootstrapping relations from large corpora . . . . . . . . . . . . . . . . . . . . . . . . . . 634.2.2 Tackling semantic drift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.3 Distant supervision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694.4 Unsupervised relation extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.4.1 Extracting IS-A relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714.4.2 Emergent relations in open relation extraction . . . . . . . . . . . . . . . . . . . . . . . 744.4.3 Extreme unsupervised relation extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.5 Self-supervised relation extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 794.6 Web-scale relation extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

4.6.1 Never-Ending Language Learner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 804.6.2 Machine Reading at the University of Washington . . . . . . . . . . . . . . . . . . . 80

4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

Authors’ Biographies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

xi

Preface

Relations and texts

Every non-trivial text describes interactions and relations: between people, between otherentities or concepts, between events. What we know about the world consists in large part of similarrelations between concepts representing people, other entities, events, and so on. Such knowledgecontributes to the understanding of relations which occur in texts. Newly found relations can in turnbecome part of the knowledge we store.

For an automatic system to grasp a text’s semantic content, it must be able to recognize andreason about relations in texts, possibly by applying and updating previously acquired knowledge.We focus here in particular on semantic relations which describe the interactions among nounsand compact noun phrases, and we present such relations from both a theoretical and a practicalperspective. The theoretical exploration shows the historical path which brings us to the currentinterpretation and view of semantic relations, and the wide range of proposals of relation inventories;such inventories vary according to domain, granularity and suitability for downstream applications.

On the practical side, we investigate the recognition and acquisition of relations from texts.We look at supervised learning methods. We present available datasets, discuss the variety of fea-tures which can describe relation instances, and learning algorithms used successfully thus far. Theoverview of weakly supervised and unsupervised learning looks in detail at problems and solutionsrelated to the acquisition of relations from large corpora with little or no previously annotated data.We show how enduring the bootstrapping algorithm based on seed examples or patterns has provedto be, and how it has been adapted to tackle Web-scale text collections. We also present a few ma-chine learning techniques which can take advantage of data redundancy and variability for fast andreliable relation extraction.

Semantic relations play a fundamental role in ontology-based learning and information ex-traction from documents. They can also provide valuable information for higher-level language-processing tasks, including summarization, question answering and machine translation.

The audience

This book will appeal to graduate students, researchers and practitioners interested in compu-tational semantics, information extraction and, more generally, modern natural language processingtechnology. We have tried to make the presentation broadly accessible to anyone with a little back-ground in artificial intelligence. Even so, it helps to have some familiarity with computational lin-

xii PREFACE

guistics and a modicum of tolerance for mathematical formulae. A basic understanding of machinelearning is useful but not strictly necessary.

Organization of the book

A brief chapter 1 explains why we found it worthwhile to write a book about recognizingsemantic relations between nominals, what applications those recognized relations can facilitate,what the book does and what it does not discuss.

Chapter 2 offers a brief overview of how relations between nominals have emerged fromcombined sources—people’s propensity for categorization and the organization of knowledge, andfor expressing these (and more) in language. It then continues with examples of lists of relations,and a discussion of the different dimensions along which relations can vary. These differences arereflected in how relation analysis is performed, how they behave in a corpus, and what kind ofevidence or features we can use to describe them.

Chapters 3 and 4 contain an overview of the various methods of obtaining or learning relationsbetween nominals. Among the many ways in which this topic can be structured, we chose to organizeour survey according to the data support. Chapter 3 is focused on methods of learning relations whichcome from (comparatively) small, predefined datasets. Chapter 4 describes methods which work onlarge collections of unstructured texts or directly on the Web.

A very brief chapter 5 recaps our main observations, draws conclusions about the work onsemantic relations between nominals covered in this book, and sums up the lessons learned.

Vivi Nastase, Preslav Nakov, Diarmuid Ó Séaghdha, and Stan SzpakowiczApril 2013

1

C H A P T E R 1

Introduction

1.1 MOTIVATION

The connection is indispensable to the expression of thought. Without the connection, we wouldnot be able to express any continuous thought, and we could only list a succession of images andideas isolated from each other and without any link between them [Tesnière, 1959].

The connection is indispensable. Every non-trivial text describes a group of entities and the waysin which they interact or interrelate. Identifying these entities and the relations between them is afundamental step in understanding the text. It is a step which human language users perform rapidlyand reliably, using their knowledge of the language and their existing world knowledge about entitiesand relations. If natural language processing∗ systems are to reach the goal of producing meaningfulrepresentations of text, they too must attain the ability to detect entities and extract the relationswhich hold between them. Furthermore, a language understanding system must use and then updateexisting repositories of knowledge about entities and the way they interact if they are to adapt tonew information just as humans do.

Entity recognition (identifying which tokens correspond to entity mentions) and resolution(identifying the real-world entities or entity classes mentioned) are well-studied problems in NLP,with a voluminous associated literature. In this book, we will generally make the simplifying as-sumption that these steps have already been completed before the relation-processing stage—ourmain focus here—begins.

When human readers interpret the relational content of a text, they draw on a spectrumof knowledge acquired from past experience, and on explicit and implicit signals in the text itself.Consider an example:

NASA flew its final three space shuttle missions—one per orbiter remaining in thefleet—earlier this year.

Discovery, one of the space shuttles in NASA’s fleet, bound next year for the Smith-sonian’s Steven F. Udvar-Hazy Center in northern Virginia, was retired first in March.Endeavour landed June 1 and is now being prepared for display at the California ScienceCenter in Los Angeles.

∗abbreviated to NLP throughout this book

2 1. INTRODUCTION

Atlantis flew the 135th and final shuttle mission, STS-135, last month. It will be exhib-ited near where it and all the other shuttles launched and most landed, at the KennedySpace Center Visitor Complex in Florida.1

The first entity mentioned in this text, NASA, must be recognized as referring to the US spaceagency and not the National Auto Sports Association or the Nasa people of Colombia. As we said,we assume that this can be identified before we have attempted to extract relational content. Thisrecognition may trigger associations with a number of entities, such as space, space shuttle,orbiter, Kennedy Space Center, Discovery, Atlantis, some of which may appear in thetext. Such associations—and the specific relations between the trigger concept and the triggeredone—may help interpret the text.

The second entity mention refers to the final three space shuttle missions. Space shuttle missionis a compound of three nouns. It can only be fully interpreted by unpacking the semantic relationswhich hold between the concepts referred to as space, shuttle and mission—we already need relationalprocessing! Roughly stated, space shuttle mission denotes a mission fulfilled (performed? aided?) by aspace shuttle. Even if the term space shuttle mission is unfamiliar, we can understand its meaningif we know enough about what space shuttles and missions are, and how they usually interact. Nextquestion: what is a space shuttle? This term may be familiar, or can be looked up in a dictionary. Whileit is a relatively opaque compound, an unfamiliar reader can make an informed guess that such athing moves around in space, if only she knows other uses of the word shuttle and sees the context.The parenthetical comment one per orbiter remaining in the fleet can only be interpreted as implyingthat a space shuttle is a kind of orbiter.

The second sentence explicitly states that the entity referred to as Discovery is an instance ofspace shuttle. It does so by means of the construction "X, one of Y". This information is veryuseful if we do not already know that Discovery is a space shuttle. And so on; a full explanationof the relational content of a text is typically much longer than the text itself.

Like a human reader, an NLP system must exploit both pre-acquired knowledge about theworld and new information given by the text. The former resource is often construed as a knowledgebase, and its content may be fixed or may be dynamically updated as more texts are read. Thus, thereis a clear interaction between the tasks of knowledge acquisition and text understanding. Interestin modeling these two tasks can be traced back to classical antiquity. It has been the object ofcomputational research throughout the history of Artificial Intelligence (AI), even though the twotasks have often been treated as separate problems. The tight interaction between them has comeinto clear focus recently, as large-scale automated systems have been deployed to help constructknowledge bases by means of text understanding, and to improve text understanding by exploitingthe knowledge bases thus constructed. This book surveys the research landscape and the state of theart in these often intertwined tasks. It also shows how people have tried to design systems whichtake on these tasks. Many questions arise when one deals with relations between entities. Here aresome questions worth considering:1 www.space.com/12804-nasa-space-shuttle-program-officially-ends.html

www.space.com/12804-nasa-space-shuttle-program-officially-ends.html

1.2. APPLICATIONS 3

• Can one design a parsimonious and general representation of the semantic relations whichappear in text?

• What linguistic signals in text can help identify semantic relations?• Can background knowledge enhance the understanding of relations in text?• Can understanding of relations help acquire new world knowledge and distinguish relational

information of lasting value (Discovery is a space shuttle) from contingent informationwith a one-off benefit (Endeavour landed on June 1)?

1.2 APPLICATIONS

Swanson [1987] has demonstrated the potential of mining relations from text. By combining rela-tions extracted from various biomedical articles, he discovered previously unknown, but ultimatelyimportant, connections between magnesium and migraines. The biomedical literature is growingat a double-exponential pace [Hunter and Cohen, 2006], so it is impossible for a biomedical re-searcher to keep track of everything that is being published. Yet, there are billions of dollars to besaved from finding out what costly experiments have been done already and what their outcomeswere. Automatic relation extraction from text simply has no practical alternatives in the biomedicaldomain.

The ability to identify semantic relations in text can be useful in many NLP tasks. Let usname some:

• Information Extraction,• Information Retrieval,• Text Summarization,• Machine Translation,• Question Answering,• Paraphrasing,• Recognizing Textual Entailment,• Thesaurus Construction,• Semantic Network Construction,• Word-Sense Disambiguation,• Language Modelling.

Many of the techniques which we will discuss in this book are already quite mature, and havebeen used in some of these NLP applications. For example, Nakov’s [2008a] work helps improveStatistical Machine Translation. Suitable paraphrases make explicit the hidden relations between thenouns in a noun compound, and this enables the recognition of different variants of the same phrase.Suppose that the phrase oil price hikes is interpreted as hikes in oil prices and hikes in the prices of oil.If a machine translation system knows that oil price hikes can be translated into Spanish as alzas en

4 1. INTRODUCTION

los precios del petróleo,2 then it can also use the same fluent Spanish translation for the two additionalparaphrases.

Information retrieval and question answering offer another example. Suppose that we wantto ask a search engine what causes cancer. Many causes are possible, so one might want to pose aquery like this: “list all X such that X causes cancer”. This problem can be seen as a special kind ofa search, called relational search [Cafarella et al., 2006]. It asks for a list of things in a given relationwith a given entity. Here are more examples of such queries:

• list all X such that X is part of an automobile engine• list all X such that X is material for making a submarine’s hull• list all X such that X is a type of transportation• list all X such that X is produced from cork trees

Naturally, these two examples are just the tip of an iceberg.There are many more exciting newapplications to come.

1.3 WHAT THIS BOOK IS ABOUTThis book talks about relations between entities mentioned in the same sentence, and expressedlinguistically as nominals. Relations are the connections we perceive among concepts or entities. Aconnection may come from general knowledge about the world (Chocolate is-a-kind-of food), orfrom a text fragment (Chocolate is a psychoactive food).3 When one talks casually about a relation,one refers either to its type, such as meronymy or hyponymy, or to its instance with argumentsaccompanying the relation name, such as “chocolate contains caffeine.” Throughout the book, wewrite simply “relation” if the context makes it clear which of the two usages is intended, otherwisewe write “relation type” or “relation instance.”

The term nominal usually refers to a phrase which behaves syntactically like a noun or anoun phrase [Quirk et al., 1985, p. 335]. For our book, we have adopted a narrower definition. Anominal can be a common noun (chocolate, food), a proper noun (Godiva, Belgium), a multi-wordproper name (United Nations), a deverbal noun (cultivation, roasting), a deadjectival noun([the] rich), a base noun phrase built of a head noun with optional premodifiers (processed food,delicious milk chocolate), and (recursively) a sequence of nominals (cacao tree, cacaotree growing conditions).4

The relations themselves can be signalled by a phrase which links the entity mentions ina sentence (Chocolate is a raw or processed food produced from the seed of the tropical Theobroma

cacao tree.), or they can be only implied, e.g., when the entity mentions are compressed into a nouncompound (cacao tree, cacao tree growing conditions).2This is hard to generate word-for-word for the English input because of the need to also generate the connecting prepositionsand determiners en los and del.

3www.chocolate.org4We use italics to mark special terms and relations, typewriter font for words and patterns from texts, small caps for conceptsand entities, and sans-serif for examples.

www.chocolate.org

1.4. WHAT THIS BOOK IS NOT ABOUT 5

Superficially, it seems easier to learn or detect relations when some linguistic clues exist thanwhen the relation is only implied by the adjoining of terms.We will see, however, that, in order to relyon the linguistic expression of relations in texts, one must deal with ambiguity. For example, the wordin may indicate a temporal relation (chocolate in the 20th century) or a spatial relation (chocolate in

Belgium). Another difficulty is over-specification. Consider, for example, the ornate relation betweenchocolate and cultures in Chocolate was prized as a health food and a divine gift by the Mayan and

Aztec cultures. When there are no surface indicators, the clues about the type of relation will comefrom knowledge about the entities (cacao tree: cacao are seeds produced by a tree).

A special situation arises when an entity is actually an occurrence (event, activity, state) ex-pressed by a deverbal noun such as cultivation. Relations between a deverbal noun and itsmodifiers mirror the relations between the underlying verb and its arguments. For example, in theclause the ancient Mayans cultivated chocolate, chocolate is the theme. So, one can also discern atheme in chocolate cultivation. We do not single out such relations for separate discussion, becausethe methods do not differ significantly from what is required to deal with any other relations andany other types of nominals.5

1.4 WHAT THIS BOOK IS NOT ABOUTAs we said earlier, we will not focus on the issue of entity identification or disambiguation. Wemake the simplifying assumption that entity identification/disambiguation and relation extractionare separate tasks: a typical pipeline system will first identify, and possibly disambiguate, the entitiesof interest and only then it will move on to identifying semantic relations. In evaluation settings suchas the ACE or SemEval relation classification tasks, “gold” entity annotations are usually providedas part of the dataset. In some cases, these annotations may also be linked to WordNet or to someother semantic network, mimicking the output of a sense disambiguation step.

Apart from relations within a noun compound and between entities within a sentence, thereare other relations between nominals in a text, such as discourse relations. Coreference relations inparticular are not included in this review. They usually cross sentence boundaries, their argumentsmay be complex noun phrases (the girl next door) or pronouns, or they may not be explicitly expressedif ellipsis is at work. In the text below, Angela Merkel and German chancellor are co-referents,and the ellided noun meeting is marked with square brackets.

Angela Merkel’s spokesman has insisted that the German chancellor’s first meeting withFrançois Hollande, France’s president-elect, will be a “getting to know you” exercise, andnot “decision making” [meeting].

The book concentrates on relations between nominals: what they are and how we can identifyor recognize them. Existing ontologies, knowledge bases and other repositories are not of directinterest. Resources of that kind do provide seed examples or training data for learning or for de-5Even so, nominalization has been treated differently in some linguistic theories [Levi, 1978], and in some computational linguisticswork [Lapata, 2002].

6 1. INTRODUCTION

veloping other techniques of extracting relation instances; in such cases, we will name them anddiscuss their relevant properties. Building such resources may also require interesting methods ortechniques; we will then single them out, and present them insofar as they are relevant to semanticrelations between nominals.

7

C H A P T E R 2

Relations between Nominals,Relations between Concepts

Semantic relations describe interactions. Depending on the level at which an interaction is perceived,a semantic relation may connect nominals in a text or concepts in knowledge representation. Itseems artificial to separate relations just because of this distinction, especially since the knowledgeis ultimately expressed by words, and entities in particular are expressed by nominals. Indeed, weare at a stage in NLP where we exploit large amounts of textual data to identify pairs of entitieswhich interact, in the end gathering and formalizing this information to build large-scale knowledgerepositories. Early work on knowledge acquisition relied on contemplating the nature of the worldand the objects within it, and structuring this knowledge by certain organizational principles. Workon the study of language and the way it conveys meaning has evolved separately until the 20th

century. In this chapter, we will briefly review these two threads, and the types of semantic relationseach has emphasized. We will then take a closer look at and discuss the distinguishing characteristicsof semantic relations.

2.1 HISTORICAL OVERVIEWPhilosophical attempts to capture and describe our knowledge about the world go back to ancienttimes. Artistotle’s Organon, a collection assembled posthumously by his students, includes a treatiseon Categories which presents criteria for organizing objects [Studtmann, 2008]. This endeavor mustinevitably deal with language, given that world knowledge and language are intertwined. Objects inthe natural world are put into categories called τ α λεγ óμενα (ta legomena, things which are said)and their organization is based on the class inclusion relation.

For two millenia following Aristotle, contemplation on the nature of knowledge and onthe principles of knowledge organization has been the domain of philosophers—and an occasionalbotanist or zoologist who would put living things into a taxonomy. Let us fast-forward over centuriesof hot philosophical debate, colored by changing ideas about what concepts are and how they relate tothe real world [Margolis and Laurence, 1999]. In the 1970s came a realization that a robust ArtificialIntelligence (AI) system needs the same kind of knowledge which humans have. This revelation hasspurred a concerted effort to capture and represent knowledge in a machine-friendly format, and atthat point an intersection with language became inevitable.

We now return to Antiquity to pick up the language analysis thread and follow it to thisintersection point. During the second half of the first millenium BCE, scholars of the Indian lin-

8 2. RELATIONS BETWEEN NOMINALS, RELATIONS BETWEEN CONCEPTS

guistic tradition (vyakaran. a) developed a highly refined theory of language, covering what we wouldnow describe as morphology, syntax, and semantics. The seminal document of this tradition is theAs.t.adhyayı by Pan. ini, an eight-volume collection of aphoristic rules which describe the process ofgenerating a Sanskrit sentence from what we would call a semantic representation. The latter isprimarily conceptualized in terms of karakas, semantic relations between events and participants,similar to semantic roles. The As.t.adhyayı covers noun-noun compounds comprehensively from theperspective of word formation, but only refers in passing to noun-noun semantics. Subsequent com-mentators such as Katyayana and Patañjali expand on these semantic issues, pointing out for examplethat compounding is only supported by the presence of a semantic relation between entities [Joshi,1968].

Much closer to the modern day, early in the 20th century, Ferdinand de Saussure proposedhis hugely influential Course in General Linguistics [de Saussure, 1959]. He distinguished betweensyntagmatic and associative relations, which “correspond to two different forms of mental activity,both indispensable to the workings of language.” Syntagmatic relations hold between two or moreterms in a sequence in praesentia, in a particular context: “words as used in discourse, strung togetherone after the other, enter into relations based on the linear character of languages—words must bearranged consecutively in spoken sequence. Combinations based on sequentiality may be called syn-tagmas.” Associative (paradigmatic) relations, on the other hand, come from accumulated experienceand hold in absentia: “Outside the context of discourse, words having something in common areassociated together in the memory. […] All these words have something or other linking them.Thiskind of connection is not based on linear sequence. It is a connection in the brain. Such connectionsare part of that accumulated store which is the form the language takes in an individual’s brain.”Word associations can be morphological, phonological, grammatical, or semantic. Syntagmatic andassociative relations interact in the understanding of text: word associations are summoned for theinterpretation of a syntagma. Without the associations, a syntagma would have no meaning of itsown. Interestingly, the Course did not propose any list of relations. Harris [1987] observed that fre-quently occurring instances of syntagmatic relations may become part of our memory, thus becomingparadigmatic.This parallels Gardin [1965], who proposed that instances of paradigmatic relations bederived from accumulated syntagmatic data. Coincidentally, this reflects current thinking on relationextraction from open texts.

Another important development in philosophical thought was the rise of formal semanticsat the end of the 19th century. Starting with the work of Gottlob Frege [1879], predicate logic andits extensions [Gamut, 1991] have been the standard analytical toolkit for philosophers of language.Predicate logic is an inherently relational formalism. Predicates take one or more arguments; whenmodeling language, those which take multiple arguments are typically used to encode semanticrelations. A simple logical representation of the sentence Google buys YouTube. might look like this:

buy(Google, YouTube)

Such representation is still most commonly used in presentations of computational models.In an alternative representation, sometimes called neo-Davidsonian after the philosopher Donald

2.1. HISTORICAL OVERVIEW 9

Davidson, additional variables represent the event or relation as something that can be explicitlymodified and subject to quantification, for example:

∃e InstanceOfBuying(e) ∧ agent(e, Google) ∧ patient(e, YouTube)or perhaps∃e InstanceOf(e, Buying) ∧ agent(e, Google) ∧ patient(e, YouTube)

One of the great minds people could turn to for inspiration for representing knowledge andrelations had been Charles Sanders Peirce. Among his accomplishments are the existential graphs,which rely on the notion of an existential relation R [Peirce, 1909]: “anything that is R to x (wherex is some particular kind of object) is nonexistent in case x is nonexistent. Thus, lovers of womenwith bright green complexions are nonexistent in case there are no such women.”

Relations, looked at from an abstract mathematical point of view, had a dual nature. In logic,they served as predicates; in graphs (later known as semantic networks), they labeled arcs connectingvertices which represented concepts. AI adopted the representation in logic to support knowledge-based agents and inference; the idea of a graph of concepts has been adopted to represent factualknowledge, prevalent in NLP [Russell and Norvig, 2009]. The latter has had a strong effect onthe type of relations represented. In graphs, it is quite natural to represent binary relations. Binaryrelations have become the de facto norm in ontologies built from texts, and are by far the majority ofrelations targeted for extraction by NLP.

The advent of the computer has brought about the interest in putting this new tool to the kindof tasks which people do, and thus the field of AI began. John McCarthy was the first to describea complete, though hypothetical, AI program [McCarthy, 1958]. Designed to apply general worldknowledge in search of solutions to a problem, the program relied on two separate components,one for knowledge represented as rules and one for reasoning mechanisms. As a logic-based sys-tem, it already built upon relational information, but it was not directly concerned with language.Linguistically-oriented AI systems soon followed. Work such as Winograd’s groundbreaking inter-active English dialogue system [Winograd, 1972] or Charniak’s study of understanding children’sstories [Charniak, 1972] demonstrated that semantic knowledge about a variety of topics is essentialto computational language comprehension. That was a conceptual shift from the “shallow” archi-tecture of primitive conversation systems such as ELIZA [Weizenbaum, 1966] and first-generationmachine translation systems.The need for storehouses of background knowledge to support reason-ing systems has led in several directions, one of which was the creation of large-scale hand-craftedontologies such as Cyc [Lenat and Guha, 1990]. A more recent, and perhaps more fruitful, directionwas the acquisition of collections of propositional facts about the world via volunteer contributionsover the Web; this trend began with OpenMind Common Sense [Singh et al., 2002], and Mind-Pixel,1 and has reached truly large-scale with FreeBase.2

At the cross-roads between knowledge and language, we encounter interconnected systems inwhich knowledge about words and their various meanings is expressed in terms of their relations to1The project has been dormant since 2005. See en.wikipedia.org/wiki/Mindpixel for history.2www.freebase.com

en.wikipedia.org/wiki/Mindpixel

www.freebase.com

www.freebase.com

www.freebase.com


other words.The idea of defining the meaning of a word by its connections to other words is familiarto any user of a dictionary. Spärck Jones [1964] suggested that the kind of lexical relations found in adictionary could be formalized and learned automatically from text. Around the sametime, Quillian[1962] proposed the semantic network, a graph in which meaning is modeled by labeled associationsbetween words. The vertices of the network are concepts onto which one maps the words in a text,and then connections—relations between concepts—are established on edges linking some of thewords.

The network-based style of representation has remained very influential, informing large-scalelexical resources such as WordNet [Fellbaum, 1998], a continually growing network now consistingof over 155,000 words (nouns, verbs, adjectives, adverbs) and over 117,000 groups of near-synonymscalled synsets.3 WordNet represents around a dozen semantic relations between synsets, includingsynonymy, antonymy, hypernymy (a sandwich is a kind of snack food ), hyponymy (snack food has asandwich among its kinds), meronymy (bread is part of a sandwich) and holonymy (a sandwich hasbread as a part). We will return to WordNet’s relations in Section 2.2.2. Lexical resources such asWordNet, with many applications in modern NLP, will reappear throughout this book.

The early work on manual knowledge acquisition has quickly made apparent the fact thatautomating the process is a necessity. Luckily, work on text analysis has revealed that much ofthe knowledge we wish to extract is captured in texts. As a result, work on automatically build-ing knowledge bases from text collections took off with the pioneering work of Hearst [1992].The beginning was marked by focusing on the relations which are the backbone of ontologies,is-a and part-of [Berland and Charniak, 1999], and then the bootstrapping approaches developedfor these relations were applied to other relations; see for example [Patwardhan and Riloff, 2007,Ravichandran and Hovy, 2002]. One can then observe that having specific targets for relation ex-traction may lead to the omission of a wealth of information in texts. This has led to open in-formation extraction. Within this paradigm, the starting assumption states how a relation may beexpressed, for example, as patterns over parts of speech [Fader et al., 2011], paths in a syntacticparse tree [Ciaramita et al., 2005], or sequences of high-frequency words [Davidov and Rappoport,2008b]. Next, all matching instances are extracted as candidate relation instances. The downside ofsuch methods is the high variability in relation expressions. Mapping these onto a set of “canonical”relation expressions is ongoing work.

2.2 A MENAGERIE OF RELATION SCHEMATAThe two threads of parallel work, on the organization of knowledge and on texts, have led to twoperspectives on relations. A relation manifests itself in text at the word level, and arises from theparticular context in which it appears. In ontologies and other taxonomies, we find relations betweenconcepts, considered (believed) to be true in view of the current state of our collective understandingof the world. In the next two sections, we briefly review these two perspectives. We look at how

3See wordnet.princeton.edu, in particular the WordNet 3.0 statistics at 〈wordnet.princeton.edu/wordnet/man/wnstats.7WN.html〉.

wordnet.princeton.edu

wordnet.princeton.edu/wordnet/man/wnstats.7WN.html

wordnet.princeton.edu/wordnet/man/wnstats.7WN.html

2.2. A MENAGERIE OF RELATION SCHEMATA 11

relations between nominals have attained prominence in NLP research, and at the different ways inwhich the two perspectives have been turned into practice. On the ontology side, we look at relationsin a few of the knowledge repositories which have been or are being used in NLP, and what issuessemantic relations pose when used in ontologies.

2.2.1 RELATIONS BETWEEN WORDS, RELATIONS WITHIN NOUNCOMPOUNDS

Standard lexical-semantic literature discusses semantic relations at great length. We refer the inter-ested reader to the latest comprehensive monograph [Geeraerts, 2010] and to citations therein toall the classic positions. We will zoom in on the task of determining, or building, a list of relationswith coverage wide enough for text analysis—complete coverage, if at all possible.

A noteworthy attempt to build a list of semantic relations is due to Casagrande and Hale[1967]. They asked native speakers of an exotic language to give definitions for a predeterminedlist of words. These definitions were then analyzed into declarative sentences which express simplefacts, and used to determine the relations expressed in each such statement. The result was a list of13 relations, not only between nominals, which is shown in Table 2.1.

Table 2.1: Casagrande and Hale’s [1967] relations

Relation Example Relation Exampleattributive toad - small contingency lightning - rain

function ear - hearing spatial tongue - mouth

operational shirt - wear comparison wolf - coyote

exemplification circular - wheel class inclusion bee - insect

synonymy thousand - ten hundred antonymy low - high

provenience milk - cow grading Monday - Sunday

circularity X is defined as X

Chaffin and Herrmann [1984] presented an exercise in the analysis of relations themselves,and their distinguishing properties. They explored human perception of similarities and differencesbetween relations via an exercise in grouping instances of 31 semantic relations. The results of theexperiment showed that the subjects perceived five classes of semantic relations, shown in Table 2.2.Instances of these five classes can be distinguished by three properties: contrasting/non-contrasting,logical/pragmatic, and inclusion/non-inclusion.

Much of the debate on the correct representation of semantic relations has played out withregard to characterizing the interpretation of noun compounds—sequences of two or more nounswhich function as a single noun, e.g., space shuttle and space shuttle mission. Compounding is afrequent and productive process in English:4 any text will contain numerous compounds and manyof these will be infrequent. Semantic interest in the special case of two-word noun compounds, or

4Compounding is also a feature of many other languages; see Bauer [2001] for a comprehensive cross-linguistic overview.


Table 2.2: Chaffin and Herrmann’s [1984] relations

Relation Exampleconstrasts night - day

similars car - auto

class inclusion vehicle - car

part-whole airplane - wing

case relations—agent, instrument

noun-noun compounds, is due not just to their ubiquity but also to the fact that they can encode agreat deal of relational information. For example, a taxi driver is ‘a driver who drives a taxi’, whilean embassy driver is ‘a driver who is employed by/drives for an embassy’, and an embassy building is‘a building which houses, or belongs to, an embassy’.

The main issues of representation which arise in the study of compounds reflect broaderquestions relevant to any attempt at formalizing semantic relations in general. Noun compoundscan be therefore viewed as an informative case study or microcosm. There is a voluminous literatureon the semantics of compounds, from the perspective of both linguistics and NLP.5 In linguistics,the primary aim is to find the most comprehensively explanatory representation. In NLP, it is toselect the most useful representation for a particular application: this should have the right tradeoffbetween generality and precision to be computationally tractable and to give informative output todownstream systems; the two perspectives are complementary.

One important question is whether the relational semantics of compounding can be explainedby a concise listing of possible semantic relations or whether the set of distinguishable relations isin practice unboundedly large. In linguistics, the former assumption has led to the compilation ofrelation inventories starting with early descriptive work [Grimm, 1826, Jespersen, 1942, Noreen,1904] and continuing through to the age of generative linguistics [Levi, 1978, Li, 1971, Warren,1978]. For example, Warren [1978] proposed an inventory of relations informed by a comprehensivecorpus study using the Brown Corpus [Kucera and Francis, 1967].The inventory consists of six majorsemantic relations, each of them further subdivided according to a hierarchy of up to four levels.The major relations are shown in Table 2.3. As an example of further division, the relation Time—adirect child of the major relation Location—is specialized into relations like Time-Animate Entity

(weekend guests), Time-Concrete, Inanimate Entity (Sunday paper) and Time-Abstract Entity (fall

colors).Levi [1978] proposed a set of relations (for theory-internal reasons called“recoverable deletable

predicates” or RDPs), which she claims underlie all compositional non-nominalized compoundsin English. We show them in Table 2.4. The Role column shows the modifier’s function in thecorresponding paraphrasing relative clause: when the modifier is the subject of that clause, the RDPis marked with the index 2.

5See www.cl.cam.ac.uk/˜do242/Resources/compound_bibliography.html for a long list.

www.cl.cam.ac.uk/~do242/Resources/compound_bibliography.html


Table 2.3: Warren’s [1978] major semantic relations

Relation ExamplePossession family estate

Location water polo

Purpose water bucket

Activity-Actor crime syndicate

Resemblance cherry bomb

Constitute clay bird

Table 2.4: Levi’s [1978] relations

RDP Example Role Traditional nameCAUSE1 tear gas object causativeCAUSE2 drug deaths subject causativeHAVE1 apple cake object possessive/dativeHAVE2 lemon peel subject possessive/dativeMAKE1 silkworm object productive/compositeMAKE2 snowball subject productive/compositeUSE steam iron object instrumentalBE soldier ant object essive/appositionalIN field mouse object locativeFOR horse doctor object purposive/benefactiveFROM olive oil object source/ablativeABOUT price war object topic

In Levi’s theory, nominalizations such as taxi driver are accounted for by a separate procedurebecause they are assumed to be derived from a different kind of deep representation. For thosewho are not committed to a transformational account of syntax and semantics, this separation isunnecessary and only leads to spurious distinctions (horse doctor would be labeled for but horse

healer would have another relation label, agent). Levi deems the degree of ambiguity afforded by12 relations to be sufficiently restricted for a hearer to identify the relation intended by a speakerby recourse to lexical or encyclopaedic knowledge, while still allowing for the semantic flexibility ofcompounding.

Levi’s relations have influenced some further proposals of relation inventories. One example isthe work of Ó Séaghdha and Copestake [2007], who started from Levi’s set of relations and followeda set of principles based on empirical and theoretical considerations:

i. the inventory of relations should have good coverage;ii. relations should be disjunct, and should describe a coherent concept;

iii. the class distribution should not be overly skewed or sparse;


iv. the concepts underlying the relations should generalize to other linguistic phenomena;v. the guidelines should make the annotation process as simple as possible;

vi. the categories should provide useful semantic information.

The result was an inventory of eight relations. Four of Levi’s relations (about, be, have, in) were kept,and for was replaced with agent and inst (instrument). Ó Séaghdha and Copestake introduced rel

for compounds encoding non-specific relations and lex for compounds which are idiomatic.In opposition to the comprehensive view, Zimmer [1971] pointed to the great variety of

English compounds and concluded that it may be simpler to categorize the semantic relations whichcannot be encoded in compounds rather than those which can. Downing [1977] cited compoundssuch as plate length (“what your hair is when it drags in your food”) in order to argue that “Theexistence of numerous novel compounds like these guarantees the futility of any attempt to enumeratean absolute and finite class of compounding relationships.” A complementary argument holds thatsimple relations chosen from a discrete set are insufficient to capture the richness of relationalmeaning and that the semantics of word combinations arise from the interaction between necessarilycomplex representations of events and entities.This view was given a detailed treatment in Coulson’s[2001] work on frame semantics.

While this debate has arisen from theoretical linguistic concerns, the tension between parsi-mony and expressiveness in semantic representation is also a fundamental issue for computationallinguists. The inventory approach has been popular in NLP because it is computationally suitedto both rule-based and statistical classification methods. Su [1969], to our knowledge the first re-searcher to report on noun compound interpretation from a computational perspective, described 24semantic categories to be used for producing paraphrase analyses of compounds. These categoriescontain many relations familiar from linguistically motivated inventories: Use, Possessor, Spatial

Location, Cause, and so on. Other inventories proposed for noun compound analysis include thoseof Leonard [1984], Vanderwende [1994], Girju et al. [2005] and Ó Séaghdha [2008]. A large in-ventory, later used by a number of researchers, appeared in [Nastase and Szpakowicz, 2003]: 30relations were grouped into 5 categories—see Table 2.5.

The example inventories presented here should make it clear that these alternative accountshave much in common. They all have categories for locative relations, for possessive relations, forpurposive relations, and so on.Tratz and Hovy [2010] proposed a new inventory of 43 relations in 10categories, developed through an iterative crowd-sourcing process to find a scheme which maximizesagreement between annotators.The relations are shown inTables 2.6-2.7.Tratz and Hovy performedmeta-analysis of the most notable previous proposals, which has shown that they all cover essentiallythe same semantic space, though they do differ in the exact way of partitioning that space.

Every representational framework considered in this section thus far has assumed that semanticrelations are abstract constructs which correspond to logical predicates rather than lexical items.Another approach to meaning holds that semantic relations can be expressed by paraphrases. Therelation in weather report can be attributed to the abstract predicate named about or topic; or thesame relation can be described by saying that a weather report is “a report about the weather” or “a


Table 2.5: Nastase and Szpakowicz’s [2003]’s relations

Relations of Causality Examplescause flu virus

effect exam anxiety

purpose concert hall

detraction headache pill

Participant relations ExamplesAgent student protest

Beneficiary student discount

Instrument laser printer

Object metal separator

Object_Property sunken ship

Part printer tray

Possessor national debt

Property blue book

Product plum tree

Source olive oil

Stative sleeping dog

Whole daisy chain

Relations of Quality ExamplesContainer film music

Content apple cake

Equative player coach

Manner stylish writing

Material brick house

Measure expensive book

Topic weather report

Type oak tree

Relations of Spatiality ExamplesDirection outgoing mail

Location home town

Location_at desert storm

Location_from foreign capital

Relations of Temporality ExamplesFrequency daily experience

Time_at morning exercise

Time_through six-hour meeting


Table 2.6: The relations from Tratz and Hovy [2010] (part I). The Approximate Mappings shows themapping of the proposed relation to a relation from Barker and Szpakowicz [1998] (B), Girju et al.[2005] (G), Levi [1978] (L), Nastase and Szpakowicz [2003] (N), Vanderwende [1994] (V) and Warren[1978] (W)

Category Name Example Approximate MappingsCausal Groupcommunicator of communication court order BGN:Agent, L:acta+Producta , V:Subjperformer of act/activity police abuse BGN:Agent, L:Acta+Producta , V:Subjcreator/provider/cause of ad revenue BGV:Cause(d-by), L:Cause2, N:EffectPurpose/Activity Groupperform/engage in cooking pot BGV:Purpose, L:For, N:Purpose,W:Activity,Purposecreate/provide/sell nicotine patch BV:Purpose, BG:Result, G:Make-Produce,

GNV:Cause(s), L:Cause1,Make1,For , N:Product,W:Activity,Purpose

obtain/access/seek shrimp boat BGNV:Purpose, L:For, W:Activity,Purposemodify/access/change eye surgery BGNV:Purpose, L:For, W:Activity,Purposemitigate/oppose/destroy flak jacket BGV:Purpose, L:For, N:Detraction,

W:Activity,Purposeorganize/supervise/authority ethics board BGNV:Purpose/Topic, L:For/Abouta , W:Activitypropel water gun BGNV:Purpose, L:For, W:Activity,Purposeprotect/conserve screen saver BGNV:Purpose, L:For, W:Activity,Purposetransport/transfer/trade freight train BGNV:Purpose, L:For, W:Activity,Purposetraverse/visit tree traversal BGNV:Purpose, L:For, W:Activity,PurposeOwnership, Experience, Employment and Usepossessor + owned/possessed family estate BGNVW:Possess∗, L:Have2

experiencer + cognition/mental voter concern BNVW:Possess∗, G:Experiencer, L:Have2

employer + employee/volunteer team doctor BGNVW:Possess∗, L:For/Have2, BGN:Beneficiaryconsumer + consumed cat food BGNVW:Purpose, L:For, BGN:Beneficiaryuser/recipient + used/received voter guide BNVW:Purpose, G:Recipient, L:For,

BGN:Beneficiaryowned/possessed + possession store owner G:Possession, L:Have1, W:Belonging-Possessorexperience+experiencer fire victim G:Experiencer, L:Have1

thing consumed + consumer fruit fly W:Obj-SingleBeingthing/means used + user faith healer BNV:Instrument, G:Means,Instrument, L:Use,

W:MotivePower-ObjTemporal Grouptime (span) + X night work BNV:Time(At), G:Temporal, L:Inc , W:Time-ObjX + time (span) birth date G:Temporal, W:Obj-TimeLocation and Whole+Part/Member oflocation/geographic scope of X hillside home BGV:Locat(ion/ive), L:Ina ,Fromb , B:Source,

N:Location(At/From), W:Place-Obj,PlaceOfOriginwhole-part/member of robot arm B:Possess, G:Part-Whole, L:Have2, N:Part,

V:Whole-Part, W:Obj-Part,Group-Member


Table 2.7: The relations from Tratz and Hovy [2010] (part II). The Approximate Mappings shows themapping of the proposed relation to a relation from Barker and Szpakowicz [1998] (B),Girju et al. [2005](G), Levi [1978] (L), Nastase and Szpakowicz [2003] (N), Vanderwende [1994] (V) and Warren [1978](W)

Category Name Example Approximate MappingsComposition and Containment Groupsubstance/material/ingredient+whole plastic bag BNVW:Material∗, GN:Source, L:Froma , L:Have1,

L:Make2b , N:Contentpart/member+collection/config/series truck convoy L:Make2ac , N:Whole, V:Part-Whole, W:Parts-

WholeX + spatial container/location/bounds shoe box B:Content,Located, L:For, L:Have1, N:Location,

W:Obj-PlaceTopic Grouptopic of communication/imagery/info travel story BGNV:Topic, L:Abouta , W:SubjectMatter,

G:Depictiontopic of plan/deal/arrangement/rules loan terms BGNV:Topic, L:Abouta , W:SubjectMattertopic of observation/study/evaluation job survey BGNV:Topic, L:Abouta , W:SubjectMattertopic of cognition/emotion jazz fan BGNV:Topic, L:Abouta , W:SubjectMattertopic of expert policy wonk BGNV:Topic, L:Abouta , W:SubjectMattertopic of situation oil glut BGNV:Topic, L:Aboutatopic of event/process lava flow G:Theme, V:SubjAttribute Grouptopic/thing + attrib street name BNV:Possess∗, G:Property, L:Have2, W:Obj-

Qualitytopic/thing + attrib value charac of earth toneAttributive and Coreferentialcoreferential fighter plane BV:Equative, G:Type,IS-A, L:Bebcd ,

N:Type,Equality, W:Copulapartial attribute transfer skeleton crew W:Resemblance, G:Typemeasure + whole hour meeting G:Measure, N:TimeThrough,Measure, W:Size-

WholeOtherhighly lexicalized/fixed pair pig ironother contact lens

report forecasting the weather.” Lauer [1995] proposed a widely cited analysis of noun compounds asparaphrases,casting the task of interpreting compounds as that of choosing a prepositional paraphrasefrom the following set of precisely eight prepositions: of, for, in, at, on, from, with, about. For example,olive oil is analyzed as oil from olives, night flight as a flight at night, and odor spray as a spray for odors.From a computational point of view, paraphrasing is attractive because a predictive model can bebuilt by identifying noun-preposition co-occurrences in a corpus or even the Web [Lapata and Keller,2004].

On the other hand, the lexical nature of Lauer’s relations has disadvantages. Prepositionsare themselves polysemous, and the assignment of a prepositional paraphrase to a compound does


not unambiguously identify the compound’s meaning. In other words, once we have identified acompound as, say, an of-compound, we still need to ask what kind of of we are dealing with. Theparaphrases school of music, theory of computation and bell of (the) church do not describe the same kindof semantic relation. Furthermore, the assignment of different categories does not necessarily entaila difference in semantic relations. The categories in, on and at share a significant overlap, and thedistinction between prayer in (the) morning, prayer at night, and prayer on (a) feast day has more to dowith shallow lexical association than with any interesting semantic issue. Another problem is thatmany noun-noun compounds cannot be paraphrased using prepositions (woman driver, taxi driver)and are excluded from the model, while others admit only unintuitive paraphrases: should honey beereally be analyzed as bee for honey?

Departing from the assumption that a semantic relation can be characterized by a handfulof phrases, Nakov [2008b] and Butnariu et al. [2010] consider a relation to be expressed by anycombination of verbs and prepositions which occur in texts: olive oil can now be interpreted as, e.g., oilthat is extracted from olives or oil that is squeezed from olives. Such paraphrases—a lot more informativethan Lauer’s oil from olives or Levi’s from(oil, olives)—come closer to the richness demanded byDowning’s [1977] linguistic arguments. A semantic relation is represented as a distribution overmultiple paraphrases, and this allows comparisons. Two compounds may be similar in some ways(olive oil and sea salt both match the paraphrase N1 is extracted from N2) and different in others (saltis not squeezed from the sea).

2.2.2 RELATIONS IN ONTOLOGIES AND KNOWLEDGE BASESDeveloping an ontology or a knowledge base requires choosing which entities and relations betweenthem to represent. Both these choices depend on the domain of the knowledge to be captured, andwe will see further on that there is much variety. There is, however, some consensus: the backboneof an ontology is the is-a relation, and part-of is desirable as well. The consensus may break overgranularity: even is-a and part-of relations can be further refined.

An instance of the is-a relation usually links a more specific with a more general concept.From the point of view of formalizing knowledge, there is a distinction between linking two genericconcepts (chocolate is-a food), and linking a concept instance and its superordinate concept(Toblerone is-a chocolate). The first formalizes the class inclusion, and the second models theclass membership relation. Such a distinction was added to WordNet’s hypernym-hyponym hierarchyin the form of the instance hypernymy relation [Hristea and Miller, 2006]. Further distinctions canbe made. Wierzbicka [1984] refined is-a into five subrelations. The two most interesting here arethe is-a-kind-of relation, which she calls taxonomic (chicken - bird) and the is-used-as-a-kind-of,called functional (adornment - decoration).

Meronymic (part-of) relations also can, and in certain situations should, be re-fined. Winston et al. [1987] made a convincing case for six types of meronymic relations, listedalong with examples in Table 2.8. These distinctions are motivated by apparent contradictions inthe transitivity of meronymic relations, and are explained by changes in three properties: the func-


Table 2.8: Winston et al.’s [1987] fine-grained part-of relations

Relation Examplecomponent-integral object pedal - bike

member-collection ship - fleet

portion-mass slice - pie

stuff-object steel - car

feature-activity paying - shopping

place-area Everglades - Florida

tionality of the relation between the part and the whole, the separability between the part and thewhole, and whether the part and whole are homeomerous (have like/similar parts). For example:

1. Simpson’s arm is part of Simpson(’s body).2. Simpson is part of the Philosophy Department.3. *Simpson’s arm is part of the Philosophy Department.

Transitivity fails in this case, because the relation in the first sentence is an instance ofcomponent-object, while in the second sentence we have an instance of member-collection.Gerstl and Pribbenow’s [1995] inventory proposal contains three meronymic relations, quantity-

mass, element-collection, and component-complex. They have also been adopted in Word-Net [Fellbaum, 1998], where meronymy has three subtypes: member, substance and part meronyms(and there are the matching holonym counterparts).A representative selection of WordNet’s relationsappears in Table 2.9.

Nutter [1989] presented an overview of lexical-semantic relations from 15 studies on the topic.The result of this analysis is a rich five-level hierarchy of over 100 relations, comprising semantic,morphological, syntactic and factive relations.The relations come from work on English and Russian,and most of them were found to be language-independent. Nutter notes that domains are likely torequire specific relations, and so there is no comprehensive list of relations.

Cyc [Lenat and Guha, 1990, Matuszek et al., 2006] is a project aiming at the large-scaleacquisition of common-sense knowledge. Cyc features 15,000 predicates, organized hierarchically.There is a distinction between class inclusion genls: (#$genls #$Tree-ThePlant #$Plant) and classmembership is-a: (#$isa #$BillClinton #$UnitedStatesPresident); there are causal and meronymicrelations, as well as many “encyclopedic” relations such as for example capitalCity (#$capitalCity#$France #$Paris), basedInRegion, biologicalMotherOf. The resource also includes rules, relationsbetween relations (e.g., implies), definitions of relations, and relations between both entities andgeneral concepts. Our example of the is-a relation shows a downside of the manual acquisition ofencyclopedic knowledge. Not only must new data be added, but existing data must be checked forconsistency, because some relations have an expiration date.

Collaborative approaches to knowledge-sharing have given rise to other sources of encyclope-dic relations, with the added bonus that such relations emerge from the interaction of multiple users,


Table 2.9: Selected relations among nouns in WordNet 3.0 (its Unix-based implementation). In theexamples, we note sense numbers of polysemous nouns.The bottom part of the table shows nouns relatedto other parts of speech

Relation ExampleSynonym day (Sense 2) / timeAntonym day (Sense 4) / nightHypernym berry (Sense 2) / fruitHyponym fruit (Sense 1) / berryMember-of holonym Germany / NATOHas-member meronym Germany / SorbianPart-of holonym Germany / EuropeHas-part meronym Germany / MannheimSubstance-of holonym wood (Sense 1) / lumberHas-substance meronym lumber (Sense 1) / woodDomain - TOPIC line (Sense 7) / militaryDomain - USAGE line (Sense 21) / channelDomain member - TOPIC ship / portholeAttribute speed (Sense 2) / fastDerived form speed (Sense 2) / quickDerived form speed (Sense 2) / accelerate

and so they reflect a form of collective knowledge and organizational principles. One very frequentlyused source of collaboratively obtained relations is Wikipedia.6 Within structured sections of articlescalled infoboxes, the users include attributes of the entity to which the article corresponds, and assignthe corresponding values, which often are entities themselves. This in effect produces instances ofrelations between entities. Freebase,7 where users explicitly add facts, is (from our point of view) alarge repository of relation instances.

2.2.3 A CONCLUSIONA review of the literature has shown that almost every new attempt to analyze relations betweennominals leads to a new list of relations. We also observe that a necessary and sufficient list ofrelations to describe the connection between nominals does not exist [Downing, 1977, Jespersen,1942].This observation applies both to relations between nominals in texts and to relations betweenconcepts in an ontology. Murphy [2003] reviews several properties of semantic relations, amongthem uncountability, which basically marks relations as an open class.This stance has been embracedby researchers who work on relation extraction from open texts, where relation markers are verbs or

6www.wikipedia.org7www.freebase.com

www.wikipedia.org

www.freebase.com

2.3. DIMENSIONS OF VARIATION ACROSS RELATIONS 21

verb phrases, a clearly open-ended set. The missing step, for now, is the mapping of such relationexpressions onto proper relation types. Resources built collaboratively on the Web offer a middleground. In particular, Wikipedia and Freebase contain numerous instances of relations consideredsalient by the (human) contributors.

2.3 DIMENSIONS OF VARIATION ACROSS RELATIONSWhile there is no consensus on a list of semantic relations which would be necessary and sufficientfor analyzing texts (in any domain) or for building ontologies, a review of related literature revealscertain shared properties of relations. Such properties—briefly presented in this section—can beused to differentiate relations and can be relevant to computational methods of work with semanticrelations.

2.3.1 PROPERTIES OF RELATIONSOntological and idiosyncratic relationsAn idiosyncratic relation is one whose application is highly sensitive to the context. On the otherhand, an ontological relation comes up practically the same in numerous contexts. While apple is-

a fruit regardless of the context (and even without considering any text at all), the content-container

relation between apple and basket will hold only in specific contexts.This parallels the distinctionbetween paradigmatic and syntagmatic relations first noted in de Saussure’s [1959] Course in GeneralLinguistics, and adopted in other studies of semantic relations [Khoo and Na, 2006, Murphy, 2003].Paradigmatic relations cover a wide variety of associations between words, including morphologicaland phonetic, whereas syntagmatic relations cover words which occur close in a text.

Our notion of ontological relations captures just that: semantic relations between nominalsappropriate for insertion into an ontology. Idiosyncratic relations hold between nominals whichappear together in texts and which the reader perceives as related. This distinction based on thecontext sensitivity of relations is compelling. Had there been a fool-proof test, we would havestructured the remainder of the book along this dimension. This would have enabled a systematicand well-grounded automation of induction of knowledge bases and ontologies from texts. As it is,however, we must be practical.

Automated acquisition of knowledge in general seems to split rather neatly into three majorforms of machine learning—unsupervised, semi-supervised and supervised—but there is no clearmapping from methods to the learning of specific types of semantic relations. The same relationcan be acquired by any of these methods (it is true, for example, of the is-a relation) or indeed by acombination of methods. Features which characterize relations, such as context patterns, may comefrom unsupervised corpus analysis, and go into more or less supervised models of individual relations.Idiosyncratic relations are inherently context-dependent, just as ontological relations are not, buteven so we cannot distinguish them easily only by data support. The arguments of relations of bothkinds (e.g., is-a and cause-effect, or content-container) can be characterised by features acquiredfrom large corpora. We do expect, however, that idiosyncratic relations will benefit from including


features which represent the specific context of the instance.These considerations explain the way inwhich we chose to organize the material. Chapter 3 focuses on supervised learning, applicable bothto ontological and idiosyncratic relations. Data representation consists of features which describethe entities, their interaction and the context. Chapter 4 shows methods which require little or nosupervision, and work with large amounts of data with hardly any manual annotations; relationanalysis in this framework relies mostly on redundancy and consistency across data, so the methodspresented here work better for ontological relations.

Binary and n-ary relationsWe focus on binary relations, as is currently quite common in analyzing the semantic interactionof entities in text, but this is not the only possibility. The semantic relations for verbs—which takebetween one and several arguments—may be better represented by frames. A frame captures allthe required semantic arguments. For example, a selling event can be seen as invoking a framewhich covers relations between the buyer, the seller, the object bought and the price paid. Such aframe-based perspective can make sense for nominals and their modifiers especially if the nominal isverbal or deverbal (e.g., selling or purchase). There may be other situations which require n-aryrelations between nominals.

Targeted and emergent relationsOne of the considerations in comprehensive knowledge acquisition from texts is whether the inven-tory of relations to be extracted is open-ended. In such cases, the extraction process will first assumehow relations may be expressed, for example, as patterns over parts of speech:

(V | V (N | Adj | Adv | Pron | Det)* PP).This allows the extraction of relations such as invented, is located in or made a deal with. All in-stances which fit these pre-specified patterns and their arguments are collected and proposed forinclusion in the knowledge base. Further processing may cluster together similar relation expressions,but relations in this situation remain largely anonymous, and are identified either by a set of phrases(which match the specified patterns) or by examples.

This contrasts with situations when the list of relations to be extracted is given in advance.The systems either learn how to distinguish instances of the requested relations from all others, orbootstrap to incrementally enlarge a given set of seed examples.

First-order and higher-order relationsWe have thus far only considered first-order relations, whose arguments can only be entities. Higher-order relations can take relations as arguments. For example, John’s belief that apples are fruits mightbe notated like this:believes(John, is-a(apple, fruit)).Higher-order phenomena can be expressed quiteelegantly in a notation inspired by Peircean logic, namely conceptual graphs [Sowa, 1984],8 in whichthe target of a relation can be another relation. Higher-order semantic relations have been relatively

8conceptualgraphs.org/

conceptualgraphs.org/

2.3. DIMENSIONS OF VARIATION ACROSS RELATIONS 23

less studied in NLP, though they play a role in datasets for biomedical event extraction [Kim et al.,2009] and in semantic parsing [Liang et al., 2011, Lu et al., 2008].

Consider, for example, this fragment of a biomedical text: “In this study we hypothesized thatthe phosphorylation of TRAF2 inhibits binding to the CD40 cytoplasmic domain.” It contains threeevents, E1, E2, and E3, with one, three and two arguments, respectively. Here E3 is a higher-orderevent which takes E1 and E2 as arguments:

• E1: phosphorylation(Theme:TRAF2),• E2: binding(Theme1:TRAF2, Theme2:CD40, Site:cytoplasmic domain),• E3: negative_regulation(Theme:E2, Cause:E1).

General and domain-specific relationsSome relations are likely to be useful in processing all kinds of text or in representing knowledgein any domain. Examples of such general relations are location, possession, causation, is-a, orpart-of. Other relations are only relevant to a specific text genre or to a narrow domain. A sys-tem for interpreting the language of biomedical documents may need to understand gene/proteinevents [Kim et al., 2009], relations between diseases and treatments [Rosario and Hearst, 2004], orother relations such as Person afflicted and Defect [Rosario and Hearst, 2001]. On the other hand,all these very specific relations will be of very little use when processing reports about, say, sportsfixtures.

2.3.2 PROPERTIES OF RELATION SCHEMATACoarser- and finer-grained schemataDepending on the application, it may make sense to partition the space of relations into a small setof coarse-grained relations or a larger set of fine-grained relations. In the extreme, one can argue thatevery interaction between entities is a distinct relation with unique properties. This is not practical,however, because it does not allow useful generalizations to be made from data. At the other extreme,a scheme consisting of one universal relation is equally useless. Most inventories proposed in theliterature contain fewer than 50 relations, though this is anything but a magic number. Unrestrictedparaphrasing approaches can be seen as promoting an arbitrarily fine-grained scheme.

Hierarchical and flat schemataA compromise which shares the benefits of coarse-grained and fine-grained schemes is to adopta schema with a hierarchical organization, in which more numerous but narrower relations aregrouped into coarser “super-relations.” For example, Nastase and Szpakowicz’s [2003] relation sethas a two-level structure. Each of the 30 relations belongs to one of five categories. Warren’s [1978]schema (Section 2.2.1) goes as far as four levels: Possessor-Legal Belonging is a subrelation ofPossessor-Belonging, itself a subrelation of Whole-Part under the top-level relation Possession.


Open and closed schemataThe categorization into open and closed schemata reflects the distinction between emergent andtargeted relations. In Chapter 4, we will review open information extraction (IE) methods, aimingfor comprehensive relation extraction.They assume that the relations are an open class, and the list ofrelations represented in the analyzed corpus will be revealed during processing. Such an assumptionfits the unsupervised, or self-supervised, learning framework of open IE systems. Closed relationschemata allow for a deeper analysis and definition of specific relations, and their supervised learningfrom annotated data.

2.4 SUMMARYThis chapter has briefly reviewed the parallel threads of semantic relations in the representationof knowledge and in texts, and how they have recently come together in the field of informa-tion extraction. More detailed analysis of semantic relations can be found in [Murphy, 2003]and [Khoo and Na, 2006]. For a textbook overview of lexical-semantic representation for NLP,we refer the reader to [Jurafsky and Martin, 2009], and for knowledge representation and usage inAI, we suggest [Russell and Norvig, 2009].

We have discussed several characteristics of semantic relations and semantic relation schemata.One important observation is that relations are an open class and, like concepts, can be organizedinto hierarchies. Different relation inventories and different levels of generality are appropriate indifferent contexts and domains. What remains constant across all these differences is the natureof relations. Some are ontological, and so capture/represent a long-lasting, and largely context-independent, connection; some are idiosyncratic, and hold within a specific context. Today, when amass of freely available text is both a tempting source of information for the automatic acquisitionof knowledge and the target for “regular” text analysis, it is important to understand that there is adifference between the two types of relations. Such understanding can help us acquire better andcleaner resources.

25

C H A P T E R 3

Extracting Semantic Relationswith Supervision

In this chapter, we focus on the supervised learning of semantic relations. Supervised learning is ageneral machine learning methodology in which a predictive model is “trained” by generalizing fromexamples of data that are provided along with their true label. In the context of relation classification,this requires the provision of text that has been annotated with the relations it expresses. Supervisedlearning can perform very well, but there are specific difficulties for this paradigm to overcome. First,annotated data must be available, which can be a serious bottleneck. Second, learning good modelsrequires the design and acquisition of a representative set of features. We will describe in this chaptersome of the available datasets, and we will pay particular attention to how the relation arguments andthe surrounding context are described. Depending on how shallow or deep the features describing arelation are, such approaches may or may not be useful in open information extraction, where, dueto the large amount of data to be processed, fast—and therefore shallow—methods are used. Thetechniques presented in this chapter are more appropriate for the semantic analysis of individualtexts, where we can afford to compute interesting features and possibly apply models for differentrelation types to a given test instance in order to determine the relation that holds. Chapter 4 willreview methods that are more suitable to open information extraction or to processing large textcollections.

3.1 DATA

The task considered here is this: the system is presented with a text (maybe as short as a sentence)and must decide whether the text contains instances of one or more semantic relations from apre-specified fixed inventory. In the full relation-classification task, the system must also identifywhich nominals fill the argument roles of each detected relation. This experimental design requiresan evaluation dataset which has been manually annotated with correct answers. Systems based onsupervised statistical methods also require a separate annotated dataset for training. Depending onthe intended application of the system, the task can be set up in a variety of ways.

Several datasets,of different sizes and covering different relations,have been made available forsupervised learning of semantic relations. Below we describe some representative examples, rangingfrom small-scale to large-scale, and from general-purpose to domain-specific.

26 3. EXTRACTING SEMANTIC RELATIONS WITH SUPERVISION

3.1.1 MUC AND ACEThe first large-scale datasets for relation classification were produced as part of the Message Un-derstanding Conferences (MUC).1 See [Grishman and Sundheim, 1996] for an overview of MUC.The MUC conferences were mostly devoted to filling scenario templates automatically, but MUC-7introduced a relation extraction task. It is a distinguishing feature of this task that it focuses on rela-tions which hold between specific classes of entities, typically covering entity types common in newsstories. In some cases, these are restricted to named entities, which can be referred to by proper nouns,numbers or time expressions. For example, MUC considered three kinds of entities—Person, Or-ganization, and Location—as well as dates, times, percentages and currency amounts. MUC-7featured a very limited set of three relations, each of them mapping an organization into an en-tity: Employee_of, Product_of, and Location_of.2 That was sufficient in a task in which relationextraction was only a step in a more general process.

MUC was followed by a series of Automatic Content Extraction (ACE) shared tasks.3 Forexample, ACE 2008 featured two languages (English and Arabic) and five kinds of named entities:Facility (FAC), Geo-Political Entity (GPE), Location (LOC), Organization (ORG),and Person (PER); see [ACE, 2008] for more detail. Moreover, the ACE 2008 annotation consid-ered six types of relations between entities of these five types, and each of these relation types hadone or more subtypes; the full list is shown in Table 3.1.

The definition for each relation type specified the entity types permitted as its arguments.For example, the first argument of an instance of Employment must be a Person, and the secondargument must be either an Organization or a Geo-Political Entity. The MUC-7 and ACE relationclassification datasets consisted of collections of documents in which each markable instance ofa relation is labeled with its type and subtype. The guidelines for the annotation procedure weredetailed. In general, an entity is markable if it belongs to one of the specified entity classes. A relationR is markable if it holds between two markable entities mentioned in the same sentence, the sentencecontains explicit evidence for R, and R is in one of the specified relation classes. Figure 3.1 showsexample annotations.

The MUC-7 and ACE datasets were constructed so they can be used for a multi-stageinformation extraction evaluation consisting of entity detection and relation detection. Researcherswho only focus on the relation classification task often rely on the gold-standard entity annotations.The task then becomes that of assigning a relation type/subtype label to each pair of annotatedentities appearing within the same sentence. For many entity pairs, no semantic relation holds, so aNone label is also available.4

1www.itl.nist.gov/iaui/894.02/related_projects/muc/2 www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/muc_7_proceedings/overview.html3www.itl.nist.gov/iad/mig/tests/ace/4We did not have access to these datasets, so we cannot give detailed statistics on the number of instances for each relation typethey include.

www.itl.nist.gov/iaui/894.02/related_projects/muc/

www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/muc_7_p roceedings/overview.html




www.itl.nist.gov/iad/mig/tests/ace/

3.1. DATA 27

Table 3.1: The inventory of relation types and subtypes used by ACE-2008

Relation Type SubtypesPhysical Located

Near

Part-Whole Geographical

Subsidiary

Personal-Social Business

Family

Lasting-Personal

Organization-Affiliation Employment

Ownership

Founder

Student-Alum

Sports-Affiliation

Investor-Shareholder

Membership

Agent-Artifact User-Owner-Inventor-Manufacturer

General Affiliation Citizen-Resident-Religion-Ethnicity

Organization-Location-Origin

Employment(Person, Organization):<PER>He</PER> had previously worked at <ORG>NBCEntertainment</ORG>.

Near(Person, Facility):<PER>Muslim youths</PER> recently staged a half dozen rallies in frontof <FAC>the embassy</FAC>.

Citizen-Resident-Religion-Ethnicity(Person, Geo-political entity):Some <GPE>Missouri</GPE> <PER>voters<PER>…

Figure 3.1: Example annotations for three ACE-2008 relations.

3.1.2 SEMEVAL-2007/2010The 2007 and 2010 SemEval workshops on semantic evaluation included semantic relation clas-sification tasks. These tasks differed from the ACE approach in that they dealt with very generalrelations, not restricted to specific classes of entities. Those relations can be expected to have rel-evance for semantic processing in various text genres and domains. Furthermore, the datasets forthese tasks are publicly available, with no restrictions on use. As well as attracting large numbers of


participants to the shared tasks, the datasets have been used by other researchers in a number of sub-sequent studies [Davidov and Rappoport, 2008b, Nakov and Hearst, 2008, Nakov and Kozareva,2011, Ó Séaghdha and Copestake, 2008, Socher et al., 2012].

SemEval-2007Task 4 [Girju et al.,2009] provided a dataset in seven parts, each correspondingto one relation from the list in Table 3.2. Each part has training and test sections, and can be treatedas defining a binary classification task. The motivating application for this setup is that relationinstances can be collected via a search engine by submitting simple query patterns (e.g., “* in *” forthe Content-Container relation). On the other hand, the resulting collection will contain many falsepositive examples which represent “near misses” (e.g., a stitch in time). A supervised classifier couldbe used to distinguish true instances from such near misses.

Table 3.2: The seven relations considered in SemEval-2007 Task 4, with positive examples and statisticsabout the absolute and relative distribution of positive examples for each relation in the training/testingdatasets

Relation Training Testpositive size positive size

Cause-Effect 52.1% 140 51.3% 80laugh [Cause] wrinkles [Effect]

Instrument-Agency 50.7% 140 48.7% 78laser [Instrument] printer [Agency]

Product-Producer 60.7% 140 66.7% 93honey [Product] bee [Producer]

Origin-Entity 38.6% 140 44.4% 81message [Entity] from outer-space [Origin]

Theme-Tool 41.4% 140 40.8% 71news [Theme] conference [Tool ]

Part-Whole 46.4% 140 36.1% 72the door [Part] of the car [Whole]

Content-Container 46.4% 140 51.4% 74the apples [Content] in the basket [Container]Average 48.0% 140 48.5% 78

The format of the dataset is exemplified in Figure 3.2, which shows a positive and a negativeinstance of the Content-Container relation.

The arguments of the candidate Content-Container instance are given to the system,marked upwith the <e1>. . .</e1> and <e2>. . .</e2> tags.The annotation also includes manually disambiguatedWordNet senses for each argument, and the query string used to retrieve the sentence from the Web.

3.1. DATA 29

"Among the contents of the <e1>vessel</e1> were a set of carpenter’s<e2>tools</e2>, several large storage jars, ceramic utensils, ropes andremnants of food, as well as a heavy load of ballast stones."

WordNet(e1) = "vessel%1:06:00::",WordNet(e2) = "tool%1:06:00::",Content-Container(e2, e1) = "true",Query = "contents of the * were a"

"<e1>Batteries</e1> stored in <e2>contact</e2> with one another cangenerate heat and hydrogen gas."

WordNet(e1) = "battery%1:06:00::",WordNet(e2) = "contact%1:26:00::",Content-Container(e1, e2) = "false,"Query = "batteries stored in"

Figure 3.2: Annotated sentences showing a positive and a negative instance of the semantic relationContent-Container in SemEval-2007 Task 4.

The system must predict a label “true” or “false” for each data item, that is, make a binary classificationjudgment.

The follow-up Task 8 at SemEval-2010 [Hendrickx et al., 2010] changed the experimentaldesign in a few ways.

1. Instead of supplying a separate dataset and posing a separate binary classification task for eachrelation, there was a single multi-class dataset. Systems were to choose between all relations inthe inventory to classify an instance and also had the option of choosing no relation (None).

2. The candidate arguments were still given, but systems had to decide the ordering of theseentities with respect to the predicted relation’s argument slots.

3. WordNet senses and query strings were no longer supplied.

The dataset created for Task 8 was also much larger (over 10,000 annotated sentences); the set ofrelations was larger as well.The relation inventory is listed in Table 3.3. Figure 3.3 shows an exampledata item.

One challenging aspect of the SemEval-2010 Task 8 dataset is that the relation inventorycontains two groups of very close relations. The first group consists of Component-Whole andMember-Collection, both special cases of Part-Whole.The second group includes Content-Container,Entity-Origin and Entity-Destination; they are distinguished by considering whether the situationdescribed in the corresponding sentence is static or dynamic, as shown in the examples in Figure 3.4.


Table 3.3: The 9+1 relations considered in SemEval-2010 Task 8, with positive examples and statisticsabout the absolute and relative distribution of positive examples for each relation in the training/testingdatasets

Relation Training Testpositive size positive size

Cause-Effect 12.5% 1003 12.1% 328radiation [Cause] cancer [Effect]

Instrument-Agency 6.3% 504 5.7% 156phone [Instrument] operator[Agency]

Product-Producer 9.0% 717 8.5% 231suits [Product] factory [Producer]

Content-Container 6.8% 540 7.1% 192wine [Content] is in the bottle [Container]Entity-Origin 9.0% 716 9.5 % 258letters [Entity] from the city [Origin]

Entity-Destination 10.6% 845 10.8% 292boy [Entity] went to bed [Destination]Component-Whole 11.8% 941 11.5 % 312kitchen [Component] apartment [Whole]

Member-Collection 8.6% 690 8.6% 233tree [Member] forest [Collection]Message-Topic 7.9% 634 9.6 % 261lecture [Message] on semantics [Topic]

Other 17.6% 1410 16.7% 454people filled with joyTotal 8000 2717

The <e1>collision</e1> resulted in two more <e2>crashes</e2> in theintersection, including a Central Concrete truck that was about to turn leftonto College Ave.Relation = Cause-Effect(e1, e2)

Figure 3.3: An instance of the semantic relation Cause-Effect in SemEval-2010 Task 8.

3.1.3 DATASETS FOR RELATIONS IN NOUN-NOUN COMPOUNDSHere we present some other datasets for relation extraction, which focus on the special case ofnoun-noun compounds; some of them do not provide sentential context.

3.1. DATA 31

Entity-Origin(e1, e2)

He removed the <e1>apples</e1> from the <e2>basket</e2> and putthem on the table.

Content-Container(e1, e2)

When I entered the room, the <e1>apples</e1> were put in the<e2>basket</e>.

Entity-Destination(e1, e2)

Then, the <e1>apples</e1> were put in the <e2 >basket</e2 > once again.

Figure 3.4: Annotated sentences showing instances of the close relations Entity-Origin, Content-

Container and Entity-Destination from the SemEval-2010 Task 8.

Nastase and Szpakowicz [2003] made available a set of 600 base noun phrases (noun-modifierpairs) with two annotations of different granularity levels. One annotation used 35 relations,5 whilethe other used five more general relations.6 The 35 relations, together with examples, were presentedin Chapter 2, Table 2.5. This dataset contains not only noun-noun pairs, but also noun-adjectivepairs. These base noun phrases cannot be fully interpreted out of context, but they are annotatedwith part-of-speech and WordNet 1.7 senses. Statistics for the dataset at the coarse 5-class leveland examples of annotated instances appear in Table 3.4, in a Prolog-style format (predicate andarguments).7

Table 3.4: Statistics for Nastase and Szpakowicz’s [2003] noun-modifier dataset at the coarse 5-classlevel, and examples of annotated instances at the fine-grained level. Abbreviations used: nmr = noun-modifier relation, eff = effect, freq = frequency, inst = instrument, lfr = locationFrom, cont = container

Relation #class examples ExampleCausality 86 rel(nmr, wrinkle, [n, 1], laugh, [n, 2], eff )Participant 260 rel(nmr, observation, [n, 1], radar, [n, 1], inst)Temporality 52 rel(nmr, workout, [n, 1], regular, [a, 1], freq)Spatiality 56 rel(nmr, material, [n, 1], cosmic, [a, 1], lfr)Quality 146 rel(nmr, album, [n, 2], photo, [n, 1], cont)

5 www.eecs.uottawa.ca/˜vnastase/Data/nmr_relations.data6 www.eecs.uottawa.ca/˜vnastase/Data/nmr_relations_5class.data7The format of an entry is as follows:rel(nmr, Head, [POSHead , WNSenseHead ], Modif ier, [POSModif ier , WNSenseModif ier ], Relation).(the WordNet 1.7 sense corresponds to that particular word and part of speech).

www.eecs.uottawa.ca/~vnastase/Data/nmr_relations.data

www.eecs.uottawa.ca/~vnastase/Data/nmr_relations_5class.data


Table 3.5: Statistics for Kim and Baldwin’s [2005] noun compounds dataset (N1 = modifier, N2 = head)

Relation Definition Example # test/# training

agent N2 is performed by N1 student protest 10 / 5beneficiary N1 benefits from N2 student price 10 / 7

cause N1 causes N2 exam anxiety 54 / 74container N1 contains N2 printer tray 13 / 19content N1 is contained in N2 paper tray 40 / 34

destination N1 is destination of N2 exit route 2 / 2equative N1 is also head payer coach 9 / 17

instrument N1 is used in N2 electron microscope 6 / 11located N1 is located at N2 home town 12 / 16location N1 is the location of N2 lab printer 29 / 24material N2 is made of N1 gingerbread man 12 / 15object N1 is acted on by N2 engine repair 88 / 88

possessor N1 has N2 student loan 32 / 22product N1 is a product of N2 automobile factory 27 / 32property N2 is like N1 elephant seal 76 / 85purpose N2 is meant for N1 concert hall 160 / 160

result N1 is the result of N2 cold virus 7 / 8source N1 is the source of N2 chest pain 86 / 99

time N1 is the time of N2 winter semester 26 / 19topic N2 is concerned with N1 satefy standard 465 / 446

Girju et al. [2004] annotated—with one of 22 relations—4,504 noun-noun compounds insentences extracted from the Wall Street Journal and the glosses from eXtended WordNet (XWN).The nouns were also annotated with WordNet senses.8

Kim and Baldwin [2005] gathered 2,169 noun-noun compounds (split into 1,088 for training,1,081 for testing) from the Wall Street Journal section of the Penn treebank.Those were only commonnouns, and were not part of longer compounds. The annotations used the 20 relations presented inTable 3.5. Unlike other datasets where annotations are disjunct, Kim and Baldwin [2005] allowedmultiple annotations: 94 compounds in the training dataset and 81 in the test dataset were annotatedwith more than one semantic relation.

Ó Séaghdha and Copestake [2007] collected and annotated 1,568 noun compounds from theBritish National Corpus9 with one of the 8 relations presented in Section 2.2.1, derived from Levi’slist of recoverable deletable predicates, plus the relation unknown.Table 3.6 shows the data statistics.

8The dataset itself appears to be missing; the project is noted on the Web at 〈apfel.ai.uiuc.edu/semantic_relations.html〉.

9www.natcorp.ox.ac.uk/

apfel.ai.uiuc.edu/semantic_relations.html

apfel.ai.uiuc.edu/semantic_relations.html

www.natcorp.ox.ac.uk/

3.1. DATA 33

Table 3.6: Statistics for Ó Séaghdha and Copestake’s [2007] noun compounds dataset

Relation Example Distributionbe steel knife 191have street name 199in forest hut 308inst rice cooker 266actor honey bee 236about fairy tale 243rel camera gear 81lex home secretary 35unknown similarity crystal 9

Tratz and Hovy [2010] collected and annotated 17,509 noun compounds consisting of com-mon nouns from a large corpus and the Wall Street Journal. Details of the relations used are pre-sented in Table 2.6.The most frequent relations in this dataset are perform/engage in (13.24%), cre-

ate/provide/sell (8.94%), topic of communication/imagery/info (8.37%), location/geographic scope

of (4.99%), organize/supervise/authority (4.82%).

3.1.4 RELATIONS IN MANUALLY BUILT ONTOLOGIESThe early attempts to build ontologies for use in NLP and AI relied on manual contributions ofa small number of experts. Such reliance limits the size and range of the resource, and may affectits availability. Even so, ontologies offer a good deal of useful information, including many relationinstances for various relation types. WordNet has been the most popular resource in NLP, surpassingall other resources by a large margin; however, there are others, with more varied relations, e.g., Cyc.10

Section 2.2.2 briefly presented the relations in WordNet and Cyc.Manually built resources tend to contain data of high quality and usually in larger quantity

than data annotated specifically for relation-learning purposes, but often outside a textual context.On the other hand, learning from such good data would be restricted to the relations present inthe ontology. WordNet’s equivalent of is-a relations, hypernymy/hyponymy, are often used as initialseeds or for training data for relation extraction.

WordNet and Cyc are based on different principles and thus have different organization oftheir taxonomies. WordNet was developed as a network of synsets (groups of near-synonymouswords), linked by several fundamental lexical-semantic relations. Cyc aims for the organization ofcommon-sense knowledge; thus, it is the concepts and the way in which they are defined that in-forms the construction of the ontology. In particular, Cyc’s concepts, defined intensionally, lead toa “property-driven” organization. Cyc’s top thing collection (concept) is partitioned into tempo-ral thing—atemporal thing, spatial thing—aspatial thing, intangible—partially

10Cyc and its public-domain version OpenCyc are very rarely used in NLP.


tangible and set or collection—individual. WordNet’s top node, entity, has subclassesphysical entity, abstraction, thing; abstraction is subdivided into attribute, psycho-logical feature, group, relation, and so on. Somewhat optimistically, each ontology may offerits particular advantages to the task of learning semantic relations between nominals.That is becausegeneralization can run along different dimensions, and so exemplify the refinement of is-a relationsdiscussed by Wierzbicka [1984].

3.1.5 RELATIONS FROM COLLABORATIVELY BUILT RESOURCESThe infeasibility of building ontologies or data repositories on a very large scale has sometimes beenreferred to as the data acquisition bottleneck. Once the Web began to be regarded as a platform forfacilitating online collaborative projects, this became much less of an issue. Knowledge sharing, in amore or less structured format, has led to the construction of large-scale knowledge sources. Two inparticular,Wikipedia and Freebase, contain data of much interest to the analysis of semantic relations.While not yet used in the type of learning described in this chapter, but rather in a semi-supervisedsetting, we include them in this section of the book, and reconsider their particular characteristicsin Chapter 4.

Wikipedia infoboxes

Wikipedia, the online encyclopedia, contains a wealth of information. Some of that informationhas a fixed structure and thus is relatively easy to obtain. For the analysis of semantic relations, wefind the information captured in infoboxes especially interesting and useful. As can be seen fromthe example in Figure 3.5, an infobox contains attributes of the entity described in the surroundingWikipedia article, and the values of those attributes, some of which are entities themselves. We canregard the attributes as relations between the entity described by the article (that entity’s name isthe article’s title) and those referenced in the infobox.

The attributes and their values included in the infobox are selected by the contributors. Tomake the content more consistent, infobox templates have been instituted. The contributors areencouraged to use them, which is meant to ensure (up to a point) that the attributes are consistentacross multiple articles. For example, the birth date attribute is called birth_date across all articlesusing an infobox template referring to a person. There is still variation, but one can expect that intime schemata will converge.

There are approximately 195 infobox templates for persons alone, which share some(e.g., birth_date, birth_place, nationality) but not all attributes. For example, the motorcycle riderinfobox contains attributes such as current_team, bike_number, specific to this template. Not everyattribute value corresponds to a Wikipedia entry (customarily considered a concept). Some valuesare simply data, such as population size, for example. The remaining “true” relations still consti-tute a massive dataset. A 2007 version of DBpedia [Auer et al., 2007] contained 15 million relationinstances extracted from English Wikipedia infoboxes. Because the datasets are so large, in the dis-

3.1. DATA 35

cussion of specific projects, we will note the use of subsets, often focused on a small set of specificrelations. We will show examples of such work in Chapter 4.

Henry Walton “Indiana” Jones, Jr.

First

Created by

Portrayed by Films:

TV series:

Video games:

Information

George Lucas

Harrison Ford

River Phoenix

Doug Lee

Corey Carrier

Sean Patrick FlaneryGeorge Hall

Steven Spielberg

Born: July 1, 1989

Neil “Boulie” Boulane (baby)

Boutalat (toddler)

David Esch (voice)

(voice)

John Armstrong (voice)

Indiana Jones

Harrison Ford as Indiana Jones in in 1981

character

(ages 36-58)

(ages 8-10)

(ages 93)

(ages 16-21)

(age 13)

Raiders of the Lost Ark


appearance

Figure 3.5: Sample of the infobox for the article Dr.Henry “Indiana” Jones

The fact that these relation instancesare so easily available, and in many languages,makes them very attractive to the NLP com-munity. Because they appear in an article whichincludes also the information in the infobox intextual form, the link between the two can beestablished, and then the two sources of infor-mation can be used together. This kind of datais used successfully in seeding open informationextraction. We will explore this further, espe-cially in Chapter 4.

The various types of information and re-lations in Wikipedia have been the source ofseveral other machine-readable repositories.Webriefly describe some of them below.

DBpedia [Auer et al., 2007] convertsWikipedia’s relational database tables and thestructured information in infoboxes into struc-tured knowledge, and connects it to a varietyof resources, including Freebase, WordNet andOpenCyc. Some of DBpedia’s three million en-tities are connected to a manually-built ontol-ogy (based on an analysis of infobox templates)consisting of 170 classes.

YAGO [Weikum et al.,2009] extracts ar-ticle and category links from Wikipedia, as wellas facts which represent relations of 100 types,from categories which provide relational in-formation (1975 births), or categories whichstart with Countries in . . ., Rivers of. . ., and so on. Those categories provide in-stances of such relations as bornInYear, locate-

dIn, politicianOf and so on. The core YAGO contains over a hundred million facts.WikiNet [Nastase and Strube, 2013] transforms Wikipedia into a multilingual concept net-

work by exploiting category names, structural links and infoboxes. The network uses approximately450 relations, which arise out of infoboxes or such category names as, e.g., created by from category


Characters created by Joss Whedon, member from category Queen (band) members. Itcontains approximately 49 million relation instances.

FreebaseFreebase is a large collaborative online knowledge base, consisting of facts gathered from varioussources, including contributions over the Web. It consists of topics of various types connected byproperties. Properties are close to what we call here relations, and they are often expressed by a verbor a verb phrase, but also by relational nouns. According to the Freebase wiki,11

Properties of a topic define a HAS A relationship between the topic and the value of theproperty, e.g., Paris {topic} has a population {property} of 2153600 {value}. In Freebasethe value of a property can also be another topic, e.g., Apocalypse Now {topic} has adirector {property} Francis Ford Coppola {value}.

As of August 2012, Freebase contains more than 418 million facts, connecting almost 24million topics with more than 8000 relation types (that is to say, properties).

3.1.6 DOMAIN-SPECIFIC DATAOne of the important applications of NLP technology has been the development of methods andtools for extracting information from academic scientific literature. In particular, publications in thefield of biomedicine are appearing at a rate impossible for human readers to keep up with.Researchersin this field have found great promise in deploying text mining tools to automate or semi-automatetasks such as curating databases of existing knowledge [Cohen, 2010, Zweigenbaum et al., 2007].

Table 3.7: Semantic relations between treatments and diseases, proposed by Rosario and Hearst [2004]

Relation # examplesExample

Cure 810 Intravenous immune globulin for recurrent spontaneousabortion

Only DISEASE 616 Social ties and susceptibility to the common coldOnly TREATMENT 166 Flucticasone propionate is safe in recommended dosesPrevent 63 Statins for prevention of strokeVague 36 Phenylbutazone and leukemiaSide Effect 29 Malignant mesodermal mixed tumor of the uterus follow-

ing irradiationNo Cure 4 Evidence for double resistance to permethrin and malathion

in head liceIrrelevant 1771 Patients were followed up for 6 months

11wiki.freebase.com/wiki/Property, August 2012

wiki.freebase.com/wiki/Property

3.2. FEATURES 37

A popular task, that of identifying protein-protein interactions (PPIs) in the abstracts or the fulltext of biomedical publications, is conceptually similar to the standard relation classification task,and is generally amenable to similar approaches [Hachey et al., 2011]. A variety of PPI classificationdatasets have been constructed for different task designs; see [Pyysalo et al., 2008] for an overview.In some cases, it is necessary to classify interaction types, while in others the task is simply to makea binary decision: is there an interaction? For example, the following sentence from the PPI corpusof Nédellec [2005] expresses interactions between SigK and ykvP and between GerE and ykvP, butnone between SigK and GerE:

Both SigK and GerE were essential for ykvP expression, and this gene was transcribedfrom T5 of sporulation.

In the popular BioCreative12 PPI evaluation [Krallinger et al., 2008], the goal is to extract pairs ofinteracting proteins, where the context to be considered for detecting PPIs is as large as a completescientific article. Only a binary yes/no decision is required for this dataset.

The setup of the PPI task follows the general trend in protein-protein datasets (such as MINTand BIND)13 not to provide the type of protein-protein interaction. Still, some biomedical datasetsdo give such information. For example, the manually-curated HIV-1 Human Protein InteractionDatabase uses an inventory of 65 possible protein-protein interactions, including relations like Binds,Inhibits, and Upregulates. The ten most frequent among these relations were used for relationextraction from text [Rosario and Hearst, 2005].

Of considerable interest in the biomedical domain have also been relations between genesand diseases, relations between treatments and diseases, and relations which deal with subcellularlocations. Table 3.7 shows the statistics of a dataset of 3,495 relation instances of 8 subtypes of themore general Treatment-for-Disease relation, from [Rosario and Hearst, 2004].

There have been some attempts to design and use larger and more general relation inventories.For example, Rosario and Hearst [2001] annotated a set of 1,551 biomedical noun-noun compoundswith 18 relations; the data statistics are shown in Table 3.8. Their initial relation inventory consistedof 38 relations, but the remaining 20 relations had too few instances to be useful in supervisedlearning.14

3.2 FEATURESThe job of selecting features to represent relation instances adequately is an essential element of thetask of learning semantic relations between nominals. In a typical incarnation, a relation instanceappears in a sentence annotated with the two nominals marked, and we need a mapping fromsuch sentences onto feature vectors. A larger context, perhaps a paragraph, may be called for, or aninstance may just be the pair of nominals to be identified in texts automatically. In any event, a feature

12biocreative.sourceforge.net/13mint.bio.uniroma2.it/mint/download.do and bind.ca/14The datasets can be found at biotext.berkeley.edu/data.html, along with other useful resources.

biocreative.sourceforge.net/

mint.bio.uniroma2.it/mint/download.do

bind.ca/

biotext.berkeley.edu/data.html


Table 3.8: Semantic relations holding between the nouns in biomedical noun-noun compounds, pro-posed by Rosario and Hearst [2001]

Relation # examples ExampleSubtype 393 headaches migraineActivity/Physical_process 59 virus reproductionProduce_genetically 47 polyomavirus genomeCause 116 heat shockCharacteristic 33 drug toxicityDefect 52 hormone deficiencyPerson_afflicted 55 AIDS patientAttribute_of_clinical_study 77 headache parameterProcedure 60 genotype diagnosisFrequency/time_of 25 influenza seasonMeasure_of 54 relief rateInstrument 121 laser irradiationObject 30 bowel transplantationPurpose 61 headache drugsTopic 38 headache questionnaireLocation 145 brain arteryMaterial 28 aloe gelDefect_in_location 157 lung abscess

vector must contain enough discriminative information to make the relation classification problemat hand feasible. As always in supervised learning, the learning algorithm builds (trains) a modelfrom annotated examples. The model’s predictive power is then tested on new, previously unseeninstances. Simply put, the process establishes similarities between relation instances, or between arelation instance and a relation model, as described by the features.

Many factors can affect the similarity between instances of semantic relations. Figure 3.6illustrates two extreme cases. Here, Barack Obama and François Hollande are similar since they areboth presidents (also, politicians), while USA and France are similar because they are both countries(also, presidential republics). Given the similarity of the corresponding relation arguments, it isreasonable to guess that Barack Obama–USA and François Hollande–France should be instances ofthe same relation (namely President_of). This is very different from the case of feline–tiger andvehicle–motorcycle, where there is hardly any similarity between feline and vehicle or between tigerand motorcycle (although one could argue for some highly metaphorical similarity between the lattertwo). Even so, they share the same relation, hypernymy (or is-a), because a tiger is a kind of felineand a motorcycle is a kind of vehicle.

More generally, let us consider two pairs of relation arguments, (A, B) and (C, D). There aretwo kinds of reasons why such pairs may be instances of the same relation. First, they may have high

3.2. FEATURES 39

entity similarity

Barack ObamaA

(A , B)(C , D)

C

B

D

B

D

A

CFranҫois Hollande France vehicle motorcycle

president_of is_a

USA feline tiger

relation similarity

Figure 3.6: Different types of similarities between instances of the same relation.

attributional similarity [Turney, 2006a]: arguments A and C are similar, and so are B and D. Forexample, in the ACE 2008 data, the arguments of a relation must belong to pre-specified semanticclasses (e.g., Person, Organization). This already guarantees some degree of similarity, even if notenough to assign the relation based on the argument type alone. Second, the two instances may havehigh relational similarity [Turney, 2006a].That happens when the relation between A and B parallelsthe relation between C and D, perhaps because they are used in similar contexts. At the same time,A and C have nothing or very little in common, and the same goes for B and D. Naturally, bothrelational and attributional similarity can contribute to the similarity of two relation instances. Wewill explore features modeling both types of similarity.

3.2.1 ENTITY FEATURESEntity features capture some representation of the semantics of the arguments of a relation instance.Many semantic representations are possible, and this is reflected in the various proposals of featuresets.

Basic entity features include the string value of each candidate argument as well as the individualwords making up each of the arguments, possibly lemmatized or stemmed. In many cases, suchfeatures are informative enough for a good relation assignment, but unfortunately they also tend tobe sparse.

Background entity features augment the basic features with syntactic or semantic information,possibly from a task-specific inventory (such as the ACE entity types) or from a general lexicalresource such as WordNet. They can also be learned in the form of automatically acquired word


clusters. Using such clusters helps combat data sparseness by incorporating knowledge about thetypical values of the arguments of the target semantic relation.Even if a test-time candidate argumenthas not been seen in the training dataset, a classifier would still be able to make an informed predictionif it knows that the argument is semantically similar to seen arguments for the same argument slot ofthe target semantic relation. Word clusters can be induced from word co-occurrences in a large textcorpus and they easily allow tailoring to different domains and levels of granularity. For example, theBrown clustering algorithm [Brown et al., 1992] assigns each word in a training corpus to a positionin a hierarchy of words, so that any two words can be compared by checking whether they match atthe first level of the hierarchy, or at the second level, and so on. Using Brown cluster features has beenshown to help in many NLP tasks, including ACE-style relation extraction [Chan and Roth, 2010,Sun et al., 2011]. There are many other clustering algorithms which could also produce features,including Clustering by Committee [Pantel and Lin, 2002] and Latent Dirichlet Allocation [Blei et al.,2003].

A complementary alternative to clusters learned from co-occurrence data is a direct represen-tation of co-occurrences in the feature space.This allows the supervised relation classification systemto identify which aspects of co-occurrence behavior are most relevant. Ó Séaghdha and Copestake[2008] observed that features based on coordinations (and, or) can yield state-of-the-art performancefor the SemEval-2007 Task 4, relation classification. The idea is that similar nouns tend to coordi-nate with other similar nouns, e.g., both dog and cat will often appear in coordinating constructionswith other animals, especially pets, and hence they will look similar in the feature space.

More generally, a distributional representation can capture the meaning of a word by aggregatingall its interactions as found in a large collection of texts. Such descriptions can be either unstructuredor grouped by grammatical relations. Table 3.9 shows an example of grammatically-informed dis-tributional representation for the noun paper obtained from the British National Corpus,15 parsedwith the RASP system [Briscoe et al., 2006]. The representation includes semantically associatednouns such as pen, ink and cardboard obtained through coordination (and /or). We can see what apaper can do, e.g., propose and say; what one can do with a paper, e.g., read, publish; some typicaladjectival modifiers such as white and recycled ; some noun modifiers such as toilet and consultation;nouns connected to the target via various prepositions, e.g., paper on environment, paper for meeting,paper with a title, and so on. Note that this representation mixes different senses of paper, so forexample ink refers to the medium for writing, while propose refers to writing/publication/document.

A relational semantic representation, based on relations in a semantic network or in a formalontology, is another way of finding semantic features which describe a relation’s arguments. Animportant advantage of using semantic networks is that they are typically organized around wordsenses rather than words. So, in a semantic network there would be one entry for bank as a “riverbank,” and another for “bank as a financial institution” (and possibly other senses, e.g., “bank as abuilding”). This is in contrast with distributional representations, which tend to ignore distinctionsbetween different word senses, assuming that such distinctions can be handled by the learning

15http://www.natcorp.ox.ac.uk/

http://www.natcorp.ox.ac.uk/

3.2. FEATURES 41

Table 3.9: Distributional profile of the noun paper in the British National Corpus

Word Syntactic relation Co-occurring wordspaper-n coordination pen-n:69, pencil-n:51, paper-n:32, glass-n:22, ink-

n:20, …subject_of say-v:86, make-v:39, propose-v:39, describe-v:31, set-

v:30, …object_of publish-v:147, read-v:129, use-v:78, write-v:62, take-

43, …modified_by_adj white-j:923; local-j:159, green-j:63, brown-j:56, non-

stick-j:71…modified_by_n consultation-n:117, government-n:94, discussion-

n:84, tissue-n:71, blotting-n:59, …modifies_n bag-n:150, money-n:44, cup-n:37, mill-n:36, work-

n:34, …pp_with number-n:6, address-n:3, title-n:2, note-n:2, word-n:2,

…pp_on reform-n:13, future-n:13, policy-n:10, environment-

n:9, subject-n:8, ……

algorithm when necessary. In order to access this type of representation, however, one must be ableto link the terms which express a relation’s arguments with the appropriate nodes in the network,that is to say, one needs word-sense disambiguation (WSD)16 to pick the contextually correctsense for each argument. Provided that WSD has been performed or that the representation willmerge and use all possible senses, the related concepts and their relations from the resource, as wellas a concept’s ancestors, can be extracted as entity features. Features of this kind have been usedin [Girju et al., 2004, Nastase and Szpakowicz, 2003, Nastase et al., 2006] for general concepts, andin [Rosario and Hearst, 2001] for domain-specific concepts.

3.2.2 RELATIONAL FEATURESRelational features characterize the relation directly, e.g., by modeling the text which serves as thecontext for instances. In principle, one could consider the entire document where the candidate argu-ments appear, but such a broad context is likely to include much irrelevant and spurious informationand thus mislead the classifier. Therefore, it is usually assumed that only part of the context, e.g., inthe vicinity of the candidate argument mentions, is relevant to the task of deciding whether a givensemantic relation holds between them. One well-liked option is to take the words between the twoarguments, and maybe also the words from a fixed window of words on either side of the arguments, as

16See [Ide and Véronis, 1998] for an early survey, and [Navigli, 2009] for an up-to-date overview of WSD.


a basic representation of context. Popular alternative representations include a dependency path fromone mention to the other, a subgraph of a syntactic graph produced by a dependency parser from thecontext sentence; or the whole dependency graph; or the smallest dominating subtree in a constituentparse tree. Figure 3.7 illustrates some of those commonly-used context representations.

Bag of words {2006, bought, Google, in, YouTube}

(Google,bought,YouTube, in, 2006)

nsubj

nsubj

dobj

dobj

prep

pobj

Google

Google

S

NP

NNP

NNP

NP

CD

NPPVBD

VP PP

bought

bought

YouTube

YouTube in 2006

Google bought YouTube in 2006

Word sequence

Dependency path

Dependency graph

Constituent tree

Figure 3.7: Context representations for the sentence Google bought YouTube in 2006.

Extracting relational features from the context typically involves enumerating substructuresof the context representation. The exact nature of these substructures depends on the representationchosen; a bag of words naturally decomposes into words, a word sequence decomposes into n-grams,a parse tree decomposes into subtrees, and so on. For example, from the dependency parse tree ofthe sentence Google bought YouTube in 2006. (Figure 3.7), a system could extract a set of dependencytriples {(bought →nsubj Google), (bought →dobj YouT ube), (in →pobj 2006), (bought →prep

in)}, or more complex paths such as (Google ←nsubj bought →dobj YouT ube), or even the entiredependency tree, though larger structures will tend to be sparse and hence less useful in learning.

3.3. LEARNING METHODS 43

The number of subtrees or subgraphs can be exponential in the depth of the structure, so it maybe necessary to restrict the features to substructures of up to certain maximal size, or to count onlysubstructures complete in some way. It may also be beneficial to consider multiple representationsto exploit the complementary strengths of syntax-based and adjacency-based features or of differentstyles of parsing. One can even adopt a hybrid representation, with syntax and sequential order bothexhibited in a shared graph, and hope to capture discriminative features which combine syntacticand sequential information [Jiang and Zhai, 2007].

Background relational features attempt to encode knowledge about how entities typically in-teract in texts beyond the immediate context. Again, when one applies clustering, the feature spaceis expanded so that the classifier can leverage training examples involving similar but non-identicalcontexts [Wang et al., 2011]. Assume, for example, that CEO and company are often mentionedtogether in contexts which indicate an employed-by relation. A new context containing CEO andcompany is also likely to describe this relation even when the context is rather vague (e.g., the company’sCEO).

Given a pair of nominals (N1, N2), such features can be extracted by querying a Web searchengine with suitable patterns. For example, Nakov and Hearst [2007] used exact-phrase querieslike "N1 that * N2", "N2 that * N1", "N1 * N2" and "N2 * N1", and then extracted theverbs, prepositions and coordinating conjunctions which connected N1 and N2. Table 3.10 showsan example: V stands for verb (possibly +preposition and/or +particle), P for preposition and C forcoordinating conjunction; 1 → 2 means committee precedes the feature and member follows it; 2 → 1means member precedes the feature and committee follows it. Such features can be also extracted fromthe Google Web 1T Corpus, without using a search engine [Butnariu and Veale, 2008].

Background relational features do not have to be restricted to any part of speech: just aboutany pattern is fine, even patterns containing placeholders. For example, Turney [2006b] mines fromthe Web patterns like “Y * causes X ” for Cause (e.g., cold virus) and “Y in * early X ” for Temporal

(e.g., morning frost).Finally, background relational features can be distributional : a target relation is represented

using a vector with coordinates corresponding to fixed patterns. For example, Turney and Littman[2005] characterize the relationship between two words as a vector with coordinates correspondingto the Web frequencies of 128 fixed phrases like “X for Y ” and “Y for X” instantiated from a fixedset of 64 joining terms such as for, such as, not the, is *.17 These vectors can be used for classificationdirectly, or can be further mapped to vectors of lower dimensionality, e.g., using Singular ValueDecomposition [Turney, 2006a].

3.3 LEARNING METHODS

This section discusses supervised methods of relation classification. Semi-supervised, unsupervisedand self-learning methods will be described in Chapter 4. We first refresh the reader’s memory on

17Note that the features in Table 3.10 are also distributional, but their dimensionality is not predetermined.


Table 3.10: Paraphrasing verbs, prepositions and coordinating conjunctions linking committee and mem-ber, extracted from the Web [Nakov and Hearst, 2007]

Freq. Pattern POS Direction2205 of P 2 → 11923 be V 1 → 2771 include V 1 → 2382 serve on V 2 → 1189 chair V 2 → 1189 have V 1 → 2169 consist of V 1 → 2148 comprise V 1 → 2106 sit on V 2 → 181 be chaired by V 1 → 278 appoint V 1 → 277 on P 2 → 166 and C 1 → 266 be elected V 1 → 258 replace V 1 → 248 lead V 2 → 147 be intended for V 1 → 245 join V 2 → 1. . . . . . . . . . . .

general supervised machine learning and then talk in more detail about the learning methods mostsuccessfully used for learning semantic relations.

3.3.1 SUPERVISED MACHINE LEARNINGA very robust and generally applicable approach to the task of semantic relation classification is touse supervised classifiers. In effect, such classifiers learn from labeled training data how to weigh upthe evidence for a given target relation mention by evaluating the contributions of multiple features.

Given training data (X, Y ) in the form of a set of d-dimensional feature vectors xi =〈xi1xi2 . . . xid〉 paired with binary labels yi , a supervised classification model will learn a decisionfunction f which predicts a label for a feature vector x′. In principle, the decision function f canhave an arbitrary mathematical form, but in practice each learning algorithm imposes a particular


functional form. One standard form is that of a linear decision function:

f (x′) =d∑

j=1

wjx′j

= 〈w, x′〉 (3.1)

The classification model is parameterized by a vector of real-valued weights w ∈ Rd , which is learned

from the training data. As Equation 3.1 shows, the linear decision function is equivalent to takingthe inner product (dot product) of the weight vector and the feature vector x′, and its value is a realnumber. When the task is to predict a binary label y′ ∈ {−1, 1}, we can apply the sign function tof (x′). This is equivalent to checking on which side of a hyperplane defined by w the point x′ lies.Where the label set is larger, a standard approach is to decompose the problem into multiple binaryclassification tasks [Hsu and Lin, 2002]. Popular examples of linear classifiers include the perceptronand linear-kernel Support Vector Machine (SVM). A related form of classification function is a log-linear model. It directly estimates the probability that a datapoint x′ has a label y′:

p(y′|x′) = exp(∑d

j=1 wy′j x′j )∑

y′′∈Y exp(∑d

j=1 wy′′j x′j )

(3.2)

Log-linear classifiers are also known as logistic regression, maximum entropy or MaxEnt classi-fiers [Manning and Schütze, 1999]. Many other kinds of classifiers exist, including Naïve Bayes,decision trees and k-nearest-neighbours [Witten et al., 2011], but to discuss the whole variety isbeyond the scope of this book. In our experience, SVMs and logistic regression models are reliablygood choices for most practitioners.

3.3.2 LEARNING ALGORITHMSEarly work on relation extraction from text was based on manually crafted rules, which could oftenachieve very high precision but suffered from low recall. A classic rule-based system at MUC wasFASTUS [Hobbs et al., 1997]. It achieved high speed by means of cascaded, non-deterministicfinite state transducers. Such kinds of systems are beyond the scope of this book. From now on, wewill focus on machine learning.

Supervised machine learning automatically constructs models to capture the essence of aphenomenon.The construction relies on the analysis of (usually many) instances of this phenomenon,represented by salient and relevant features. Computers outperform people by a huge margin whenit comes to processing very much data very fast, so many regularities and correlations between thefeatures and the targeted phenomenon can be found. Small sets of manually crafted rules can thusgive way to automatically built models which are often more complex and more comprehensive.Theuser who wants to attack a problem using machine learning must find a good description of theproblem instances by features, and choose an appropriate learning algorithm. We have already seenhow the recognition and acquisition of semantic relations can be described by various entity features


or relational features.Such features represent data in a manner agreeable to many learning techniques,but depending on the characteristics of the features themselves and of the targeted phenomenon,some methods may be more appropriate or useful than others. In this section, we discuss in moredetail two classes of learning methodologies: kernel methods and sequential labelling methods. Weexplain which characteristics make these methods particularly appropriate for the task of classifyingsemantic relations.

Classification with kernelsIn a typical relational classification problem, we will want to combine heterogeneous informationsources and also to extract information from such structures as strings and parse trees. For example,relational features (Section 3.2.2) capture the interaction between two entities by extracting a syntac-tic graph from a parse tree or a sequence of words from a sentence. In both cases, the representationhas internal structure which would be ignored if we simply treated it as a single atomic feature.In particular, two syntactic representations of context can look quite different while sharing somesubstructure and some elements of meaning. At the same time, we may also wish to combine entityfeatures (Section 3.2.1) with relational information. Kernel methods are a class of machine learningmethods which elegantly handle both concerns, and have been applied in many systems which learnsemantic relations.

The fundamental idea in kernel methods is that the similarity of two instances can be computedin a high-dimensional feature space without needing to enumerate the dimensions of that space.Consider again the case of learning to classify semantic relations using parse trees as the inputrepresentation. We can choose a feature mapping which expands a tree into a large set of subtrees.The calculation of the distance between two instances with different parse trees will compare not onlythe trees but also the subtrees, thus uncovering “hidden” shared structure. Dynamic programmingcan help avoid the prohibitive computational expense of a naïve expansion. We will now outline howthis can be incorporated in the learning method.

The similarity between instances is computed by a function known as the kernel function orsimply kernel. Not any function k can be a kernel. It must be a positive semi-definite function, that is,correspond to an inner product in some vector space.

k(x, x′) = 〈φ(x), φ(x′)〉 (3.3)

The embedding function φ maps data items from the input space X to a feature space F in whichthe inner product is applied. X and F may be the same, and then learning is similar to that of thestandard feature-vector method. X and F may also be qualitatively different, as when X is not avector space or F has an infinite dimensionality.The feature space may capture interactions betweenfeatures, as in the polynomial kernel k(x, x′) = (xT x′ + a)d where the dimensions of F correspondto degree-d polynomials of the features in X . A comprehensive overview of the variety of kernelsand kernel-based algorithms can be found in [Shawe-Taylor and Cristianini, 2004].

It is a significant advantage of the kernel approach that it may be more practical to compute thekernel function k than the large (or even infinite) feature mappings φ(x). For example, the Gaussian


kernel k(x, x′) = exp(γ ||x − x′||2) corresponds to an inner product in an infinite-dimensionalfeature space with a straightforward calculation. This is sometimes known as the “kernel trick.” Asecond advantage is that different inner product geometries may be swapped in without requiringa new learning algorithm. For example, improved performance can be attained on many NLP tasksby using kernels derived from well-established similarity measures such as the Jensen-Shannondivergence instead of the Euclidean dot product [Ó Séaghdha and Copestake, 2008]. Finally, theclass of valid kernel functions is closed under addition and pointwise multiplication,18 making iteasy to combine different information sources in a principled way.

The most popular classification algorithm for use with kernel functions is the Support VectorMachine ( SVM) [Cortes and Vapnik, 1995]. SVM classification with a kernel k replaces the linearclassification (Equation 3.1) with

f (x′) =∑x∈T

αx k(x, x′) (3.4)

where T is the set of training examples and each training example x is associated with a weight αx

in training. Under the linear kernel embedding φ(x) = x, the relationship between these weightsand the per-feature weights w learned by a standard linear classifier can be seen from the following:

f (x′) =∑x∈T

αx k(x, x′)

=∑x∈T

αx 〈x, x′〉

=∑x∈T

αx

d∑j=1

xj x′j

=d∑

j=1

x′j

∑x∈T

αx xj (3.5)

Hence the corresponding feature weights are:

wj =∑x∈T

αx xj (3.6)

For kernels with more complex feature mappings, e.g., the Gaussian kernel, the relationshipis more opaque, but the intuition holds: the example weights learned by an SVM determine aset of feature weights in some feature space even when we do not represent that space explicitly.The optimization task defined by the SVM problem selects the set of weights which maximizesthe margin—roughly speaking, the separating distance between the members of each class and thedecision boundary. Maximizing the margin amounts to finding a distinction between classes suchthat the members of each class are as far away as possible from the border between them.

18In pointwise multiplication of matrices, also known as the Hadamard product, (A • B)i, j = Ai, j ∗ Bi, j .


Kernel methods and SVM classification have become very popular tools for relation classi-fication, in particular as a way of integrating relation information about the context of a candidaterelation mention. As in feature-based classification, one can select from a variety of context represen-tations. The choice of representation affects the nature of the kernel function used to map instancesinto the implicit feature space. The core intuition shared by all approaches is that two contexts areconsidered similar (or dissimilar) if they share many (or few) subparts. This intuition is formalizedin the framework of convolution kernels [Haussler, 1999]:

kR(x, x′) =∑

xi∈R−1(x)

∑x′j ∈R−1(x′)

k0(xi, xj ) (3.7)

where the function R−1 decomposes each structure belonging to the set X into the multiset of itssubstructures, and k0 is a function on substructures. The simplest choice of k0 is a function whichtakes value 1 when two substructures are identical and 0 otherwise; this leads to a kernel kR whichcounts the number of shared substructures. It is also possible to define a k0 which gives fractionalcredit for partial matches. This definition is rather abstract, because it applies to many differentclasses of structures. In relation classification, the structures of interest are typically representationsof context; so long as they are decomposable into substructures (e.g., trees decompose into subtrees),the convolution kernel method works.

The literature contains proposals for convolution kernels on representations of linguistic struc-tures such as strings [Cancedda et al., 2003], dependency paths [Bunescu and Mooney, 2005], shal-low parse trees [Zelenko et al., 2003], constituent parse trees [Collins and Duffy, 2001], dependencyparse trees [Moschitti, 2006], and directed acyclic graphs [Suzuki et al., 2003]. Most of these propos-als have also been applied to relation classification, particularly in the context of the ACE evaluation.Comparative studies [Nguyen et al., 2009] suggest that different representations can have differentstrengths and that combining them via kernel addition can improve performance. In the remainderof this section, we will work through treatments of string and constituent tree kernels, which havebeen most popular.

String kernels for relation classification represent context as ordered sequences of words.Formally, a string is a finite sequence s = (s1, . . . , s|s|) of symbols from an alphabet . l denotesthe set of all strings of length l and ∗ is the set of all strings, also called the language. A subsequenceu of s is specified by a list of integer indices i = (i1, . . . , i|u|) such that 1 ≤ i1 ≤ . . . ≤ i|u| ≤ |s|.19

The length of u is calculated with respect to s, as len(i) = i|u| − i1 + 1, regardless of the possiblegaps in u. For example, if s is the string cut the bread with the knife, the index list i = (1, 4) identifiesthe substring s[i] = (cut, with) with len(i) = 4.

An expressive kernel on strings, the gap-weighted subsequence kernel [Lodhi et al., 2002] isdefined as follows:

kl(s, s′) =∑u∈l

∑i, j : s[i]=u=s′[j ]

λlen(i)+len(j) (3.8)

19This formulation allows subsequences to be discontinuous.


where the positive integer l sets the length of the subsequences to be counted, and λ is a decayparameter between 0 and 1.The smaller the value of λ, the more the contribution of a discontinuoussubsequence is reduced. This kernel induces a feature embedding φl : ∗ → R

||l which maps astring s onto a vector of non-negative counts. The counts are not generally integers unless λ = 0or λ = 1. Lodhi et al.’s [2002] efficient dynamic programming algorithm does not require that thefeature vector φl(s) of all subsequence count be represented explicitly. While kl is parameterizedto count subsequences of a specified length l, it is straightforward to combine the contributions ofmultiple subsequence lengths through kernel addition.

Tree kernels apply the convolution kernel principle to the output of syntacticparsers [Collins and Duffy, 2001]. Tree kernels define similarity between trees in terms of the num-ber of shared subtrees:

k(T , T ′) =∑

S∈Subtrees(T )

∑S′∈Subtrees(T ′)

k0(S, S′) (3.9)

This definition leaves many degrees of freedom [Nguyen et al., 2009]: whether dependency or con-stituent trees are used; whether the full parse of the sentence is used or just the parse of the partof the sentence containing the candidate arguments; whether k0 allows partial matches, e.g., byallowing mismatches between lexical items while requiring their parts of speech to match. The as-sumption of tree structure allows for efficient computation of tree kernels by dynamic programming;see [Collins and Duffy, 2001, Moschitti, 2006] for more detail.

As may be apparent to the reader, the motivation for using convolution kernels is similar to thatfor extracting context features in a standard feature engineering scenario: we decompose a contextinto smaller parts in order to learn a more generalizable model. The tree kernel computes similarityin a feature space where each dimension indexes a subtree, while the syntax-based context featuresdescribed in Section 3.2.2 can also correspond to subtrees. Given appropriate feature specifications,one can obtain the same results with a tree kernel and an explicitly represented subtree featurespace used with a linear kernel (dot product) SVM. Summation of kernel functions corresponds toconcatenating the corresponding feature spaces [Joachims et al., 2001]. The kernel representationhas an advantage in terms of memory efficiency, but the explicit feature vector representation canbe more flexible. Ó Séaghdha and Copestake [2009] use alternative inner product similarities in thespace defined by a string kernel feature mapping, while Pighin and Moschitti [2010] use an explicitrepresentation of a tree kernel feature space to do feature selection.

A final note on practical implementation: convolution kernels should be normalized, so thatlarger structures are not assigned higher similarity values simply because they contain more sub-structures. For example, consider the strings s = the dog runs and the dog jumps and t = the dog runs. sshares more length-2 subsequences with t (three) than t does with itself (two), so k2(s, t) > k2(t, t),but s also contains many subsequences not shared with t . The standard approach to normalizationis through the following operation:

k(s, t) = k(s, t)√k(s, s)

√k(t, t)

(3.10)


This is equivalent to normalizing the substructure count vectors φ(s) and φ(t) to have unit length orto taking the cosine between the count vectors rather than a dot product. As a result, the normalizedkernel k(s, t) has a maximal value of 1, taken when s and t are identical.

Sequential labelling methodsLet us zoom out for a moment and consider the architecture of a real-world relation extractionsystem. So far, we have been focusing on the task of relation extraction, assuming that the argumentsare provided. This is unrealistic for a real-world relation extraction system, which should be ableto (1) identify the relation arguments, and possibly assign them some semantic type, e.g., Person,Organization, Location, Protein, Disease, and then (2) connect some of these argumentsin a relation of some type, e.g., President-of, Born-in, Cause, Side-Effect.

These tasks are typically executed sequentially: the system first finds potential arguments andthen tries to connect them with some relations. That is why evaluation exercises such as MUC,ACE, and BioNLP, in addition to relation extraction sub-tasks, also have had special sub-tasks onnamed-entity recognition (NER), since the relations they were interested in take named entities asarguments.

Argument identification typically uses a sequence model such as Hidden Markov Model(HMM), Maximum Entropy Markov Model (MEMM) or Conditional Random Field (CRF),which are very powerful for segmenting and labelling sequence data [Bikel et al., 1999,Lafferty et al.,2001]. This is handy for arguments which can have different lengths and for which the order ofappearance might be important. CRFs, the most sophisticated of the three families of models, havenow become almost a standard for the special task of named-entity recognition [McCallum and Li,2003], and thus have been much used for NER at evaluations such as MUC, ACE, CoNLL, BioNLPand BioCreAtIvE.

The identification of named entity mentions in text is usually done by a sequence tagger. Eachtoken is labeled with a BIO tag [Ramshaw and Marcus, 1995], which indicates whether it begins(B), is inside (I), or is outside (O) of a named-entity mention. It is further necessary to indicate thetype of named entity, e.g., person (PER), organization (ORG), location (LOC).This yields tags suchas B-PER, I-PER, B-ORG, I-ORG, B-LOC, I-LOC, and O.20 Here is an example annotation ofa sentence containing three named entities—two persons and one location:

George W. Bush , son of the Republican president George H. W.B-PER I-PER I-PER O O O O O O B-PER I-PER I-PER

Bush , was born in New Haven , Connecticut .I-PER O O O O B-LOC I-LOC I-LOC I-LOC O

Extracting the named entities from a BIO annotation is straightforward. In a dynamic se-quence model such as CRF, the labels would correspond to hidden states, while the observationswould be individual words. The words and the labels are not used directly but in feature functionsdefined over the following:

20[Ratinov and Roth, 2009] discuss some variations and alternatives of the BIO tags.


a. words: individual words, previous/following two words, word substrings (prefixes, suffixes ofvarious lengths), capitalization,digit patterns,manual lexicons (e.g.,of days,months,honorifics,stopwords, lists of known countries, cities, companies, and so on);

b. labels: individual labels, previous/following two labels;c. combinations of words and labels.

Sequence models can be useful not only for argument identification, but also for relationextraction. In fact, in some special cases, relation extraction can be reduced to sequence labelling. Forexample, biographical texts such as encyclopedia articles discuss one principal entity, and many ofthe other entities are related to that entity; we have already seen this in Section 3.1.5. Therefore, animportant task when analyzing such texts is to predict for each mentioned entity its relationship, ifany, with the principal entity [Culotta et al., 2006].This task is easier than general relation extraction.There is no need to enumerate all pairs of entities in the document, since relations are binary andone of the arguments is already fixed. There is also no need for two passes through the data, one toidentify the entities and their types, and another to assign the relation to the principal entity; thiscan be solved in one step as a labelling problem. Here is an example annotation of the sentence aboutGeorge W. Bush, assuming that he is the principal entity:

George W. Bush , son of the Republican president George H. W.B-Target I-Target I-Target O O O O B-Party B-Job B-Father I-Father I-Father

Bush , was born in New Haven , Connecticut .I-Father O O O O B-BirthPlace I-BirthPlace I-BirthPlace I-BirthPlace O

Note that in this example the relations Party, Job and Father come sequentially one after theother in that order. Such sequences are common and likely to occur elsewhere; this gives a sequencemodel an advantage over simple non-sequence classifiers.

Another example where relation extraction can be treated as a sequence labelling taskcomes from the biomedical domain. In EntrezGene,21 many genes are functionally described us-ing GeneRIF (Gene Reference Into Function) phrases.22 Since the target gene is known, it ispossible to use CRFs to extract various relations which involve that gene, e.g., gene-disease interac-tions [Bundschus et al., 2008].For example, the following GeneRIF for COX-2 contains interactionsbetween the target gene and two diseases: (1) endometrial adenocarcinoma and (2) ovarian serous cys-tadenocarcinoma.

COX-2 expression is significantly more common in endometrial adenocarcinoma andovarian serous cystadenocarcinoma, but not in cervical squamous carcinoma, compared withnormal tissue.

Note, however, that most relation extraction problems do not have one of the relation argu-ments fixed, and thus cannot be easily treated as sequence labelling tasks. While one can still apply

21www.ncbi.nlm.nih.gov/gene22www.ncbi.nlm.nih.gov/projects/GeneRIF/GeneRIFhelp.html

www.ncbi.nlm.nih.gov/gene

www.ncbi.nlm.nih.gov/gene

www.ncbi.nlm.nih.gov/projects/GeneRIF/GeneRIFhelp.html

www.ncbi.nlm.nih.gov/projects/GeneRIF/GeneRIFhelp.html


Relation

Role

f1 f2 fn

Role

f1 f2 fn

Role

f1 f2 fn

Figure 3.8: Dynamic graphical model for relation extraction. Adapted from [Rosario and Hearst, 2004].

sequence models to general relation extraction, this typically involves a more complex structure; wewill now give two examples of such more complex dynamic models.

Ray and Craven [2001] use an HMM which incorporates syntactic information, subsequentlyencoded in complex hidden states which pair up a syntactic category and an argument type, e.g., <NP,LOCATION>. Moreover, the observations for the HMM are not over individual words but oversets of words spanned by the syntactic phrase, e.g., cervical squamous carcinoma. Finally, there areseparate HMM models for positive and negative examples: at test time, each sentence is scoredagainst both models, and the one which assigns higher probability is selected.

HMMs and CRFs belong to a larger family of probabilistic models called graphical mod-els. In such models, conditional dependencies between variables are represented as a graph.Rosario and Hearst [2004] use a more complex, dynamic graphical model: in addition to all hiddennodes, it has an extra hidden node for the relation, as shown in Figure 3.8. Similarly, to the HMMmodel of Ray and Craven [2001], the hidden nodes in the graphical model do not represent labelsover individual words, but rather “roles” which can span sequences of multiple words. For example,the roles would be DISEASE, TREATMENT and NULL for relations between treatments anddiseases. The hidden node Relation represents the relationship in the sentence, e.g., Cures, Does-

not-cure, Prevents, Has-side-effect, No-relation; it is assumed that a sentence contains no morethan one instance of a relation of interest. There are links from the relation node to the role nodes,and optionally, to the first word or to all words spanning these roles (words are decomposed asfeatures, e.g., prefixes, suffixes, capitalization, digit patterns, and so on). Figure 3.8 shows the fullyconnected version of the model, which has also been found effective for protein-protein interactionextraction [Rosario and Hearst, 2005]. Just like the HMM model of Ray and Craven [2001], thisdynamic graphical model allows for the simultaneous discovery of both the relation and its argu-ments in one step. It is, however, more general and allows for multi-way relation classification, whilethe model of Ray and Craven [2001] is limited to binary yes/no decisions about whether a relationis present.


3.3.3 DETERMINING THE SEMANTIC CLASS OF RELATION ARGUMENTSSection 3.2.1 briefly discussed the mapping of relation arguments onto an ontology and their de-scription by features from an ontology. The motivation is to find an expression of a relation’s ar-guments in terms of general concepts, in a manner similar to the ACE relation definitions. Thebest level of generalization can be determined in various ways. If a description is based on ancestorsin an ontology, then any learning algorithm can automatically determine the best generalizationlevel [Nastase and Szpakowicz, 2003]. The arguments of the collected instances can also be clus-tered [Girju et al., 2004].The degree of generalization can be determined manually or automatically,as we will now show in more detail.

The Descent or Ascent of Hierarchy The descent of hierarchy was proposed by Rosario and Hearst[2002]. They were interested in the implicit relation between the two nouns in a noun-noun com-pound; for example, a kitchen knife is a knife used in the kitchen. Rosario and Hearst were motivatedby the assumption that head-modifier relations reflect the qualia structure [Pustejovsky, 1995] asso-ciated with the head noun. Under this interpretation, the meaning of the head determines what canbe done to such a thing, what it is made of, what it is a part of, and so on. For example, a knife canbe in the following relations:

• Used-in: kitchen knife, hunting knife• Made-of: steel knife, plastic knife• Instrument-for: carving knife• Used-on: meat knife, putty knife• Used-by: chef ’s knife, butcher’s knife

Some relations are specific to narrow noun classes, while others, more general, apply to widerclasses. Building on this idea, Rosario and Hearst [2002] proposed a semi-supervised characteriza-tion of the relation between the nouns in a bioscience noun-noun compound, based on the semanticcategory in a lexical hierarchy to which each of the nouns belongs. They extracted all noun-nouncompounds from a very large corpus of one million MEDLINE abstracts, and made observationsabout which pairs of semantic categories the nouns tend to belong to. Based on these observations,they manually labeled the relations for a small subset of the compounds, thus avoiding the need todecide on a particular set of relations in advance.

The work was based on the MeSH lexical hierarchy,23 where each concept is assigned aunique identifier (e.g., Eye is D005123) and one or more descriptor codes corresponding to particularpositions in the hierarchy.24 Rosario and Hearst manually selected different levels of generalization,and then tested them on new data, reporting about 90% accuracy. For example, all compounds

23Medical Subject Heading, www.nlm.nih.gov/mesh24Those who require detailed examples need only consider the path A (Anatomy),A01 (Body Regions),A01.456 (Head ),A01.456.505

(Face), A01.456.505.420 (Eye). By the way, the word is ambiguous, so it has another code: A09 (Sense Organs), A09.371 (Eye).

www.nlm.nih.gov/mesh


in which the first noun is in the A01 sub-hierarchy (Body Regions) and the second noun in A07(Cardiovascular System) are hypothesized to express the same relation.25

Rosario and Hearst further empirically found that the majority of the observed noun-nouncompounds fall within a limited number of semantic category pairs corresponding to the top levelsin the lexical hierarchy, e.g., A01-A07. Most of the remaining ones require descending one or twolevels down the hierarchy for at least one of the nouns in order to arrive at the appropriate level ofgeneralization of the relation. For example, the relation is not homogeneous when the modifier is inA01 (Body Regions) and the head is in M01 (Persons). It depends on whether the person is a patient,a physician, a donor and so on. Making the distinction between different kinds of persons requiresspecialization for the head: descending one level down the M01 hierarchy.

A01-M01.643 (Patients): abdomen patients, ankle inpatient, eye outpatientA01-M01.526 (Occupational Groups): chest physician, eye nurse, eye physicianA01-M01.898 (Donors): eye donor, skin donorA01-M01.150 (Disabled Persons): arm amputees, knee amputees

The idea of the descent of hierarchy is appealing and the accuracy is very high (about 90%), butthere are limitations. First, the classification is not fully automated; human annotators decide whereto cut the hierarchy. Second, the coverage is limited by the lexical hierarchy, most likely to narrowdomains.Third, problems are caused by lexical and relational ambiguities. Finally, the approach doesnot propose explicit names for the assigned relations.

Iterative semantic specialization [Girju et al., 2003] is similar but automated. It was applied tothe part-whole relation. The training data, annotated with WordNet sense information and con-taining both positive and negative examples, are first generalized going from each positive/negativeexample up the WordNet hierarchy. Next, these generalizations are iteratively specialized whenevernecessary, to make sure that they are not ambiguous with respect to the semantic relation assigned.A set of rules is then produced from these examples by supervised learning.

Semantic scattering [Moldovan et al., 2004] implements an idea along the same lines. It re-lies on the training data to determine a boundary (essentially, a cut) in a hierarchy—in particular,WordNet’s hypernym-hyponym hierarchy—such that the sense combinations in a noun compoundwhich can be mapped onto this boundary are unambiguous with respect to the relation within thecompound. Sense combinations found above this boundary may convey different relation types.

3.4 BEYOND BINARY RELATIONSWe have implicitly assumed in this chapter that we are interested in binary semantic relations. How-ever, there are many interesting relations with more than two arguments. For example, one can definea Purchase relation with four arguments (Purchaser, Purchased Entity, Price, Seller) or a

25Again, the reader who wants detailed examples will enjoy this list: limb vein, scalp arteries, shoulder artery, forearm arteries, fingercapillary, heel capillary, leg veins, eyelid capillary, ankle artery, hand vein, forearm microcirculation, forearm veins, limb arteries, thighvein, foot vein.

3.5. PRACTICAL CONSIDERATIONS 55

Local_Sports_Team relation with three arguments (Team, City, Sport). The methods discussedin the preceding sections apply as well to n-ary relations. One can train supervised models to classifya candidate argument set of n entities in a given context as either positive or negative instances. Evenso, the move to relations of higher arities causes some complications.

• It is not trivial to implement feature extraction methods based on the words between entitymentions, or the dependency path between mentions, or the least common subtree.

• A corpus will often contain “partial mentions” where some but not all arguments of therelation appear. For example, consider the sentences Sparks Ltd. bought 500 tons of steel fromSteel Ltd. where the Price argument is missing and Steel Ltd. bought 500 tons of coal wherethe Price and Seller are missing. Such examples cannot sensibly be presented in trainingas positive or negative examples. It is possible to ignore all such partial mentions, but thislimits the applicability of the trained system. One can also train separate models for everypossible combination of arguments (one four-argument model, three three-argument models,six two-argument models), but this goes against our intuition that we are dealing in each casewith the same relation.

One possible alternative approach is to decompose every instance with n entities into n(n−1)2

binary instances, allowing both full and partial mentions to be treated the same way. For example,McDonald et al. [2005] train a single classifier to predict whether two entities are related to eachother, and then rely on a combination of strong argument typing and graph-based global optimizationto compose the binary predictions into an n-ary prediction.

The problem of identifying n-ary relations also arises in the case of semantic role labelling.See [Palmer et al., 2010] for a monograph on this topic.26

3.5 PRACTICAL CONSIDERATIONS

In this chapter, we have attempted a survey of the methods of relation classification and the datasetsdeveloped for training and evaluating classification systems. How can a practitioner choose amongthe available options? We cannot write out a recipe which will prove optimal in all circumstances,because a system’s performance depends on the nature of the data and of the relations to be classified.We can offer some very general advice.

• Favour reliably high-performing learning algorithms such as SVM, logistic regression or CRF,the latter if the task makes sense as a sequence labelling problem.

• Entity features and relational features are almost always useful.• The value of background features will tend to vary according to the degree of “world knowl-

edge” required to disambiguate the relation instances. For noun compound classification, theimmediate context may not be useful while background knowledge plays a very important role.

26Download available at 〈 www.morganclaypool.com/doi/abs/10.2200/S00239ED1V01Y200912HLT006〉.

www.morganclaypool.com/doi/abs/10.2200/S00239ED1V01Y200912HLT006


It is similarly difficult to come up with generalized advice about the expected level of perfor-mance. That will be affected by a number of factors which interact in complex ways: the numberand nature of the relations to be classified, the distribution of those relations in data, the source ofthe data for training and testing, the annotation procedure for training data, the amount of trainingdata available, and so on. A conservative conclusion, well-supported by the published literature, isthat state-of-the-art systems can perform well above a random or majority-class baseline.

At SemEval-2007 Task 4, the winning system’s [Beamer et al., 2007] F-score of 0.724 and76.3% accuracy came about from a range of features which rely on manually engineered resourcessuch as WordNet. More recently, researchers have come close to this performance level usingonly corpus data to generate features [Davidov and Rappoport, 2008b, Nakov and Kozareva, 2011,Ó Séaghdha and Copestake, 2008]. These figures, however, can only be properly understood if wenote how the Task 4 datasets have been created: the distribution of positive and negative examplesis roughly even and the negative examples are “near-misses” (Section 3.1.2). Also, some relationsare relatively easy to recognize, while others are harder. For example, at SemEval-2007 Task 4,the best accuracy achieved by a participating system for Content-Container was 82.4% (baseline:51.4%), while for Origin-Entity the best accuracy was only 72.8% (baseline 55.6%)—see Table 5in [Girju et al., 2007].27

At SemEval-2010 Task 8, the best-performing system [Rink and Harabagiu, 2010] achievedan F-score of 0.822 and 77.9% accuracy using a combination of pre-existing resources; again, subse-quent research has come close with purely corpus-based methods [Socher et al., 2012]. In this task,the datasets were not nearly as balanced as in Task 4 three years earlier.

In ACE, the other most prominent relation classification evaluation exercise, there is a verydifferent kind of data. The examples to be annotated consist of full documents rather than singlesentences extracted from Web query results, and relations are only annotated between specific classesof named entities (Section 3.1.1). State-of-the-art performance on the ACE datasets gives F-scoresin the low-to-mid 70s [Jiang and Zhai, 2007, Zhou et al., 2007, 2009]. Lest it seem tempting toconclude that F-scores over 0.7 are guaranteed, we note Zhou et al.’s [2009] observation: movingfrom a set of fewer than ten ACE relation types to a set of more than twenty relation subtypesannotated on the same data has led to a decrease of about 20%.

3.6 SUMMARY

This chapter presented an overview of supervised learning as applied to semantic relationsbetween nominals. We have discussed a few of the available datasets, ranging from small,e.g., [Nastase and Szpakowicz, 2003], to very large, e.g., all the Wikipedia infoboxes; fromfew relations, e.g., [Rosario and Hearst, 2004], to very many, e.g., the Wikipedia infoboxes(again) and Freebase; from general-purpose, e.g., WordNet, to those specific to narrow domains,

27We will not try to explain why semantic relations differ in that important practical respect. Let us name the likely factors but leavetheir interpretation to the reader’s intuition: idiosyncrasy, degree of homogeneity, context dependency, need for world knowledge,permanence versus episodicity, and so on.

3.6. SUMMARY 57

e.g., [Rosario and Hearst, 2004]. We then refreshed the reader’s memory about supervised machinelearning, and briefly presented some of the most frequently applied methods: kernel methods, se-quence models and hierarchy-based methods. The chapter concluded with a brief discussion of thedifficulties when the learning problem is further complicated by applying it to relations with morethan two arguments.

Our collective experience has shown that supervised learning methods yield good results.Supervised learning, however, relies on annotated data, so it is limited to the semantic relationspresent in the annotation, and maybe to additional information such as WordNet sense, links toan ontology or textual context. This is a disadvantage. The fact that we are limited to relationsrepresented in the training data precludes the recognition of unexpected relations in a text. Changesin the training data—perhaps through the addition of new relations—dictate the derivation of newmodels. In the next chapter, we will investigate methods which are better tailored for open-endedrelation inventories, or for very large textual data. It is also a drawback when information aboutword senses or entity types is required: manual annotation is impractically labor-intensive, whileautomatic labelling is usually unacceptably error-prone.

On the positive side, supervised relation learning, in virtue of working in a controlled andinformation-rich environment, can help determine the characteristics of entities and relations thatare useful for relation identification. Such lessons may give directions for future work. For example,the fact that sense information is very useful may motivate work towards including (possibly coarser)sense information in open relation extraction on a large scale.

59

C H A P T E R 4

Extracting Semantic Relationswith Little or No Supervision

Resources such as the World Wide Web and large collections of news texts or scientific articlescontain massive amounts of data, much more than would be feasible to manually encode if wewished to use it for training a supervised model. Thus, it is very tempting to mine such data in anunsupervised fashion and to extract various types of relational knowledge, including:

• Taxonomic knowledge, e.g., what kinds of animals exist?• Ontological knowledge, e.g., which cities are located in the United Kingdom?• Event knowledge, e.g., which companies have bought which other companies?

Manually created knowledge bases are inevitably incomplete, even those created by a long-term and large-scale effort.1 Automatic population of a database or knowledge base can at least assistthe human builders, or even better, autonomously learn from already encoded data how to find andextract more data.

There is another difficulty: the amount of data available to scientists in any specific field maybe overwhelming. Without automation in the form of knowledge discovery, connections betweenentities or phenomena may go unnoticed. Automatic knowledge discovery methods have gainedpopularity particularly in medicine, after Swanson’s [1987] pioneering work. Swanson has discoveredpreviously unknown, but ultimately important connections between magnesium and migraines bycombining relations extracted from various biomedical articles. The technique of connecting piecesof knowledge found in various sources, previously thought to be unrelated, has become knownas Swanson linking. For example, one publication may show evidence that illness A is caused by

chemical B, and another can report that drug C reduces the amount of chemical B in the body.Linking the two establishes a connection between illness A and drug C.

Texts may contain a lot of information irrelevant to one’s specific purpose or need. Part of thechallenge is to develop precise methods of sifting through text to find relevant portions and reliableindicators which support the extraction of salient relation instances. It may not be feasible to apply inpractice a model learned from training data by a supervised learning method of the kind described inChapter 3.That can happen because the relation of interest has not been present in the training data,and it may not be cost-effective to derive deep features required by the model (e.g., sense information,

1Consider Cyc (www.opencyc.org) and Freebase (www.freebase.com).

www.opencyc.org

www.freebase.com

60 4. EXTRACTING SEMANTIC RELATIONS WITH LITTLE OR NO SUPERVISION

dependency relations). The Web-scale data on which open relation extraction operates make suchknowledge-rich methods impractical.

One historically important mode of semantic relation acquisition is to deploy hand-craftedpatterns, with the assumption that any text matching one of these patterns should describe a positiveinstance of the target relation.Such fixed lists of patterns can achieve high precision,but are inherentlylimited in recall because relation instances can be expressed in a large variety of ways. Coverage mayalso be limited because it is impractical, or even impossible, to devise high-precision patterns for allrelations which might be of interest.

Relation extraction from open texts has began with hyponym/hypernym (is-a) andmeronym/holonym (part-of) relations, the backbone of any taxonomy or ontology. This line ofresearch was pioneered by Hearst [1992], and has since been gradually expanded both in terms ofrelations targeted and of the scale of texts processed. The Web scale is now within reach in suchprojects as Never-Ending Language Learner2 and Machine Reading.3

4.1 MINING ONTOLOGIES FROM MACHINE-READABLEDICTIONARIES

The acquisition of ontological relations is a step in formalizing human knowledge in order to enableits embedding in NLP and, more generally, AI applications. In the 1970s and 1980s, when suchwork was first seriously contemplated, it was natural to turn for help to repositories of humanknowledge. What was then available in machine-readable format were unilingual dictionaries, inparticular Merriam-Webster and Longman’s Dictionary of Contemporary English. The main focuswere hyponymy/hypernymy—subclass/superclass relations. We refer the reader interested in thisearly line of research to [Ahlswede and Evens, 1988, Alshawi, 1987, Amsler, 1981, Chodorow et al.,1985, Ide et al., 1992, Klavans et al., 1992].

Machine-readable dictionaries as an information source for relation mining offer some clearadvantages. Dictionaries contain short and focused definitions of concepts, they use somewhatstandard language, and they tend to rely on a limited vocabulary for the definitions. All thesecharacteristics simplify the mining task significantly. Still, there are difficulties. Consider for examplethe subset of interrelated senses of the following nouns: CLASS, GROUP, TYPE, KIND, SET,DIVISION, CATEGORY, SPECIES, INDIVIDUAL, GROUPING, PART and SECTION.The Merriam-Webster dictionary gives the following definitions, reproduced from [Amsler, 1981]:

• GROUP 1.0A—a number of individuals related by a common factor (as physical association,community of interests, or blood)

• CLASS 1.1A—a group of the same general status or nature• TYPE 1.4A—a class, kind, or group set apart by common characteristics• KIND 1.2A—a group united by common traits or interests

2rtw.ml.cmu.edu/rtw/3www.cs.washington.edu/node/3535/

rtw.ml.cmu.edu/rtw/

www.cs.washington.edu/node/3535/

4.1. MINING ONTOLOGIES FROM MACHINE-READABLE DICTIONARIES 61

• KIND 1.2B—CATEGORY• CATEGORY .0A—a division used in classification• CATEGORY .0B—CLASS, GROUP, KIND• DIVISION .2A—one of the parts, sections, or groupings into which a whole is divided• *GROUPING <== W7—a set of objects combined in a group• SET 3.5A—a group of persons or things of the same kind or having a common characteristic

usu. classed together• SORT 1.1A—a group of persons or things that have similar characteristics• SORT 1.1B - CLASS• SPECIES .IA—SORT, KIND• SPECIES .IB—a taxonomic group comprising closely related organisms potentially able to

breed with one another

One evident trouble is the circularity in definitions, shown graphically in Figure 4.1. It maycause problems for many applications which assume a tree-like, or at least acyclic, taxonomic struc-ture.

Another difficult element of the extraction of taxonomic relations from dictionary definitionsis the identification of the terms of interest. Chodorow et al. [1985] defined the keyword (the termof interest) as the head of a phrase, but this heuristic is not always appropriate. For example, in thedefinition of SET, group is the keyword in the noun phrase group of persons, whereas in thedefinition of GROUP 1.0, the actual keyword in the syntactically parallel noun phrase number ofindividuals should be individuals.

It is a challenge to build an ontology from text, even if the vocabulary and grammar arecontrolled as they are in dictionary definitions. An ontology interconnects concepts (represented asstrings), while a text contains words (also strings, but interpreted linguistically). Adding a relationextracted from text to an ontology implies mapping the relation’s arguments onto ontology conceptsand adding the corresponding edge. Mapping words in a text onto concepts in a hierarchy mayrequire dealing with ambiguity, including polysemy and homography.4 Information extraction oftenignores such matters, because it remains at the text level: it extracts isolated relation instances withtextual arguments.

Last but not least, dictionaries clearly have limited coverage. General dictionaries are shallow,specialized dictionaries are narrow—either way, what they cover is small compared to what is poten-tially out there in open texts, and small compared to what has come to be expected of “real-world”NLP applications. Still, work on extracting knowledge from dictionaries, in particular comparingsyntactic parsing and pattern matching [Ahlswede and Evens, 1988], has shown that patterns havebig potential for relation extraction. They are computationally light compared to parsing and thuscan be easily applied to very large text collections.

4To state the obvious: polysemy = same word, different but related senses (e.g.,bank as a building and as an institution), homography= same word, different unrelated senses (e.g., bank as a building and as sloping ground along the edge of a river).


SET 3.5 ASET 3.5 A

GROUPINGS*GROUPINGS*number of INDIVIDUALSnumber of INDIVIDUALS

one of the PARTS*one of the PARTS*SECTIONS*SECTIONS*

DIVISION .2ADIVISION .2A

CATEGORY .0ACATEGORY .0A CATEGORY .0BCATEGORY .0B

SPECIES .1BSPECIES .1BSPECIES .1ASPECIES .1A

KIND 1.2BKIND 1.2B

GROUP 1.0AGROUP 1.0A

CLASSCLASS KINDKIND

TYPETYPE 1.4A1.4A

1.1A1.1A 1.2A1.2A

SORT 1.1ASORT 1.1ASORT 1.1BSORT 1.1B

Figure 4.1: The GROUP concept and its relationship to other concepts as extracted from a machine-readable dictionary. Adapted from Amsler [1981].

4.2 MINING RELATIONS WITH PATTERNS

The fundamental concept underpinning most relation extraction systems is that of a relation pattern:an expression which, when matched against a text fragment, identifies a relation instance. Suchpatterns can have many forms, incorporating lexical items, wildcards, parts of speech, syntacticrelations, and flexible rules such as those available in regular expressions. Hearst [1992] introduceda seminal inventory of patterns for extracting instances of the taxonomic is-a relation:

1. NP such as {NP, }∗ {(or|and)} NP—e.g., ... bow lute, such as Bambara ndang ...→ (bow lute, Bambara ndang)

2. such NP as {NP, }∗ {(or|and)} NP—e.g., ... works by such authors as Herrick, Goldsmith, andShakespeare→ (authors, Herrick); (authors, Goldsmith); (authors, Shakespeare)

4.2. MINING RELATIONS WITH PATTERNS 63

3. NP {, NP}∗{, } (or|and) other NP—e.g., ... temples, treasuries, and other important civic buildings...→ (important civic buildings, temples); (important civic buildings, treasuries)

4. NP{, } (including|especially) {NP, }∗ (or|and) NP—e.g., ... most European countries, especiallyFrance, England and Spain...→ (European countries, France); (European countries, England); (European countries, Spain)

These lexico-syntactic patterns are frequent across text genres and have been shown to yieldvery high precision. This means that it is fairly safe to assume that any single match for a pattern issufficient evidence for extracting a relation instance. Yet, there exist many other ways to express is-a

relations which will not be captured by this limited set of patterns, so the recall is low.While it is unclear whether such high-precision rules can be devised for all relations

of interest, matching lexico-syntactic patterns has been attempted for many other relations,e.g., Berland and Charniak [1999] proposed patterns for identifying instances of the part-of re-lation. Similarly, early work on automatically identifying protein-protein interactions in biomedicaljournal articles has focused on searching for words from a predefined list which describe the notewor-thy interactions [Blaschke et al., 1999, Pustejovsky et al., 2002]: a system for identifying instancesof the Inhibition relation might search for patterns like N1 inhibits N2, N2 is inhibited by N1 andinhibition of N2 by N1, while a system for detecting instances of the Activation relation might lookfor occurrences of specific verbs such as activate and of specific nominalizations such as activation.

4.2.1 BOOTSTRAPPING RELATIONS FROM LARGE CORPORAHearst’s [1992] pattern-based search for relation instances was originally made once over a corpusthat was small by today’s standards. It was later noted that the set of initial patterns can itself befurther enlarged by using the extracted instances, and this new set in turn can lead to extractingmore instances, and so on.This iterative learning framework has come to be known as bootstrapping.Experimenting with this (seemingly simple) procedure has led to a few negative observations. Inparticular, noise in data can “infiltrate” the algorithm and steer it in a wrong direction. This phe-nomenon is called semantic drift .Thus, systems which operate on large amounts of unstructured textmust implement methods to avoid semantic drift.

We will now describe bootstrapping in more detail, beginning with the basic form, and thenwe will discuss enhancements designed to counter semantic drift while still allowing for high recall.

Bootstrapping is a simple, iterative method. It begins with a small set of seed patterns, to whichmore patterns based on new matches found in the data are added in each iteration. This procedurewas developed to overcome the weakness of extracting relations by matching hand-crafted patterns:one cannot presume to enumerate all the ways in which a relation might be expressed. That is whythe system is allowed to discover new instances, and through them new patterns which express therelation of interest. This type of processing was pioneered by Hearst [1992], and its efficiency hasbeen demonstrated empirically e.g., by [Agichtein and Gravano, 2000, Brin, 1998].


Algorithm 1 Bootstrapping with pattern and relation scoring for relation extraction. One but notboth of the sets of seed patterns and seed relation instances may be empty.Variations of the algorithminclude performing only one iteration (N = 1), and updating only the set of relation instancesextracted but not the patterns used (skipping step 13).Require:

P—a set of seed patternsR—a set of seed relation instancesC—a corpusN—maximum number of iterations

Ensure: R—a set of relation instances

1: for i = 1..N do2: P ′ = {}3: R′ = {}4: for p ∈ P do5: match p in C6: add matched pairs: R′ = {(npi, npj )} ∪ R′7: end for8: R = R ∪ T opk(rankInstances(R′))9: for (npi, npj ) ∈ R do

10: match npi(.∗)npj in C11: add matched pattern (.∗) to P ′12: end for13: P = P ∪ T opk(rankPatterns(P ′))14: end for15: return R

Hearst [1992] ran her patterns only once on Grolier’s American Academic Encyclopedia.This yielded instances of high quality, but there were merely 152 of them, despite the fairly large sizeof this encyclopedia: 8.6 million tokens. To increase recall, patterns and relations can be extractediteratively using bootstrapping as shown in Algorithm 1.

The system is initialized with a small set of “seeds” in the form of high-precision patternsP and/or known instances R of the relation to be learned. No other supervision is required. Forexample, a system for extracting is-a relation instances could be initialized with the set of Hearstpatterns or a set of seed examples such as cat–animal, car–vehicle and banana–fruit. In the most typicalscenario, the system will expand both the pattern set and the instance set: new patterns are collectedby identifying candidates which corroborate and do not contradict the content of the set of knowninstances, while new instances are collected by matching the set of patterns against the corpus.


To avoid adding noisy relation instances and patterns to the initial seed sets,extracted candidateinstances (R′) and patterns (P ′) are scored and ranked. The top k, or those with a score above agiven threshold, are selected, added to P and R and used in the next iteration. Given enough text,this procedure can in principle be repeated indefinitely [Carlson et al., 2010a]. For a fixed corpus,however, it will naturally reach a point where no new patterns or instances can be harvested, whilerepeated iterations are not practical for a very large text collection. The solution is then to set anupper limit on the number of iterations.

Patterns produced by bootstrapping may comprise lexical items, syntactic categories and othereasily identifiable textual items such as HTML markup [Brin, 1998]. Where entity categories suchas Location and Organization are given in the data annotation or can be predicted automat-ically, they can be used to constrain the semantic types of a relation’s arguments. For example, theSNOWBALL system [Agichtein and Gravano, 2000] learns patterns such as the following:

〈organization〉’s headquarters in 〈location〉〈location〉-based 〈organization〉〈organization〉, 〈location〉

It has been shown [Greenwood and Stevenson, 2006, Stevenson and Greenwood, 2005]that syntactic patterns in the form of dependency triples can improve the quality of boot-strapping for relations like Management_Succession. Example triples for that relation includeCompany+appoint+Person, Company+elect+Person, Company+promote+Person, and Com-pany+name+Person. On the other hand, syntactic representations require time-consuming parsingof the corpus, something not quite practical for the very large corpora one typically wishes to usethese days.

In principle, one could try using bootstrapping to learn any semantic relation. In practice,the framework works best for ontological relations, which satisfy the assumption that the relation’sarguments always interact in the same manner, regardless of the context. An idiosyncratic relationwould tend to find contradictory evidence for a given set of arguments: one sports report might statethat “Manchester United defeated Chelsea,” while the same newspaper might state six months later“Chelsea defeated Manchester United,” referring to a different match.

A comparative study of different relations types has shown that bootstrapping works betterfor specific relations such as birthdate, and has trouble with more generic or more heterogeneousrelations such as is-a or part-whole [Ravichandran and Hovy, 2002].This also depends on the gran-ularity of the relations targeted. If we want to extract the six types of part-whole relations proposedby Winston et al. [1987]—namely Component-Integral_Object, Member-Collection, Portion-Mass,Stuff-Object, Feature-Activity, and Place-Area—instead of just one coarse-grained relation, patternswill not be helpful: they will be shared across these six relations. As these six relations’ names suggest,the difference between them comes mainly from the type of entities they connect.


4.2.2 TACKLING SEMANTIC DRIFTOne of the problems with bootstrapping is that not only does it iteratively increase the seed set, butit also increases noise which gets into the system. Such semantic drift [Curran et al., 2007] can beillustrated with the extraction from a corpus of instances of the notion of City—essentially a kind ofis-a relation extraction. While some of the patterns (such as mayor of X) are good, other patternsfrequently encountered with cities (lives in X) appear often with other types of locations, suchas countries, states or continents.

SeedsLondon

ParisNew York

→Patterns

mayor of Xlives in X

...

→Added examples

CaliforniaEurope

...

To counter semantic drift, the patterns and relation instances considered for addition to theexpanding pattern set and relation instance set are first scored. Only the top scorers are eventuallyadded, as shown in Algorithm 1. We now review some of the successfully applied scoring functions.

One popular pattern-scoring function is the specificity of patterns:

specif icity(p) = − log(Pr(X ∈ MD(p))) (4.1)

where p is the pattern to be scored, MD(p) is the set of tuples matching the pattern in the documentset D, and X is a random variable uniformly distributed over the domain of the tuples of the targetrelation [Curran et al., 2007].

The idea is that patterns which match many contexts are probably too general and thusshould be avoided. In practice, the specificity of a pattern is often measured by its length [Brin,1998], assuming that the longer the pattern, the more specific it is. Patterns of higher specificity arepreferred since they are likely to have higher precision and are thus less likely to cause semantic drift.

Agichtein and Gravano [2000] proposed a precision-based measure, the confidence of patterns,which is defined over the number of positive and the number of negative matches of a pattern p:

Conf (p) = p.positive

p.positive + p.negative(4.2)

where p.positive and p.negative are the numbers of positive and negative matches for pattern p.Agichtein and Gravano [2000] further defined the confidence of relation instances:

Conf (i) = 1 −|Pi |∏k=0

(1 − Conf (pk)Match(ck, pk)) (4.3)

where Pi = {pk} is the set of patterns which can extract relation instance i, and Match(ck, pk) isthe matching score for pattern pk in context ck for i.


Pantel and Pennacchiotti [2006] defined the confidence of relation patterns based on thereliability of patterns and relations, defined recursively:

rπ (p) =∑

i∈Ipmi(i,p)

maxi,p pmi(i,p)rι(i)

|I | (4.4)

rι(i) =∑

p∈Ppmi(i,p)

maxi,p pmi(i,p)rπ (p)

|P | (4.5)

where rπ (p) is the reliability of pattern p, and rι(i) is the reliability of relation instance i; rι(i) = 1 forinstances supplied initially as seeds. maxi,p pmi(i, p) is the maximum pointwise mutual informationbetween all patterns and all instances, where the pointwise mutual information pmi(i, p) betweenpattern p and a binary relation instance i = (x, y) is defined as follows:

pmi(i, p) = log|x, p, y|

|x, ∗, y| |∗, p, ∗ | (4.6)

where |x, p, y| is the number of times pattern p can be instantiated with relation instance i = (x, y),|x, ∗, y| is the number of times x and y co-occur, and |∗, p, ∗ | is the number of times pattern p

can be instantiated in the corpus.Based on these reliability measures, the confidence of relation instances can be computed as

follows:

S(i) =∑

p∈PR

Sp(i)rπ (p)

T(4.7)

where Sp(i) = pmi(i, p), and PR is the set of reliable patterns extracted up until the previousiteration. Instances whose score is higher than a given threshold are added to the set of extractedrelations.

Pasca et al. [2006b] relied on word similarity, computed using a distributional representationof a word’s meaning based on collocations extracted from a corpus, to score candidate entities orrelations.

Mutual bootstrapping, or mutual exclusion, extracts multiple categories of entities (or rela-tion types) simultaneously [Riloff and Jones, 1999, Yangarber, 2003]. Categories are assumed to bedisjoint, so patterns can be scored to favor those which extract instances of one category only:

score(patterni) = Fi

Ni

log2(Fi) (4.8)

where Fi is the number of instances among those extracted with patterni which belong to the desiredcategory, and Ni is the total number of instances extracted with this pattern [Riloff and Jones, 1999].In turn, instances extracted with high-scoring patterns are more highly scored, and will be retainedfor further iterations.


Argument type checking [Pasca et al., 2006a, Rosenfeld and Feldman, 2007] matches the en-tity type of a relation’s argument with the expected type in order to filter erroneous candidates.Classifying a potential relation’s arguments into fine-grained entity classes is error-prone and com-putationally not efficient for large-scale data. That is why the type-checking is done via a mea-sure of word/term similarity based on collocations extracted from corpora [Pasca et al., 2006a], orby matching against coarse-grained classes for provided relation schemata, and against the givenseeds [Rosenfeld and Feldman, 2007].

Coupled training [Carlson et al., 2010b] combines different types of constraints to filter thelist of candidate extractions. This is based on the observation that the semi-supervised learning ofrelations in context (together with other relations) and not in isolation, yields better results, becauseconstraints can lead to mutual disambiguation and can be used to filter out bad candidates. Thissituation is illustrated in Figure 4.2. By learning together relation extractors for multiple relations,the system could take advantage of constraints on categories (e.g., person and sport are mutuallyexclusive), or on argument types (e.g., Blue Devils is a team).

coach

sportperson

playsForTeam(a,t)

playsSport(a,s)team

coachathlete

X1Krzyzewski coaches the Blue Devils.

(A) A difficult semi-supervised learning problem

(B) An easier semi-supervised learning problem

Krzyzewski coaches the Blue Devils.X1 X2

coachesTeam(c,t)

Figure 4.2: Example of coupled relation learning. Adapted from [Carlson et al., 2010b]

In particular, Carlson et al. [2010b] combine mutual exclusion, argument type checking anda form of co-training,5 using structured and semi-structured text features, e.g., HTML tags. This isthe approach which underlies the Never-Ending Language Learner (NELL).6 The starting point isa small, manually built ontology containing categories, a few instances for each, and a few instancesof relations between them. It then uses subcomponents which make uncorrelated errors, and learns5Co-training is a semi-supervised learning algorithm: two models are trained separately on the same initial dataset representedby two sets of conditionally independent features; the models are used to iteratively enlarge each other’s training set by addinginstances with high confidence predictions produced by one model to the other model’s training data [Blum and Mitchell, 1998].

6rtw.ml.cmu.edu/rtw/overview

rtw.ml.cmu.edu/rtw/overview

4.3. DISTANT SUPERVISION 69

multiple types of inter-related knowledge. It learns predicates from texts, and it learns to infer newrelations from the learned ones. To learn predicates, it uses the coupled semi-supervised methodsmentioned above, which leverage constraints between the predicates to be learned.

4.3 DISTANT SUPERVISION

Bootstrapping relies on a small set of manually provided seed instances or patterns for the relationto be extracted. We have, however, available many relation examples in resources such as WordNet,Cyc, Wikipedia infoboxes or Freebase (see Section 3.1.5). The relations in WordNet and Freebasein particular are “stand-alone,” outside of any specific context. They can be used as large seed sets tofind contexts in which they appear. Those contexts can then be used to train classifiers for each or allof the target relations. This is known as distant supervision [Craven and Kumlien, 1999]. Snow et al.[2005] used the is-a relations in WordNet. Mintz et al. [2009] applied this method to 102 relationsin Freebase, and 17,000 seed instances. They mapped those onto Wikipedia article texts, whichcomprised 1.2 million articles in 2009, to extract 1.8 million instances, connecting 940,000 entities.As in most work in open information extraction, the assumption is that all co-occurrences of a pairof entities in a text express the same relation between them. Contexts from all co-occurrences foundin text are aggregated, and then used as features for learning.

Wikipedia provides a unique environment for learning extraction patterns.As discussed brieflyin Section 3.1.5, a Wikipedia infobox contains a summary of information, in the form of attribute-value pairs, where the attribute is in fact a relation, with one implicit argument being the conceptdenoted by the article within which it appears.The infobox, however, is not isolated: it is attached toa specific article. The information summarized in the infobox appears also in the associated article.Matching the information in the infobox with sentences in the article gives examples of relationoccurrence in text, from which patterns can be learned. An example is shown in Figure 4.3.

Using Wikipedia allows one to build a corpus of sentences which contain instances of spe-cific relations. Other sentences from the articles can provide negative examples (sentences withoutinstances of the target relation). Having such a corpus facilitates a two-step learning procedure:(i) learn to identify sentences which contain a relation instance, and (ii) learn relation extractorsfrom sentences that contain relation instances. In principle, any of the supervised learning methodsdescribed in Chapter 3 could be used in combination with an argument identification stage. Onepopular approach is to use Conditional Random Fields [Banko and Etzioni, 2008, Wu and Weld,2007], as we have seen in Section 3.3.2. They are particularly appropriate for this problem, becausethey combine the strengths of sequence modeling with the advantages of supervised learning, to findin the given sentence the arguments (of varying length) of a relation targeted [Lafferty et al., 2001].Considering the relations in the infobox and the associated article presented in Figure 4.3, a systemwould need to learn to recognize that the sentence Jones is most famously played by Harrison Ford and

has also been portrayed by River Phoenix ... contains (at least) an instance of the portrayed by relation,and to then extract the instances ( Jones, portrayed by, Harrison Ford) and ( Jones, portrayed by,River Phoenix).


Dr. Henry Walton “Indiana” Jones, Jr., Ph.D.Henry Jones, Jr.

First

Created by

Portrayed by Films:

TV series:

George Lucas

Harrison Ford

River Phoenix

Lost Ark

Steven Spielberg


Indiana Jones

Harrison Ford as Indiana Jones in

character

(ages

(ages 13)

36-58)

Raiders of the

appearance

is a fictional character and the protagonist

of the Indiana Jones franchise. George Lucas and Steven Spielberg created the character in

homage to the action heroes of 1930s film serials. The character first appeared in the 1981

film Raiders of the Lost Ark, to be followed by Indiana Jones and the Temple of Doom in 1984,

Indiana Jones and the Last Crucade in 1989, The Young Indiana Jones Chronicles from 1992 to

1996, and Indiana Jones and the Kingdom of the Crystal Skull in 2008. Alongside the more

widely known films and television programs, the character is also featured in novels, comics,

video games, and other media. Jones is also featured in the theme park attraction Indiana

Jones Adventure, which exists in similar forms at Disneyland and Tokyo Disney Sea.

Jones is most famously played by Harrison Ford and has also been portrayed by River Phoenix

(as the young Jones in The Last Crusade), and in the television series. The Young Indiana Jones

Chronicles by Corey Carrier, Sean Patrick Flanery, and George Hall. Doug Lee has supplied

Jones’s voice to two LucasArts video games, Indiana Jones and the Fate of Atlantis and Indiana

Jones and the Infernal Machine, while David Esch supplied his voice to Indiana Jones and the

Emperor’s Tomb.

Particulary notable facets of the character include his iconic look (bullwhip, fedora, and

leather jacket), sense of humor, deep knowledge of many ancient civilizations and languages,

and fear of snakes.

Indiana Jones remains one of cinema’s most revered movie characters. In 2003, he was ranked

as the second greatest movie hero of all time by the American Film Institute. He was also

named the sixth greatest movie character by Empire magazine. Entertainment Weekly

ranked Indy 2nd on their list of The All-Time Coolest Heroes in Pop Culture. Premiere

magazine also placed Indy at number 7 on their list of The 100 Greatest Movie Characters of

All Time. Since his first appearance in Raiders of the Lost Ark, he has become a worldwide

star. On their list of the 100 Greatest Fictional Characters, Fandomania.com ranked Indy at

number 10. In 2010, he ranked #2 on Time Magazine’s list of the greatest fictional

characters of all time, surpassed only by Sherlock Holmes.

Contents

[16]

[17]

[12]

[13]

[14]

[15]

[citation needed]

[hide]

Figure 4.3: Example of a Wikipedia article, showing that the information in the infobox can be foundwithin the article’s text.

Riedel et al. [2010] refined relation learning by distant supervision. They dispensed with theassumption that a pair of terms will have the same relation regardless of the context in which theyappear. Instead, they assumed that, from all available contexts, at least one expresses the targetrelation. They used a factor graph to model the decision whether two entities identified in a contextare related, and whether the relation between them is indeed the target one.S and D are the observedvariables which model the source and the destination of a relation, a hidden variable Y models therelation between a source and a destination, and another hidden variable Zi takes a true value if therelation captured in instance i (between source Si and destination Di) matches a relation Y .The aimis to compute the joint probability p(Z = z, Y = y|x), where x is the sequence of observations (thetext containing relation instances). They use a subset of entity pairs from Freebase, mapped onto atext corpus extracted from the New York Times.

4.4. UNSUPERVISED RELATION EXTRACTION 71

4.4 UNSUPERVISED RELATION EXTRACTIONBootstrapping relies on a small set of seed instances or patterns iteratively increased in multiplepasses over a corpus. Unfortunately, in open information extraction, where the data support forrelation extraction is the Web, or a comparatively large corpus, multiple passes over the data arecomputationally neither desired nor even feasible. Moreover, when extracting all relations whichcould be encountered, one simply cannot specify seed examples for each of these relations. In suchcases, relation extraction proceeds without a pre-specified list of relations, seeds, or patterns.

4.4.1 EXTRACTING IS-A RELATIONSIf we have a homogeneous cluster of closely related terms—such as Apple, Google, IBM, Oracle, SunMicrosystems… or cat, mouse, dog…—we can automatically induce is-a relations if we can name thecluster. Such a name might be company for Apple… and animal for cat… We can assume that theis-a relation holds between the name of the cluster and each of its members: is-a(Apple, company),is-a(Google, company)… and is-a(cat, animal), is-a(mouse, animal)…

This is what Pantel and Ravichandran [2004] did. They first clustered nouns using co-occurrence information in the method proposed by Pantel and Lin [2002].The cluster, considered torepresent a semantic class, is labeled with the hypernym of the elements in the cluster. For example,company would (ideally) be predicted for the automatically generated cluster of semantically relatedwords:

Apple, Google, IBM, Oracle, Sun Microsystems…

Finding this hypernym is a three-step process:

i. represent the elements in the cluster using their grammatical collocates and score these collo-cates;7

ii. define a measure of representativeness for elements of the clusters, and extract the most rep-resentative elements which unambiguously describe the cluster;

iii. extract the name of the class using the grammatical relations (e.g., apposition, nominal subjects,patterns like “such as”,and so on) which have been determined empirically to be good predictorsof such information.

For example, the top four relations they found were the following:

• Apposition (N:appo:N), e.g., …Oracle, a company known for its progressive employmentpolicies …

• Nominal subject (-N:subj:N),8 e.g., …Apple was a hot young company, with Steve Jobs incharge …

7Grammatical collocates represent terms which appear together in a sentence and are related by a grammatical relation such as,e.g., subject, object or noun modifier.

8The minus indicates that the right-hand side of the relation is the head.


• Such as (-N:such as:N), e.g., …companies such as IBM must be weary …• Like (-N:like:N), e.g., …companies like Sun Microsystems do not shy away from such chal-

lenges …

After determining the name of the class, is-a relations are added between each element of theclass and the class name.

It is also possible to extract instances of is-a relations without clustering. For example,Kozareva et al. [2008] proposed an algorithm for learning hyponyms/hypernyms for a target hy-ponym/hypernym. Their algorithm instantiates as an exact-phrase query a doubly-anchored pattern(DAP) of the following general form:

“sem-class such as term1 and term2”where sem-class is a semantic class, and term1 and term2 are members of this class.

The DAP is related to the following widely-used lexico-syntactic pattern introduced by Hearst[1992], which we have already seen in Section 4.2:

NP0 such as {NP1, NP2, . . ., (and | or)} NPn

Unlike that pattern, however, DAP posits exactly two arguments after such as and makes the coor-dinating conjunction and obligatory.

The DAP can be varied. For example, using the form“sem-class such as term1 and *”

one can extract from the position of the asterisk the semantic class members that are semanticallyrelated to term1, potential co-hyponyms of term1.

Similarly, the form“* such as term1 and term2”

enables the extraction of semantic classes common to the two terms, that is, common hypernyms ofterm1 and term2.

Finally, the two-placeholder form“* such as term1 and *”

allows for the simultaneous acquisition of pairs of hypernyms and co-hyponyms of term1.The power of DAP is in its use of two anchoring points, which act as disambiguators, thus

yielding more accurate hyponym-hypernym pairs. For example, given the term jaguar, the Hearst-style singly-anchored pattern

“* such as jaguar”

would yield hypernyms such as cars and animals, which mix two senses of jaguar. In contrast, thedoubly-anchored pattern

“* such as jaguar and *”

could prevent such sense-mixing by learning hypernym-hyponym-hyponym triples:

cats–jaguar–puma


predators–jaguar–leopardcars–jaguar–ferrari

Kozareva and Hovy [2010] further extended the method, making it capable of building entiretaxonomies in a bootstrapping fashion starting with a root concept and a basic-level concept or aninstance. Figure 4.4 shows an example of how the process starts given the root concept animal andthe instance lion. First, other basic-level concepts are learned, such as tiger, puma, deer and donkeyusing the DAP “animals such as lion and *”. Then, hypernyms for these basic-level concepts arelearned using a DAP with a placeholder for the superclass, e.g., ,“* such as lion and tiger”; this yieldsother basic-level categories like vertebrate, chordate, feline, and mammal. In order to tackle semanticdrift and to focus on the most reliable instances, thus minimizing the number of Web queries, agraph algorithm is used to rank the unexplored nodes at each iteration, based on their popularity(ability of a class to be discovered by other class instances; in-degree in graph terms) and productivity(ability of an instance to discover new instances; out-degree in graph terms).

animal

felines

lion

vertebrates

tiger elephant

cheetah

turtle

deerdog

donkey

puma

carnivoresmammals

chordates

herbivores

Figure 4.4: Taxonomy induction from scratch using DAPs. Adapted from [Kozareva and Hovy, 2010].

Finally, (not shown in Figure 4.4), special patterns like “X are Y like”, “X such as Y ”, “Xincluding Y ”, “X like Y ” and “such X as Y ” are used to determine the mutual position of theharvested hypernyms with respect to one another. For example, here is how the first pattern can


be used to determine whether vertebrate is a more general or a more specific term compared tochordate. Check which of the two queries yields more Web page hits: “vertebrates are chordates that”or “chordates are vertebrates that”; if the former query returns more hits, chordates would be assumedto be the more general term. This is applied to all pairs of intermediate nodes, yielding a graph inwhich the longest path is to be found so that unnecessary transitive links can be removed; this yieldschains like this: animal→chordate→ vertebrate→mammal→ feline→lion. These chains are finallymerged to yield a taxonomy as shown in Figure 4.5.

animals

chordates invertebratesaquatic_vertebrates

vertebrates

reptiles

snakes

vipers penguins ducks wombats rats

emus aquatic_birds marsupials rodents placentals moths

birds metatherians mammals insects

arthropods

Figure 4.5: Part of the taxonomy for animal, induced using DAPs. Adapted from [Kozareva and Hovy,2010].

4.4.2 EMERGENT RELATIONS IN OPEN RELATION EXTRACTIONIn Web-scale information extraction, giving the system an a priori set of relations would precludethe discovery of relations which may exist but have not been pre-specified. In such a situation, it isnecessary for the system to be able to identify novel relations in text. A natural choice is to use verbsand prepositions as relation holders, but they can be ambiguous. Different verbs may express thesame relation (e.g., create and produce), and some verbs may not express relations at all (e.g., “Itrains,” “I do”).


There is a substantial general difficulty with relation extraction: the same relation can beexpressed in many ways. For example, the phrases a shot against the flu and a shot to prevent the flu

arguably express the same relation between shot and flu. The same relation can also be expressedimplicitly by the noun compound flu shot, where there is no linguistic material to connect the relationarguments.This abundance of expressions is even more of a problem in open information extraction.It is necessary to recognize when different phrases and expressions represent the same relation andwhen they refer to different relations. This is often treated as a clustering problem: cluster phrasesso that there is one relation per cluster and one cluster per relation.

A key element of such clustering is the definition of an appropriate function to characterizethe similarity between two relation expressions. In the Resolver system [Yates and Etzioni, 2009],part of the Machine Reading project, two information sources are used in defining the function:(1) the similarity of relation instances as character strings, that is to say, edit distance, and (2) thesimilarity of all contexts where these instances appear, that is to say, distributional similarity.

String similarity model One possible way to decide whether two phrases s1 and s2 express thesame relation R is to compare them as character strings, and then to define a suitable score orprobability based on string similarity. This gives rise to the string similarity model :

Pr(R|sim(s1, s2)) = α ∗ sim(s1, s2) + 1

α + β(4.9)

In this formula,α and β help smooth the probability,giving more or less weight to the similaritymeasure. Overall, the probability ranges from 1/β for α = 0 to sim(s1, s2) for very large values ofα. Yates and Etzioni recommended α = 20 and β = 5, even though the choice of α and β makeslittle difference for Resolver. The similarity sim(s1, s2) score is calculated using an appropriatestring similarity function. The Levenshtein string edit distance was found very appropriate for thistask [Cohen et al., 2003].

Distributional similarity model Another way to decide whether phrases s1 and s2 express the samerelation R may be to work with distributional similarity. Distributional representations frequentlyappear in calculations of word similarity; for example Pasca [2007] considered it in entity extraction.For relation similarity, Lin and Pantel’s [2001] system called Discovery of Inference Rules from Text(DIRT) learns paraphrases over paths in dependency parse trees. It compares the distribution of thearguments of one relation expression to those of another relation expression, both represented asdependency triples.9 The working assumption is that if two triples tend to link similar sets of words,then the semantics of the corresponding relation expressions should be similar as well. For example,here are the top 20 paraphrases DIRT finds for X solves Y :

Y is solved by X, X resolves Y, X finds a solution to Y, X tries to solve Y, X deals with Y, Y is resolved by X, X addresses Y, X

seeks a solution to Y, X does something about Y, X solution to Y, Y is resolved in X, Y is solved through X, X rectifies Y, X copes

with Y, X overcomes Y, X eases Y, X tackles Y, X alleviates Y, X corrects Y, X is a solution to Y, X makes worse Y, X irons out Y

9(r , w1, w2) for the path argument—syntactic relation—argument. Arguments are nominals, relations are verbs.


The similarity measure adopted in dirt [Lin, 2001] is based on the mutual informationbetween two relation expressions r1 and r2, e.g., X solves Y and X finds a solution to Y. It takesinto account the extracted instances represented as a set of triples D of the form (rel, arg1, arg2),e.g., (solves, the student, the problem) and (finds a solution to, the government, the issue),and the similaritiesbetween the corresponding arguments:

sMI (r1, r2) = √sim2(r1, r2) × sim3(r1, r2) (4.10)

where sim2(r1, r2) measures the similarity between the second elements of the triples andsim3(r1, r2) between the third elements. The argument similarity measures are calculated fromthe extracted triples D:

simx(r1, r2) =∑

a∈projx(D1=r1 )∩projx(D1=r2 ) mi1,x(r1, a) + mi1,x(r2, a)∑a∈projx(D1=r1 ) mi1,x(r1, a)

∑a∈projx(D1=r2 ) mi1,x(r2, a)

(4.11)

where Dx=s1,y=s2 is the subset of D with s1 and s2 at positions x and y, respectively, and projx(D) ={s|(d1, d2, d3) ∈ D, dx = s)} is the projection of D onto dimension x. Finally, mi1,x(r, a) is themutual information between argument at position x and relation r :

mi1,x(r, a) = log

( |D1=r,x=a| × |D||D1=r | × |Dx=a|

)(4.12)

Extracted shared property model A third method of deciding whether two relation expressions s1

and s2 represent the same relation R is to use the similarity of the properties with which they appearin a corpus. For example, if both (lacks, Mars, ozone layer) and (lacks, Red Planet, ozonelayer) were observed, then Mars and Red Planet would share a context consisting of lacks andozone layer. In the Extracted Shared Property (ESP) model [Yates and Etzioni, 2007], such ashared context is called a property of Mars and Red Planet.

ESP models the extraction of relation expressions like (lacks, Mars, ozone layer) as agenerative process, inspired by the URNS model [Downey et al., 2005],where a number of propertiesPi are associated with each string si . These properties are written on balls, which are placed in anurn. Given two strings si and sj , there are two separate urns, containing Pi and Pj balls, respectively.Then, a set Di of ni = |Di | relation expressions which contain si are generated by selecting ni ballsfrom the first urn, and a set Dj of nj = |Dj | relation expressions which contain sj are generatedby selecting nj balls from the second urn. Let k = |Di ∩ Dj | be the number of observed sharedproperties in the samples Di and Dj . Let also Si,j be the number of potential shared propertiesbetween the two urns.

The ESP model assumes that if si and sj corefer, then Si,j should be equal to Pmin =min(Pi, Pj ), and if they do not, Si,j should be strictly smaller than Pmin. Unfortunately, the ac-tual number of potential shared properties Si,j between the two urns is unknown; we only have


access to Di , Dj , Pi , and Pj (and we can calculate k based on them). Thus, ESP estimates theprobability that two relation strings si and sj corefer:

Pr(Rti,j |Di, Dj , Pi, Pj ) = Pr(k|ni, nj , Pi, Pj , Si,j = Pmin)∑

k≤Si,j ≤PminPr(k|ni, nj , Pi, Pj , Si,j )

(4.13)

where

Pr(k|ni, nj , Pi, Pj , Si,j ) = Count(k, ni, nj |Pi, Pj , Si,j )(Pi

ni

)(Pj

nj

) (4.14)

In the last equation above, the denominator shows the number of different ways in which wecan sample ni balls from the first urn and nj balls from the second urn. The numerator shows thenumber of ways in which the sampling can be done so that the samples share exactly k attributes:

Count(k, ni, nj |Pi, Pj , Si,j ) =(

Si,j

k

) ∑r,s≥0

(Si,j − k

r + s

)(r + s

r

)(Pi − Si,j

ni − (k + r)

)(Pj − Si,j

nj − (k + s)

)

(4.15)

4.4.3 EXTREME UNSUPERVISED RELATION EXTRACTIONDavidov and Rappoport [2008b] aim for comprehensive extraction of entities and relations. Toachieve this, they map onto texts a very generic pattern:

Prefix CW1 Infix CW2 Postfix

where Prefix, Infix and Postfix consist of only high-frequency words (HFWs)—words or punctuationwith a frequency above a given lower bound—while CWi are content words—with a frequencybetween given bounds. The Prefix, Infix and Postfix make up the context of the relation between thecontent words, and all such contexts are gathered together for each pair of CWs. These clusters arefiltered and then combined based on shared contexts, and the final clustering represents differentrelations and their possible expressions in texts. Examples of clusters show patterns with which weare already familiar:

label patterns(pets, dogs) { such X as Y, X such as Y, Y and other X }(phone, charger) { buy Y accessory for X!, shipping Y for X,

Y is available for X, Y are available for X,Y are available for X systems, Y for X }

The clusters are labeled with the five highest-ranking instances according to a HITS score.The HITS measure quantifies the affinity of a word pair (w1, w2) to a cluster C. It considers how


many of the patterns which can be used to obtain (w1, w2) can be used to extract multiple wordpairs (Pcore), or just the current pair (Punconf ):

HIT S(C, (w1, w2)) = |{p,(w1,w2) appears in p∈Pcore}||Pcore| +

α|{p,(w1,w2) appears in p∈Punconf }|

|Punconf |(4.16)

The performance of this algorithm is controlled by several parameters which fix the frequencyboundary for determining HFWs and CWs, the window size within which the pattern is applied(narrower means more precision but lower recall, wider means more specific and possibly incorrectrelations), the minimal overlap for cluster merging,and α which can penalize very specialized patternsmore or less.

An advantage of this approach is that it is language-independent. Davidov and Rappoport[2008a] apply it to English and Russian. A further benefit, demonstrated by Davidov and Rappoportis that the clusters can be used as background relational features (in the sense discussed in Sec-tion 3.2.2) for supervised relation classification.

This extreme case of relation extraction from open texts relies on the assumption that when twoentities co-occur, they always interact or are connected in the same manner. For ontological relationsthis assumption is relatively safe, but context matters. A person, for example, may be connected to alocation by such different relations as born, working or traveling there.

The assumption was applied to solving SAT-analogy questions, collectedby Turney and Littman [2005], with performance higher than the average obtained by hu-man subjects. In the verbal analogy questions at SAT exams, one must estimate relational similarityin order to choose, from a set of 5-6 given word pairs, the pair which resembles the test pair themost.Two examples are shown in Table 4.1, where the subject must decide whether, e.g., the relationbetween ostrich and bird is more similar to the relation between lion and cat, or rather betweenprimate and monkey.

Table 4.1: SAT verbal analogy: example questions. The correct answer is in italics, and the distractorsare in plain text

ostrich:bird palatable:toothsome(a) lion:cat (a) rancid:fragrant(b) goose:flock (b) chewy:textured(c) ewe:sheep (c) coarse:rough(d) cub:bear (d) solitude:company(e) primate:monkey (e) no choice

4.5. SELF-SUPERVISED RELATION EXTRACTION 79

4.5 SELF-SUPERVISED RELATION EXTRACTION

While deeper processing (e.g., syntactic parsing) is altogether infeasible on a Web scale, it is possibleon a smaller text collection. If it is done, it can facilitate the self-learning of relation extractors. In theMachine Reading project at the University of Washington, the relation extraction process begins withparsing a small corpus. The parsing results are then used to extract and annotate relation instancesbased on heuristics related to the connecting path between entity mentions. These automaticallyannotated examples are applied to the training of relation extractors. We note that those are genericextractors: they extract relation instances not guided by or assigned to any particular relation type.The features used are shallow—lexical and part-of-speech information from the immediate contextand the connecting path between the two entities, similar to those in bootstrapping—so they areeasily obtained and applied to Web-scale text collections.

Extracting entity-linking phrases as relations can lead to relations which are not coherent(do not have a meaningful interpretation, e.g., The Mark 14 was central to the torpedo scandal of thefleet. → was central torpedo), uninformative (e.g., …is the author of … → is) or too specific (e.g., is

offering only modest greenhouse gas reductions targets at). Such unwanted effects can be partlyremedied by imposing syntactic, positional and argument frequency constraints on the extractedrelations [Fader et al., 2011].

Adverse effects can also be avoided by targeting relations with specific properties. Lin et al.[2010] work with functional relations, in which exactly one value of the second argument is possiblefor a given value of the first argument. Consider, for example, the relation birthplace betweenpersons and locations: everyone has precisely one birthplace. It may be too difficult to check all pairsof arguments, however, even if one managed to circumvent argument ambiguity and other likelyrestrictions on relation instances.

Downey et al. [2005, 2010] rely on the redundancy of information, and the “KnowItAll hy-pothesis”: extractions drawn more frequently from distinct sentences in a corpus are more likely to becorrect. When working on the Web scale, however, it is difficult to test whether a particular examplesatisfies this hypothesis, because estimating frequency is not straightforward. The same relation canbe expressed in different ways, and frequency is (possibly) not absolute but rather relative. For exam-ple, in October 2012 a search for "Elvis killed JFK" yielded 19300 hits, which is high and couldbe considered correct, but for the comparison with the search results for "Oswald killed JFK"(39200 hits). Extracting the correct information, then, depends not only on redundancy, but alsoon corroborating the outcome of different extraction heuristics. The scoring depends on measuressimilar to those in the extracted shared property model, as described in Section 4.4.2.

4.6 WEB-SCALE RELATION EXTRACTION

The approaches reviewed until this point exploit or adjust a particular aspect of relation extraction—some focus on named entities, some use patterns relying on specific methods for filtering candidates,and so on. With the field being open, and each approach free to focus on different sets of relations,


training and test data, direct comparison between these approaches is very complicated if not im-possible. Also, an intrinsic evaluation of the extracted knowledge may not be desirable—relationextraction is not an end task, but its purpose is to build resources to be used by other NLP and AIapplications. Instead of making a forced and ultimately unreliable comparison, here we choose topresent in more detail two large-scale knowledge acquisition approaches, which harvest relationsfrom the Web. Because we have already reviewed some of the methods used, we will give here anoverview, and show in particular how relations and relation extraction are learned by these systems.

4.6.1 NEVER-ENDING LANGUAGE LEARNERNever-Ending Language Learner (NELL)10 is a continuing knowledge acquisition system atCarnegie Mellon University. Its starting point is a seed ontology, consisting of approximately 600categories and relations, each with a small set of seed examples (10-20). NELL aims to enrich thisontology with new concepts, new concept instances, new instances of the existing relations and novelrelations. It uses a corpus of 500 million Web pages, which it queries and analyzes daily. Moreover,there is minimal manual intervention (a few minutes a day) that removes some of the erroneous factsretrieved by the system.

NELL implements a form of bootstrapping that relies on coupled learning to avoid semanticdrift (see Section 4.2.2). The intuition is that learning an isolated relation is harder and more error-prone than learning multiple relations simultaneously, which can provide additional constraints andthus keep the system focused. Coupled learning allows the system to add new instances to relationsthat are already in the ontology, by filtering candidates extracted from its corpus of Web pages. Inorder to add new relations to the existing ontology, NELL relies on clustering techniques: it firstfinds pairs of known category instances from its ontology which co-occur in texts, which it thenco-clusters together with phrases (also known as candidate relation expressions) that connect thesepairs where encountered.

In September 2012, NELL’s knowledge base contained approximately 15 million confidence-scored relations (beliefs), 1.4 million of which had high confidence scores, estimated at 85% preci-sion [Mohamed et al., 2011].

4.6.2 MACHINE READING AT THE UNIVERSITY OF WASHINGTONThe AI group at the University of Washington is developing a machine reading system, which hasan open information extraction component.11 Apart from extracting and clustering entities fromthe Web, the system also aims to extract relations. The continuous work on this large-scale projectshows an interesting evolution of relation extraction approaches. The initial open IE system wasKnowItAll [Etzioni et al., 2005], which implemented a form of bootstrapping using Hearst patterns.It then evolved into TextRunner [Banko et al., 2007], a self-supervised system which builds modelsfor specific relations from a small corpus, which are then applied to candidate extractions from a

10rtw.ml.cmu.edu/rtw/11 ai.cs.washington.edu/projects/open-information-extraction

rtw.ml.cmu.edu/rtw/

ai.cs.washington.edu/projects/open-information-extraction

4.7. SUMMARY 81

larger corpus. Kylin [Wu and Weld, 2007] and WPE [Hoffmann et al., 2010] bootstrap startingfrom relations in Wikipedia infoboxes combined with their associated articles to learn extractorsfor these relations. WOE [Wu and Weld, 2010] extends Kylin to open information extraction,and employs either part-of-speech or dependency patterns for relation extraction. These patternsdefine the possible expression of a relation as a phrase consisting of at least one verb and/or apreposition. ReVerb [Fader et al., 2011] introduces lexical and syntactic constraints on potentialrelation expressions that allow the system to obtain cleaner extractions, with less uninformativeor incoherent expressions. OLLIE [Mausam et al., 2012] improves over WOE by allowing moreflexible patterns that include nouns such as Bill Gates, co-founder of Microsoft ..., which WOE wouldhave missed. Patterns are gathered and generalized (e.g., from words to parts of speech), which boostsrecall. While previous work has also recognized the fact that some relations have dependencies—aretrue for some period of time, or are contingent upon external conditions—it does not representthis additional information. OLLIE confronts these issues and adds a ClausalModifier field to therelation tuple, when a dependent clause in the sentence was found that modifies the main assertionfrom which the relation was extracted.

In conclusion, this evolving line of research, embedded into a large-scale system, shows therobustness and endurance of bootstrapping as well as interesting methods for addressing its weak-nesses.

4.7 SUMMARY

In this chapter, we have reviewed methods which require little or no supervision, aiming for relationextraction from large text collections or from the Web. Bootstrapping is still a popular method. Itsbasic iterative matching and extraction process has been enhanced with measures which enable thefiltering of candidate patterns and instances to avoid noise and semantic drift. Another trend is thedevelopment of large-scale knowledge bases or structured knowledge repositories through collab-orative endeavors over the Web. It has facilitated the development of distant and semi-supervisedapproaches, which combine the strength of supervised and bootstrapping methods to process morerelations with better results. These methods are part of more traditional information extraction,where the relations to be learned are given to the system as seed pairs of patterns.

Open relation extraction aims for comprehensive relation extraction, often in combinationwith entity identification. Extreme approaches consider very generic patterns based on word fre-quencies, which are iteratively clustered to counter the variation in lexicalization of relation ex-pressions. The adoption of variants of bootstrapping in large-scale knowledge acquisition systemslike NELL and Machine Reading at the University of Washington shows the endurance of thisapproach. Both in NELL and in the Machine Reading project we have seen interesting methodsdeveloped to counter weaknesses of the basic bootstrapping algorithm: NELL employs coupledtraining, using to its advantage the mutual constraints that relations sharing an argument imposeon each other, to obtain cleaner extractions. In Machine Reading, we find familiar patterns, evolvedfrom automatically discovered sequences, to patterns over parts of speech with different restrictions.


Information extraction, and its subfield of relation extraction, are very active research areas.We have shown some of the most popular and successful approaches, but there is much more in theliterature.

83

C H A P T E R 5

ConclusionWe have fashioned our book to tell the reader where semantic relations between nominals come from,how one can work with them and what they are good for. We have shown how relation instancesare recognized in large text corpora, and how relations can be learned. These are pieces of a muchlarger puzzle: NLP finds knowledge in a lot of text and gets the deeper meaning of a little text. Aglimpse of history in Chapter 2 suggests that knowledge representation and acquisition have becomeintertwined in our field with semantic analysis, and that this interrelationship can be quite helpful.

Manual construction of knowledge bases is a process both costly and inherently limited inscope, though it is accurate insofar as people who do it do not make mistakes. Automated knowledgeacquisition or extraction from text is an obvious alternative, given that most of what we know iswritten down (and now put on the Web).The practically unlimited amount of data available today forsuch work is a boon and a challenge.The ample redundancy of Web-scale data can vastly increase therobustness of knowledge extraction systems, but it introduces a great deal of inaccurate information.It is very difficult to brave the tide of raw text. Chapter 4 has reviewed methods of identifying relationinstances in such texts while dealing with noise and harnessing redundancy.

Learning relations sits at the other end of the spectrum: there is a list of pre-specified relationswhose manifestations in texts we model so that new instances can be automatically tagged.This typeof processing requires annotated data in addition to a properly justified relation inventory. Chapter 3discusses the various features useful for learning relations between nominals, and algorithms usuallyfavoured in such learning.

Rather predictably, there is a gray area between acquisition of relations from large raw textsand learning from smaller annotated texts. This is where Wikipedia “steps in.” It is large and largelyfree-form, yet structured enough to allow semi-supervised knowledge acquisition to thrive. Whilethe scale is not nearly as massive as the Web (or however much of it search engines let us access),desirable relations and their instances are present in Wikipedia to a degree which makes supervisedrelation learning possible. Wikipedia offers an intriguingly productive connection between pure textand the more conceptual data in infoboxes.

The study of semantic relations in texts continues at both ends of the spectrum and all the waythroughout it.Neither statistical nor symbolic processing can be said to dominate, so it is encouragingthat there are attempts to marry them. A case in point is the automated acquisition of ontologies(something evidently symbolic) using statistical techniques. Such combined methods are, as far aswe can see today, a recipe for future successes in work on semantic relations between nominals.

85

Bibliography

ACE. Automatic Content Extraction 2008 Evaluation Plan: Assessment of Detection and Recog-nition of Entities and Relations Within and Across Documents, 2008. http://www.itl.nist.gov/iad/mig/tests/ace/2008/doc/ace08-evalplan.v1.2d.pdf. 26

Eugene Agichtein and Luis Gravano. Snowball: extracting relations from large plain-text collections.In Proceedings of the Fifth ACM Conference on Digital Libraries, pages 85–94, New York, NY, USA,2000. Association for Computing Machinery. DOI: 10.1145/336597.336644. 63, 65, 66

Thomas Ahlswede and Martha Evens. Parsing vs. text processing in the analysis of dictionarydefinitions. In Proceedings of the 26th Annual Meeting of the Association for Computational Linguistics,Buffalo, N.Y., 7-10 June 1988, pages 217–224, 1988. DOI: 10.3115/982023.982050. 60, 61

Hiyan Alshawi. Processing dictionary definitions with phrasal pattern hierarchies. Americal Journalof Computational Linguistics, 13(3):195–202, 1987. 60

Robert Amsler. A taxonomy for English nouns and verbs. In Proceedings of the 19th Annual Meetingof the Association for Computational Linguistics, Stanford University, Stanford, California, 29 June- 1 July 1981, pages 133–138, 1981. DOI: 10.3115/981923.981959. 60, 62

Sören Auer, Christian Bizer, Jens Lehmann, Georgi Kobilarov, Richard Cyganiak, and Zachary Ives.DBpedia: A nucleus for a Web of open data. In Proceedings of the 6th International Semantic WebConference and 2nd Asian Semantic Web Conference, Busan, Korea, November 11-15, 2007, pages722–735, 2007. DOI: 10.1007/978-3-540-76298-0_52. 34, 35

Michele Banko and Oren Etzioni. The tradeoffs between open and traditional relation extraction.In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: HumanLanguage Technologies, Columbus, Ohio, 15-20 June 2008, pages 28–36, 2008. 69

Michele Banko, Michael Cafarella, Stephen Sonderland, Matt Broadhead, and Oren Etzioni. Openinformation extraction from the Web. In Proceedings of the 22nd Conference on the Advance-ment of Artificial Intelligence, Vancouver, B.C., Canada, 22-26 July 2007, pages 2670–2676, 2007.DOI: 10.1145/1409360.1409378. 80

Ken Barker and Stan Szpakowicz. Semi-automatic recognition of noun modifier relationships. InProceedings of the 36th Annual Meeting of the Association for Computational Linguistics, Montréal,1998, pages 96–102, 1998. DOI: 10.3115/980845.980862. 16, 17

http://www.itl.nist.gov/iad/mig/tests/ace/2008/doc/ace08-evalplan.v1.2d.pdf

http://www.itl.nist.gov/iad/mig/tests/ace/2008/doc/ace08-evalplan.v1.2d.pdf

http://dx.doi.org/10.1145/336597.336644

http://dx.doi.org/10.3115/982023.982050

http://dx.doi.org/10.3115/981923.981959

http://dx.doi.org/10.1007/978-3-540-76298-0_52

http://dx.doi.org/10.1145/1409360.1409378

http://dx.doi.org/10.3115/980845.980862

86 BIBLIOGRAPHY

Laurie Bauer. Compounding. In Martin Haspelmath, editor, Language Typology and LanguageUniversals. Mouton de Gruyter, The Hague, 2001. 11

Brandon Beamer, Suma Bhat, Brant Chee, Andrew Fister, Alla Rozovskaya, and Roxana Girju.UIUC: a knowledge-rich approach to identifying semantic relations between nominals. In Pro-ceedings of the 4th International Workshop on Semantic Evaluations (SemEval-1), Prague, CzechRepublic, 23-24 June 2007, pages 386–389, 2007. DOI: 10.1016/j.ipm.2009.09.002. 56

Matthew Berland and Eugene Charniak. Finding parts in very large corpora. In Proceedings of the37th Annual Meeting of the Association for Computational Linguistics, College Park, Md., 20-26June 1999, pages 57–64, 1999. DOI: 10.3115/1034678.1034697. 10, 63

Daniel M. Bikel, Richard Schwartz, and Ralph M. Weischedel. An algorithm that learns what’s ina name. Machine Learning, 34(1-3):211–231, February 1999. ISSN 0885-6125. URL http://dx.doi.org/10.1023/A:1007558221122. DOI: 10.1023/A:1007558221122. 50

Christian Blaschke, Miguel A. Andrade, Christos Ouzounis, and Alfonso Valencia. Automaticextraction of biological information from scientific text: protein-protein interactions. In Pro-ceedings of the 7th International Conference on Intelligent Systems for Molecular Biology (ISMB-99),Heidelberg, Germany, 1999. 63

David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet allocation. Journal of MachineLearning Research, 3:993–1022, 2003. 40

Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. InProceedings of the 11th Annual Conference on Computational Learning Theory, pages 92–100, 1998.DOI: 10.1145/279943.279962. 68

Sergey Brin. Extracting patterns and relations from the World Wide Web. In Selected papers from theInternational Workshop on the World Wide Web and Databases, WebDB ’98, pages 172–183, London,UK, 1998. Springer-Verlag. DOI: 10.1007/s11280-007-0022-0. 63, 65, 66

Ted Briscoe, John Carroll, and Rebecca Watson. The second release of the RASP system. InProceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meet-ing of the Association for Computational Linguistics, Sydney, Australia, 17-21 July 2006, 2006.DOI: 10.3115/1225403.1225423. 40

Peter F. Brown, Peter V. deSouza, Robert L. Mercer, T. J. Watson, Vincent J. Della Pietra, andJenifer C. Lai. Class-Based n-gram Models of Natural Language. Computational Linguistics, 18(4):467–479, 1992. 40

Justus Buchler, editor. The philosophy of Peirce: selected writings. Harcourt, Brace & Co., 1940. 97

http://dx.doi.org/10.1016/j.ipm.2009.09.002

http://dx.doi.org/10.3115/1034678.1034697

http://dx.doi.org/10.1023/A:1007558221122

http://dx.doi.org/10.1023/A:1007558221122

http://dx.doi.org/10.1023/A:1007558221122

http://dx.doi.org/10.1023/A:1007558221122

http://dx.doi.org/10.1145/279943.279962

http://dx.doi.org/10.1007/s11280-007-0022-0

http://dx.doi.org/10.3115/1225403.1225423

BIBLIOGRAPHY 87

Markus Bundschus, Mathaeus Dejori, Martin Stetter, Volker Tresp, and Hans-Peter Kriegel. Ex-traction of semantic biomedical relations from text using conditional random fields. BMC Bioin-formatics, 9(1):207, 2008. ISSN 1471-2105. URL http://www.biomedcentral.com/1471-2105/9/207. DOI: 10.1186/1471-2105-9-207. 51

Razvan Bunescu and Raymond J. Mooney. A shortest path dependency kernel for rela-tion extraction. In Human Language Technology Conference and Conference on Empiri-cal Methods in Natural Language Processing (HLT-EMNLP-05), Vancouver, Canada, 2005.DOI: 10.3115/1220575.1220666. 48

Cristina Butnariu and Tony Veale. A concept-centered approach to noun-compound interpretation.In Proceedings of the 22nd International Conference on Computational Linguistics, pages 81–88,Manchester, UK, 2008. 43

Cristina Butnariu, Su Nam Kim, Preslav Nakov, Diarmuid Ó Séaghdha, Stan Szpakowicz, and TonyVeale. SemEval-2 Task 9: The interpretation of noun compounds using paraphrasing verbs andprepositions. In Proceedings of the 5th International Workshop on Semantic Evaluations (SemEval-2),Uppsala, Sweden, 11-16 July 2010, pages 39–44, 2010. 18

Michael Cafarella, Michele Banko, and Oren Etzioni. Relational Web search. Technical Report2006-04-02,University of Washington,Department of Computer Science and Engineering,2006.4

Nicola Cancedda, Eric Gaussier, Cyril Goutte, and Jean-Michel Renders. Word-sequence kernels.Journal of Machine Learning Research, 3:1059–1082, 2003. URL http://jmlr.csail.mit.edu/papers/v3/cancedda03a.html. 48

Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R. Hruschka Jr., and Tom M.Mitchell. Toward an architecture for Never-Ending Language Learning. In Proceedings of theTwenty-Fourth Conference on Artificial Intelligence (AAAI 2010), 2010a. 65

Andrew Carlson, Justin Betteridge, Richard C. Wang, Estevam R. Hruschka Jr., and Tom M.Mitchell. Coupled semi-supervised learning for information extraction. In Proceedings of theThird ACM International Conference on Web Search and Data Mining (WSDM 2010), 2010b.DOI: 10.1145/1718487.1718501. 68

Joseph B. Casagrande and Kenneth Hale. Semantic relationships in Papago folk-definition. InDell H. Hymes and William E. Bittleolo, editors, Studies in southwestern ethnolinguistics, pages165–193. Mouton, The Hague and Paris, 1967. 11

Roger Chaffin and Douglas J. Herrmann. The similarity and diversity of semantic relations. Memory& Cognition, 12(2):134–141, 1984. DOI: 10.3758/BF03198427. 11, 12

http://www.biomedcentral.com/1471-2105/9/207

http://www.biomedcentral.com/1471-2105/9/207

http://dx.doi.org/10.1186/1471-2105-9-207

http://dx.doi.org/10.3115/1220575.1220666

http://jmlr.csail.mit.edu/papers/v3/cancedda03a.html

http://jmlr.csail.mit.edu/papers/v3/cancedda03a.html

http://dx.doi.org/10.1145/1718487.1718501

http://dx.doi.org/10.3758/BF03198427

88 BIBLIOGRAPHY

Yee Seng Chan and Dan Roth. Exploiting background knowledge for relation extraction. InProceedings of the 23rd International Conference on Computational Linguistics (COLING-10), Bei-jing,China, 2010. 40

Eugene Charniak. Toward a model of children’s story comprehension. Technical Report AITR-266(hdl.handle.net/1721.1/6892), Massachusetts Institute of Technology, 1972. 9

Martin S. Chodorow, Roy Byrd, and George Heidorn. Extracting semantic hierarchies from a largeon-line dictionary. In Proceedings of the 23th Annual Meeting of the Association for ComputationalLinguistics, Chicago, Ill., 8-12 July 1985, pages 299–304, 1985. DOI: 10.3115/981210.981247.60, 61

Massimiliano Ciaramita, Aldo Gangemi, Esther Ratsch, Jasmin Šaric, and Isabel Rojas. Unsu-pervised learning of semantic relations between concepts of a molecular miology ontology. InProceedings of the 19th International Joint Conference on Artificial Intelligence, Edinburgh, Scotland,30 July - 5 August 2005, 2005. 10

Kevin Cohen. Handbook of Natural Language Processing, chapter 27. BioNLP: Biomedical TextMining. CRC Press, Taylor and Francis Group, Boca Raton, FL, 2nd edition, 2010. 36

William W. Cohen, Pradeep Ravikumar, and Stephen E. Fienberg. A comparison of string dis-tance metrics for name-matching tasks. In Proceedings of the IJCAI-03 Workshop on InformationIntegration on the Web, Acapulco, Mexico, August 9-10, 2003, pages 73–78, 2003. 75

Michael Collins and Nigel Duffy. Convolution kernels for natural language. In Proceedings of the15th Conference on Neural Information Processing Systems (NIPS-01), Vancouver, Canada, 2001.URL http://books.nips.cc/papers/files/nips14/AA58.pdf. 48, 49

Corinna Cortes and Vladimir Vapnik. Support vector networks. Machine Learning, 20(3):273–297,1995. DOI: 10.1023/A:1022627411411. 47

Seana Coulson. Semantic Leaps: Frame-Shifting and Conceptual Blending in Meaning Construction.Cambridge University Press, Cambridge, 2001. DOI: 10.1017/CBO9780511551352. 14

M. Craven and J. Kumlien. Constructing biological knowledge bases by extracting informationfrom text sources. In Proceedings of the Seventh International Conference on Intelligent Systems forMolecular Biology, pages 77–86, 1999. 69

Aron Culotta, Andrew McCallum, and Jonathan Betz. Integrating probabilistic extraction modelsand data mining to discover relations and patterns in text. In Proceedings of the Conference on HumanLanguage Technology Conference of the North-American Chapter of the Association of ComputationalLinguistics, HLT-NAACL ’06, pages 296–303, 2006. URL http://dx.doi.org/10.3115/1220835.1220873. DOI: 10.3115/1220835.1220873. 51

http://dx.doi.org/10.3115/981210.981247

http://books.nips.cc/papers/files/nips14/AA58.pdf




http://dx.doi.org/10.1023/A:1022627411411

http://dx.doi.org/10.1017/CBO9780511551352

http://dx.doi.org/10.3115/1220835.1220873

http://dx.doi.org/10.3115/1220835.1220873

http://dx.doi.org/10.3115/1220835.1220873

BIBLIOGRAPHY 89

James R. Curran, Tara Murphy, and Bernhard Scholz. Minimising semantic drift with MutualExclusion Bootstrapping. In Proceedings of Conference of the Pacific Association for ComputationalLinguistics, Melbourne, Australia, 19-21 September, 2007 , pages 172–180, 2007. 66

Dmitry Davidov and Ari Rappoport. Unsupervised discovery of generic relationships using patternclusters and its evaluation by automatically generated SAT analogy questions. In Proceedingsof the 46th Annual Conference of the Association for Computational Linguistics: Human LanguageTechnologies (ACL-08:HLT), Columbus, OH, 2008a. 78

Dmitry Davidov and Ari Rappoport. Classification of semantic relationships between nominals usingpattern clusters. In Proceedings of the 46th Annual Meeting of the Association for ComputationalLinguistics: Human Language Technologies, Columbus, Ohio, 15-20 June 2008, pages 227–235,2008b. 10, 28, 56, 77, 78

Ferdinand de Saussure. Course in General Linguistics. Philosophical Library, New York, 1959. Editedby Charles Bally and Albert Sechehaye. Translated from the French by Wade Baskin. 8, 21

Doug Downey, Oren Etzioni, and Stephen Soderland. A probabilistic model of redundancy in infor-mation extraction. In Proceedings of the 19th International Joint Conference on Artificial Intelligence,Edinburgh, Scotland, 30 July - 5 August 2005, pages 1034–1041, 2005. 76, 79

Doug Downey, Oren Etzioni, and Stephen Soderland. Analysis of a probabilistic model of re-dundancy in unsupervised information extraction. Artificial Intelligence, 174(11):726–748, 2010.DOI: 10.1016/j.artint.2010.04.024. 79

Pamela Downing. On the creation and use of English noun compounds. Language, 53(4):810–842,1977. DOI: 10.2307/412913. 14, 18, 20

Oren Etzioni, Michael Cafarella, Doug Downey, Ana-Maria Popescu, Tal Shaked, Stephen Soder-land, Daniel S. Weld, and Alexander Yates. Unsupervised named-entity extraction from theweb: an experimental study. Artificial Intelligence, 165(1):91–134, 2005. ISSN 0004-3702.DOI: 10.1016/j.artint.2005.03.001. 80

Anthony Fader, Stephen Soderland, and Oren Etzioni. Identifying relations for open informa-tion extraction. In Proceedings of the 2011 Conference on Empirical Methods in Natural LanguageProcessing, Edinburgh, UK, 26-29 July 2011, pages 1535–1545, 2011. 10, 79, 81

Christiane Fellbaum, editor. WordNet – An Electronic Lexical Database. MIT Press, 1998. 10, 19

Gottlob Frege. Begriffschrift. Louis Nebert, Halle, 1879. 8

L.T. F. Gamut. Logic, Language, and Meaning, Volume 1: Introduction to Logic. University of ChicagoPress, Chicago, IL, 1991. 8

http://dx.doi.org/10.1016/j.artint.2010.04.024

http://dx.doi.org/10.2307/412913


90 BIBLIOGRAPHY

Jean Claude Gardin. SYNTOL. Graduate School of Library Service, Rutgers, the State University(Rutgers Series on Systems for the Intellectual Organization of Information, Susan Artandi Ed.),New Brunswick, New Jersey, 1965. 8

Dirk Geeraerts. Theories of Lexical Semantics. Oxford University Press, 2010. 11

Peter Gerstl and Simone Pribbenow. Midwinters, end games, and body parts: a classificationof part-whole relations. International Journal of Human-Computer Studies, 43:865–889, 1995.DOI: 10.1006/ijhc.1995.1079. 19

Roxana Girju, Adriana Badulescu, and Dan Moldovan. Learning semantic constraints for theautomatic discovery of part-whole relations. In Proceedings of the Human Language TechnologyConference of the North American Chapter of the Association for Computational Linguistics, Edmonton,Alberta, Canada, 27 May -1 June 2003, 2003. DOI: 10.3115/1073445.1073456. 54

Roxana Girju, Ana-Maria Giuglea, Marian Olteanu, Ovidiu Fortu, Orest Bolohan, and DanMoldovan. Support Vector Machines applied to the classification of semantic relations in nomi-nalized noun phrases. In Proceedings of HLT-NAACL Workshop on Computational Lexical Semantics,pages 68–75.Association for Computational Linguistics, 2004.DOI: 10.3115/1596431.1596441.31, 41, 53

Roxana Girju, Dan Moldovan, Marta Tatu, and Daniel Antohe. On the semantics of noun com-pounds. Computer Speech and Language, 19:479–496, 2005. DOI: 10.1016/j.csl.2005.02.006. 14,16, 17

Roxana Girju, Preslav Nakov, Vivi Nastase, Stan Szpakowicz, Peter Turney, and Deniz Yuret.SemEval-2007 Task 04: Classification of Semantic Relations between Nominals. In Pro-ceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007),pages 13–18, Prague, Czech Republic, 2007. Association for Computational Linguistics.DOI: 10.3115/1621474.1621477. 56

Roxana Girju, Preslav Nakov, Vivi Nastase, Stan Szpakowicz, Peter Turney, and Deniz Yuret. Clas-sification of semantic relations between nominals. Language Resources and Evaluation, 43(2):105–121, 2009. DOI: 10.1007/s10579-009-9083-2. 28

Mark A. Greenwood and Mark Stevenson. Improving semi-supervised acquisition of relation ex-traction patterns. In Proceedings of the ACL-06 Workshop on Information Extraction Beyond theDocument, Sydney, Australia, 2006. DOI: 10.3115/1641408.1641412. 65

Jakob Grimm. Deutsche Grammatik, Theil 2. Dieterich, Göttingen, 1826. 12

Ralph Grishman and Beth Sundheim. Message Understanding Conference-6: a brief history. InProceedings of the 16th Conference on Computational Linguistics COLING ’96, volume 1, pages466–471, 1996. DOI: 10.3115/992628.992709. 26

http://dx.doi.org/10.1006/ijhc.1995.1079

http://dx.doi.org/10.3115/1073445.1073456

http://dx.doi.org/10.3115/1596431.1596441

http://dx.doi.org/10.1016/j.csl.2005.02.006

http://dx.doi.org/10.3115/1621474.1621477

http://dx.doi.org/10.1007/s10579-009-9083-2

http://dx.doi.org/10.3115/1641408.1641412

http://dx.doi.org/10.3115/992628.992709

BIBLIOGRAPHY 91

Ben Hachey, Claire Grover, and Richard Tobin. Datasets for generic relation extraction. Journal ofNatural Language Engineering, 18(1):21–59, 2011. 37

Roy Harris. Reading Saussure: A Critical Commentary on the Cours le Linquistique Generale. OpenCourt, La Salle, Ill., 1987. 8

David Haussler. Convolution kernels on discrete structures. Technical Report UCSC-CRL-99-10,Computer Science Department, University of California at Santa Cruz, 1999. 48

Marti Hearst. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 15thInternational Conference on Computational Linguistics, Nantes, France, 23-28 August 1992, pages539–545, 1992. DOI: 10.3115/992133.992154. 10, 60, 62, 63, 72

Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva, Preslav Nakov, Diarmuid Ó Séaghdha, SebastianPadó, Marco Pennacchiotti, Lorenza Romano, and Stan Szpakowicz. SemEval-2010 Task 8:Multi-way classification of semantic relations between pairs of nominals. In Proceedings of the 5thInternational Workshop on Semantic Evaluations (SemEval-2), Uppsala, Sweden, 11-16 July 2010,pages 33–38, 2010. 29

Jerry R. Hobbs, Douglas E. Appelt, John Bear, David J. Israel, Megumi Kameyama, Mark E. Stickel,and Mabry Tyson. FASTUS: A cascaded finite-state transducer for extracting information fromnatural-language text. In Emmanuel Roche and Yves Schabes, editors, Finite-State LanguageProcessing, pages 383–406. The MIT Press, 1997. URL http://arxiv.org/abs/cmp-lg/9705013. 45

Raphael Hoffmann, Congle Zhang, and Daniel Weld. Learning 5000 relational extractors. InProceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala,Sweden, 11-16 July 2010, pages 286–295, 2010. 81

Florentina Hristea and George Miller. WordNet nouns: classes and instances. ComputationalLinguistics, 32(1):1–3, 2006. DOI: 10.1162/coli.2006.32.1.1. 18

Chih-Wei Hsu and Chih-Jen Lin. A comparison of methods for multiclass support vector machines.IEEE Transactions on Neural Networks, 13(2):415–425, 2002. DOI: 10.1109/72.991427. 45

Lawrence Hunter and Bretonnel K. Cohen. Biomedical language processing perspective: What’sbeyond PubMed? Molecular Cell, 21:589–594, March 2006. DOI: 10.1016/j.molcel.2006.02.012.3

Nancy Ide and Jean Véronis. Introduction to the special issue on word-sense disambiguation: thestate of the art. Computational Linguistics, 24(1):2–40, 1998. ISSN 0891-2017. URL http://dl.acm.org/citation.cfm?id=972719.972721. 41

http://dx.doi.org/10.3115/992133.992154

http://arxiv.org/abs/cmp-lg/9705013

http://arxiv.org/abs/cmp-lg/9705013

http://dx.doi.org/10.1162/coli.2006.32.1.1

http://dx.doi.org/10.1109/72.991427

http://dx.doi.org/10.1016/j.molcel.2006.02.012

http://dl.acm.org/citation.cfm?id=972719.972721


92 BIBLIOGRAPHY

Nancy Ide, Jean Veronis, Susan Warwick-Armstrong, and Nicoletta Calzolari. Principles for en-coding machine-readable dictionaries. In Fifth Euralex International Congress, pages 239–246,University of Tampere, Finland, 1992. 60

Otto Jespersen. A Modern English Grammar on Historical Principles, Part VI: Morphology. EjnarMunksgaard, Copenhagen, 1942. 12, 20

Jing Jiang and ChengXiang Zhai. A systematic exploration of the feature space for relation extraction.In Proceedings of Human Language Technologies: The Annual Conference of the North AmericanChapter of the Association for Computational Linguistics (NAACL-HLT 2007), Rochester, NY, 2007.43, 56

Thorsten Joachims, Nello Cristianini, and John Shawe-Taylor. Composite kernels for hypertextcategorisation. In Proceedings of the 18th International Conference on Machine Learning (ICML-01), Williamstown, MA, 2001. 49

Shivram Dattatray Joshi. Patañjali’s Vyakaran. a-Mahabhas.ya: Samarthahnika (P 2.1.1). Edited withTranslation and Explanatory Notes. University of Poona Press, Poona, 1968. 8

Daniel Jurafsky and James H. Martin. Speech and Language Processing: An Introduction to NaturalLanguage Processing, Speech Recognition, and Computational Linguistics. Prentice Hall, 2 edition,2009. 24

Christopher S. G. Khoo and Jin-Cheon Na. Semantic relations in information science. Annual Re-view of Information Science and Technology, 40(1):157–228, 2006. DOI: 10.1002/aris.1440400112.21, 24

Jin-Dong Kim, Tomoko Ohta, Sampo Pyysalo, Yoshinobu Kano, and Jun’ichi Tsujii. Overview ofBioNLP’09 shared task on event extraction. In Proceedings of BioNLP09, workshop at NAACL-HTL 2009, Boulder, Colorado, 2009. Association for Computations Linguistics. 23

Su Nam Kim and Timothy Baldwin. Automatic Interpretation of noun compounds usingWordNet::Similarity. In Proceedings of the 2nd International Joint Conference on NaturalLanguage Processing, Jeju Island, South Korea, 11-13 October 2005, pages 945–956, 2005.DOI: 10.1007/11562214_82. 32

Judith L. Klavans, Martin S. Chodorow, and Nina Wacholder. Building a knowledge base fromparsed definitions. In George Heidorn, Karen Jensen, and Steve Richardson, editors, NaturalLanguage Processing: The PLNLP Approach. Kluwer, New York, NY, USA, 1992. 60

Zornitsa Kozareva and Eduard Hovy. A Semi-Supervised Method to Learn and Construct Tax-onomies using the Web. In Proceedings of the 2010 Conference on Empirical Methods in NaturalLanguage Processing, Cambridge, Mass., 9-11 October 2010, pages 1110–1118, 2010. 73, 74

http://dx.doi.org/10.1002/aris.1440400112

http://dx.doi.org/10.1007/11562214_82

BIBLIOGRAPHY 93

Zornitsa Kozareva, Ellen Riloff, and Eduard Hovy. Semantic class learning from the Web withhyponym pattern linkage graphs. In Proceedings of the 46th Annual Meeting of the Association forComputational Linguistics ACL-08: HLT, pages 1048–1056, 2008. 72

Martin Krallinger, Florian Leitner, Carlos Rodriguez-Penagos, and Alfonso Valencia. Overview ofthe protein-protein interaction annotation extraction task of BioCreative II. Genome Biology, 9(Suppl 2):S4, 2008. DOI: 10.1186/gb-2008-9-s2-s4. 37

Henry Kucera and Winthrop Nelson Francis. Computational Analysis of Present-Day AmericanEnglish. Brown University Press, Providence, RI, 1967. 12

John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional Random Fields: Probabilisticmodels for segmenting and labeling sequence data. In Proceedings of the 18th International Con-ference on Machine Learning, Williams College, Massachusetts, 27-30 June, 2001, pages 282–289,2001. 50, 69

Maria Lapata. The disambiguation of nominalizations. Computational Linguistics, 28(3):357–388,2002. DOI: 10.1162/089120102760276018. 5

Mirella Lapata and Frank Keller. The Web as a baseline: Evaluating the performance of unsupervisedWeb-based models for a range of NLP tasks. In Proceedings of the 2004 Human LanguageTechnologyConference and the Meeting of the North-American Chapter of the Association for ComputationalLinguistics, pages 121–128, Boston, 2004. 17

Mark Lauer. Designing Statistical Language Learners: Experiments on Noun Compounds. PhD thesis,Department of Computing, Macquarie University, Australia, December 1995. 17, 18

Douglas B. Lenat and R. V. Guha. Building Large Knowledge-Based Systems: Representation andInference in the CYC Project. Addison-Wesley, Reading, Mass., 1990. 9, 19

Rosemary Leonard. The Interpretation of English Noun Sequences on the Computer. North-Holland,Amsterdam, 1984. 14

Judith N. Levi. The Syntax and Semantics of Complex Nominals. Academic Press, New York, 1978.5, 12, 13, 16, 17, 18

Charles N. Li. Semantics and the structure of compounds in Chinese. PhD thesis, University ofCalifornia, Berkeley, 1971. 12

Percy Liang, Michael I. Jordan, and Dan Klein. Learning dependency-based compositional seman-tics. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics,Portland, Oregon, USA, 19-24 June 2011, June 2011. 23

Dekang Lin. LaTaT: Language and Text Analysis Tools. In Proceedings of the 1st InternationalConference on Human Language Technology Research, San Diego, Cal., 18-21 March 2001, pages222–227, 2001. DOI: 10.3115/1072133.1072198. 76

http://dx.doi.org/10.1186/gb-2008-9-s2-s4

http://dx.doi.org/10.1162/089120102760276018

http://dx.doi.org/10.3115/1072133.1072198

94 BIBLIOGRAPHY

Dekang Lin and Patrick Pantel. Discovery of inference rules for question-answering. NaturalLanguage Engineering, 7(4):343–360, 2001. DOI: 10.1017/S1351324901002765. 75

Thomas Lin, Mausam, and Oren Etzioni. Identifying functional relations in Web text. In Proceedingsof the 2010 Conference on Empirical Methods in Natural Language Processing, Cambridge, Mass.,9-11 October 2010, pages 1266–1276, 2010. 79

Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini, and Christopher J. C. H.Watkins. Text classification using string kernels. Journal of Machine Learning Research, 2:419–444, 2002. URL http://www.jmlr.org/papers/v2/lodhi02a.html. 48, 49

Wei Lu, Hwee Tou Ng, Wee Sun Lee, and Luke S. Zettlemoyer. A generative model for parsingnatural language to meaning representations. In Proceedings of the Conference on Empirical Methodsin Natural Language Processing, EMNLP ’08, pages 783–792, 2008. URL http://dl.acm.org/citation.cfm?id=1613715.1613815. DOI: 10.3115/1613715.1613815. 23

Christopher Manning and Hinrich Schütze. Foundations of Statistical Natural Language Processing.MIT Press, Cambridge, 1999. 45

Eric Margolis and Stephen Laurence, editors. Concepts: Core Readings. A Bradford Book, 1999. 7

Cynthia Matuszek, John Cabral, Michael Witbrock, and John DeOliveira. An introduction to thesyntax and content of Cyc. In Proceedings of the 2006 AAAI Spring Symposium on Formalizing andCompiling Background Knowledge and Its Applications to Knowledge Representation and QuestionAnswering, pages 44–49, 2006. 19

Mausam, Michael Schmitz, Robert Bart, Stephen Soderland, and Oren Etzioni. Open languagelearning for information extraction. In Proceedings of the 2012 Conference on Empirical Methods inNatural Language Processing, Jeju Island, Korea, 12-14 July 2012, pages 523–534, 2012. 81

Andrew McCallum and Wei Li. Early results for Named Entity Recognition with Conditional Ran-dom Fields, feature induction and Web-enhanced lexicons. In Proceedings of the 7th Conference onNatural Language Learning at HLT-NAACL 2003 – Volume 4, CONLL ’03, pages 188–191, 2003.URL http://dx.doi.org/10.3115/1119176.1119206. DOI: 10.3115/1119176.1119206.50

John McCarthy. Programs with common sense. In Proceedings of the Teddington Conference on theMechanization of Thought Processes, 1958. 9

Ryan McDonald, Fernando Pereira, Seth Kulik, Scott Winters, Yang Jin, and Pete White. SimpleAlgorithms for Complex Relation Extraction with Applications to Biomedical IE. In Proceedingsof the 43rd Annual Meeting of the Association for Computational Linguistics (ACL-05), Ann Arbor,MI, 2005. DOI: 10.3115/1219840.1219901. 55

http://dx.doi.org/10.1017/S1351324901002765

http://www.jmlr.org/papers/v2/lodhi02a.html

http://www.jmlr.org/papers/v2/lodhi02a.html



http://dx.doi.org/10.3115/1613715.1613815

http://dx.doi.org/10.3115/1119176.1119206

http://dx.doi.org/10.3115/1119176.1119206

http://dx.doi.org/10.3115/1119176.1119206

http://dx.doi.org/10.3115/1219840.1219901

BIBLIOGRAPHY 95

Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. Distant supervision for relation extractionwithout labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACLand the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2, ACL ’09, pages 1003–1011, 2009. ISBN 978-1-932432-46-6. URL http://dl.acm.org/citation.cfm?id=1690219.1690287. 69

Thahir Mohamed, Estevam Hruschka Jr., and Tom Mitchell. Discovering relations between nouncategories. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Pro-cessing, Edinburgh, UK, 26-29 July 2011, pages 1447–1455, 2011. 80

Dan Moldovan, Adriana Badulescu, Marta Tatu, Daniel Antohe, and Roxana Girju. Models forthe semantic classification of noun phrases. In Proceedings of the HLT-NAACL Workshop onComputational Lexical Semantics, pages 60–67. Association for Computational Linguistic, 2004.DOI: 10.3115/1596431.1596440. 54

Alessandro Moschitti. Efficient convolution kernels for dependency and constituentsyntactic trees. Proceedings of the 17th European Conference on Machine Learning(ECML-06), 2006. URL http://dit.unitn.it/˜moschitt/articles/ECML2006.pdf.DOI: 10.1007/11871842_32. 48, 49

Lynne M. Murphy. Semantic Relations and the Lexicon. Cambridge University Press, Cambridge,UK, 2003. DOI: 10.1017/CBO9780511486494. 20, 21, 24

Preslav Nakov. Improved Statistical Machine Translation using monolingual paraphrases. In Pro-ceedings of the 16th European Conference on Artificial Intelligence, Valencia, Spain, 23-27 August2004, pages 338–342, 2008a. DOI: 10.3233/978-1-58603-891-5-338. 3

Preslav Nakov. Noun compound interpretation using paraphrasing verbs: feasibility study. In DanailDochev, Marco Pistore, and Paolo Traverso, editors, Artificial Intelligence: Methodology, Systems,and Applications, 13th International Conference, AIMSA 2008, Varna, Bulgaria, September 4-6, 2008,volume 5253 of Lecture Notes in Computer Science, pages 103–117. Springer, 2008b. ISBN 978-3-540-85775-4. DOI: 10.1007/978-3-540-85776-1_10. 18

Preslav Nakov and Marti Hearst. UCB: System Description for SemEval Task #4. In Proceedingsof the 4th International Workshop on Semantic Evaluations (SemEval-07), Prague, Czech Republic,2007. DOI: 10.3115/1621474.1621554. 43, 44

Preslav Nakov and Marti Hearst. Solving relational similarity problems using the Web as a corpus.In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: HumanLanguage Technologies, Columbus, Ohio, 15-20 June 2008, pages 452–460, 2008. 28

Preslav Nakov and Zornitsa Kozareva. Combining relational and attributional similarity for semanticrelation classification. In Proceedings of the International Conference on Recent Advances in NaturalLanguage Processing, Hissar, Bulgaria, 12-14 September 2009, pages 323–330, 2011. 28, 56




http://dx.doi.org/10.3115/1596431.1596440

http://dit.unitn.it/~moschitt/articles/ECML2006.pdf

http://dx.doi.org/10.1007/11871842_32


http://dx.doi.org/10.3233/978-1-58603-891-5-338

http://dx.doi.org/10.1007/978-3-540-85776-1_10

http://dx.doi.org/10.3115/1621474.1621554

96 BIBLIOGRAPHY

Vivi Nastase and Michael Strube. Transforming Wikipedia into a large-scale multilingual conceptnetwork. Artificial Intelligence, 194:62–85, 2013. DOI: 10.1016/j.artint.2012.06.008. 35

Vivi Nastase and Stan Szpakowicz. Exploring noun-modifier semantic relations. In Proceedings of6th International Workshop on Computational Semantics, Tilburg, The Netherlands, 15-17 January,2005, pages 285–301, 2003. 14, 15, 16, 17, 23, 30, 31, 41, 53, 56

Vivi Nastase, Jelber Sayyad-Shirabad, Marina Sokolova, and Stan Szpakowicz. Learning noun-modifier semantic relations with corpus-based and WordNet-based features. In Proceedings of the21st National Conference on Artificial Intelligence, Boston, Mass., 16-20 July 2006, pages 781–787,2006. 41

Roberto Navigli. Word sense disambiguation: A survey. ACM Computing Surveys, 41(2):1–69, 2009. ISSN 0360-0300. URL http://doi.acm.org/10.1145/1459352.1459355.DOI: 10.1145/1459352.1459355. 41

Claire Nédellec. Learning language in logic-genic interaction extraction challenge. In Proceedingsof the ICML-05 Workshop on Learning Language in Logic (LLL-05), Bonn, Germany, 2005. 37

Truc-Vien T. Nguyen, Alessandro Moschitti, and Giuseppe Riccardi. Convolution kernels on con-stituent, dependency and sequential structures for selation extraction. In Proceedings of the 2009Conference on Empirical Methods in Natural Language Processing, Singapore, 6-7 August 2009,Singapore, 2009. 48, 49

Adolf Noreen. Vårt Språk, volume 5. C. W. K. Gleerups Förlag, Lund, 1904. 12

J. Terry Nutter. A lexical relation hierarchy. Technical Report TR-89-06(eprints.cs.vt.edu/archive/00000143/01/TR-89-06.pdf ), Virginia Polytechnic Institute andState University, Department of Computer Science, 1989. 19

Diarmuid Ó Séaghdha. Learning compound noun semantics. PhD thesis, University of Cambridge,2008. 14

Diarmuid Ó Séaghdha and Ann Copestake. Co-occurrence contexts for noun compound interpre-tation. In Proceedings of the ACL Workshop on A Broader Perspective on Multiword Expressions, pages57–64. Association for Computational Linguistics, 2007. DOI: 10.3115/1613704.1613712. 13,14, 32, 33

Diarmuid Ó Séaghdha and Ann Copestake. Semantic classification with distributional kernels.In Proceedings of the 22nd International Conference on Computational Linguistics (COLING-08),Manchester, UK, 2008. DOI: 10.3115/1599081.1599163. 28, 40, 47, 56

Diarmuid Ó Séaghdha and Ann Copestake. Using lexical and relational similarity to classify se-mantic relations. In Proceedings of the 12th Conference of the European Chapter of the Association for


http://doi.acm.org/10.1145/1459352.1459355

http://dx.doi.org/10.1145/1459352.1459355

http://dx.doi.org/10.3115/1613704.1613712

http://dx.doi.org/10.3115/1599081.1599163

BIBLIOGRAPHY 97

Computational Linguistics (EACL-09), Athens,Greece, 2009. DOI: 10.3115/1609067.1609136.49

Marius Pasca. Organizing and searching the World-Wide Web of facts – step two: harnessing thewisdom of the crowds. In 16th International World Wide Web Conference, May 8-12, 2007, Banff,Canada, pages 101–110, 2007. DOI: 10.1145/1242572.1242587. 75

Marius Pasca, Dekang Lin, Jeffrey Bigham, Andrei Lifchits, and Alpa Jain. Names and similaritieson the Web: Fact extraction in the fast lane. In Proceedings of the 21st International Conference onComputational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics,Sydney, Australia, 17-21 July 2006, pages 809–816, 2006a. DOI: 10.3115/1220175.1220277. 68

Marius Pasca,Dekang Lin, Jeffrey Bigham,Andrei Lifchits, and Alpa Jain. Organizing and searchingthe World-Wide Web of facts – step one: the one-million fact extraction challenge. In Proceedingsof the 21st National Conference on Artificial Intelligence, Boston, Mass., 16-20 July 2006, pages1400–1405, 2006b. 67

Martha Palmer, Daniel Gildea, and Nianwen Xue. Semantic Role Labeling. Synthesis Lectures onHuman Language Technologies. Morgan & Claypool, 2010. 55

Patrick Pantel and Dekang Lin. Discovering word senses from text. In Proceedings of the 8th ACMSIGKDD Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada,23-26 July 2002, pages 613–619, 2002. DOI: 10.1145/775047.775138. 40, 71

Patrick Pantel and Marco Pennacchiotti. Espresso: Leveraging generic patterns for automaticallyharvesting semantic relations. In Proceedings of the 21st International Conference on Computa-tional Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Sydney,Australia, 17-21 July 2006, pages 113–120, 2006. DOI: 10.3115/1220175.1220190. 66

Patrick Pantel and Deepak Ravichandran. Automatically labeling semantic classes. In Proceedingsof the Human Language Technology Conference of the North American Chapter of the Association forComputational Linguistics, Boston, Mass., 2-7 May 2004, pages 321–328, 2004. 71

Siddharth Patwardhan and Ellen Riloff. Effective information extraction with semantic affinitypatterns and relevant regions. In Proceedings of the 2007 Joint Conference on Empirical Methodsin Natural Language Processing and Computational Language Learning, Prague, Czech Republic,28-30 June 2007, pages 717–727, 2007. 10

Charles Sanders Peirce. Existential graphs. Unpublished manuscript reprinted in [Buchler, 1940],1909. 9

Daniele Pighin and Alessandro Moschitti. On reverse feature engineering of syntactic tree kernels.In Proceedings of the 14th Conference on Computational Natural Language Learning (CONLL-10),Boulder, CO, 2010. 49

http://dx.doi.org/10.3115/1609067.1609136

http://dx.doi.org/10.1145/1242572.1242587

http://dx.doi.org/10.3115/1220175.1220277

http://dx.doi.org/10.1145/775047.775138

http://dx.doi.org/10.3115/1220175.1220190

98 BIBLIOGRAPHY

James Pustejovsky. The Generative Lexicon. MIT Press, Cambridge, MA, 1995. 53

James Pustejovsky, JosÃ© M. Castaño, Jason Zhang, M. Kotecki, and B. Cochran. Robust relationalparsing over biomedical literature: Extracting inhibit relations. In Proceedings of the 7th PacificSymposium on Biocomputing (PSB-02), Lihue, Hawaii, 2002. 63

Sampo Pyysalo, Antti Airola, Juho Heimonen, Jari Björne, Filip Ginter, and Tapio Salakoski. Com-parative analysis of five protein-protein interaction corpora. BMC Bioinformatics, 9(Suppl 3):S6,2008. DOI: 10.1186/1471-2105-9-S3-S6. 37

M.Ross Quillian. A revised design for an understanding machine. Mechanical Translation, 7:17–29,1962. 10

Randolph Quirk, Sidney Greenbaum, Geoffrey Leech, and Jan Svartvik. A Comprehensive Grammarof the English Language. Longman, London and New York, 1985. 4

Lance Ramshaw and Mitch Marcus. Text chunking using transformation-based learning. In Pro-ceedings of the Third Workshop on Very Large Corpora. ACL, 1995. 50

Lev Ratinov and Dan Roth. Design challenges and misconceptions in named entity recognition.In Proceedings of the Thirteenth Conference on Computational Natural Language Learning, CoNLL’09, pages 147–155, 2009. ISBN 978-1-932432-29-9. URL http://dl.acm.org/citation.cfm?id=1596374.1596399. DOI: 10.3115/1596374.1596399. 50

Deepak Ravichandran and Eduard Hovy. Learning surface text patterns for a Question Answeringsystem. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics,Philadelphia, Penn., 7-12 July 2002, pages 41–47, 2002. DOI: 10.3115/1073083.1073092. 10,65

Soumya Ray and Mark Craven. Representing sentence structure in hidden Markov models forinformation extraction. In Proceedings of the 17th International Joint Conference on Artificial intel-ligence – Volume 2, IJCAI’01, pages 1273–1279, San Francisco, CA, USA, 2001. Morgan Kauf-mann Publishers Inc. ISBN 1-55860-812-5, 978-1-558-60812-2. URL http://dl.acm.org/citation.cfm?id=1642194.1642264. 52

Sebastian Riedel, Limin Yao, and Andrew McCallum. Modeling relations and their mentionswithout labeled text. In Proceedings of the European Conference on Machine Learning and KnowledgeDiscovery in Databases (ECML PKDD ’10), volume 6232 of Lecture Notes in Computer Science,pages148–163. Springer Berlin / Heidelberg, 2010. DOI: 10.1007/978-3-642-15939-8_10. 69

Ellen Riloff and Rosie Jones. Learning dictionaries for information extraction by multi-level boos-trapping. In Proceedings of the 16th National Conference on Artificial Intelligence, Orlando, Fla.,18-22 July 1999, 1999. 67

http://dx.doi.org/10.1186/1471-2105-9-S3-S6



http://dx.doi.org/10.3115/1596374.1596399

http://dx.doi.org/10.3115/1073083.1073092



http://dx.doi.org/10.1007/978-3-642-15939-8_10

BIBLIOGRAPHY 99

Bryan Rink and Sanda Harabagiu. UTD: Classifying Semantic Relations by Combining Lexicaland Semantic Resources. In Proceedings of the 5th International Workshop on Semantic Evaluation,pages 256–259, Uppsala, Sweden, July 2010. Association for Computational Linguistics. URLhttp://www.aclweb.org/anthology/S10-1057. 56

Barbara Rosario and Marti Hearst. Classifying the semantic relations in noun compounds via adomain-specific lexical hierarchy. In Proceedings of the 2001 Conference on Empirical Methods inNatural Language Processing, Pittsburgh, Penn., 3-4 June 2001, pages 82–90, 2001. 23, 37, 38, 41

Barbara Rosario and Marti Hearst. The descent of hierarchy, and selection in relational semantics. InProceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia,Penn., 7-12 July 2002, pages 247–254, 2002. DOI: 10.3115/1073083.1073125. 53, 54

Barbara Rosario and Marti Hearst. Multi-way relation classification: Application to protein-proteininteractions. In Proceedings of Human Language Technology Conference and Conference on EmpiricalMethods in Natural Language Processing, pages 732–739, Vancouver, British Columbia, Canada,October 2005. DOI: 10.3115/1220575.1220667. 37, 52

Barbara Rosario and Marti A. Hearst. Classifying semantic relations in bioscience texts. In Proceed-ings of the 42nd Annual Meeting of the Association for Computational Linguistics, Barcelona, Spain,21-26 July 2004, pages 430–437, 2004. DOI: 10.3115/1218955.1219010. 23, 36, 37, 52, 56, 57

Benjamin Rosenfeld and Ronen Feldman. Using corpus statistics on entities to improve semi-supervised relation extraction from the Web. In Proceedings of the 45th Annual Meeting of theAssociation for Computational Linguistics, Prague, Czech Republic, 23-30 June 2007, pages 600–607, 2007. 68

Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall, New Jersey,3rd edition, 2009. 9, 24

John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge UniversityPress, Cambridge, UK, 2004. DOI: 10.1017/CBO9780511809682. 46

Push Singh, Thomas Lin, Erik T. Mueller, Grace Lim, Travell Perkins, and Wan Li Zhu. OpenMind Common Sense: Knowledge acquisition from the general public. In Proceedings of theFirst International Conference on Ontologies, Databases, and Applications of Semantic for Large ScaleInformation Systems, Irvine, California, 2002. DOI: 10.1007/3-540-36124-3_77. 9

Rion Snow, Daniel Jurafsky, and Andrew Y. Ng Ng. Learning syntactic patterns for automatichypernym discovery. In Proceedings of the 19th Conference on Neural Information Processing Systems(NIPS-05), pages 1297–1304, 2005. 69

Richard Socher, Brody Huval, Christopher D. Manning, and Andrew Y. Ng. Semantic composi-tionality through recursive matrix-vector spaces. In Proceedings of the 2012 Conference on EmpiricalMethods in Natural Language Processing, Jeju, Korea, 2012. 28, 56

http://www.aclweb.org/anthology/S10-1057

http://dx.doi.org/10.3115/1073083.1073125

http://dx.doi.org/10.3115/1220575.1220667

http://dx.doi.org/10.3115/1218955.1219010


http://dx.doi.org/10.1007/3-540-36124-3_77

100 BIBLIOGRAPHY

John F. Sowa. Conceptual Structures: Information Processing in Mind and Machine. Addison-Wesley,1984. 22

Karen Spärck Jones. Synonymy and Semantic Classification. PhD thesis, University of Cambridge,1964. 10

Mark Stevenson and Mark A. Greenwood. A semantic approach to IE pattern induction. InProceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL-05),Ann Arbor, MI, 2005. DOI: 10.3115/1219840.1219887. 65

Paul Studtmann. Aristotle’s Categories. In Edward N. Zalta, editor, The Stanford En-cyclopedia of Philosophy. The Metaphysics Research Lab, Center for the Study of Lan-guage and Information, Stanford University, 2008. URL http://plato.stanford.edu/.http://plato.stanford.edu/archives/fall2008/entries/aristotle-categories/ . 7

Stanley Y. W. Su. A Semantic Theory Based Upon Interactive Meaning. Computer SciencesTechnical Report #68, University of Wisconsin, 1969. 14

Ang Sun, Ralph Grishman, and Satoshi Sekine. Semi-supervised relation extraction with large-scale word clustering. In Proceedings of the 49th Annual Meeting of the Association for ComputationalLinguistics (ACL-11), Portland, OR, 2011. 40

Jun Suzuki, Tsutomu Hirao, Yutaka Sasaki, and Eisaku Maeda. Hierarchical directed acyclicgraph kernel: Methods for structured natural language data. In Proceedings of the 41st An-nual Meeting of the Association for Computational Linguistics (ACL-03), Sapporo, Japan, 2003.DOI: 10.3115/1075096.1075101. 48

Don R. Swanson. Two medical literatures that are logically but not bibliograph-ically connected. Journal of the American Society for Information Science, 38(4):228–233, January 1987. ISSN 1097-4571. URL http://dx.doi.org/10.1002/(SICI)1097-4571(198707)38:4%3C228::AID-ASI2%3E3.0.CO;2-G.DOI: 10.1002/(SICI)1097-4571(198707)38:4%3C228::AID-ASI2%3E3.0.CO;2-G. 3,59

Lucien Tesnière. Éléments de syntaxe structurale. C. Klincksieck, Paris, 1959. 1

Stephen Tratz and Eduard Hovy. A taxonomy, dataset, and classifier for automatic noun compoundinterpretation. In Proceedings of the 48th Annual Meeting of the Association for ComputationalLinguistics, pages 678–687, Uppsala, Sweden, 2010. 14, 16, 17, 32

Peter Turney. Similarity of semantic relations. Computational Linguistics, 32(3):379–416, 2006a. 39,43

http://dx.doi.org/10.3115/1219840.1219887

http://plato.stanford.edu/

http://dx.doi.org/10.3115/1075096.1075101

http://dx.doi.org/10.1002/(SICI)1097-4571(198707)38:4%3C228::AID-ASI2%3E3.0.CO;2-G



BIBLIOGRAPHY 101

Peter Turney. Expressing implicit semantic relations without supervision. In Proceedings of the21st International Conference on Computational Linguistics and 44th Annual Meeting of the Asso-ciation for Computational Linguistics, Sydney, Australia, 17-21 July 2006, pages 313–320, Syd-ney, Australia, July 2006b. URL http://www.aclweb.org/anthology/P/P06/P06-1040.DOI: 10.3115/1220175.1220215. 43

Peter Turney and Michael Littman. Corpus-based learning of analogies and semantic relations.Machine Learning, 60(1-3):251–278, 2005. DOI: 10.1007/s10994-005-0913-1. 43, 78

Lucy Vanderwende. Algorithm for the automatic interpretation of noun sequences. In Proceedingsof the 15th International Conference on Computational Linguistics, Kyoto, Japan, 5-9 August 1994,pages 782–788, 1994. DOI: 10.3115/991250.991272. 14, 16, 17

Chang Wang, James Fan, Aditya Kalyanpur, and David Gondek. Relation extraction with relationtopics. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing(EMNLP-11), Edinburgh, UK, 2011. 43

Beatrice Warren. Semantic Patterns of Noun-Noun Compounds. Universitatis Gothoburgensis, Göte-borg, 1978. 12, 13, 16, 17, 23

G. Weikum, G. Kasneci, M. Ramanath, and F. Suchanek. Database and information-retrieval methods for knowledge discovery. Communications of the ACM, 52(4):56–64, 2009.DOI: 10.1145/1498765.1498784. 35

Joseph Weizenbaum. ELIZA – a computer program for the study of natural language com-munication between man and machine. Communications of the ACM, 9(1):36–45, 1966.DOI: 10.1145/365153.365168. 9

Anna Wierzbicka. Apples are not a "kind of fruit": the semantics of human categorization. AmericanEthnologist, 11:313–328, 1984. DOI: 10.1525/ae.1984.11.2.02a00060. 18, 34

Terry Winograd. Understanding natural language. Cognitive Psychology, 3(1):1–191, 1972.DOI: 10.1016/0010-0285(72)90002-3. 9

Morton E. Winston, Roger Chaffin, and Douglas Herrmann. A taxonomy of part-whole relations.Cognitive Science, 11(4):417–444, 1987. ISSN 1551-6709. URL http://dx.doi.org/10.1207/s15516709cog1104_2. DOI: 10.1207/s15516709cog1104_2. 18, 19, 65

Ian H. Witten, Eibe Frank, and Mark A. Hall. Data Mining: Practical Machine Learning Tools andTechniques. Morgan Kaufmann, 3rd edition, 2011. 45

Fei Wu and Daniel S. Weld. Autonomously Semantifying Wikipedia. In Proceedings of the ACM16th Conference on Information and Knowledge Management (CIKM 2007), Lisbon, Portugal, 6-9November 2007, pages 41–50, 2007. DOI: 10.1145/1321440.1321449. 69, 81

http://www.aclweb.org/anthology/P/P06/P06-1040

http://dx.doi.org/10.3115/1220175.1220215

http://dx.doi.org/10.1007/s10994-005-0913-1

http://dx.doi.org/10.3115/991250.991272

http://dx.doi.org/10.1145/1498765.1498784

http://dx.doi.org/10.1145/365153.365168

http://dx.doi.org/10.1525/ae.1984.11.2.02a00060

http://dx.doi.org/10.1016/0010-0285(72)90002-3

http://dx.doi.org/10.1207/s15516709cog1104_2



http://dx.doi.org/10.1145/1321440.1321449

102 BIBLIOGRAPHY

Fei Wu and Daniel S. Weld. Open information extraction using Wikipedia. In Proceedings of the48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, 11-16 July2010, pages 118–127, 2010. 81

Roman Yangarber. Counter-training in discovery of semantic patterns. In Proceedings of the 41stAnnual Meeting of the Association for Computational Linguistics, Sapporo, Japan, 7-12 July 2003,pages 343–350, 2003. DOI: 10.3115/1075096.1075140. 67

Alexander Yates and Oren Etzioni. Unsupervised resolution of objects and relations on the Web. InProceedings of Human Language Technologies 2007: The Conference of the North American Chapter ofthe Association for Computational Linguistics, Rochester, N.Y., 22-27 April 2007, pages 121–130,2007. 76

Alexander Yates and Oren Etzioni. Unsupervised methods for determining object and relationsynonyms on the Web. Journal of Artificial Intelligence Research, 34:255–296, March 2009.DOI: 10.1613/jair.2772. 75

Dmitry Zelenko, Chinatsu Aone, and Anthony Richardella. Kernel methods for relation extraction.Journal of Machine Learning Research, 3:1083–1106, 2003. 48

Guo Dong Zhou, Min Zhang, Dong Hong Ji, and Qiao Ming Zhu. Tree kernel-based relationextraction with context-sensitive structured parse tree information. In Proceedings of the 2007Joint Conference on Empirical Methods in Natural Language Processing and Computational NaturalLanguage Learning (EMNLP-CoNLL-07), pages 728–736, Prague, Czech Republic, 2007. 56

Guo Dong Zhou,Long Hua Qian, and Qiao Ming Zhu. Label propagation via bootstrapped supportvectors for semantic relation extraction between named entities. Computer Speech and Language,23(4):464–478, 2009. DOI: 10.1016/j.csl.2009.03.001. 56

Karl E. Zimmer. Some general observations about nominal compounds. Stanford Working Papers onLinguistic Universals, 5:C1–C21, 1971. 14

Pierre Zweigenbaum, Dina Demner-Fushman, Hong Yu, and Kevin B. Cohen. Frontiers ofbiomedical text mining: current progress. Briefings in Bioinformatics, 8(5):358–375, 2007.DOI: 10.1093/bib/bbm045. 36

http://dx.doi.org/10.3115/1075096.1075140

http://dx.doi.org/10.1613/jair.2772

http://dx.doi.org/10.1016/j.csl.2009.03.001

http://dx.doi.org/10.1093/bib/bbm045

103

Authors’ bios

VIVI NASTASEWhile writing this book, Vivi Nastase1 was a visiting professor at the University of Heidelberg.She is now a researcher at the Fondazione Bruno Kessler in Trento, working mainly on lexicalsemantics, semantic relations, knowledge acquisition and language evolution. She holds a Ph.D.from the University of Ottawa, Canada.

PRESLAV NAKOVPreslav Nakov,2 a Research Scientist at the Qatar Computing Research Institute, part of QatarFoundation, holds a Ph.D. from the University of California, Berkeley. His research interests includecomputational linguistics and NLP, machine translation, lexical semantics, Web as a corpus andbiomedical text processing.

DIARMUID Ó SÉAGHDHADiarmuid Ó Séaghdha,3 a postdoctoral Research Associate at the University of Cambridge Com-puter Laboratory, holds a Ph.D. from the University of Cambridge. His research interests includelexical and relational semantics, machine learning for NLP (probabilistic models and kernel meth-ods), scientific text mining and social media analysis.

STAN SZPAKOWICZStan Szpakowicz,4 a professor of Computer Science at the University of Ottawa, holds a Ph.D.from Warsaw University. He has been active in NLP for 44 years. His recent interests include lexicalresources, semantic relations, emotion analysis and text summarization.

1Human Language Technologies Research Unit, Fondazione Bruno Kessler, Trento, [email protected]

2Qatar Computing Research Institute, Qatar Foundation, Doha, [email protected]

3Computer Laboratory, University of Cambridge, Cambridge, [email protected]

4School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, CanadaInstitute of Computer Science, Polish Academy of Sciences, Warsaw, [email protected]

105

Index

antonymy: 10argument type checking: 68Aristotle: 7associative relation: 8attributional similarity: 39Automatic Content Extraction (ACE): 5, 26,

27, 39, 40, 48, 50, 53

bootstrapping: 63–67, 69, 71, 73, 79–81

classifier: 28, 40, 41, 43–45, 47, 51, 55, 69cluster, clustering: 22, 40, 43, 53, 71, 72, 75, 77,

78, 80, 81co-hyponymy: 72co-training: 68Conditional Random Field (CRF): 50–52, 69constituency parsing: 42, 48, 49context representation: 42, 48coupled learning: 68, 69, 80, 81Cyc: 9, 19, 33, 59, 69

DBpedia: 34, 35decision function: 44, 45dependency: 42, 48, 52, 55, 60, 65, 75, 81dependency parsing: 42, 48, 49, 75distant supervision: 69, 70, 81distributional representation: 40, 43, 67, 75distributional similarity: 75doubly-anchored pattern: 72

emergent relation: 24entity feature: 39, 41, 45, 46

eXtended WordNet: 32

feature vector: 37, 38, 44, 45, 49Freebase: 20, 21, 34–36, 56, 59, 69, 70Frege, Gottlob: 8

Google Web 1T Corpus: 43graphical model: 52

Hearst pattern: 64, 80Hidden Markov Model (HMM): 50, 52holonymy: 10, 19, 60hypernymy: 10, 18, 33, 38, 54, 60, 71–73hyponymy: 4, 33, 54, 60, 72

idiosyncratic relation: 21, 22, 24, 65information extraction: 24, 26, 61, 74, 81, 82is-a relation: 10, 18, 19, 21, 23, 33, 34, 38, 60,

62–66, 69, 71, 72

kernel function: 46–50kernel method: 46, 48, 49, 57knowledge base: 2, 5, 10, 18, 21, 22, 36, 59, 80,

81, 83

learning algorithm: 38, 41, 44–47, 53, 83lexical hierarchy: 53, 54lexical resource: 10, 39lexical-semantic relation: 10, 19, 33lexico-syntactic pattern: 63, 72

machine learning: 21, 25, 45machine-readable dictionary: 60, 61

106 INDEX

Maximum Entropy Markov Model (MEMM):50

meronymy: 4, 10, 18, 19, 60Message Understanding Conference (MUC):

26, 45, 50

named entity: 26, 50, 79named entity recognition (NER): 50near-synonymy: 10, 33Never-Ending Language Learner (NELL): 68,

80, 81nominal: xii, 4–7, 11, 20–22, 25, 34, 37, 43, 56,

75, 83noun compound: 2–5, 11, 12, 14, 17, 18, 32,

33, 54, 75noun-noun compound: 8, 12, 18, 30–32, 37,

38, 53, 54

ontological relation: 21, 22, 24, 60, 65, 78ontology: 11, 18, 20, 21, 33–35, 40, 53, 57, 60,

61, 68, 80, 83open information extraction: 10, 24, 25, 35, 69,

71, 75, 80, 81OpenCyc: 33, 35

Pan. ini: 8paraphrase: 3, 4, 12, 14, 17, 18, 23, 75parse tree: 10, 42, 46, 48, 49, 75part-of relation: 10, 18, 23, 60, 63part-whole relation: 65pattern: 4, 10, 21, 22, 28, 43, 60–67, 69, 71–73,

77–79, 81pattern confidence: 66, 67pattern reliability: 67pattern specificity: 66Peirce, Charles Sanders: 9, 22pointwise mutual information (PMI): 67predicate logic: 8

recoverable deletable predicate: 12, 32

relation classification: 5, 25–28, 37, 38, 40, 43,44, 48, 52, 78

relation extraction: 3, 5, 9, 10, 20, 22, 24, 26,30, 33, 37, 40, 45, 50–52, 57, 59–62,64, 66, 71, 75–82

relation inventory: 12–14, 19, 22–25, 29, 37,39, 57, 83

relational feature: 41–43, 46, 78relational similarity: 39representation in logic: 8, 9, 22

Saussure, Ferdinand de: 8self-supervised learning: 24, 80semantic class: 39, 71, 72semantic drift: 63, 66, 73, 80, 81semantic network: 5, 9, 10, 33, 35, 40, 41semantic representation: 8–11, 13, 14, 24,

39–43, 67, 75semantic role labelling: 55SemEval: 5, 27–29, 40semi-supervised learning: 21, 34, 53, 68, 69,

81, 83sequence model: 50–52, 57, 69sequential labelling: 46, 51singly-anchored pattern: 72string kernel: 48, 49string similarity: 75supervised learning: 21, 22, 24, 25, 28, 37, 38,

40, 43–45, 54–57, 59, 69, 78, 81, 83Support Vector Machine (SVM): 45, 47–49synonymy: 10syntagmatic relation: 8, 21

targeted relation: 24tree kernel: 48, 49

unsupervised learning: 21, 24, 43, 59

WikiNet: 35Wikipedia: 20, 21, 34, 35, 69, 83

INDEX 107

Wikipedia infobox: 20, 34, 35, 56, 69, 81, 83WordNet: 5, 10, 18–20, 28, 29, 31–35, 39, 54,

56, 57, 69

YAGO: 35

Date post:	08-Dec-2016
Category:	Documents
Upload:	stan
View:	224 times
Download:	0 times