+ All Categories
Home > Documents > Mining Semantically Related Terms from Biomedical Literature€¦ · In this article, we focus on...

Mining Semantically Related Terms from Biomedical Literature€¦ · In this article, we focus on...

Date post: 07-Aug-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
22
Mining Semantically Related Terms from Biomedical Literature GORAN NENADI ´ C and SOPHIA ANANIADOU University of Manchester and National Centre for Text Mining Discovering links and relationships is one of the main challenges in biomedical research, as sci- entists are interested in uncovering entities that have similar functions, take part in the same processes, or are coregulated. This article discusses the extraction of such semantically related entities (represented by domain terms) from biomedical literature. The method combines various text-based aspects, such as lexical, syntactic, and contextual similarities between terms. Lexical similarities are based on the level of sharing of word constituents. Syntactic similarities rely on expressions (such as term enumerations and conjunctions) in which a sequence of terms appears as a single syntactic unit. Finally, contextual similarities are based on automatic discovery of rele- vant contexts shared among terms. The approach is evaluated using the Genia resources, and the results of experiments are presented. Lexical and syntactic links have shown high precision and low recall, while contextual similarities have resulted in significantly higher recall with moderate precision. By combining the three metrics, we achieved F measures of 68% for semantically related terms and 37% for highly related entities. Categories and Subject Descriptors: I.2.7 [Natural Language Processing]: Text analysis; H.3.1 [Content Analysis and Indexing]: Linguistic processing; J.3 [Life and Medical Sciences]: Biology and genetics General Terms: Algorithms, Documentation, Languages Additional Key Words and Phrases: biomedical literature, contextual patterns, term similarities, text mining 1. INTRODUCTION Dynamic progress in biology, molecular biology, and biomedicine has resulted in a huge body of knowledge that is represented by various concepts, enti- ties, events, processes, functions, and relationships among them. Such knowl- edge can be represented by semantic networks that link related entities with This research was partially supported by the UK BBSRC grant “Mining Term Associations from Literature to Support Knowledge Discovery in Biology” (BB/C007360/1). Authors’ addresses: G. Nenadi´ c, School of Informatics, University of Manchester, Manchester M60 1QD, UK; email: [email protected]. S. Ananiadou, School of Informatics, University of Manchester, P.O. Box 88, Sackvile Street, Manchester M60 1QD; email: Sophia. [email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, or [email protected]. C 2006 ACM 1530-0226/06/0300-0022 $5.00 ACM Transactions on Asian Language Information Processing, Vol. 5, No. 1, March 2006, Pages 22–43.
Transcript
Page 1: Mining Semantically Related Terms from Biomedical Literature€¦ · In this article, we focus on the extraction of semantically related biomedical entities by combining information

Mining Semantically Related Terms fromBiomedical Literature

GORAN NENADIC and SOPHIA ANANIADOU

University of Manchester and National Centre for Text Mining

Discovering links and relationships is one of the main challenges in biomedical research, as sci-

entists are interested in uncovering entities that have similar functions, take part in the same

processes, or are coregulated. This article discusses the extraction of such semantically related

entities (represented by domain terms) from biomedical literature. The method combines various

text-based aspects, such as lexical, syntactic, and contextual similarities between terms. Lexical

similarities are based on the level of sharing of word constituents. Syntactic similarities rely on

expressions (such as term enumerations and conjunctions) in which a sequence of terms appears

as a single syntactic unit. Finally, contextual similarities are based on automatic discovery of rele-

vant contexts shared among terms. The approach is evaluated using the Genia resources, and the

results of experiments are presented. Lexical and syntactic links have shown high precision and

low recall, while contextual similarities have resulted in significantly higher recall with moderate

precision. By combining the three metrics, we achieved F measures of 68% for semantically relatedterms and 37% for highly related entities.

Categories and Subject Descriptors: I.2.7 [Natural Language Processing]: Text analysis; H.3.1

[Content Analysis and Indexing]: Linguistic processing; J.3 [Life and Medical Sciences]:

Biology and genetics

General Terms: Algorithms, Documentation, Languages

Additional Key Words and Phrases: biomedical literature, contextual patterns, term similarities,

text mining

1. INTRODUCTION

Dynamic progress in biology, molecular biology, and biomedicine has resultedin a huge body of knowledge that is represented by various concepts, enti-ties, events, processes, functions, and relationships among them. Such knowl-edge can be represented by semantic networks that link related entities with

This research was partially supported by the UK BBSRC grant “Mining Term Associations from

Literature to Support Knowledge Discovery in Biology” (BB/C007360/1).

Authors’ addresses: G. Nenadic, School of Informatics, University of Manchester, Manchester

M60 1QD, UK; email: [email protected]. S. Ananiadou, School of Informatics,

University of Manchester, P.O. Box 88, Sackvile Street, Manchester M60 1QD; email: Sophia.

[email protected].

Permission to make digital or hard copies of part or all of this work for personal or classroom use is

granted without fee provided that copies are not made or distributed for profit or direct commercial

advantage and that copies show this notice on the first page or initial screen of a display along

with the full citation. Copyrights for components of this work owned by others than ACM must be

honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers,

to redistribute to lists, or to use any component of this work in other works requires prior specific

permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515

Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, or [email protected]© 2006 ACM 1530-0226/06/0300-0022 $5.00

ACM Transactions on Asian Language Information Processing, Vol. 5, No. 1, March 2006, Pages 22–43.

Page 2: Mining Semantically Related Terms from Biomedical Literature€¦ · In this article, we focus on the extraction of semantically related biomedical entities by combining information

Mining Semantically Related Terms from Biomedical Literature • 23

specific and/or general relations, or by classification, clustering or ontology-based annotations. In order to allow biologists to efficiently acquire, analyze,and use such information, domain resources (e.g., genomic databases) are be-ing continuously adapted to integrate new knowledge as it becomes available.While manual update (known as curation) involves labor-intensive work onproducing summarized information [Blake et al. 2003], automatic and semi-automatic methods rely on mining either experimental data (e.g., assigningthe Gene Ontology annotations using homology searching [Camon et al. 2003])or the literature to suggest relationships and associations among bioentities[Shatkay and Feldman 2003]. The availability of vast textual resources hasspurred huge interest in designing text-mining methods that can help scientistsin locating, collecting, and extracting relevant knowledge represented in theliterature. Traditionally, however, biomedical text-mining applications mainlyfocus on document retrieval and restricted information extraction, typicallywithout linking and combing information that spans across documents.

In this article, we focus on the extraction of semantically related biomedicalentities by combining information collected from multiple documents and mul-tiple sources. The approach is terminology-driven, as terminology representsa means to communicate knowledge among scientists, in particular, when it isexpressed in the literature. Terms represent domain concepts (such as entities,processes, and functions) and are particularly relevant in the biomedical do-main, which is terminologically extremely dynamic, dense, and variable. Termidentification in text is still far from being completely solved in the biomedicaldomain (cf. [Friedman et al. 2001a; Ananiadou and Nenadic 2006; Harkema etal. 2004; Krauthammer and Nenadic 2004; Hirschman et al. 2005]). However,discovering and establishing links and associations among entities is one of themain challenges in biomedical knowledge acquisition [Camon et al. 2005].

Biomedical entities are related in many ways: they have functional,structural, causal, hyponymous or other links [cf. Skuce and Meyer 1991;Stapley et al. 2002]. Relationships include diverse types of general (such as,generalization, specialization, meronymy) and domain-specific relations (suchas, binding, phosphorilation, and inhibition). For example, the term NF-kappaB is a hyponym of the term transcription factor, while the binding relationshiplinks amino acid and amino acid receptor, as well as CREP and CREP-bindingprotein; further examples of diverse relationships are colocation of proteinsScNFU1 and Nfs1p in yeast mitochondria [Leon et al. 2003], or structural andfunctional similarities between proteins red blood cell protein 4.1 and synapsinI [Krebs et al. 1987]. The aim of this study is to mine such (pairs of) termsthat are (potentially) semantically linked. We do not aim at identifying thetype of the relationship(s) that exists among them, but rather at discoveringlinks regardless of the type of the relationship (such terms are considered assemantically related).

The article is organized as follows. In Section 2, we overview related ap-proaches to the extraction of term relationships from text. Section 3 introducesthe term similarity measures that are used to establish links among enti-ties, while Section 4 presents experiments and evaluation, which are furtherdiscussed in Section 5. Finally, Section 6 concludes the article.

ACM Transactions on Asian Language Information Processing, Vol. 5, No. 1, March 2006.

Page 3: Mining Semantically Related Terms from Biomedical Literature€¦ · In this article, we focus on the extraction of semantically related biomedical entities by combining information

24 • G. Nenadic and S. Ananiadou

2. RELATED WORK

Several methods have been suggested for the extraction of relationshipsfrom literature [for overviews see Mack and Hehenberger 2002; Shatkayand Feldman 2003]. The most straightforward approach for establishingterm links is to measure lexical similarities among the words that constituteterms [Bourigault and Jacquemin 1999; Yeganova et al. 2004]. However, asnaming conventions do not necessarily systematically reflect any particularfunctional property or relatedness between biological entities (in particularwhen abbreviations or adhoc names are used), term relationships are typicallyextracted using context analysis of occurrences of terms within and acrosscorpora. Contexts, however, may be selected in a number of ways: as an entireabstract or document [Stapley and Benoit 2000], as a sentence [Grefenstette1994], or as a chunk (e.g., a verb complementation phrase [Spasic et al. 2003]).Ding et al. [2002] investigated the effectiveness of these contexts (based on aterm-term cooccurrence measure): they reported that larger units naturallyprovided better recall, while smaller units (e.g., phrases) typically deliveredsignificantly better precision. With respect to overall effectiveness (i.e., Fmeasure), sentences were significantly better than phrases.

Identifying features within selected contexts that are informative for identifi-cation of relatedness among terms proved to be challenging. Although, as Blakeand Pratt [2001] have indicated, “few researchers would claim that a word rep-resentation is optimal”, many approaches rely on simple words [e.g., Stapleyand Benoit 2000; Stapley et al. 2002; Ding et al. 2002]. Orthographic [e.g.,Collier et al. 1999] and morphological features [e.g., Hatzivassiloglou et al. 2001;Kazama et al. 2002] are also used, as well as grammatical roles [e.g., objectsor subjects Grefenstette 1994] and shallow-syntactic information [Yakushijiet al. 2001]. Some approaches rely on terminological features [e.g., Blake andPratt 2001] rely on the distributions of controlled keywords, while Nenadicet al. [2002] use dynamically recognised terms).

Various methods have been applied to extract relationships from text. Re-ported rule-based approaches range from those based on predefined lexicalpatterns [Blaschke et al. 1999; Ng and Wong 1999] and templates [Maynardand Ananiadou 1999; Pustejovsky et al. 2002], to parsing of documents us-ing domain-specific grammars [Friedman et al. 2001b; Yakushiji et al. 2001;Gaizauskas et al. 2003]. Various statistical approaches—mainly based on mu-tual information and cooccurrence frequency counts—were used to associateterms that are not explicitly linked in text [Andrade and Valencia 1997; Stapleyand Benoit 2000; Raychaudhuri et al. 2002; Ding et al. 2002; Nenadic et al.2002]. Similarly, machine-learning approaches have been widely used to learnlexical contexts expressing a given relationship [Craven and Kumlien 1999;Marcotte et al. 2001; Stapley et al. 2002; Donaldson et al. 2003; Nenadic et al.2003b; Spasic and Ananiadou 2005].

As a rule, many of these approaches extract a specific, predefined type ofrelationship (e.g., binding and activation), and rarely do so by combining infor-mation from more than one text segment. If relationship-specific rules are hand-crafted, this significantly prolongs the construction of a knowledge-mining

ACM Transactions on Asian Language Information Processing, Vol. 5, No. 1, March 2006.

Page 4: Mining Semantically Related Terms from Biomedical Literature€¦ · In this article, we focus on the extraction of semantically related biomedical entities by combining information

Mining Semantically Related Terms from Biomedical Literature • 25

system, reduces its adaptability, and makes it impossible to extract term re-lationships that do not correspond to the predefined patterns and templates.On the other hand, co-occurrences and statistical distributions within largertext units (such as, documents) may not reveal significant links for some typesof relationships. For example, many studies reported that 40% of cooccurrence-based relationships in the domain of biomedicine were biologically meaningless[cf. Jenssen et al. 2001; Tao and Leibel 2002].

For these reasons, in the following section, we introduce a hybrid methodfor the identification of semantically related entities, which is based on lexical,syntactic, and contextual similarities between the terms in question.

3. LEXICAL, SYNTACTIC, AND CONTEXTUAL TERM SIMILARITIES

Our approach to mining semantically related terms from literature follows theideas that use more complex features rather than simple cooccurrence statistics.We also combine rule-based and statistical methods and extract informationfrom sentences rather than from entire abstracts/documents. Finally, we mergeinformation mined from multiple sources. The method incorporates three text-based aspects, namely, lexical, syntactic and contextual similarities betweenterms. In the following subsections, we briefly describe each of them [see alsoNenadic et al. 2004a].

3.1 Lexical Similarities

We generalized the lexical approaches mentioned earlier [Bourigault andJacquemin 1999; Yeganova et al. 2004]. We consider constituents (head andmodifiers) shared by terms as a basis for measuring their lexical similarity.The rationale behind the approach involves the following hypothesis: a termderived by modifying another term may indicate further concept specialization(e.g., orphan nuclear receptor is a kind of receptor), or some functional rela-tionship (e.g., CREP-binding protein is linked to CREP through the bindingrelationship; this type of links has been studied in detail in Ogren et al. [2004]for the Gene Ontology). In particular, terms sharing a terminological head1 areassumed to be (in)direct hyponyms of the same term (e.g., progesterone recep-tor and oestrogen receptor are both receptors). More generally, when a term isnested inside another term, we assume that the terms in question are somehowsemantically related.

For each term we define its lexical profile containing its terminological headand all of its substrings (see Table I, for examples). We then use a weighteddicelike coefficient to compare lexical profiles of two terms. We give more creditto pairs that share longer nested constituents, with an additional weight givento the similarity if the two terms have common heads. More precisely, lexicalsimilarity (LS) between two terms is defined as:

LS(t1, t2) = | P(h1) ∩ P(h2)|| P(h1)| + | P(h2)| + | P(t1) ∩ P(t2)|

| P(t1)| + | P(t2)| (1)

1The notion of terminological head refers to the element that awards termhood to the whole term

[cf. Ananiadou 1994].

ACM Transactions on Asian Language Information Processing, Vol. 5, No. 1, March 2006.

Page 5: Mining Semantically Related Terms from Biomedical Literature€¦ · In this article, we focus on the extraction of semantically related biomedical entities by combining information

26 • G. Nenadic and S. Ananiadou

Table I. Examples of Lexical Profiles

Terms Lexical Profiles, P(term)

nuclear receptor {nuclear, receptor, nuclear receptor}orphan receptor {orphan, receptor, orphan receptor}orphan nuclear {orphan, nuclear, receptor, orphan nuclear,receptor nuclear receptor, orphan nuclear receptor}

Table II. Examples of Lexical Similarities Between Terms

Term1 Term2 LS (term1, term2)

nuclear receptor orphan nuclear receptor 0.83

orphan receptor nuclear orphan receptor 0.83

orphan nuclear receptor nuclear orphan receptor 0.75

nuclear receptor nuclear orphan receptor 0.72

orphan receptor orphan nuclear receptor 0.72

nuclear receptor orphan receptor 0.67

nuclear receptor nuclear translocator 0.17

orphan nuclear receptor nuclear translocator 0.11

orphan receptor nuclear translocator 0.00

where h1 and h2 are terminological heads of terms t1 and t2 respectively, and P(s)refers to a set of all nonempty subsequences of s. Examples of lexical similaritiesare provided in Table II.

This lexical metric is obviously useful for comparing multiword terms, butis rather limited when it comes to adhoc names (since they may have arbitraryconstituents) or single-word terms. Also, lexical similarities can capture only re-stricted types of links (typically specialization/generalisation relationships), al-though, in some cases, domain-specific relationships can be lexically expressed(e.g., CREP-binding protein is concerned with binding). Finally, assessing lex-ical similarities depends on the ability to neutralise some lexical variations(e.g., inflection and simple structural variations) and to expand acronyms; amethod that we have used for this is described in Section 4.2.

3.2 Syntactic Similarities

It is widely accepted that specific syntactic expressions may indicate functionalsimilarities among terms [Hearst 1992]. For instance, when encountered in text,an enumeration of terms (e.g., steroid receptors, such as, estrogen receptor, gluco-corticoid receptor, and progesterone receptor), term coordination2 (e.g., adrenalglands and gonads), or conjunction of terms (e.g., estrogen receptor and pro-gesterone receptor) typically indicate that the terms involved are highly re-lated and functionally similar. In these cases, a sequence of terms appears asa single syntactic unit, and, thus, involved terms are used within the samecontext (in the same sentence) and in combination with the same verb and/orpreposition.

2In this article, we assume that, in a term coordination, a lexical constituent(s) common for two

or more terms is shared (appears only once), while their distinct lexical parts are enumerated and

coordinated; term conjunctions involve terms which are lexically represented as separate units (no

constituents are shared among them (i.e., represented as belonging to each of them)).

ACM Transactions on Asian Language Information Processing, Vol. 5, No. 1, March 2006.

Page 6: Mining Semantically Related Terms from Biomedical Literature€¦ · In this article, we focus on the extraction of semantically related biomedical entities by combining information

Mining Semantically Related Terms from Biomedical Literature • 27

Table III. Examples of Term Enumeration and Coordination Patternsa

<TERM> ([(](such as)|like|(e.g.,[,]))<TERM> (,<TERM>)* [[,] <&> <TERM>] [)]

<TERM> (,<TERM>)* [,] <&> other <TERM>

<TERM> [,] (including | especially) <TERM> (,<TERM>)* [[,]<&><TERM>]

both <TERM> and <TERM>

either <TERM> or <TERM>

neither <TERM> nor <TERM>

aThe standard regular-expression notation is used: [], denotes optional elements; * denotes repeti-

tion; and |, denotes alternatives; and <&>, denotes a coordination conjunction.

For the extraction of syntactic similarities (SS), we have generalized theapproach of Hearst [1992] and defined a set of lexical patterns3 (see Table III).They are applied as filters in order to retrieve sets of terms appearing in enu-merations, coordinations, and conjunctions. Note that corresponding patternsare typically ambiguous, as they may retrieve both coordinated terms and con-junctions of terms [Nenadic et al. 2004c]. Still, in either case, the retrievedterms are highly related.

Syntactic similarity typically captures taxonomic relationships (hyponyms,siblings, etc.). For more generic type of links between terms, comparison of othertextual contexts in which terms appear individually is, therefore, necessary.

3.3 Contextual Similarities

Contextual similarities rely on the comparison of contexts in which terms tendto appear. We do not use the bag-of-words approach or full parsing to modelcontexts, but rather aim at describing (in a more generic way) a context inwhich a given term occurred. For example, by analyzing the following set oftextual contexts:

. . . receptor is bound to these DNA sequences . . .

. . . estrogen receptor bound to DNA . . .

. . . RXRs bound to respective DNA elements in vitro. . .

. . . TR when bound to DNA . . .

one can note that terms receptor, estrogen receptor, RXRs, and TR appear in acontext that can be “roughly” described using the following (right) contextual“pattern”: TERM VERB:bind to CLASS:dna. Sharing this context may indicatefunctional similarity among these terms (i.e., these terms may have similarfunctions).

In order to generate generic contextual “descriptions,” we need to discardless informative contextual constituents (such as determiners, adverbs, linkingphrases, and auxiliary verbs), and neutralize lexical variability. On the otherhand, context elements with high information and domain-specific content(e.g., terms and terminological verbs) can in addition, be additionally lemma-tized (and “instantiated”) to highlight their importance for establishing links.

We represent a context of an individual term occurrence by a genericregular expression called context pattern (CP). A CP contains only relevant

3These patterns assume that term occurrences have been previously identified in text.

ACM Transactions on Asian Language Information Processing, Vol. 5, No. 1, March 2006.

Page 7: Mining Semantically Related Terms from Biomedical Literature€¦ · In this article, we focus on the extraction of semantically related biomedical entities by combining information

28 • G. Nenadic and S. Ananiadou

elements, their part-of-speech and syntactic tags, terminological and additionalontological (or class) information (when available), and lemmatized significantcontextual elements.

When mining CPs from text, one of the challenges is to deal with CPs ofvarious lengths, in particular, with “nested” contexts. For example, the follow-ing contexts are nested in the context TERM VERB:bind to CLASS:dna PREP:inCLASS:location:

TERM VERB:bind to CLASS:dna PREP:in

TERM VERB:bind to CLASS:dna

TERM VERB:bind to

In our approach, we generate and consider all possible nested patterns whencomparing contexts.

Another problem is to distinguish contexts that are relevant for the domain,as terms also appear in contexts that are not relevant for establishing theirproperties.4 Our approach aims at automatic identification of relevant CPs byproviding a weighting mechanism called CP value.5 The CP-value measureassigns importance weights as follows: the weights of CPs that do not appearas nested elsewhere are proportional to their frequency (in a given corpus) andlength (sharing two longer CPs is more important than two shorter ones). If aCP also appears as nested, we take into account both the number of times itappears as a maximal one and the number of times it appears as nested. Moreprecisely, CP value of pattern p is defined as

CP-value(p) ={

log2 |p| · f (p), p is a not-nested CP

log2 |p| · ( f (p) − 1|Tp| · ∑

q∈Tpf (q)), p is a nested CP (2)

where f (p) is the absolute frequency of pattern p in the corpus, |p| is its length(as the number of constituents), Tp is a set of all CPs that contain p, and, conse-quently, |Tp| is the frequency of its occurrence within other CPs. The CPs whoseCP values are within a certain interval can be deemed important: CPs with veryhigh CP values are, as a rule, general contextual patterns, while CPs with lowCP values may be irrelevant for comparisons (they typically have low frequen-cies). Table IV presents some examples of CPs. Note that the mined CPs arenot information extraction patterns; they are rather used as an approximationof the contexts in which terms appear.

Finally, each term is associated with a set of the most characteristic patternsin which it occurs (we extract left and right patterns separately). Such patternsrepresent a contextual profile of a term. As we treat CPs, i.e., contextual profilesas terms’ features. We use a dicelike coefficient to assess contextual similarity(CS) between terms t1 and t2 as follows:

CS(t1, t2) = | CL1 ∩ CL2| + | CR1 ∩ CR2|| CL1| + | CL2| + | CR1| + | CR2| (2)

4Apart from infrequent patterns, consider, for example, a frequently used, but noninformative and

nondiscriminative context We report on TERM. . .5This measure is analogous to the C-value termhood measure [Frantzi et al. 2000].

ACM Transactions on Asian Language Information Processing, Vol. 5, No. 1, March 2006.

Page 8: Mining Semantically Related Terms from Biomedical Literature€¦ · In this article, we focus on the extraction of semantically related biomedical entities by combining information

Mining Semantically Related Terms from Biomedical Literature • 29

Table IV. Examples of (Left) CPsa

Context Pattern CP value “Type”

PREP:of NP 232.65 general

PREP:in NP PREP:of NP 126.47 patterns

. . . . . .

PREP:with V:interact NP 12.32

TERM PREP:on NP PREP:of 10.17 domain-

TERM V:regulate NP PREP:by V:bind NP PREP:to 4.64 specific

NP V:mediate NP PREP:through NP PREP:of 4.27 patterns

NP PREP:of NP V:separate PREP:by TERM V:mediate 4.00

TERM PREP:of NP V:inhibit PREP:by 4.00

. . . . . .

TERM PREP:at NP V:induce 2.00 low-freq.

TERM NP V:activate PREP:as 2.00 patterns

. . . . . .

aPrepositions and most frequent verbs are instantiated.

Table V. Examples of Contextual Patterns Shared Between NF Kappa B and

Transcription Factora

Examples of Shared Patterns Frequency

Left TERM V:inhibit NP PREP:of TERM 4

CPs NP V:bind TERM 3

TERM V:affect NP PREP:of 3

NP V:induce 2

TERM PREP:of TERM PREP:of NP PREP:between TERM 2

PREP:by NP PREP:in NP V:involve NP PREP:between TERM 2

NP V:mediate PREP:by TERM 2

NP V:stimulate PREP:with TERM V:inhibit NP PREP:of TERM 2

Right V:activate PREP:by TERM 8

CPs V:bind PREP:to NP PREP:in TERM 6

V:involve PREP:in NP PREP:of NP 4

TERM V:control NP PREP:of TERM 2

NP V:associate PREP:with TERM 2

TERM V:inhibit TERM PREP:of TERM 2

V:contribute PREP:to NP PREP:of TERM 2

V:allow NP PREP:of TERM 2

aIn these patterns, verbs and prepositions are instantiated. CS calculated for these two terms from their

contexts extracted from the Genia corpus is 0.56 (note that NF kappa B is a hyponym of transcriptionfactor).

where CL1, CR1, CL2, and CR2 are the sets of left and right CPs associated withterms t1 and t2 respectively. Table V shows examples of contextual patternsshared between two terms.

4. EXPERIMENTS AND EVALUATION

Here we report on an experiment with extracting semantically related termsfrom the Genia corpus [Kim et al. 2003]. First we present our evaluationmethodology and briefly overview the implementation of the methods (dis-cussed in the previous section) used to extract related terms. Evaluation andcomparisons are presented in Sections 4.3–4.5 and 4.6–4.7, respectively, furtherdiscussions are given in Section 5.

ACM Transactions on Asian Language Information Processing, Vol. 5, No. 1, March 2006.

Page 9: Mining Semantically Related Terms from Biomedical Literature€¦ · In this article, we focus on the extraction of semantically related biomedical entities by combining information

30 • G. Nenadic and S. Ananiadou

Table VI. Examples of Term Distances (Based on the Genia Ontology) and Their

Contextual, Lexical, and Syntactic Similarities (as Extracted from the Genia Corpus)

Term1 Term2 Distance CS LS SS

human monocyte monocyte 0 0.48 0.75 1.00

B cell T cell 0 0.51 0.67 1.00

B cell line T cell line 0 0.35 0.75 1.00

hella cell Jurkat cell 0 0.43 0.67 1.00

hella cell Jurkat T cell 0 0.46 0.61 1.00

p50 p65 0 0.51 0.00 1.00

cell survival cell cycle progression 0 0.00 0.11 1.00

tumor necrosis factor tumor necrosis factor α 0 0.37 0.83 0.00

RAI therapy surgery 0 0.00 0.00 1.00

adult T cell primary T cell 0 0.00 0.75 0.00

NF kappa B transcription factor 1 0.56 0.00 0.00

4.1 Evaluation Environment and Methodology

The testing Genia corpus contains 2000 abstracts with manually marked part-of-speech (POS) information. Furthermore, each occurrence of more than 30,000different terms is tagged and annotated with a corresponding class from the Ge-nia ontology. The ontology contains around 50 hierarchically organized classes,but only leaves (35 nodes) are used for annotations. These annotations wereused to evaluate term links mined from the corpus.

It is obvious that defining and applying a consistent and meaningful eval-uation approach for assessing relatedness among terms is a huge challenge[Camon et al. 2005]. In our approach, relatedness between two terms was esti-mated via the distance among the corresponding annotations in the Genia on-tology (assigned to the terms in question). More precisely, the following methodwas used: the distance between two terms is calculated as the mean of the sumof distances (the number of edges) of their respective classes from the nearestcommon ancestor in the Genia ontology. Thus, if two terms share an annota-tion, then their distance is 0, while if they belong to sibling classes (i.e., havean immediate common ancestor), then their distance is (1+1)/2 = 1. The max-imal distance between two classes (and, consequently, associated terms) in theGenia ontology is 10.

In the experiments reported here, we assume that terms with distances thatare less than or equal to 1 (i.e., terms that belong either to the same classor to sibling classes) are highly related, while terms within the distance of 3are considered as related. For example, the distance between cell survival andcell cycle progression is 0 (highly related terms), between NF kappa B andtranscription factor is 1 (highly related), and between CRE and CRE bindingprotein is 2.5 (related terms). On the other hand, for the purpose of this study,terms with distances above 3 were deemed weakly or nonrelated. Examplesof term distances and the values of their similarities mined from the Geniaresources are given in Table VI.

We analyzed links between terms from a controlled set of 1749 terms withfrequencies of occurrence above 5 in the Genia corpus. Table VII shows thedistribution of distances in the controlled set: 15.07% of term pairs contain

ACM Transactions on Asian Language Information Processing, Vol. 5, No. 1, March 2006.

Page 10: Mining Semantically Related Terms from Biomedical Literature€¦ · In this article, we focus on the extraction of semantically related biomedical entities by combining information

Mining Semantically Related Terms from Biomedical Literature • 31

Table VII. Distribution of Term Distances in the Controlled Set

Distance 0 ≤1 ≤3 ≥3

Term pairs (%) 15.07 22.32 57.54 42.46

terms that belong to the same class, with 22.32% of pairs with term distances ≤1, and 57.54% of them with distances ≤ 3.

In order to evaluate the approach, we measured the distances (induced bythe Genia ontology) among terms that have been suggested as related by theirsimilarities mined from the corpus. These distances were used to assess effec-tiveness of the approach. More precisely, for the suggested set of related termpairs, we calculated separate precision/recall values with respect to distances≤ 0, 1 and 3 as follows. Precision with respect to distance d is the number ofextracted pairs whose distance is less than or equals d in the Genia ontology,over the number of all extracted pairs. Recall with respect to distance d is thenumber of extracted pairs whose distance is less than or equals d , over the totalnumber of pairs from the controlled set whose distance is less than or equals din the Genia ontology. F measure with respect to distance d was calculated bytaking into account corresponding precision/recall values. We evaluated indi-vidual types of term similarities with various thresholds (i.e., when similaritieswere above a certain value), as well as a combination of the measures (see Sec-tion 4.6). Before presenting the results, we briefly describe the implementationof the methods presented in Section 3.

4.2 Implementation of the Methods

Lexical similarities were calculated for each pair of the controlled terms usingEq. (1). As indicated in Section 3.1, the method for assessing lexical similari-ties depends on the neutralization of term variations (including expansion ofacronyms) and the accurate identification of terminological heads. The latterwas mainly dealt with heuristics: in English, the head is typically the left-most noun, but, in some cases, biomedical terms appear in a prepositional form(e.g., level of transcription) or end with a postmodifying adjective (e.g., humanimmunodeficiency virus type 1 or tumor necrosis factor alpha). If the heads arenot correctly recognized (as level, virus, and factor, respectively, for the aboveexamples) then the corresponding similarities will not be consistent. Therefore,some neutralization of variation (such as, prepositional, and inflectional) is nec-essary to help in this process. For example, the Genia term nuclear factor ofactivated T cells needs to be normalized (i.e., transformed) into activated T cellnuclear factor prior to any comparisons. In order to neutralize inflectional andsimple structural variations, we used the method described in Nenadic et al.[2004b]. This method essentially generates singular terms, resolves acronyms,and transforms (by inversion) each term containing a preposition to an equiv-alent form.

The extraction of syntactic and contextual similarities needs corpus pre-processing, but the Genia corpus has already been tagged with the necessaryinformation. Term occurrences have been marked in the corpus and we have, in

ACM Transactions on Asian Language Information Processing, Vol. 5, No. 1, March 2006.

Page 11: Mining Semantically Related Terms from Biomedical Literature€¦ · In this article, we focus on the extraction of semantically related biomedical entities by combining information

32 • G. Nenadic and S. Ananiadou

addition, cross-linked some types of terminological variants6 (such as, acronymsand terms containing prepositions) with respective terms (for details, see alsoNenadic et al. [2004b]), in order to improve coverage of the mined links betweenthem.

Syntactic similarities were extracted using the patterns presented inTable III, which have been applied on a version of the Genia corpus whereonly terms and coordination conjunctions have been tagged. We applied thepatterns separately for enumerations and term conjunctions. As term coordi-nations were disambiguated and marked in the Genia corpus, there were noambiguities when processing term coordinations and term conjunctions. Pair-wise similarities were calculated as follows: if two terms appeared in any ex-pression described by the patterns, the syntactic measure for the pair was setto 1, and to 0 otherwise. Thus, when calculating syntactic similarity, we did notdiscriminate among different relationships among terms (represented by var-ious patterns), but instead, considered terms appearing in the same syntacticrole in the same sentence as (highly) related.

Contextual similarities were extracted from the POS-tagged version of theGenia corpus. We first collected concordances for all controlled terms. For eachterm occurrence, the maximal7 left and right contexts were extracted (withoutcrossing the sentence boundary) and normalized. Contexts containing “stop-list” constituents (e.g., report on, result in) were removed. From the remainingcontexts, we firstly discarded noninformative constituents, namely adjectives(that are not part of terms), adverbs and determiners, as well as so-called link-ing “expressions” (e.g., however, moreover). Adjacent nouns/noun phrases werereplaced by appropriate regular expressions. At the same time, constituentsdeemed relevant were instantiated. For the experiments reported here, we in-stantiated terms and verbs found in contexts, as we have previously shownthat they were useful for mining relationships [Nenadic et al. 2003b; Spasicet al. 2003]. Finally, nested contexts were generated by “trimming” the left/rightside until the contexts of the minimal length were reached.

Once we have CPs extracted, we calculated their CP values in order to es-timate their importance. The top 5% of the ranked patterns were discardedas general, while the lower CP value threshold was chosen empirically afterseveral experiments. Each controlled term was then associated with a set ofremaining CPs it appeared in, and pairwise similarities were calculated usingformula (3).

4.3 Evaluation of Lexical Similarities

The results achieved with lexical similarities generally show low coverage andhigh precision. More precisely, by using lexical similarities, we were able toextract only 5% of total term pairs that are highly related (distance ≤ 1) in thecontrolled set. For more semantically distant terms (distances ≤ 3), recall wasas low as 2.8%. On the other hand, lexical links were fairly accurate. Figure 1

6Note that terminological variants are not marked or linked in the Genia corpus.7The minimal and maximal lengths of CPs were chosen empirically; in the experiments reported

in this article, these lengths were set to 2 and 10, respectively.

ACM Transactions on Asian Language Information Processing, Vol. 5, No. 1, March 2006.

Page 12: Mining Semantically Related Terms from Biomedical Literature€¦ · In this article, we focus on the extraction of semantically related biomedical entities by combining information

Mining Semantically Related Terms from Biomedical Literature • 33

Fig. 1. Precision of lexically related term pairs with regard to their distances (various threshold

values for LS are given on the X axis).

shows precision for various thresholds used to cut-off pairs with lower LS. Forexample, if we consider all lexical similarities (the threshold set to 0), then in42% of the cases involved terms belonged either to the same class or to siblingclasses (distance ≤ 1), while even two thirds of term pairs with LS above 0.25had terms belonging to the same class, as did all terms with LS > 0.90. However,for such values, the number of involved terms and term pairs fell dramatically:for example, precision of 99% (with respect to terms with distances ≤ 3) wasachieved at the recall point of only 1%.

4.4 Evaluation of Syntactic Similarities

Similar results were obtained for syntactic similarities. In 71% of the cases,terms occurring in an enumeration expression belonged to the same class (pre-cision for distance 0), in 86% of the cases terms found in term enumerationwere members of either the same class or sibling classes (distance ≤ 1), andvirtually all were within the distance ≤ 3. In the case of term conjunctions,the results were slightly less accurate (the corresponding values were 66, 76,and 98%, respectively). The average distance among terms appearing in con-junction expressions was also double the average distance of terms appearingin enumerations. This suggests that term enumerations express stronger sim-ilarity than term conjunctions. However, enumerations were eight times lessfrequent than conjunctions.

With respect to the coverage, the results were expectedly disappointing: only0.25% of term pairs that belonged to the same class had syntactic similaritiesextracted from the Genia corpus. Of course, the size of the testing corpus here isa limiting factor; a larger corpus may reveal a larger coverage for this measure.

4.5 Evaluation of Contextual Similarities

As opposed to LS and SS, contextual similarities resulted in significantly highercoverage and modest precision. When no threshold for CS values was used, 15%of contextually related terms belonged to the same class (precision for distance0) and 58% of them were within distances ≤ 3 (related terms). However, in the

ACM Transactions on Asian Language Information Processing, Vol. 5, No. 1, March 2006.

Page 13: Mining Semantically Related Terms from Biomedical Literature€¦ · In this article, we focus on the extraction of semantically related biomedical entities by combining information

34 • G. Nenadic and S. Ananiadou

Fig. 2. Precision of contextually related term pairs with regard to their distances (various thresh-

old values for CS are given on the X axis).

latter case, the contextual measure covered more than 80% of related term pairs(compared to 2.8% covered by lexical similarity). When the threshold value wasset to one-half of the maximal CS value in the controlled set, almost one-halfof contextually related pairs contained terms either from the same or from thesibling classes, while in 78% cases distances were ≤ 3 (see Figure 2). The resultsfor contextual similarities were, to some extent, comparable to those achievedfor lexical similarities in terms of precision when the extracted term sets withthe comparable number of elements were considered.

Further experiments with contextual similarities have demonstrated thatterms belonging to the same or sibling classes have a higher degree of contextualrelatedness than terms belonging to different classes. More precisely, for thecontrolled set of terms and associated classes, we calculated that the averageCS for terms that belonged to the same class was 0.15 (microaverage8) and 0.17(macroaverage) compared to 0.12 for across-classes pairs.

As an additional test, we further examined the quality and consistency of con-textual similarities by comparing the values of CS among variants that denotedthe same terms (in this case, we used the original Genia corpus without linkingvariants). Typically, terminological variants were mutually highly contextuallysimilar, and, in the majority of cases even the most similar9 (see Table VIII forexamples). This demonstrates that CS can be used as a consistent indicator ofrelatedness between terms.

4.6 Combining Similarities

In order to make use of all information mined from the literature and in anattempt to improve accuracy and coverage of the mined relationships, we ex-perimented with a linear combination of the similarity measures. We calculated

8In microaveraging, precision is averaged over the number of terms, while macroaverage gives the

mean precision for each class [Yang 1997].9For example, HIV and human immunodeficiency virus were mutually the most contextually similar

terms.

ACM Transactions on Asian Language Information Processing, Vol. 5, No. 1, March 2006.

Page 14: Mining Semantically Related Terms from Biomedical Literature€¦ · In this article, we focus on the extraction of semantically related biomedical entities by combining information

Mining Semantically Related Terms from Biomedical Literature • 35

Table VIII. Examples of Contextual Similarity Among Terminological Variantsa

Term1 Term2 CS

open reading frame open reading frames 0.53

transcription activation transcriptional activation 0.39

HIV infection AIDS 0.34

HIV 1 AIDS 0.25

HIV human immunodeficiency virus 0.23

HIV HIV 1 0.19

human immunodeficiency virus human immunodeficiency virus type 1 0.18

human T cell leukemia virus type 1 human T cell leukemia virus type I 0.18

aThe maximal CS in the Genia corpus was 0.60.

a hybrid term similarity measure (called the CLS similarity) as follows:

CLS(t1, t2) = αCS(t1, t2) + βLS(t1, t2) + γ SS(t1, t2) (3)

where α + β + γ = 1. The choice of the weights in the Eq. (4) is not a trivialproblem and an automatic learning method can be used to suggest an optimalsolution [Spasic et al. 2002]. For experiments performed here (see the followingsection), the best performance in terms of F measure was achieved for α = 0.2,β = 0.7, and γ = 0.1.

4.7 Comparisons

Tables IX–XI show detailed comparisons among different similarity metrics10

with respect to distances 0, 1, and 3. In all cases, contextual similarities havemuch wider coverage, and their F scores are significantly higher than thosefor lexical and syntactic similarities. Precision-wise, syntactic similarities arethe most accurate, but at extremely low recall. The results have also shownslightly improved F measure of CLS compared to any measure on its own. Bycombining various types of metrics, we achieved F measures of 68% for relatedterms (distances ≤ 3) and 37% for highly related entities (distances ≤ 1).

In order to compare the methods presented in this article with other ap-proaches, we further analyzed term associations based on standard term co-occurrences within sentences and abstracts (using the same controlled set). Weadopted the following approach: two terms were deemed semantically relatedif they cooccurred in the same sentence/abstract. The recall results show thatwhen using term co-occurrences within sentences and abstracts in the Geniacorpus, we were able to extract only 1.5 and 4.5% of all pairs with distances ≤ 3,respectively. As for precision, as one can expect, within-sentence co-occurrenceswere more accurate: in 30% of the cases terms cooccurring within a sentencehave distances ≤ 1, compared to 24% for within-abstract co-occurrences. Ifwe consider terms with distances ≤ 3, then similarities based on term co-occurrences have precision of 60% for within-sentence co-occurrences, and 57%for abstract-based co-occurrences. Thus, in more than 40% of the cases cooc-curring term pairs have distances > 3 (these outcomes confirm the resultspresented previously in Jenssen et al. [2001] and Tao and Leibel [2002]. Inmost cases, the performance of cooccurrence-based similarities is, to some

10In all cases, the thresholds were set to 0.

ACM Transactions on Asian Language Information Processing, Vol. 5, No. 1, March 2006.

Page 15: Mining Semantically Related Terms from Biomedical Literature€¦ · In this article, we focus on the extraction of semantically related biomedical entities by combining information

36 • G. Nenadic and S. Ananiadou

Table IX. Comparison of the Precision, Recall, and F Measure

Values for the Similarity Measures (Distances = 0)

Similarity Measure Recall Precision F Measure

LS 0.05 0.36 0.09

SS 0.0025 0.65 0.005

CS 0.83 0.15 0.25

CLS 0.83 0.16 0.27

cooccurrence in sentences 0.02 0.20 0.04

cooccurrence in abstracts 0.05 0.16 0.08

Table X. Comparison of the Precision, Recall, and F Measure

Values for the Similarity Measures (Distances ≤ 1)

Similarity Measure Recall Precision F Measure

LS 0.04 0.42 0.07

SS 0.002 0.76 0.004

CS 0.83 0.23 0.36

CLS 0.83 0.24 0.37

cooccurrence in sentences 0.02 0.30 0.04

cooccurrence in abstracts 0.05 0.24 0.08

Table XI. Comparison of the Precision, Recall, and F Measure

Values for the Similarity Measures (Distances ≤ 3)

Similarity Measure Recall Precision F Measure

LS 0.03 0.78 0.06

SS 0.0009 0.93 0.002

CS 0.82 0.57 0.67

CLS 0.82 0.58 0.68

cooccurrence in sentences 0.015 0.60 0.04

cooccurrence in abstracts 0.045 0.57 0.09

extent, comparable to lexical relatedness and outperforms syntactic similar-ities, while the performance of contextual similarities is significantly betterthan cooccurrence-based relationships. More precisely, while the precisions ofthe cooccurrence-based approach and contextual similarities are comparable,the recall of the latter is considerably better.

5. DISCUSSION

The presented methods for mining semantically related terms are based oneither internal lexical similarities or external aspects of term occurrences indocuments (co-occurrences, syntactic, and contextual similarities). The inter-nal aspect makes use of naming associations that have been “built” into terms,while the external aspects rely on various levels of similarity in their usage.While co-occurrences rely on simple within-sentence or within-document dis-tributions, syntactic similarities capture appearances in specific expressions(i.e., phrases) and contextual similarities indicate overall resemblances of con-texts in which terms appear. In general, our results suggest that phrases (usedfor syntactic similarities) were relatively more accurate than other approaches,but extremely sparse. Term co-occurrences within sentences and documentsmight not reveal many term links as they are typically confined to single

ACM Transactions on Asian Language Information Processing, Vol. 5, No. 1, March 2006.

Page 16: Mining Semantically Related Terms from Biomedical Literature€¦ · In this article, we focus on the extraction of semantically related biomedical entities by combining information

Mining Semantically Related Terms from Biomedical Literature • 37

sentences/documents, while contextual similarities integrate information fromvarious sources and, consequently, improve recall.

The results presented show that only 5% of the highly related terms (termswith distances ≤ 1) are lexically linked. In terms of recall, LS is more effectivefor highly than for weakly related terms. Lexical similarities are also accurate inpredicting links among terms that have high values for LS. To further examinethe consistency of the measure, we analyzed within-class lexical similaritiesfor a subset of the Genia classes induced by the controlled set of terms andalso the mean across-class LS. Apart from the protein class, the average LSamong term pairs belonging to the same class (microaverage 0.43) was greaterthan the average LS for the whole term collection (0.27), which, in turn, wasgreater than the average LS among terms belonging to different classes (0.18).The only class with the mean LS lower than the average for the whole collectionwas the protein class. This confirms that protein names do exhibit a higherdegree of lexical variability than names of other biological classes (as indicatedin Fukuda et al. [1998] and Tanabe and Wilbur [2002]).

Relationships that rely on term co-occurrences in enumeration and conjunc-tion expressions provide a similarity measure with the highest precision, butwith extremely low recall, as terms do not frequently appear in such expressionsrelative to the number of occurrences (in particular, for smaller corpora). As in-dicated above, the size of the corpus is an important factor and a larger corpusmay reveal better recall. Analogously, bigger corpora may improve performanceof cooccurrence based relationships.

Contextual similarity presented here is, to some extent, a generalizationof the work of Hindle [1990] and Grefenstette [1994], who used only subject–verb, object–verb, and adjective–noun relationships as a basis for establishingword similarities. On the other hand, we use more general patterns that de-scribe various contexts, not necessarily predicate–argument structures. Fur-ther, our approach generalizes the contextual clustering approach presented inMaynard and Ananiadou [1999], which was based on a set of manually pre-defined semantic frames and that were semiautomatically tuned by corpusprocessing.

Contextual similarities have shown significantly higher recall comparedto other measures. Although one can argue that these results could be bi-ased to some extent as we considered more frequently occurring terms, addi-tional experiments have shown that similar results were obtained for more orless frequently occurring terms. The experiments have also demonstrated theconsistency of CS, as terms belonging to the same or sibling classes have ahigher degree of contextual similarity than terms belonging to different classes.In addition, tests on contextual correspondence among equivalent term vari-ants have shown that the CS performance is coherent: in many cases, termi-nological variants (i.e., synonyms) are mutually the most contextually similarcounterparts.

Since it is based on capturing unrestricted recurrent contextual patterns, CScan reveal not only named and known relationships (as it is the case with LSand SS), but also some latent links among terms, which can be beneficial formining new or unknown relationships among entities. For example, CS revealed

ACM Transactions on Asian Language Information Processing, Vol. 5, No. 1, March 2006.

Page 17: Mining Semantically Related Terms from Biomedical Literature€¦ · In this article, we focus on the extraction of semantically related biomedical entities by combining information

38 • G. Nenadic and S. Ananiadou

links (such as the link between breast cancer and cell proliferation, or between9-cis-RA and signal transduction pathways [Heyman et al. 1992]) that were notcaptured by other two approaches (lexical and syntactic). In our controlled set,only 2% of related term pairs could have been discovered by each of the threemetrics. By combining single similarities into the CLS similarity, we used allinformation mined for a pair of terms.

Note that the results obtained on the controlled set may be biased, to someextent, as our evaluation methodology is mainly based on the hyponymy re-lationships represented through the Genia ontology, which are, in some cases,consistently reflected via lexical modifications of terms. This suggests that pre-cision of LS might be lower if other evaluation environment was used. Also,syntactic similarity could have benefited more from the chosen methodology.On the other hand, some relationships recognized by CS were considered falsepositives as they have not been suitably represented in the Genia ontology. Forexample, although suggested as contextually related, the pair glucocorticoidreceptor and glucocorticoid receptor function was considered a false positive astheir distance was above 3. Thus, precision of CS might be higher if anotherevaluation scheme was used.

Further improvements can be made to the measures presented. For example,contextual similarity can be enhanced by incorporating additional weights andstatistical and distributional properties for comparing term contextual profiles.For example, if two terms appear exclusively and/or frequently in a certain con-text, then this fact is more important than an “incidental” sharing of a contextpattern. Another challenge is to handle modality and negation if appearing ina given context. Further, links based on various syntactic similarities amongterms (represented by different patterns) can be weighted (e.g., enumerationsseem to be more accurate than conjunctions); the values of syntactic similar-ity can be also parameterized by the number of patterns in which two termsappear simultaneously. Lexical similarity can be generalized (in particular, forsingle-word terms) by combining alternative methods for lexical comparison(e.g., approximate string matching or character-based n-gram comparisons).Finally, the three measures were integrated by their linear combination, butother approaches (such as, polynomial) could improve performance. A furtherchallenge is to automatically discover the type of the link among semanticallyrelated terms.

The suggested method for the extraction of semantically related terms canbe used for several biomedical text-mining scenarios. Apart from mining linksamong terms, extracted similarities can be used as a basis for term classifi-cation [cf. Spasic et al. 2004; Spasic and Ananiadou 2005] and for term sensedisambiguation (e.g., by comparing a contextual pattern corresponding to agiven ambiguous term occurrence with patterns relevant to each of the termsenses). Furthermore, the most significant CPs extracted from a domain cor-pus may be used to semiautomatically suggest patterns relevant for variousinformation extraction tasks. For example, pattern V:inhibit TERM1 PREP:ofTERM2 associated to transcription factor and other related terms can be used asa template to extract information about the inhibition of certain bioprocesses:

ACM Transactions on Asian Language Information Processing, Vol. 5, No. 1, March 2006.

Page 18: Mining Semantically Related Terms from Biomedical Literature€¦ · In this article, we focus on the extraction of semantically related biomedical entities by combining information

Mining Semantically Related Terms from Biomedical Literature • 39

TERM1 typically contains the information about the process11 in question, whichis influenced by a given transcription factor, while TERM2 fills the “slot” corre-sponding to the respective target12 of the inhibition.

Finally, the demand from the biomedical user community is directed towardsystems that are able to recognize, extract, and relate entities and events froma large body of literature, so that they can be visualized and analyzed [Fukudaand Takagi 2004; Camon et al. 2005]. For this, the extraction of term relation-ships is essential. For example, many researchers are interested in informationlatent in huge repositories of biomedical text that can help in new hypothesesgeneration, and are mainly interested in wider coverage of possible links be-tween entities. Once we have relatedness among two terms identified, theycan be used to either propagate knowledge that we have about one of them tothe other, or to hypothesize a novel link among them. This would be particu-larly feasible when text mining is harnessed with experimental data derivedby postgenomic techniques, such as expression array and sequence analysis.Outlier detection between textual and nontext information can also be a verypowerful method for knowledge discovery. If, for instance, entities that appearlinked from the results of text-mining behave very differently under a par-ticular set of experimental conditions, then this can suggest the experimentis uncovering something that was previously unknown and is worthy of fur-ther investigation. Similarly, relationships mined from text can reveal someinconsistencies or contradictions in the literature, or identify gaps in existingknowledge by suggesting possible links among entities.

6. CONCLUSION

In this article we presented and evaluated a method for the automatic min-ing of semantically related terms that is based on comparison of their lexical,syntactic, and contextual profiles. One of the most important advantages of theapproach is that it is entirely data-driven, as the terminological information iscollected automatically from the literature without using external resources.

Lexical similarities are based on the level of sharing constituents amongterms and are highly accurate. However, lexical similarity has low coverage:only 5% of the semantically closest terms are lexically related. Syntactic sim-ilarities rely on co-occurrences in specific expressions (such as, enumerationsand term conjunctions), which provide a term similarity measure with highprecision, but with extremely low recall. These two methods are rather lim-ited when it comes to discovering new relationships among terms, as theymainly rely on explicitly expressed associations (within term names or spe-cific phrases). Therefore, if we aim at supporting systematic knowledge ac-quisition and discovery, then higher recall and contextual usage patterns areessential. In our approach, contextual similarities are extracted by automaticpattern mining. Such patterns are used as an approximation of the contexts

11In the Genia corpus, processes found in this pattern include transcription, activation, apoptosis,

differentiation, translation, cell death, proliferation, etc.12For example, mRNA, carcinoma cells, macrophanges, plasmids, etc.

ACM Transactions on Asian Language Information Processing, Vol. 5, No. 1, March 2006.

Page 19: Mining Semantically Related Terms from Biomedical Literature€¦ · In this article, we focus on the extraction of semantically related biomedical entities by combining information

40 • G. Nenadic and S. Ananiadou

in which terms appear. Contextual similarities are not confined to informationrepresented only in individual documents, as the method collects patterns fromseveral documents. Compared to other measures, recall of contextual similaritywas significantly higher (for similar precision values). Overall, the combinedCLS metric achieved F measures of 68% for semantically related terms and 37%for highly related entities.

As opposed to approaches that are designed to extract only predefined andspecific types of relations, our method can reveal not only named and explicitrelationships, but also some latent links among terms. Such links can be bene-ficial for mining new or unknown relationships among entities, since it is basedon capturing unrestricted recurrent contextual patterns. This can assist in theprocess of discovering and formulating new hypotheses or predictions by pos-sibly suggesting new relations among terms. Term relationships can also helpin bridging the gap that exists between collective knowledge (represented bythe domain literature) and individual requirements (or acquaintance) of do-main specialists. They can be used to support systematic curation, updates andadjustments of existing terminological resources, resolving term ambiguities,semantic document indexing (annotation), similarity-based document retrieval,document categorization/classification, as well as learning domain-specific pat-terns relevant for information extraction. All these comprise future researchdirections.

ACKNOWLEDGMENTS

The UK National Centre for Text Mining (NaCTeM, http://www.nactem.ac.uk)is funded by the Joint Information Systems Committee (JISC), the Biotechnoloyand Biological Sciences Research Council (BBSRC), and the Engineering andPhysical Sciences Research Council (EPSRC).

REFERENCES

ANANIADOU, S. 1994. A methodology for automatic term recognition. In Proceedings of COLING94, Kyoto, Japan. 1034–1038.

ANANIADOU, S. AND NENADIC, G. 2006. Automatic Terminology Management in Biomedicine. In S.

Ananiadou, J. McNaught (Eds.), Text Mining for Biology and Biomedicine, Artech House Books,

pp. 67–98.

ANDRADE, M. AND VALENCIA, A. 1997. Automatic annotation for biological sequences by extrac-

tion of keywords from Medline abstracts. Development of a prototype system. In Proceedings ofIntelligent Systems for Molecular Biology 5, 1, 25–32.

BLAKE, C. AND PRATT, W. 2001. Better rules, fewer features: a semantic approach to select-

ing features from text. In Proceedings of IEEE Data Mining Conf, San Jose, California. 59–

66.

BLAKE, J. A., RICHARDSON, J. E., BULT, C. J., KADIN, J. A., EPPIG, J. T., AND THE MOUSE GENOME DATABASE

GROUP. 2003. MGD: the Mouse Genome Database. Nucleic Acids Research 31, 193–195.

BLASCHKE, C., ANDRADE, M., OUZOUNIS, C., AND VALENCIA, A. 1999. Automatic extraction of biolog-

ical information from scientific text: protein-protein interactions. In Proceedings of IntelligentSystems for Molecular Biology 99, 60–67.

BOURIGAULT, D. AND JACQUEMIN, C. 1999. Term extraction + term clustering: an integrated platform

for computer-aided terminology. In Proceedings of the 8th EACL, Bergen. 15–22.

CAMON, E., MAGRANE, M., BARRELL, D., BINNS, D., FLEISCHMANN, W., KERSEY, P., MULDER, N., OINN,

T., MASLEN, J., COX, A., AND APWEILER, R. 2003. The Gene Ontology Annotation (GOA) project:

ACM Transactions on Asian Language Information Processing, Vol. 5, No. 1, March 2006.

Page 20: Mining Semantically Related Terms from Biomedical Literature€¦ · In this article, we focus on the extraction of semantically related biomedical entities by combining information

Mining Semantically Related Terms from Biomedical Literature • 41

implementation of GO in Swiss-Prot, TrEMBL and InterPro. Genome Research Apr; 13, 4, 662–

672.

CAMON, E., BARRELL, D., DIMMER, E., LEE, V., MAGRANE, M., MASLEN, J., BINNS, D., AND APWEILER, R.

2005. An evaluation of GO annotation retrieval for BioCreAtIvE and GOA. BMC Bioinformatics6, (Suppl 1), S17.

COLLIER, N., NOBATA, C., AND TSUJII, J. 1999. Automatic term identification and classification in

biological texts. In Proceedings of Natural Language Pacific Rim Symposium. Beijing, China.

369–374.

CRAVEN, M. AND KUMLIEN, J. 1999. Constructing biological knowledge bases by extracting infor-

mation from text sources. In Proceedings of Intelligent Systems for Molecular Biology 99, 77–86.

DING, J., BERLEANT, D., NETTLETON, D., AND WURTELE, E. 2002. Mining Medline: Abstracts, sen-

tences, or phrases? In Proceedings of Pacific Symposium on Biocomputing 7. 326–337.

DONALDSON, I., MARTIN, J., DE BRUJIN, B., WOLTING, C., ET AL. 2003. PreBIND and Textomy—mining

the, biomedical literature for protein-protein interactions using a support vector machine. BMCBionformatics 2003, 4, 1, 11.

FRANTZI, K., ANANIADOU, S., AND MIMA, H. 2000. Automatic recognition of multi-word terms: the

C-value/NC-value method. International Journal on Digital Libraries 3, 2, 115–130.

FRIEDMAN, C., LIU, H., SHAGINA, L., ET AL. 2001a. Evaluating UMLS as a source of lexical knowledge

for medical language processing. In Proceedings of AMIA 2001. 189–193.

FRIEDMAN, C., KRA, P., YU, H., KRAUTHAMMER, M., AND RZHETSKY, A. 2001b. GENIES: a natural-

language processing systems for the extraction of molecular pathways from journal articles.

Bioinformatics 17, 1, S74–S82.

FUKUDA, K. AND TAKAGI, T. 2004. A pathway editor for literature-based knowledge curation. In

Proceedings of Asia-Pacific Bioinformatics Conf., Dunedin, New Zealand. 339–344.

FUKUDA, K., TSUNODA, T., TAMURA, A., AND TAKAGI, T. 1998. Toward information extraction: identify-

ing protein names from biological papers. In Proceedings of Pacific Symposium on Biocomputing98, 707–718.

GAIZAUSKAS, R., DEMETRIOU, G., ARTYMIUK, P., AND WILLETT, P. 2003. Protein structures and infor-

mation extraction from biological texts: The PASTA system. Bioinformatics 19, 1, 135–143.

GREFENSTETTE, G. 1994. Exploration in Automatic Thesaurus Discovery. Kluwer Academic Publ.,

Boston, MA.

HARKEMA, H., GAIZAUSKAS, R., HEPPLE, M., ROBERTS, A., ROBERTS, I., DAVIS, N., AND GUO, Y. 2004. A

large-scale terminology resource for biomedical text processing. In Proceedings of BioLINK 2004.

53–60.

HATZIVASSILOGLOU, V., DUBOUE, P., AND RZHETSKY, A. 2001. Disambiguating proteins, genes, and rna

in text: A machine learning approach. In Bioinformatics 17, 1, S97–S106.

HEARST, M. 1992. Automatic acquisition of hyponyms from large text corpora. In Proceedings ofthe COLING 92, Nantes, France. 539–945.

HEYMAN, R. A., MANGELSDORF, D. J., DYCK, J. A., STEIN, R. B., EICHELE, G., EVANS, R. M., AND THALLER, C.

1992. 9-cis retinoic acid is a high affinity ligand for the retinoid X receptor. Cell 68, 2, 397–406.

HINDLE, D. 1990. Noun classification from predicate-argument structures. In Proceedings of ACL1990. 268–275.

HIRSCHMAN, L., PARK, J., TSUJII, J., WONG L., AND WU, C. 2002. Accomplishments and challenges

in literature data mining for biology. Bioinformatics 18, 12, 1553–1561.

HIRSCHMAN, L., YEH, A., BLASCHKE, C., AND VALENCIA, A. 2005. Overview of BioCreAtIvE: critical

assessment of information extraction for biology. BMC Bioinformatics 6, Suppl 1, S1.

JENSSEN, T., LAEGREID, A., KOMOROWSKI, J., AND HOVIG, E. 2001. A literature network of human

genes for high-throughput analysis of gene expressions. Nature Genetics 28, 21–28.

KAZAMA, J., MAKINO, T., OHTA, Y., AND TSUJII, J. 2002. Tuning support vector machines for biomed-

ical named entity recognition. In Proceedings of the Workshop NLP in Biomedical Domain, ACL

2002. Philadelphia. PA. 1–8.

KIM, J., OHTA, T., TETEISI, Y., AND TSUJII, J. 2003. GENIA corpus—A semantically annotated corpus

for bio-text-mining. Bioinformatics 19 (Supp 1), i180–i182.

KRAUTHAMMER, M. AND NENADIC, G. 2004. Term identification in the biomedical literature. Journalof Biomedical Informatics (Special Issue on Named Entity Recognition in Biomedicine 37), 6,

512–526.

ACM Transactions on Asian Language Information Processing, Vol. 5, No. 1, March 2006.

Page 21: Mining Semantically Related Terms from Biomedical Literature€¦ · In this article, we focus on the extraction of semantically related biomedical entities by combining information

42 • G. Nenadic and S. Ananiadou

KREBS, K. E., PROUTY, S. M., ZAGON, I. S., AND GOODMAN S. R. 1987. Structural and functional

relationship of red blood cell protein 4.1 to synapsin I. Am J Physiol. 253, 4 Pt 1, C500–

505.

LEON, S., TOURAINE, B., RIBOT, C., BRIAT, J.F., AND LOBREAUX, S. 2003. Iron-sulphur cluster assem-

bly in plants: Distinct NFU proteins in mitochondria and plastids from Arabidopsis thaliana.

Biochem. J. 371, 823–830.

MACK, R. AND HEHENBERGER, M. 2002. Text-based knowledge discovery: Search and mining of

life-sciences documents. Drug Discovery Today 7, 11(Suppl.), S89–S98.

MARCOTTE, E., XENARIOS, I., AND EISENBERG, D. 2001. Mining literature for protein-protein inter-

actions. Bioinformatics 17, 4, 359–363.

MAYNARD, D. AND ANANIADOU, S. 1999. A linguistic approach to terminological context clustering.

In Proceedings of Natural Language Pacific Rim Symposium 99. Beijing, China.

NG, S. AND WONG M. 1999. Toward routine automatic pathway discovery from on-line scientific

text abstracts. Genome Informatics 10, 104–112.

NENADIC, G., MIMA, H., SPASIC, I., ANANIADOU, S., AND TSUJII, J. 2002. Terminology-based literature

mining and knowledge acquisition in biomedicine. International Journal of Medical Informatics67 (1–3), 33–48.

NENADIC, G., SPASIC, I., AND ANANIADOU, S. 2003a. Terminology-driven mining of biomedical liter-

ature. Bioinformatics 19, 8, 938–943.

NENADIC, G., RICE, S., SPASIC, I., ANANIADOU, S., AND STAPLEY, B. 2003b. Selecting text features for

gene name classification: From documents to terms. In Proceedings of the Workshop on NLP inBiomedicine, ACL 2003. Japan. 121–128.

NENADIC, G., SPASIC, I., AND ANANIADOU, S. 2004a. Mining term similarities from corpora. Termi-nology 10, 1, 55–80.

NENADIC, G., ANANIADOU, S., AND MCNAUGHT, J. 2004b. Enhancing automatic term recognition

through recognition of variation. In Proceedings of COLING 2004. Geneva. 604–610.

NENADIC, G., SPASIC, I., AND ANANIADOU, S. 2004c. Mining biomedical abstracts: what’s in a term?

In Proceedings of IJCNLP 2004. 247–254.

OGREN, P., COHEN, K., ACQUAAH-MENSAH, G., EBERLEIN, J., AND HUNTER, L. 2004. The compositional

structure of Gene Ontology terms. In Proceedings of Pacific Symposium on Biocomputing 2004.

214–225.

PUSTEJOVSKY, J., CASTAOO, J., ZHANG, J., KOTECKI, M., AND COCHRAN, B. 2002. Robust relational pars-

ing over biomedical literature: extracting inhibit relations. In Proceedings of Pacific Symposiumon Biocomputing 2002. 362–373

RAYCHAUDHURI, S., CHANG, J., SUTPHIN, P., AND ALTMAN, R. 2002. Associating genes with Gene On-

tology codes using a maximum entropy analysis of biomedical literature. Genome Research 12,

203–214.

SHATKAY, H. AND FELDMAN, R. 2003. Mining the biomedical literature in the genomic era: An

overview. Journal of Computational Biology 10, 6, 821–856.

SKUCE, D. AND MEYER, I. 1991. Terminology and knowledge engineering: exploring a symbiotic

relationship. In Proceedings of 6th Workshop on Knowledge Acquisition for Knowledge-BasedSystems. Banff. 29.1–29.21.

SPASIC, I. AND ANANIADOU, S. 2005. A flexible measure of contextual similarity for biomedical terms.

In Proceedins of Pacific Symposium on Biocomputing 2005. 197–208.

SPASIC, I., NENADIC, G., MANIOS, K., AND ANANIADOU, S. 2002. Supervised learning of term simi-

larities. In Intelligent Data Engineering and Automated Learning (IDEAL’02), Lecture Notes in

Computer Science, 2412, Springer Verlag. 429–434.

SPASIC, I., NENADIC, G., AND ANANIADOU, S. 2003. Term classification using domain-specific

verb complementation patterns. In Proceedings of NLP in Biomedicine, ACL 2003. 17–

24.

SPASIC, I., NENADIC, G., AND ANANIADOU, S. 2004. Learning to classify biomedical terms through lit-

erature mining and genetic algorithms. In Intelligent Data Engineering and Automated Learning(IDEAL’04), Lecture Notes in Computer Science, 3177, Springer Verlag. 345–351.

STAPLEY, B. J. AND BENOIT, G. 2000. Bibliometrics: information retrieval and visualization from

cooccurrence of gene names in Medline abstracts. In Proceedigns of Pacific Symposium on Bio-computing 2000. 526–537.

ACM Transactions on Asian Language Information Processing, Vol. 5, No. 1, March 2006.

Page 22: Mining Semantically Related Terms from Biomedical Literature€¦ · In this article, we focus on the extraction of semantically related biomedical entities by combining information

Mining Semantically Related Terms from Biomedical Literature • 43

STAPLEY, B. J., KELLEY, L. A., AND STERNBERG, M. J. E. 2002. Predicting the sub-cellular location

of proteins from text using support vector machines. In Proceedins of Pacific Symposium onBiocomputing 2002. 374–385.

TANABE, L. AND WILBUR, W. 2002. Tagging gene and protein names in biomedical text. Bioinfor-matics 18, 8, 1124–1132.

TAO, Y. AND LEIBEL, R. 2002. Identifying functional relationships among human genes by system-

atic analysis of biological literature. BMC Bioinformatics 3, 16.

YAKUSHIJI, A., TATEISI, Y., MIYAO, Y., AND TSUJII, J. 2001. Event extraction from biomedical papers

using a full parser. In Proceedings of Pacific Symposium on Biocomputing 2001. 408–419.

YANG, Y. 1997. An evaluation of statistical approaches to text categorization. In InformationRetrieval 1, (1/2), 69–90.

YEGANOVA, L., SMITH, L., AND WILBUR W. J. 2004. Identification of related gene/protein names based

on an HMM of name variations. Comput Biol Chem. 28, 2, 97–107.

Received December 2004; revised April 2005; accepted June 2005

ACM Transactions on Asian Language Information Processing, Vol. 5, No. 1, March 2006.


Recommended