A text-miningsystem for knowledgediscovery frombiomedical documents

by N. UramotoH. MatsuzawaT. NaganoA. MurakamiH. TakeuchiK. Takeda

This paper describes the application of IBMTAKMI� for Biomedical Documents tofacilitate knowledge discovery from the verylarge text databases characteristic of lifescience and healthcare applications. This setof tools, designated MedTAKMI, is anextension of the TAKMI (Text Analysis andKnowledge MIning) system originallydeveloped for text mining in customer-relationship-management applications.MedTAKMI dynamically and interactivelymines a collection of documents to obtaincharacteristic features within them. By usingmultifaceted mining of these documentstogether with biomedically motivatedcategories for term extraction and a series ofdrill-down queries, users can obtainknowledge about a specific topic after seeingonly a few key documents. In addition, theuse of natural language techniques makes itpossible to extract deeper relationshipsamong biomedical concepts. The MedTAKMIsystem is capable of mining the entireMEDLINE� database of 11 million biomedicaljournal abstracts. It is currently running at acustomer site.

The life science industry is an emerging market inwhich application spaces, such as drug discovery anddevelopment in the pharmaceutical sector and clin-ical record management in health care, have becomeareas of significant recent interest.1 Documents inthe scientific literature play an important role in lifescience by serving as a potential source for under-lying knowledge discovery. These documents are a

rich repository of information on relationships amongbiomedical concepts such as genes, proteins, diseases,and a variety of other key topics.

Text mining is a technology that makes it possibleto discover patterns and trends semiautomaticallyfrom huge collections of unstructured text.2–6 It isbased on technologies such as natural language pro-cessing, information retrieval, information extrac-tion, and data mining.7 Early papers in this area men-tioned the possibility of knowledge discovery fromthe biomedical literature. Hearst, one of the foundersof text mining, proposed a system for predicting thefunctions of unknown genes using biomedical doc-uments.2 Swanson also described the idea of discov-ering new knowledge from the biomedical litera-ture.4 Subsequently, considerable research has beendone in the areas of biomedical concept extraction(named-entity extraction), relationship extraction,and network/pathway construction for protein-pro-tein interaction. However, although text mining hasproved a promising approach for knowledge discov-ery from text sources, certain specific problems areencountered when trying to apply it to the realm oflife science.

First, existing approaches are incapable of handlingthe vast amount of textual domain-specific informa-tion available. Indeed, there is more data availablethan anyone could possibly read or digest. For ex-

ample, MEDLINE**8 is a database of over 11 millioncitations (abstracts) of biomedical articles datingback to the 1960s. MEDLINE is widely used as a goldenstandard for text-mining systems in life science, andseveral text-mining applications using MEDLINE havebeen proposed. The MedMeSH Summarizer9 ex-tracts MeSH** (Medical Subject Headings) terms10

that can summarize the nature of a cluster of genenames obtained from DNA microarrays (also calledDNA chips). MedMiner11 is a system that filters in-formation for the PubMed** search engine.12 Ob-viously, any approach that applies text-mining meth-ods to such a large document collection must behighly scalable and robust.

Second, existing information extraction systems onlyprovide extracted concepts and relationships in afixed way. Because these systems are noninteractive,it is difficult to iteratively apply mining processes ontheir results directly. With an interactive text-min-ing system, users are better able to discover hiddenknowledge by using a combination of mining func-tions and a trial-and-error approach.

To address these problems we have developed a text-mining system called IBM TAKMI* for BiomedicalDocuments (designated MedTAKMI hereafter),which is capable of mining the entire MEDLINE da-tabase in an interactive manner. The predecessor ofthis system, TAKMI (Text Analysis and KnowledgeMIning), is a text-mining system for customer rela-tionship management (CRM), which has been suc-cessfully used in call centers to mine customer sup-port call logs.13 The MedTAKMI system extendsTAKMI to provide a useful set of tools for knowledgediscovery from biomedical documents. MedTAKMIis designed to handle large document sets and is thuscapable of mining the entire set of MEDLINE citations.

The development of methods for extracting infor-mation on such biomedical concepts as genes, pro-teins, and diseases from text is an active area of re-search14–19 and typically involves the following twoprimary subtasks:

1. Entity extraction—the recognition of gene, pro-tein, and chemical names from biomedical text

2. Relation extraction—the extraction of relation-ships among these entities

Thus architecturally MedTAKMI consists of twomain components designed to handle informationextraction and entity/relationship mining.

The MedTAKMI system performs entity extractionbased on dictionary lookup. This approach is simpleconceptually and can recognize entities very quickly.We have developed a large domain dictionary thatcontains two million biomedical entities. These en-tities and their associated category names are usedas keywords in the MedTAKMI system so that userscan search for documents that contain a keywordwithin a specific category, for example, a query onthe keyword “p53” within the gene category.

In a preprocessing stage input documents are parsedby a shallow syntactic parser which extracts keywords(entities) with category labels, as well as any binaryand ternary relationships that may exist among theseentities. The MedTAKMI runtime engine then usesthis information to provide mining functions to users.Categories are constructed from public ontologicalknowledge, for example, using the MeSH terms inMEDLINE or the resources provided by Gene Ontol-ogy**.3 User-defined resources may also beemployed.

There has been extensive research in relation extrac-tion,20–35 wherein the goal is to extract relationshipsamong biomedical entities (e.g. proteins and genes),from patterns such as “A inhibits B” and “A acti-vates B,” where A and B represent specific entities.Such relationships may be extracted by using one ormore of the following information and methods:

● Surface string patterns20

● Syntactic information from shallow parsing21,22 andfull parsing23–27

● Templates and rules28–30

● Statistical information with machine learning32–35

In particular, the MedTAKMI system uses syntacticinformation with a shallow parser to extract binary(a noun and a verb) and ternary (two nouns and averb) relationships. These relationships from thedocument collection are aggregated and can be dis-played by category viewers as described later.

As previously noted, MedTAKMI is an extension ofTAKMI, a text-mining system for CRM.13 The maindifferences between these two systems are the fol-lowing:

● The use of hierarchical categories: TAKMI only sup-ports flat categories such as product names. ForMedTAKMI we developed a hierarchical categoryviewer because most biomedical entities (e.g.,genes and diseases) are defined hierarchically.


● The extraction of ternary relationships to captureprotein-protein interaction by using deeper lan-guage analysis

● The introduction of support for domain-specificmining functions

● The development of a new system architecture andcomponentization structure

This paper is organized as follows: The next sectiondescribes the key features of MedTAKMI, includ-ing the system architecture and the information ex-traction process. We then introduce the searchingand mining functionalities of MedTAKMI. The fol-lowing section provides an example of the applica-tion of the MedTAKMI system to a specific user sce-nario. Finally, we summarize our work and describedirections for future research.

Features of the MedTAKMI mining system

The MedTAKMI architecture consists of two maincomponents: a preprocessing information extractionstage and a runtime search/mining server, as shownin Figure 1. In this section we briefly discuss thesemain components and then describe methods for in-formation extraction.

pair of forms for each term: a surface form and acanonical form.) The obtained canonical words areembedded in the text document as annotations inXML (eXtensible Markup Language). In Step 2, theannotated text is passed to a syntactic parser. Theparser outputs segments of phrases labeled with theirsyntactic roles, for example NP (noun phrase) or VG

(verb group). The category annotator then assignscategories to the terms in these segments andphrases. The category dictionary consists of a set ofcanonical forms and their categories, which also in-dicates the node label in the hierarchy of categories.The hierarchical categories are in turn imported fromexisting hierarchies, such as the MeSH terms in MED-

Repetitive sequence-based polymerase chain reaction effects deoxyribonucleic acids

repetitive sequence-based polymerase chain reaction

Proper Noun



Proper Noun

repetitive sequence-based polymerase chain reaction

repetitive sequence-based polymerase chain reaction



repetitive sequence-based polymerase chain reaction ... effect

repetitive sequence-based polymerase chain reaction ... effect ... DNA

Proper Noun

Proper Noun



S ... V

S ... V ... O


Proper Noun




Step 2: Parsing and assigning categories


Step 1: Term annotation


Repetitive sequence-based polymerase chain reaction effects deoxyribonucleic acids

Repetitive sequence-based polymerase chain reaction effects deoxyribonucleic acids


LINE, or from user-designed resources. Syntactic re-lationships among these entities, for example, sub-ject-verb (S-V) or subject-verb-object (S-V-O), arealso extracted from the output of the parser. All ex-tracted information is finally encoded into an indexfile that is used by the runtime part of the system.

Search/mining server process. The search and min-ing server, shown in the lower part of Figure 1, pro-vides users with the searching and mining servicesdescribed in detail later. The MedTAKMI system isa Web application and its client code is loaded intoInternet Explorer** as an applet (a servlet versionis also available now). Note that hierarchical cate-gory definitions are shared by both the information-extraction and the runtime-server preprocessingcomponents.

MEDLINE—a collection of biomedical documents.In their work life-science researchers typically useMEDLINE,8 a bibliography database that covers thebiomedical area. To understand our approach to theextraction of information from MEDLINE, some un-derstanding of MEDLINE itself is required. MEDLINEis administered by the National Center for Biotech-nology Information (NCBI)36 of the United States Na-tional Library of Medicine (NLM).37 It contains ap-proximately 11 million biomedical citations, datingfrom the mid 1960s to the present. Citations in MED-LINE are collected from over 4600 biomedical jour-nals published worldwide. Biomedical citations inMEDLINE are available to the general public at thePubMed Web site.12 Figure 3 shows an example ofone such citation.

To make citation lookup easy, NLM indexes the ar-ticles in MEDLINE for retrieval and classification us-ing the Medical Subject Headings (MeSH) thesau-rus.10 MeSH headings consist of sets of descriptorsin a hierarchical structure. Subheadings, or qualifi-ers, provide additional specificity for each descrip-tor. The major MeSH headings indicate the maincontents of the article, and the minor MeSH head-ings are used to describe secondary topics. MeSHalso contains other features such as check tags andage tags.

Each citation contains the article title, abstract, au-thors� names, MeSH headings, affiliations, publica-tion date, journal name, and other information. Thetitle and abstract are text strings that can be manip-ulated by natural-language-processing text-miningtechniques. For example, the MEDLINE citation

shown in Figure 3 contains the information shownin Table 1.

Figure 4 shows the result of retrieving the MeSH de-scriptor “Amino Acid Sequence” using the MeSHBrowser.38 These results indicate that there are twonodes labeled “Amino Acid Sequence” in the de-scriptor thesaurus: one at G06.184.603.060 and theother at L01.453.245.667.060. These codes representlocations; for example, G06.184.603.060 is the childnode of Molecular Structure (G06.184.603). The let-ter G at the beginning of the location represents themajor category Biological Sciences, while L repre-sents another major category, namely, InformationScience.

Dictionary-based information extraction. The pre-processing stage thus extracts meta-data informationfrom MEDLINE documents. In addition to informa-tion such as author and publication date, the systemcan extract information from other text fields, forexample, title or abstract, using natural language pro-cessing. This information can include sets of key-words (e.g. protein names), predicate-argument bi-nary relations (e.g. “activate-protein”), and ternaryrelations (e.g. subject-verb-object dependency triplets).

In the preprocessing phase of the MedTAKMI sys-tem, document titles and abstracts are parsed byCCAT, a shallow syntactic parser developed at the IBMThomas J. Watson Research Center using an ap-proach originally proposed by Charniak.39 Becausethis is a general-purpose parser that has not beentrained for biomedical documents, it is difficult toobtain optimized results by parsing documents fromthe medical domain. We solve this problem by firstannotating the text with domain dictionaries. Theterm dictionaries are constructed by users and em-ploy resources such as UMLS** (Unified MedicalLanguage System)40 or the users� own proprietaryresources. The annotations facilitate the parsing ofmedical-domain text even when the parser has notbeen specifically trained for this domain. This an-notation process is needed for two primary reasons:

1. Identification of term boundaries—Most techni-cal terms in the medical domain, for example,protein names, are compound words. Thus, bio-medical terms tend to consist of a combinationof numerals, symbols, and verbs, making it verydifficult to find term boundaries. For instance,the compound noun “repetitive sequence-basedpolymerase chain reaction” consists of an adjec-tive (repetitive), a past participle of a verb (se-


Figure 3 PubMed Web site


quence-based) and three nouns (polymerase,chain, reaction). The annotations made by thetechnical term dictionary are based upon a part-of-speech (POS) analysis. Note that words, suchas chain, which can be a noun or a verb, can fur-ther complicate this process.

2. Aggregation of synonymous expressions and spell-ing variations—There can be multiple expres-sions that are synonymous with a particular tech-nical term. These can arise from abbreviationsor acronyms as well as from spelling variations.If these variations are recognized as different en-tities, it can often cause problems for text min-ing. For instance, “DNA” and “deoxyribonucleicacid” are synonyms; thus, they should be countedas the same entity in the mining process. The dic-tionary contains spelling/abbreviation variantsand their canonical forms. By reducing these var-iants to a single canonical form, we can treatthem as the same entity.

After text is annotated with a technical term dictio-nary it is parsed by a parser, or tagger. In this sys-tem, we use the CCAT parser to assign a POS to eachword. This determination is based on the statisticaldistribution of candidate POSs for a word and theprobability of POS transitions (from/to adjoiningwords) that are extracted from a training corpus. Af-ter the annotation and parsing processes, phrases aredetermined by using head-driven models,41,42 and cat-egory identifiers are assigned. The current Med-TAKMI system maintains approximately 270000 hi-erarchical category identifiers. Of these, roughly170000 are MeSH category identifiers; the rest areuser-defined categories. MeSH terms are updatedannually, and the revisions are incorporated intoMedTAKMI.

Relation extraction. In general, a conventional textanalysis system counts the frequency of occurrenceof a given term and calculates its importance. The

Table 1 Meta-data for a MEDLINE citation

Symbol Meaning Value

PMID PubMed Identifier 12060689

TIS ISSN for the journal 1362-4962

VI Volume 30

DP Published Date 2002 Jun 15

TI Title Dictionary-driven prokaryotic gene finding

AB Abstract Gene identification, also known as gene finding or generecognition, is among the important problems of molecularbiology that have been receiving increasing attention with theadvent of large scale sequencing projects.......... (Snipped.)

AD Affiliation Exploratory Technology, IBM Tokyo Research Laboratory,1623-14 Shimotsuruma, Yamato-shi, Kanagawa 242-8502,Japan.

AU Author Shibuya, Tetsuo

AU Author Rigoutsos, Isidore

MH MeSH heading Algorithms

MH MeSH heading Amino Acid Sequence

MH MeSH heading Base Sequence

MH MeSH heading Codon, Initiator

MH MeSH heading Computational Biology/*methods (* represents Major MeSH)

MH MeSH heading *Genes, Archaeal

TA Journal Title Abbreviation Nucleic Acids Res


Figure 4 MeSH browser


system can then find documents that contain thequery terms or their synonyms. It may also displayrelationships among various terms using visualiza-tion tools. This would allow us to discover terms thatare strongly related, but we still would not know howor in what way they are related. For example, wecould find all documents containing the word smok-ing. But even though smoking may appear frequentlywithin these documents, conventional text analysissystems are unable to tell us, for example, what theeffects of smoking are, who is affected, and to whatextent. We could next identify terms that are stronglyrelated to the word smoking, but simple co-occur-rence of two terms (e.g., smoking and risk) revealsvery little about the semantic relationship betweenthem. Conventional text analysis systems ignore hintsthat indicate whether a verb is negative or not andwhether it is active or passive. Such systems are un-able to distinguish between the meanings of the fol-lowing two sentences: “Smoking increases the riskof lung cancer,” and, “The risk of lung cancer in-creases the consequences of smoking.” Both sen-tences contain the terms smoking and risk; simpleco-occurrence detection is not sufficient to deriverelationships.

The TAKMI system for CRM13 is able to extract “sub-ject . . . verb” or “verb . . . object” relationships ina sentence. It can extract such concepts as “modem. . . broken,” “file . . . not found,” and “hard disk . . .slow” from customer queries. These extractions wereinstrumental in facilitating problem detection andworkload reduction for analysts at customer helpcenters. Although the binary “subject . . . verb” or“verb . . . object” relationships suffice for customersupport purposes, a more detailed concept extrac-tion is needed for biomedical articles. This is espe-cially true because the “what subject influences whatobject” relationship is a very important one in thebiomedical domain. For instance, it is clear that ex-traction of the relationship “protein_A. . .activates

. . .protein_B. . .with. . .enzyme_C” provides moredetailed information than either “protein_A. . .ac-tivates. . .protein_B” or “protein_B. . .with. . .en-zyme_C.” Furthermore, we need to distinguish be-tween “protein_A. . .activates. . .protein_B. . .with. . .enzyme_C” and “protein_B. . .activates. . .pro-tein_A. . .with. . .enzyme_C.” In the MedTAKMIsystem, we extract these ternary relationships (as wellas binary relationships) using natural-language-pro-cessing technology. These extracted relationships areused as keywords by various MedTAKMI miningfunctions, as described in the next section. Table 2shows actual examples of ternary relationships ex-tracted from MEDLINE documents.

Searching and mining functionsThe MedTAKMI system is a text-mining system thatcan be integrated with search engines. It supportskeyword-based search engines and can also be usedwith other search engine implementations such asIBM�s GTR (Global Text Retrieval) and DB2* NetSearch Extender. The mining process can be appliedto the whole database or to a subset document col-lection obtained by a series of searches. Users cansubmit a query and receive a document collectionin which each document contains the query keywordsor their synonyms. Mining functions can then be ap-plied to the collection in order to discover under-lying information, such as protein-protein relation-ships. Alternatively, users can continue the searchprocess by using the results of previous searching andmining operations. Interactivity is a very importantMedTAKMI feature because it allows users to switchbetween searching and mining in a flexible manner.

MedTAKMI provides various mining functions forlarge document collections. Some of these functionsare general, but others are tailored to the life-sci-ence domain. The following functions are introducedin this section:

● Keyword-based and full-text searching● Hierarchical category viewer● Chronological viewer● Two-dimensional viewer (term-association)● Trend analysis viewer● Other analytical tools

Most of these viewers can function interactively, aunique feature of the MedTAKMI system. Experi-ences with text mining in the CRM domain13 havepreviously demonstrated the importance of interac-tivity in that application, and interactivity also proves

Table 2 Examples of ternary relationships extracted fromMEDLINE

apoptosis. . .induce. . .p53

apoptosis. . .inhibit. . .tumor protein p53 (Li-Fraumenisyndrome)

tumor protein p53. . .play. . .role

adhesion receptors. . .activate. . .extracellular signal-regulatedkinase

Ang II. . .activated. . .p38


to be an important requirement for users in the life-science domain. For example, in knowledge-discov-ery and data-mining (KDD) tasks, such as on-line an-alytical processing (OLAP), users may wish to use afact discovered in a previous mining result to struc-ture a new query from a different point of view. In-deed, a user often cannot define a complete querybeforehand but rather must iteratively refine thequery based on previous results. Furthermore, whena document collection is very large, a single querymay not be able to sufficiently narrow the collection,requiring a user to submit a sequence of queries toobtain the desired subset of documents. MedTAK-MI�s viewers help users to navigate toward this goalinteractively.

Keyword-based and full-text searching. The Med-TAKMI system provides two types of searching: key-word and full text. A keyword index which associ-ates keywords with category codes is built by theinformation extraction phase, as described above.Users can thus submit a query such as “search fordocuments that contain the word p53 as a genename.” A disadvantage of keyword search, however,involves inaccuracies in keyword extraction. If a termconsisting of multiple words is not recognized as akeyword in the indexing phase, then this term can-not be matched later in a query submitted by a user.Thus, in MedTAKMI, users can also choose a full-text search which uses different indexes built sepa-rately to retrieve documents in which the term ap-pears literally. The MedTAKMI system allows usersto switch between these two search engines seamlessly.

Hierarchical category viewer. The hierarchical cat-egory viewer shows keyword distribution in a datacollection over a predefined hierarchy. For exam-ple, Figure 5 shows the distribution of keywords forthe “Disease” node and its child nodes in the MeSHhierarchy.

Blue bars at the bottom of the viewer show the fre-quency of occurrence of a keyword. In Figure 5, thereare 35 documents that contain the term “neoplasms”or any of the terms in its child nodes. (Note that forefficiency reasons this frequency was calculated fora subset of documents sampled from the full doc-ument collection. The cardinality of such a subsetcan be specified by the user.)

Red bars indicate relative frequency—this measurecompares the current document subcollection to theinitial document collection indexed in the Med-TAKMI system. In other words, searching is a pro-

cess that narrows down a document collection. As-sume D is the initial document collection. A keywordsearch due to query q1 returns D1, the subset of Dthat satisfies q1. Similarly, let (q1, q2, . . . , qn) be asequence of successive queries and let (D1, D2, . . . ,Dn) be document collections such that Di is the re-sult of query qi made also on collection D. The rel-ative frequency for a keyword w in the document col-lection Di is calculated using the following formula:

relfreq�w, Di� �

c�w, Di�

Dic�w, D�


where Di is the number of documents in the doc-ument collection Di and c(w, Di) is the number of doc-uments that contain the word w in the collection Di.

Two types of hierarchies are registered with the Med-TAKMI system. The flat type has no children; forexample, in Figure 5, “Compound,” “Amino Acid,”and “Organ” are flat. The tree type appears with a‘�’ notation; thus “Dry Lab Methods” and “MeSHMinor” in Figure 5 are tree type. Users can definethe hierarchy set using public hierarchies such asMeSH and Gene Ontology3 as well as user-specificor company-specific hierarchies. We have developeda hierarchy that currently consists of 95000 nodesfor the various concepts in the life-science domain.

Chronological viewer. This viewer allows a user todiscover trends by viewing the chronological distri-bution of a set of documents. Figure 6 shows themonthly distribution of approximately 330000 doc-uments selected from MEDLINE in 2002. MedTAKMIsupports yearly, monthly, and daily distributions. Us-ing this viewer, one can determine when a certainkeyword began to appear in the biomedical litera-ture and how its frequency of occurrence changedwith time. For example, the term HIV does not ap-pear in MEDLINE documents in the 1970s, but by the1990s it occurs with high frequency.

Two-dimensional maps (term-association). Thetwo-dimensional viewer allows a user to visualize thestrength of association between keywords. Figure 7shows protein-protein associations in a mouse doc-ument collection. The value in each cell representsthe strength of association of two keywords—thehigher the value, the stronger the association. Forexample, the proteins “Bcl2-associated X protein”


Figure 5 Hierarchical category view


and “G elongation factor” have a strong association.The numbers “19 (36.54%)” in the cell mean thatthere were 19 documents which mentioned bothBcl2-associated X protein and G elongation factor,and that these 19 documents comprise 36.54 percentof the documents mentioning Bc12-associated X pro-tein. Because 36.54 percent is much higher than otherpercentages, the cell is automatically highlighted.

Formally, the value for each cell v(wxi, wyj, D) is cal-culated by using the following formula:

p�wxi, wyj, D� �c�wxi � wyj, D�

c�wyj, D�

v�wxi, wyj, D� �p�wxi, wyj, D�

1M �



p�wxi, wyj, D�


where c(w, D) is the number of documents that con-tains the word w in the collection D. The words wxi

and wyj belong to the categories of x and y axes, re-spectively. M and N are the size of the matrix foreach x and y axis. Clicking a cell leads to furthersearching using the two keywords.

Users are not limited to protein-protein interactions.Associations for protein-gene, gene-disease, disease-organ, and so forth, are also possible. In fact, anycombination of hierarchy terms is possible.

Other analytical tools. The viewers described thusfar are interactive, allowing for a flexible mining pro-cess. However, some tools require a much longeranalysis time to produce a response. These tools areavailable to the user as off-line tools in the Med-TAKMI system. During the mining process interme-diate results can be stored and used as input to anoff-line tool. The final output of the off-line tool isshown independent of the runtime MedTAKMI pro-cess. This section introduces two off-line functionsprovided by MedTAKMI.

Similar entry search tool. The first function is one thatpermits searching of similar biological entities. Thisfunction uses the term-association table data de-scribed in the previous section. Each row of the ta-ble is represented by a vector whose attributes cor-respond to the keywords in the columns and whoseelements represent the joint occurrence probabili-ties of the two corresponding entities. The degreeof similarity between row entities is calculated based

on the distance between their corresponding vectorsusing a cosine measure. Principal component anal-ysis is used to reduce dimensions and improve pre-cision because the number of columns tends to belarge. Figure 8 shows the ranking result of signalingproteins similar to protein 14_3_3.

Subset analysis tool. The second function allows forthe category view data or term association view dataof two different queries to be compared and analyzedto identify significant differences. Consider two datasets A and B. The occurrence probabilities of en-tries that appear in both data set A and data set Bare compared. Entries are first organized into thefollowing three categories based on the AIC (AkaikeSelection Criterion) approach for statistical modelselection criteria.43

● Entries frequently appearing in data set A● Entries frequently appearing in data set B● Entries common to both data set A and data set B

For each category, entries are ranked by a scorebased on a statistical test. We use the following sta-tistical test:

H0: pA � pB

where pA and pB are the probability of occurrenceof the entry in data set A and data set B respectively.The chi-square statistic is calculated from the occur-rence data. Under the null hypothesis H0, this statis-tic has a chi-square distribution with one degree offreedom �1

2. We define the following ranking score,

Figure 6 Chronological view


Ranking score � �log10 p

using the p-value calculated from the observed data.

Figure 9 shows the result of subset analysis by thisfunction using disease-experiment two-dimensional-

map data extracted by the MedTAKMI system. Doc-uments are selected using the term “Adult” in theage category in data set A and “Child” in data setB. Figure 9A shows the common entries in both Aand B. Figure 9B shows the entries which appear fre-quently in data set A, and Figure 9C shows those

Figure 9 Result of subset analysis tool

A Common entries

B Entries frequently appearing in data set A

C Entries frequently appearing in data set B


appearing frequently in data set B. In Parts 9B and9C, entries are ranked by the statistical base scoredescribed earlier.

Usage scenarioThe MedTAKMI system can be used to aid in drugdiscovery and clinical information retrieval. For ex-ample, defects in the AML1 gene are thought tocause acute leukemia. Suppose we are interested indeveloping a treatment for leukemia and are look-ing for potential drug targets. We describe below theapplication of the MedTAKMI system to this prob-lem. For this example a subset of approximately330000 of the latest abstracts from the 2003 MEDLINEdistribution was used. A keyword search reveals that54 papers in the sample set mention AML1. Theseare depicted in Figure 10. Each line shows the pub-lication year, PubMed ID, and title of the paper.

A category view (Figure 10B) based on the NCBILocusLink44 phenotype information indicates thatleukemia is indeed the term most frequently asso-ciated with AML1, appearing in 17 of the 54 papers.Furthermore, the relative frequency of the term leu-kemia indicates that it appears 54.41 times more of-ten in this subset of 54 papers than it does in theentire sample database.

If we sort the entries in Figure 10B in decreasingorder of relative frequency (Figure 10C), we see thatmyelodysplasia syndrome 1 (a protein that appearsto be amplified in human prostate cancer)45,46 and3q21q26 syndrome (a syndrome associated with over-expression of the Evi-1 gene, an event in turn asso-ciated with myeloid leukemia) are the phenotypesthat are strongly associated with AML1. Similarly,we can discover that various fusion genes and pro-teins such as AML1-MTG16, MTG8, AML1-ETO,and TEL-AML1 are also strongly associated withAML1 when we consider the category view for primenames of substances (Figure 10D).

Now, consider a search using the keyword leukemia,which selects 1051 papers from the entire sample da-tabase. We have already established that there ap-pears to be a close association between AML1 andleukemia; if we can find another protein that isclosely associated with leukemia, that protein mightin turn serve as a potential target for drug design.A two-dimensional (2D) map analysis between Lo-cusLink phenotypes and signaling proteins in thissubcollection of 1051 papers shows very interestingresults (Figure 11).

We find that some signaling proteins such as STYKcand TyrKc are commonly referred to in papers aboutleukemia, HMG-CoA lyase deficiency, hepatic lipasedeficiency, and Miller-Dieker syndrome. Similarly,SAM is associated with the same phenotypes. (Infact, fusion of the SAM domain of the TEL onco-gene to AML1 has been shown.45,46) In addition,HATPase_c appears associated with all the pheno-types. ITAM, on the other hand, might reveal a po-tential interesting relationship between leukemia andother phenotypes such as osteosarcoma. We can nowclick on a cell in the 2D map to access the corre-sponding papers if we are interested in pursuing aparticular association. These results could then beexplored as possible leads for drug development.Thus, the MedTAKMI system provides a contextualoverview for identifying potential targets for drugdesign and facilitates the navigation of relevantpapers.

Conclusions and future workThere are many useful applications for text miningin the life-science industry, particularly because ofthe vast amount of technical data and the myriad re-lationships contained therein that are waiting to beinferred, identified, and collated. There are alsomany textual databases other than MEDLINE to beexplored. For example, laboratory notebooks, inter-nal reports, and patent documents are all very im-portant resources that contain valuable informationabout new gene, protein, and drug discoveries.

The next stages of the MedTAKMI system evolu-tion include: (1) expanding the range of document

sources to include these other databases, (2) devel-oping a layer that allows integration of the Med-TAKMI system with external and internal tools thatare currently in daily use, and (3) improving the per-formance of the results system by developing a dis-tributed version of MedTAKMI running on a gridarchitecture.47,48 Because both the information-ex-traction process and the search/mining process han-dle several files independently, we believe that thefile I/O and calculation costs can be shared acrossmultiple processing units. If MedTAKMI can be ex-tended to run on a grid architecture, the performanceof both processes will be improved.

Compliance with the Unstructured InformationManagement Architecture (UIMA) proposed by a re-search community in IBM is also a major challenge.49

The UIMA is a common framework for text analyticstools that are defined as text analysis engines (TAEs).We have already redefined our information extrac-tors as TAEs. We are also working on developing text-mining middleware for life-science applications.50 Inthis project, the MedTAKMI system runs as a partof UIMA-based middleware.

In summary, this paper has described the Med-TAKMI text-mining system, a set of tools for knowl-edge discovery derived from millions of biomedicaldocuments. The MedTAKMI system is able to parse11 million MEDLINE citations with a syntactic shal-low parser and extract relationships from theseparsed entities with hierarchical category identifiers.It also provides a toolkit of interactive viewers to aid

HMG-CoA lyase deficiency

Hepatic lipase deficiency

Miller-Dieker lissencephaly syndrome

Colorectal cancer

Lupus erythematosus



Li-Fraumeni syndrome

9 (5.23%)

7 (10.29%)

7 (10.29%)

2 (5.4%)

0 (0.0%)

1 (3.57%)

0 (0.0%)

0 (0.0%)

0 (0.0%)


9 (5.23%)

7 (10.29%)

7 (10.29%)

2 (5.4%)

0 (0.0%)

1 (3.57%)

0 (0.0%)

0 (0.0%)

0 (0.0%)


9 (5.23%)

7 (10.29%)

7 (10.29%)

2 (5.4%)

0 (0.0%)

1 (3.57%)

0 (0.0%)

0 (0.0%)

0 (0.0%)


3 (1.74%)

3 (4.41%)

3 (4.41%)

1 (2.7%)

2 (6.06%)

3 (10.71%)

1 (3.7%)

1 (3.7%)

1 (3.7%)


2 (1.16%)

2 (2.94%)

2 (2.94%)

1 (2.7%)

1 (3.03%)

3 (10.71%)

0 (0.0%)

0 (0.0%)

0 (0.0%)


2 (1.16%)

3 (4.41%)

3 (4.41%)

0 (0.0%)

0 (0.0%)

0 (0.0%)

0 (0.0%)

0 (0.0%)

0 (0.0%)


2 (1.16%)

2 (2.94%)

2 (2.94%)

1 (2.7%)

0 (0.0%)

0 (0.0%)

0 (0.0%)

0 (0.0%)

0 (0.0%)


1 (0.58%)

0 (0.0%)

0 (0.0%)

0 (0.0%)

1 (3.03%)

0 (0.0%)

1 (3.7%)

1 (3.7%)

1 (3.7%)



in the discovery of underlying knowledge in the lit-erature of life science.


We would like to thank Celestar Lexico-Sciences,Inc. for collaborative work during the developmentand deployment of MedTAKMI.

