+ All Categories
Home > Documents > Vocabularies and Retrieval Tools in Biomedicine - ResearchGate

Vocabularies and Retrieval Tools in Biomedicine - ResearchGate

Date post: 09-Feb-2022
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
17
ORIGINAL PAPER Vocabularies and Retrieval Tools in Biomedicine: Disentangling the Terminological Knot Klaar Vanopstal & Robert Vander Stichele & Godelieve Laureys & Joost Buysschaert Received: 21 October 2008 / Accepted: 6 October 2009 # Springer Science + Business Media, LLC 2009 Abstract Terms like thesaurus, taxonomy, classifi- cation, glossary, ontologyand controlled vocab- ularycan be used in diverse contexts, causing confusion and vagueness about their denotation. Is a thesaurus a tool to enrich a writer s style or an indexing tool used in bibliographic retrieval? Or can it be both? A literature study was to clear the confusion, but rather than giving us consensus definitions, it provided us with conflicting descriptions. We classified these defi- nitions into three domains: linguistics, knowledge management and bibliographic retrieval. The scope of the terms is therefore highly dependent on the context. We propose one definition per term, per context. In addition to this intra-conceptual confusion, there is also inter-conceptual vagueness. This leads to the introduc- tion of misnomers, like ontologyin the Gene Ontology . We examined some important (bio)medical systems for their compatibility with the definitions proposed in the first part of this paper. To conclude, an overview of these systems and their classification into the three domains is given. Keywords Information retrieval . Medical terminology . Medical coding systems . Taxonomy . Thesaurus . Ontology . Controlled vocabulary . Classification Introduction Terms such as thesaurus, taxonomy, ontology and con- trolled vocabulary, and even glossary, dictionary and lexicon at first sight seem to be unambiguous terms. However, they are used in different ways in different contexts, causing continual confusion. Moreover, the distinction between the terms themselves is not always straightforward. A look at the information about the term deathin three different thesauri (see Table 1), tells us that not all thesauri give the same kind of information: The Unesco Thesaurus [1] includes information such as narrower terms (NT), broader terms (BT), related terms (RT), other language equivalents (SP, FR) and related terms (RT). In Rogets Thesaurus [2], by contrast, other information is given: function, derivations, and related terms. The ICPC2- ICD10 thesaurus, a system used for medical classification which links concepts of ICPC2 to ICD-10 concepts, gives the classification codes R99 (ICD-10) and A96 (ICPC) for death. It is clear that these thesauri differ considerably in their structure and scope. Does this mean that for one of them, the denomination thesaurusis notor lessapt? K. Vanopstal (*) : J. Buysschaert Faculty of Translation Studies, University College Ghent, Ghent, Belgium e-mail: [email protected] J. Buysschaert e-mail: [email protected] K. Vanopstal Department of Applied Mathematics and Computer Science, Ghent University, Ghent, Belgium R. Vander Stichele Faculty of Medicine and Health Sciences, Heymans Institute of Pharmacology, Ghent University, Ghent, Belgium e-mail: [email protected] G. Laureys Faculty of Arts and Philosophy, Department of Nordic Studies, Ghent University, Ghent, Belgium e-mail: [email protected] J Med Syst DOI 10.1007/s10916-009-9389-z
Transcript

ORIGINAL PAPER

Vocabularies and Retrieval Tools in Biomedicine:Disentangling the Terminological Knot

Klaar Vanopstal & Robert Vander Stichele &

Godelieve Laureys & Joost Buysschaert

Received: 21 October 2008 /Accepted: 6 October 2009# Springer Science + Business Media, LLC 2009

Abstract Terms like “thesaurus”, “taxonomy”, “classifi-cation”, “glossary”, “ontology” and “controlled vocab-ulary” can be used in diverse contexts, causingconfusion and vagueness about their denotation. Is athesaurus a tool to enrich a writer’s style or an indexingtool used in bibliographic retrieval? Or can it be both?A literature study was to clear the confusion, but ratherthan giving us consensus definitions, it provided uswith conflicting descriptions. We classified these defi-nitions into three domains: linguistics, knowledgemanagement and bibliographic retrieval. The scope ofthe terms is therefore highly dependent on the context.We propose one definition per term, per context. Inaddition to this intra-conceptual confusion, there is alsointer-conceptual vagueness. This leads to the introduc-

tion of misnomers, like “ontology” in the GeneOntology. We examined some important (bio)medicalsystems for their compatibility with the definitionsproposed in the first part of this paper. To conclude, anoverview of these systems and their classification into thethree domains is given.

Keywords Information retrieval .Medical terminology .

Medical coding systems . Taxonomy . Thesaurus . Ontology .

Controlled vocabulary . Classification

Introduction

Terms such as thesaurus, taxonomy, ontology and con-trolled vocabulary, and even glossary, dictionary andlexicon at first sight seem to be unambiguous terms.However, they are used in different ways in differentcontexts, causing continual confusion. Moreover, thedistinction between the terms themselves is not alwaysstraightforward.

A look at the information about the term ‘death’ in threedifferent thesauri (see Table 1), tells us that not all thesaurigive the same kind of information:

The Unesco Thesaurus [1] includes information such asnarrower terms (NT), broader terms (BT), related terms (RT),other language equivalents (SP, FR) and related terms (RT).In Roget’s Thesaurus [2], by contrast, other information isgiven: function, derivations, and related terms. The ICPC2-ICD10 thesaurus, a system used for medical classificationwhich links concepts of ICPC2 to ICD-10 concepts, givesthe classification codes R99 (ICD-10) and A96 (ICPC) for‘death’. It is clear that these thesauri differ considerably intheir structure and scope. Does this mean that for one ofthem, the denomination “thesaurus” is not—or less—apt?

K. Vanopstal (*) : J. BuysschaertFaculty of Translation Studies, University College Ghent,Ghent, Belgiume-mail: [email protected]

J. Buysschaerte-mail: [email protected]

K. VanopstalDepartment of Applied Mathematics and Computer Science,Ghent University,Ghent, Belgium

R. Vander SticheleFaculty of Medicine and Health Sciences,Heymans Institute of Pharmacology, Ghent University,Ghent, Belgiume-mail: [email protected]

G. LaureysFaculty of Arts and Philosophy, Department of Nordic Studies,Ghent University,Ghent, Belgiume-mail: [email protected]

J Med SystDOI 10.1007/s10916-009-9389-z

The main problem is that the terms taxonomy, classifi-cation, thesaurus, ontology and controlled vocabulary areused in many different contexts, including linguistics,bibliographic information retrieval (IR) and knowledgemanagement, including medical coding. As Kagolovskyand Moehr [3] point out, “information retrieval” has nocommon definition, due to the different research back-grounds of the authors who use the term. Kagolovsky andMoehr, propose the following definition, citing Harter andHert [4]: a system that “retrieves documents, or referencesto them, rather than data”. This definition corresponds towhat we will call in this paper bibliographic retrieval.Medical registration systems, on the other hand, areestablished in the first place to represent and storeinformation—rather than documents—and in the secondplace to later retrieve and re-use that information.

The first section of this paper gives an overview of thedifferent fields in which the terms “glossary”, “lexicon”,“dictionary”, “taxonomy”, “classification”, “thesaurus”,“ontology” and “controlled vocabulary” can be used. Onthe basis of these observations, definitions will besuggested and recommendations made for a more consis-tent and unambiguous use of the relevant terminology.In the second section, these insights will be applied tothe biomedical domain, where these issues are partic-ularly relevant. To conclude, an overview of the existingtools in the three dimensions (linguistics, knowledgemanagement—including medical coding—and bibliographicretrieval) is presented.

Domains of application of the terms

As mentioned above, terms such as taxonomy, thesaurus,ontology, controlled vocabulary etc. can be defined invarious ways depending on the domain of application. Wewill discuss three domains, namely linguistics, knowledgemanagement—including medical coding systems—andbibliographic retrieval.

There are several linguistic tools which can help to findthe right terms, or to find an explanation or definition for acertain term, viz. dictionaries, lexicons, glossaries, thesauriand controlled vocabularies. These systems (can) have apurely linguistic function. However, thesauri and controlledvocabularies can also be used for the retrieval of documentsor data.

A second domain which will be discussed here, isthat of the storage and retrieval of knowledge. Weespecially focus on medical coding systems, such asICPC and ICD. Medical coding systems can bedescribed as classifications or nomenclatures of health-and medicine-related phenomena. These concepts arestructured and usually given a code which indicates theplace of the concept in the nomenclature, as can be seenin Fig. 1.

Bibliographic retrieval can be defined as the scienceof searching a database for journal or magazine articles,containing citations, abstracts and often full texts orlinks to the full texts. The underlying structures tosearch for articles in databases include taxonomies,

Table 1 The word “death” in several thesauri

Unesco thesaurus Roget’s II: the new thesaurus ICPC2-ICD10 thesaurus

Death [93] Death Death

Terme français: Mort See also 2 (non-existence); 62 (end); 32 (killing);325 (burial).

ICD10: R99 Other ill-defined andunspecified causes of mortality

Término español: Muerte n. death, mortality, fatality, casualty, losses, death toll;extinction, decease, departure, exit, demise, release;natural death, accidental death, cot death, stillbirth,miscarriage, brain death, abortion; unnatural death, [...]

ICPC: A96 Death

Русский термин: Смерть adj. dying, moribund, half-dead, not long for this world,done for, slipping away, in extremis; dead, [...]

MT 2.70 Biology vb. Die, perish, expire, pass over/away, fall asleep,give up the ghost, depart this life, croak (colloq.),peg out (colloq.), pop one’s clogs (colloq.), [...]

UF Causes of death

BT Life cycle [77]

RT Ageing [88]

RT Birth rate [91]

RT Euthanasia [24]

RT Homicide [24]

RT Mortality [242]

RT Suicide [29]

J Med Syst

thesauri, ontologies, controlled vocabularies and topicmaps.

Linguistic tools

Glossaries, dictionaries and lexicons

The term ‘glossary’ originates from the Latin wordglossarium, a collection of glosses. ‘Gloss’, in its turn,originates from the Greek word glossa (γλωσσα) whichdenotes the explanation of a specialized expression ordifficult word. Hence, ‘glossary’ can be defined as a list ofterms in a particular field of knowledge, with definitions orexplanations.

Glossaries are usually arranged alphabetically. Theterms in monolingual glossaries usually refer to LSP(Language for Specific Purposes) and are furnished withdefinitions. These definitions generally apply to onedomain only, and thus rarely include variant meanings.In practice, however, these definitions are often omittedin multilingual glossaries.

Glossaries can be integrated into a book or a website, butthey can also be stand-alone lists. They can be used as, butare not, per se, controlled vocabularies (see infra). Theycan be monolingual (e.g. Wikipedia’s Glossary of medicalterms related to communications disorders1 or the DutchRIZIV glossary2), bilingual (e.g. the TERMISTI glossariesof abortion3 and autism4 terms) or multilingual (e.g.

Multilingual Glossary of Technical and Popular MedicalTerms in Nine European Languages5). Definitions are oftenomitted in bilingual and multilingual glossaries.

The term glossary is used interchangeably with lexiconand dictionary. This presumed equivalence, however, leadsto a blurring of the conceptual boundaries of the terms.Ananiadou [5] defines ‘lexicon’ as a list containing “thelexical elements (either as full forms or as canonical baseforms), together with additional linguistic informationabout them, which is required for further morphological,syntactic, and semantic processing.” She adds that lexiconsare not fully standardized, which allows its makers tomodel them so that they best suit their own purposes. Wewill adopt Ananiadou’s definition.

Dictionaries, both monolingual and multilingual, canrefer to general language or to a specialized terminology.They often give limited morphological and grammaticalinformation (e.g. gender, part of speech, plural form) andsometimes also a phonetic transcription, next to a defini-tion. Bi- and multilingual general language dictionariesusually provide a translation—or several translations usedin different contexts—, collocations and idiomatic expres-sions. Conversely, specialized multilingual dictionariesusually offer translations, and very little further informa-tion. An example from the Wörterbuch für Industrie undTechnik (French–English/English–French) [6]:

Reprofilage n:m: Neuprofilierung n:f : Bbatiments et travaux publics6

6

Summarizing, it can be said that the boundaries betweenthe terms glossary, lexicon and dictionary have blurred tosome extent. However, we define ‘glossary’ as “a list ofwords or terms with their explanations”, ‘lexicon’ as “a listof words or terms, together with linguistic informationabout them” and ‘dictionary’ as “a list of words or termswith limited linguistic information, usually a definition,and, in the case of bi- or multilingual dictionaries, one ormore translations”.

Thesauri

The word ‘thesaurus’ is derived form the ancient Greek‘thesauros’ (θησαυρός), or ‘treasure’. In the sixteenthcentury, its meaning was narrowed to ‘treasure of words’,like a dictionary or an encyclopaedia. The word ‘thesaurus’fell into disuse for some time, but revived with the releaseof Roget’s Thesaurus of English Words and Phrases in thenineteenth century. Roget adopted an onomasiologicalapproach—providing the word for a given idea—in his

1 http://en.wikipedia.org/wiki/Glossary_of_medical_terms_related_to_communications_disorders2 http://www.riziv.fgov.be/nl/glossary.htm3 http://www.termisti.refer.org/data/ivg/index.htm4 http://www.termisti.refer.org/data/autisme/frame.html

Fig. 1 Extract of the ICD10 classification: “diseases of appendix”

6 The first column refers to the French term, the second to the Germantranslation and the third column refers to the corresponding domain.

5 http://users.ugent.be/∼rvdstich/eugloss/welcome.html

J Med Syst

thesaurus, whereas most dictionaries were, and still are,characterised by a semasiological approach, i.e. theydescribe the referential meaning denoted by words. Rogetdid not organize his thesaurus alphabetically, but systemat-ically, i.e. according to ideas or concepts.

The purpose of an ordinary dictionary is simply toexplain the meaning of words; and the problem ofwhich it professes to furnish the solution may bestated thus:—The word being given, to find itssignification, or the idea it is intended to convey.The object aimed at in the present undertaking isexactly the converse of this: namely,—The idea beinggiven, to find the word, or words, by which that ideamay be most fitly and aptly expressed [7].

A thesaurus can thus be a purely linguistic tool, whichprovides a standard language of a particular field ofknowledge and contains information about nuances ofconcepts. This type of thesaurus is referred to by Kilgarriffand Yallop [8] as the ‘Roget-style thesaurus’. Its objectiveis to improve the effectiveness of communication: therelationships outlined in the thesaurus help to fine-tunestyle or to obviate misunderstandings.

Later, in the mid-twentieth century, the term experiencedanother shift in meaning, adopting the information retrievalaspect (see infra).

Controlled vocabulary

A controlled vocabulary is a set of terms which provides astandard language for a specific domain. It consists of twotypes of terms: preferred terms, which are designed tocontrol a domain-specific language, and non-preferredterms used as “access vocabulary”, “lead-in” or “entry”terms. The use of preferred and non-preferred terms isillustrated by Wodtke [9]:

In our restaurant we had the preferred term, “firstcourse”, and all the terms our patron might use,“starter, first course, hors d’oeuvres, appetizer”,neatly tucked into our head. So if a patron wantedan appetizer of smoked salmon, we would write in thecheck “first course: smoked salmon”.

A controlled vocabulary can be used as a prescriptiveterminology, as a means to ensure language hygiene and/orconsistency in the use of terminology. The Plain EnglishCampaign7 is an independent British organization whichhelps businesses, local governments and governmentdepartments to improve their communication by providingediting services, training courses and glossaries. They also

published a controlled vocabulary, The A to Z of alternativewords, which is a list of words with their simpleralternatives designed for writers of all text types to ensurereadability.

Knowledge management and medical coding

Taxonomies and classifications

A literature search for the term taxonomy proves thatGarshol [10] is right in saying that the term has been “usedand abused to the point that when something is referred toas a taxonomy it can be just about anything” and that thebasic denominator is that of an “abstract [hierarchical]structure”.

Taxonomy is derived from the Greek words taxis (τάξις),‘order’ and nomos (νόμος), ‘rules, law’ and is oftendescribed in a very broad sense, as “the science ofclassification of organisms” [11]. However, the termtaxonomy can also be defined in terms of its structuralcharacteristics: “a taxonomy provides a classificationstructure that adds the power of inheritance of meaningfrom generalized taxa to specialized taxa” [12]. Thisinheritance implies that subclasses take over characteristicsof their ancestor classes. Agro [13] and Beck [14] also usethe term in the sense of a hierarchical structure whichrepresents (a part of) reality. Dictionaries such as OxfordEnglish Dictionary and Merriam-Webster and other refer-ence works such as WordNet and Roget’s Thesaurusdifferentiate between the two meanings, i.e. taxonomy asa science and taxonomy as a hierarchical representation ofreality. Sterkenburg [15] combines both meanings in hisdefinition: “study of the theory, practice and rules ofclassification of terms, objects and concepts”.

The term taxonomy originated in biology, where itreferred to the classification of the names of organisms. Itwas the Swedish scholar Carolus Linnaeus who combinedthe loose principles of the existing taxonomies into the‘Linnaean taxonomy’ (Systema Naturae 1735). In thishierarchical classification, nature was divided into king-doms, phyla (for animals) and divisions (for plants),classes, orders, families, genera and species. In the figurebelow (Fig. 2), modern humans (homo sapiens) are definedaccording to the Linnaean taxonomy.

Linnaeus’ taxonomy, which is now called the alphataxonomy or classical taxonomy is still a model forbiological classifications.

The designations “taxonomy” and “classification” areused interchangeably, whereas they are not completelysynonymous. Agro [13] and Van Rees [16] argue thattaxonomies distinguish themselves from classifications inthat they group concepts according to essential, internalattributes, i.e. according to relationships between the7 http://www.plainenglish.co.uk/

J Med Syst

concepts. Taxonomies, unlike classifications, are createdfrom the bottom up, are based on actual content and guideusers through a body of information. A classification, onthe other hand, is a grouping of concepts according toarbitrary, external attributes [16]. These external attributescan be colour, shape, geography, size, usability, etc.Classifications are created from the top down and are basedon conceptual frameworks [13, 16]. Table 2 summarizes thecharacteristics of taxonomies versus classifications accord-ing to Agro and Van Rees.

Cann [17], however, uses other criteria to define theconcepts of classification and taxonomy. He describesspecial versus general, analytical versus documentary andenumerative versus faceted classifications. Firstly, a classi-fication describes either general knowledge, e.g. theUniversal Decimal Classification (UDC) or a specificknowledge domain, e.g. the International Classification ofDiseases (ICD). Secondly, a classification can be analyticalor documentary. Analytical implies that physical phenom-ena are systematized into an understandable scheme. Cann[17] also designates this type of classification as “taxono-mies”. In his opinion, “taxonomy” and “classification” arenot, as argued by Agro and Van Rees, co-hyponyms, rather“taxonomy” is hyponymous to “classification”, or ataxonomy is a ‘kind of’ classification. Documentary

classifications are used as information management andretrieval tools (e.g. UDC). Thirdly, classifications can beeither enumerative or faceted. An enumerative classificationlists certain classes and all their subclasses of interest [17],is created from the top down and allows for compoundsubjects. This type of classification is often called hierar-chical, which is a common misunderstanding, as facetedclassifications can also have a hierarchical structure.Faceted classifications are created from the bottom up anddo not provide “ready-made class numbers for compoundand complex subjects” [18]. In enumerative classifications,there is usually only one path the user can follow to find hissubject, i.e. from a broad category to the specific concept.In faceted classifications, the concepts are organized intoclasses according to several principles of division. Anexample of a faceted classification can be found inSpringerlink’s8 organization of documents, where docu-ments can be retrieved using different principles ofdivision. The collection can be searched by the facets“content type”, “featured library” or “subject collection”.

Cann’s view (see Fig. 3) seems to be more solid andlogical. Here, a classification is considered as a hypernymfor all types of concept categorization. However, Cann stilloverlooks the fact that analytical classifications, or taxon-omies, have also come to play a role in informationretrieval, i.e. they have adopted the function of documen-tary classifications.

We propose a definition for “taxonomy” in data retrieval,based on ISO/IEC 11179–2 [12]: “a taxonomy provides ahierarchical classification structure that adds the power ofinheritance of meaning from generalized taxa to specializedtaxa”. Classification is a more general term which can bedefined as the grouping together of concepts on the basis ofshared characteristics. Both structures can be used inmedical coding systems.

Ontologies

A closer look at the concept of ‘ontology’ shows that,again, its meaning depends on the domain or the (historical)context in which it is used: philosophy or informationscience. When used in the context of philosophy, Ontologyis often written with an upper-case ‘O’, whereas ontologywith a lower-case ‘o’—and with a plural form, ontologies—refers to a representation of reality or to an informationretrieval system.

The term ‘Ontology’ is derived from the Greek words o_0ν(being) and λογία (science, study, theory) and literallytranslates into “the science of being”. This branch ofmetaphysics organizes, or attempts to organize the universe

8 http://www.springerlink.com

Fig. 2 Modern humans in the Linnaean taxonomy

Table 2 Taxonomy versus classification according to Agro and VanRees

Taxonomy Classification

Grouping of concepts accordingto essential, internal attributes

Grouping of concepts accordingto arbitrary, external attributes

Created from the bottom up Created from the top down

Based on actual content Based on conceptual frameworks

Created by a multidisciplinaryteam

Created by domain experts

Flexible, dynamic Static

J Med Syst

and its components into a scheme with explicit formulationof their possible relations. Most dictionaries, such asLONGMAN Dictionary of Contemporary English [19],Oxford English Dictionary [20] and Merriam-Webster [21]define Ontology in this context. As a derived meaningused within the context of knowledge management, an‘ontology’ can be described as a representation of whatexists. Some ontologies, like SNOMED or OpenGalen,are more than just a representation of the concepts withina specific domain with their relationships; they are de-signed to use as a coding system or for clinical decisionsupport.

Bibliographic retrieval

Taxonomies

With the advent of the Internet, taxonomies started coveringother purposes than those described above: they now alsofunction as metadata for information retrieval. Thesetaxonomies are no longer used to retrieve only data, ratherthe concepts are used as keywords for tagging documents,or for referencing to these documents. Cann [17] refers tothis type of taxonomy as “documentary classifications” (seesupra).Their structure offers more transparent and moreefficient search options, including explosion of the searchterm. Term explosion allows the system to search forinformation about not only the concept itself, but also aboutits narrower, hyponymic concepts.

Taxonomies can be included in thesauri and ontologies[14, 22], and taxonomies and thesauri are often bracketedtogether as one and the same concept. So what distin-guishes the taxonomy from thesauri and ontologies?Basically, ‘taxonomy’ can refer to any hierarchical classi-fication of elements of a group into subgroups according tospecific criteria, often visualized as a tree. Its relationshipsare not specified, i.e. broader and narrower terms candesignate the obvious subsumption relationship (parent/child), but also a mereologic relationship (part/whole).Taxonomies do not cover any relationships other thanhierarchical. Thesauri and ontologies compensate for this

lacuna and give explicit or implicit indications as to thenature of the relationship.

Thesauri

Peter Luhn (IBM) conceived the idea of using a thesaurus,which was previously a purely linguistic tool, for informa-tion retrieval. In the 1960s, the first thesauri for informationretrieval were published. The Thesaurus of Engineeringand Scientific Terms [23] sketched the broad outlines of thestandard format for thesauri. In this period, thesauri evolvedtowards their current form, defined by ISO 2788 [24] as“the vocabulary of a controlled indexing language, formallyorganized so that the a priori relationships betweenconcepts (for example as “broader” and “narrower”) aremade explicit.” Controlled means that the vocabulary ispredetermined and is used as a prescriptive terminology.This implies that the terminology of the subject field issubdivided into preferred terms—also called descriptors—and non-preferred terms or entry terms (see infra). Athesaurus is usually organized hierarchically, which meansthat the relationships ‘broader term’ and ‘narrower term’ arevisible in a tree-like structure or made explicit by theabbreviations BT and NT respectively. ISO 2788 states thatthere are various ways in which the terms in a thesaurus canbe displayed, the most common of which are alphabetical,systematic and graphic display. The standardized relation-ships in thesauri are the hierarchical, associative and theequivalence relationship. These are a priori relationships,which means that they are context-independent, rather thanbeing inferred from the documents they describe.

When used in the context of information and library science,‘thesaurus’ refers to a retrieval instrument, used to index and/orsearch documents. This is often the main or only purpose ofpresent-day thesauri, and most authors [14, 24–30] definethesaurus in this context. Chowdhury [28] describes thefollowing main objectives of thesauri for information retrieval:

1. vocabulary control: a translation of natural languageinto a more constrained language

2. consistency between different indexers3. limitation of the number of terms needed to label the

documents4. search aid in information retrieval

The historical and interdomain shifts—from the linguis-tic field to the field of information science—describedabove are reflected in the definitions given by Landau [31]:

1. A “storehouse” of knowledge such as exhaustiveencyclopaedia or dictionaries,

2. Exhaustive lists of words from the general language,without definitions, arranged systematically accordingto the ideas they express.

Fig. 3 Types of classification according to Cann

J Med Syst

3. A list of subject headings for a particular field ofknowledge, arranged in alphabetic or classified orderand used for information retrieval and related purposes.

Due to these shifts, the term ‘thesaurus” carries severalmeanings, and it is thus recommendable to study thecontext and subject field in which the term occurs beforedrawing any conclusions as to its meaning.

There are several standards for thesauri. ISO 2788 wascreated for the design of monolingual thesauri and ISO5964 [32] documents the design of multilingual thesauri.These standards, however, are outdated [33], as they onlyrefer to printed thesauri. Both standards will be replaced bya new standard, ISO 25964, based on BS 87239 [27), thecorresponding British standard. ANSI/NISO, the US stan-dardization organization, created its own standard, Z39.19.These guidelines have a somewhat broader scope: theycomprise all monolingual controlled vocabularies, includ-ing lists, taxonomies, thesauri and synonym rings. There isno single ‘worldwide’ standard, as the US and otherstandards (BS, ISO) departed from each other in previouseditions. In an interview [34], Dr. Amy J. Warner10 statedthat the new ANSI/NISO standard should be morecompatible with the existing standards.

In conclusion, the term thesaurus can be used in differentcontexts, related to different fields of knowledge whichcame into existence at different points in time. When usedin the context of information science, a thesaurus can bedefined as a “controlled vocabulary, which is usuallyorganized hierarchically and which includes standardized,a priori, hierarchical, associative and equivalence relation-ships between concepts” [24].

Controlled vocabularies

According to the ANSI/NISO Guidelines [26], a controlledvocabulary, which is a list of preferred and non-preferredterms, is—or should be—exempt of ambiguities, homony-my and polysemy and all terms should have “an unambig-uous, non-redundant definition”. Controlled vocabulariescan be used for consistent indexing and searching ofinformation. For instance, using a controlled vocabularyin medical information retrieval can help health professio-nals to describe and classify medical information, optimiz-ing the work of both searchers and indexers.

Compared to natural language, a controlled vocabularyhas some weaknesses and some strengths, as stated by

Aitchison et al. [25]. Its weaknesses include the relativelack of exhaustivity and specificity, the laboriousness ofkeeping it accurate and up-to-date and the cost of doing so.Moreover, this language has to be learned by the searcherand efficient exchange is often hampered by the incompat-ibility of the existing controlled vocabularies. Aitchison etal., however, add that over-exhaustivity may provoke a lossof precision. In addition, a controlled vocabulary canfacilitate the search process considerably by expanding thequery to its synonyms and excluding ambiguity. Acontrolled vocabulary is usually incorporated into athesaurus, an ontology, a topic map, which, in turn, canbe used in an information retrieval system.

Ontologies

In the late twentieth century, the term “ontology” adoptedsome new properties as it saw its introduction intoinformation architecture and science. Most recent sources[14, 22, 26, 35–39] describe ontology in this field. Its best-known definition is that by Gruber [40]: “an explicit,formal specification of a shared conceptualisation”. Ananalysis of this definition is expedient, as it concentratessome important components. Firstly, ‘explicit’means that theconcepts included in the ontology are clearly defined, as arethe constraints on their use. ‘Formal’ refers to the language ofthe ontology. A formal language is computer-readable: thecomputer ‘understands’ the relationships—also called ‘formalsemantics’—within the ontology. This way, they can be usedto support computer applications. Examples of formalrepresentation languages for ontologies include RDF [41](Resource Description Framework; cf. the Nautilus ontology[42]), F-Logic [43], or Frame Logic (e.g. FLORID [44]),KIF (Knowledge Interchange Format, e.g.), a later version ofwhich—Common Logic—has been submitted to and ap-proved by ISO, OIL [45] (Ontology Inference Layer),DAML+OIL, a combination of DAML (DARPA11 AgentMarkup Language) and OIL, and OWL [46] (Web OntologyLanguage; e.g. Basic Clinical Ontology for breast cancer12),which combines OIL and DAML+OIL. Ontologies writtenin these formal languages can be used for inferencing or tosupport other software applications.

There are, however, ontologies which do not use aformal language, but a—restricted—natural language. Usc-hold [47] argues that “we must rely on NL [naturallanguage] definitions to be sure of what something means”.

The last components of the definition, shared and concep-tualization, imply that this abstract model of phenomena in theworld has been agreed upon by a group of users or experts.

10 Project Leader for NISO's Thesaurus Development Team

9 The BS 8723 standard consists of five parts, the first two of whichbroadly correspond to ISO 2788, whereas the combination of part oneand four have approximately the same scope as ISO 5964 (multilin-gual thesauri). BS 8723–3 covers vocabularies other than thesauri, BS8723–4 gives recommendations concerning interoperability of vocab-ularies and BS 8723–5 discusses exchange formats.

12 http://acl.icnet.uk/∼mw/MDM0.73.owl

11 DARPA stands for Defense Advanced Research Projects Agency

J Med Syst

As observed by Garshol [10], an ontology usuallyconsists of concepts, relations and properties, but “exactlywhat is provided around this varies”. The basic elements ofan ontology are concepts, grouped into classes. The actualobject referred to by the concept, is an individual orinstance. Relations between concepts and instances areoften called roles. Attributes or properties are assigned tothe concepts or instances.

Thesauri and taxonomies, and even glossaries are oftenconsidered bedfellows within the category of—simple—ontologies: they organize the concepts or terms of aknowledge domain, and all four can be used for indexingand searching information. An ontology, however, distin-guishes itself from the other tools mainly by allowing moretypes of semantic relationships, which makes the ontologymuch more versatile, more powerful. In addition, anontology usually structures its concepts not as a hierarchy,but as a network or a web.

Ontologies, as observed above, were initially conceivedas a way to represent knowledge; they are “intended tosupport the vision of the semantic web through providingstructured metadata about resources and a foundation forlogical inferencing” [48]. They are aimed at giving atruthful reflection of reality, and this has repercussions ontheir further development for use in information retrieval.

In conclusion, the term ‘ontology’ is polysemous due tohistorical and interdomain shifts. Originally, it was thestudy of being, the outcome of which was a representationof what exists, or ‘an ontology’. This later became aschematic representation of fields of knowledge withconcepts and their interrelationships. In information sci-ence, this structure is formalized and can be used forcomputer applications, including information indexing andretrieval.

Topic maps

Taxonomies, thesauri and ontologies were originally designedto represent knowledge. Later, especially with the advent ofthe Internet, they started being used as indexing vocabularies,facilitating information and document retrieval. Topic maps,on the other hand; were specifically designed for informationindexing and retrieval and consist of a knowledge layer—comparable to an ontology—and a resources layer. Theknowledge layer (called “topic space” in Fig. 4 [49]) isusually a semantic network deduced from the resources layeror pool and not—as an ontology—designed by experts as arepresentation of reality.

The distinction between ontologies and topic maps runsparallel to that between knowledge management andinformation management: ontologies cover only the knowl-edge itself, whereas topic maps also involve storing andtracking resources in which this knowledge may be found.

The idea of topic maps emerged in the early ninetieswhen the Davenport Group met to discuss ways to mergeindexes, glossaries, thesauri, cross references, etc. This newindex was to reflect the structure of the knowledge itrepresented. Their efforts resulted in ‘topic navigationmaps’, which were adopted as an ISO work item in 1996.In 2000, these topic navigation maps were renamed ‘topicmaps’ and became a new ISO standard.13

Ontologies describe concepts—represented by terms—with their attributes and relationships and divide them intoclasses. These classes consist of concrete or abstractindividuals or instances. Topic maps have subjects repre-sented by topics and described by associations andoccurrences. Topics are described in more detail by topicnames and topic types, association types and occurrenceroles (see also Pepper, 2000 [50]). In addition to thisdifference in structuring the knowledge layer, topic mapshave some other important distinguishing characteristics,mainly concerning their development, initial purpose and

13 The definition of topic maps proposed in ISO/IEC 13250 is acircular definition, thus not helping to grasp the exact meaning of‘topic maps’:

a) A set of information resources regarded by a topic mapapplication as a bounded object set whose hub document is atopic map document conforming to the SGML architecturedefined by this International Standard.

b) Any topic map document conforming to the SGML architecturedefined by this International Standard, or the document element(topicmap) of such a document.

c) The document element type (topicmap) of the topic mapdocument architecture.

Fig. 4 Structure of topic maps

J Med Syst

standards. The main differences and similarities aresummarized in Table 3.

As observed above, the knowledge framework inontologies is designed from scratch by a domain expert inorder to support the vision of the semantic web. In topicmaps, however, this knowledge layer is deduced from thedocuments or information contained in the resource layer.Pepper [50] and [51] Hummel [51] consider the separationinto two layers and the standardized format respectively asthe topic maps’ strengths. These qualities improve thenavigational function of topic maps and their interoperabil-ity with other topic maps, and even with indexes, thesauri,taxonomies, ontologies and other traditional classificationschemes. As confirmed by Garshol (2004) [10], “topicmaps do not offer more, but other possibilities with regardto the knowledge represented, i.e. a flexible model with anopen vocabulary”.

The format of topic maps is, as observed above, capturedin an ISO standard, which also improves the efficiency andinteroperability with other tools. Ontologies lack thisstandardization and are thus less suitable for exchange.The format of ontologies is not standardized, but many oftheir corresponding representation languages (XML, RDF,RDF Schema, and OWL) are.

Applications in the (bio)medical domain

The last decades have witnessed an information explosionin the (bio)medical domain, and with it the increasing needfor solid vocabularies, terminologies and classificationsystems. They include—next to the numerous medicalglossaries and dictionaries—the UMLS resources, the GeneOntology, MeSH, SNOMED and OpenGALEN. Thepresent section attempts to characterize these systems interms of the definitions given above.

Linguistic tools in the biomedical domain

Medical glossaries, lexicons and dictionaries

Wikipedia’s Glossary of medical terms related to commu-nications disorders, the Ziekenhuis.nl woordenboek areexamples of mono- and bilingual glossaries respectively.They cover terms from the field of medicine or socialservices, and comply with the definition of ‘glossary’ givenin this article in that they are lists of terms, arrangedalphabetically, with definitions.

The Specialist Lexicon, which is included the UMLS asone of the Knowledge Sources, meets the criteria forlexicons described in this article. It was designed for use innatural language processing (NLP) and is intended to be ageneral English lexicon that includes many biomedicalterms. The linguistic information includes inflectionalvariants and derivations, acronyms, spelling variants and,when applicable, verb, noun or adjective complementation.An example of a lexical record can be seen in Fig. 5.

The Pinkhof geneeskundig woordenboek and the Diccio-nari d'infermeria are examples of a monolingual and amultilingual dictionary respectively. They give definitionsand information on the origin of the word, which isgenerally Latin or Greek, and on gender.

The multilingual glossary of technical and popular medicalterms in nine European languages

The Multilingual Glossary of Technical and PopularMedical Terms in Nine European Languages is a controlledvocabulary in the form of a glossary. Each ‘technical’ termin this glossary has a popular variant which should beconsidered as the preferred term in texts intended forpatients. The glossary was initiated in the framework of the92/27/EEC Directive, which made the inclusion of patient

Table 3 Differences between ontologies and topic maps

Ontologies Topic maps

Definition An ontology is a representation of reality. A topic map is an information retrieval tool which consistsof a resources layer linked to a knowledge layer.

Differences Is an organization of knowledge Consists of a knowledge layer (comparable to an ontology)and a resources layer

Can be used as an information retrieval tool whenthe knowledge is linked to resources

Is designed as an information retrieval tool

Knowledge structure is designed by domain expert(s)and later linked to the documents or other resources

Knowledge structure is deduced from the resources

The knowledge layer is a representation of reality(within a specific domain)

The knowledge layer is a representation of the knowledgein the resources

The knowledge structure consists of concepts, classes,attributes, relations and individuals

The knowledge structure consists of subjects, topics (+ namesand types), associations (+ types) and occurrences (+ roles)

Not a standardized format as such Topic maps is an ISO standard format

J Med Syst

information leaflets (PILs) in every medication packagemandatory in the Member States of the European Commu-nity and stipulated that the leaflets had to be written inunderstandable language. As the use of terminology is oftenan important factor in the readability of these informationleaflets, a glossary with popular variants for medical ortechnical terms was very useful. This controlled vocabularywas thus intended to help writers and translators make theirPILs understandable for the general public.

The European multilingual thesaurus on health promotion

The European Multilingual Thesaurus on Health Promo-tion is a merger of three thesaurus projects in 12 languagesand is used as a linguistic tool: it should stimulate theuniform use of terms related to health promotion and healtheducation in Europe, as a such a shared language supportsthe efficient exchange of information. This thesaurus is thusused as a controlled vocabulary, with preferred and non-preferred terms. The ISO standards 2788 and 5964 wereused as construction guidelines, although it is not used forbibliographic retrieval.

Knowledge management and medical coding

The ATC classification

The ATC (Anatomical Therapeutic Chemical) classificationis a system developed by the WHO for the classification ofdrugs and other medical products. Applying Cann’s view tothis classification, one could state that this is a specific,documentary, enumerative classification. Specific, becauseit covers a part of the medical domain, namely medicalsubstances. Documentary, because it functions as aninformation management and retrieval tool, and enumera-tive because it lists the classes and subclasses in a specificdomain of interest and it is created from the top down.

The classification consists of 14 main classes, each onereferring to another anatomical main group, e.g. nervoussystem (N). The next level is indicated by two digits andcontains therapeutic subgroups, e.g. anti-parkinson drugs(N04). The third level, which is indicated by one letter, refersto the pharmacological subgroup, e.g. dopaminergic agents(N04B). The fourth level, again a letter, is a designation ofthe chemical subgroup, e.g. dopamine agonists (N04BC),and the last two digits indicate the chemical substance, e.g.pramipexole (N04BC05; see Table 4).

The ATC classification is mainly used to producestatistics about drug use, but also for the registrationprocess of drugs.

The International Classification of Diseases and relatedhealth problems (ICD)

The International Classification of Diseases and RelatedHealth Problems is published by the World HealthOrganization (WHO) and classifies diseases, signs, symp-toms, complaints, social circumstances and causes of injuryor disease. It is used in statistics, in automated decisionsupport and in reimbursement systems. ICD-10, the tenthrevision of ICD, is the most recent version of theclassification. The first level of ICD-10 consists of 22classes, each of which has several subclasses. The firstletter in the code refers to the chapter, whereas thefollowing digits specify the disease. For instance, inC18.7, C refers to malignant neoplasms, 18 refers tomalignant neoplasms of the colon, and the numeric symbolafter the decimal point further specifies the disease, in thiscase malignant neoplasm of the sigmoid colon.

ICD-10 is a specific, documentary and enumerative clas-sification: it covers a specific domain, it is used to storeand retrieve medical data and created from the top down.

The International Classification of Primary Care (ICPC)

The International Classification of Primary Care wasdesigned by the WICC (WONCA International Classifica-tion Committee) for the classification of reasons forencounter (RFE), problems, diagnoses, interventions andthe ordering of these data in an episode of care structure.

Fig. 5 Example of a lexical record in the Specialist Lexicon

Table 4 Structure of the ATC classification

ATC level ATC code ATC text

1 Anatomical main group N Nervous system

2 Therapeutic subgroup N04 Anti-parkinson drugs

3 Pharmacological subgroup N04B Dopaminergic agents

4 Chemical subgroup N04BC Dopamine agonists

5 Chemical substance N04BC05 Pramipexole

J Med Syst

Chapter ten of the second version of ICPC has beenconverted into an electronic file, i.e. ICPC-2-E, is specif-ically designed for use in electronic patient records (EPR)and for research purposes. It is to be used together with thefirst nine chapters of ICPC-2. As ICD-10 is more fine-grained and allows for documentation at the level ofindividual patients [52], this classification was the perfectcomplement to ICPC-2. When ICD-10 was made available,together with its various translations, the WICC decidedthat all translations of ICPC were to relate to ICD-10, inorder to allow for a better structuring of EPRs. For theNetherlands and the Dutch-speaking part of Belgium, thisresulted in the ICPC-2/ICD-10 thesaurus (see infra).

ICPC-2 is a specific, documentary and enumerativeclassification which has a bi-axial structure (see Fig. 6).There are 17 main classes with an alpha code referring tothe location of the complaint, and seven components with atwo-digit numeric code, which organize each of theseclasses.

ICPC-2 is included in the UMLS (see infra).

ICPC-2/ICD-10 thesaurus

The ICPC-2/ICD-10 thesaurus14 was created at the Univer-sity of Amsterdam, Department of Family Practice, in

collaboration with the Department of General Practiceand Primary Health Care of the Ghent University. Asstated above, ICD-10 is the perfect complementation forICPC-2, as it is more fine-grained. The result of thiscombination is a system with doubly encoded clinicallabels: each term has two codes, an ICD-10 and anICPC-2 code.

This bilingual (English-Dutch) terminology is called a“thesaurus” because it has a hierarchical structure andsynonyms for many of the concepts. Moreover, it is acontrolled language used to store medical information.However, not all the requirements to designate a vocabularyas a thesaurus are fulfilled: there are no associativerelationships.

SNOMED CT

The Systematized Nomenclature of Medicine, ClinicalTerms, or SNOMED CT, provides a comprehensiveterminology covering concepts in health care, i.e. diseases,clinical findings and procedures. This terminology, which isalso available in German and in Spanish, is designed tosupport data retrieval and automated inferencing (e.g. forclinical decision support). SNOMED CT is based on theSNOMED Reference Terminology (SNOMED RT) and theBritish Clinical Terms, version 3. It also cross-maps to anumber of existing terminologies and coding systems, suchas ICD-9-CM, ICD-10 and LOINC (Logical ObservationIdentifiers Names and Codes).

The clinical concepts included in SNOMED CT areorganized in 19 hierarchies—alternatively called axes—andlinked with definitions in formal logic. Each term inSNOMED CT has a unique numeric code, a unique name

14 3BT (Belgian Bilingual Biclassified Thesaurus) is a continuation ofthe ICPC2/ICD10 Thesaurus, but with French translations added to it.The designation “thesaurus” is a misnomer in this case, as the systemdoes not meet all the criteria described in the ISO standards forthesauri: it has no associative relationships. However, some terms dohave synonyms or entry terms that lead the system to the correctconcept. Like ATC, ICD and ICPC, this is a specific, enumerative,documentary classification used for medical coding.

Fig. 6 Structure of ICPC-2

J Med Syst

(‘fully specified name’), and a ‘description’ comprising onepreferred term and one or more synonyms.

Two main types of relationships are established in thisontology: hierarchical and attribute relationships. Hierar-chical ‘is-a’ relationships are defined within one axis,whereas the attributes link concepts from different hierar-chies. Attribute relationships include finding site, causativeagent, occurrence, stage, etc.

The prerequisites for an ontology in information scienceare thus fulfilled: the SNOMED CT terminology representsknowledge from a specific domain (health care), is concept-oriented, and the definitions are formalized. Moreover,almost any semantic relationship can be expressed in thisontology.

OpenGalen

OpenGALEN is a multilingual terminology and codingsystem for the classification of surgical procedures, elec-tronic healthcare records (EHCRs), clinical user interfaces,decision support systems, knowledge access systems, andnatural language processing.

The OpenGALEN Foundation (Open Galen Foundations.d.) defines ‘ontology’ as “the set of primitive, high levelcategories in a knowledge representation scheme togetherwith any taxonomy which structures those categories”. Inthis view, the OpenGALEN system is an ontology indeed.However, it also fulfils the requirements of an informationretrieval ontology in the strict sense: it represents theconcepts of a specific domain with formalized relationships,making the ontology re-usable in other applications.Moreover, the ontology allows the expression of extensivesemantic relationships, including “kind-of”, “part-of”,“connects”, “branch-of”, “serves” and laterality relations.

Bibliographic retrieval

The NCBI Entrez Taxonomy

The NCBI Entrez Taxonomy15 is a hierarchical structurewhich contains all organisms represented in GenBank, with

at least one nucleotide or protein sequence. There are seventop classes, i.e. arachea, bacteria, eukaryota, viroids,viruses, other and unclassified. The information providedfor each concept is quite elaborate and includes an ID, arank, a genetic code, synonyms, and information as to thelocation in the taxonomy (“lineage”; see Fig. 7).

The Entrez Taxonomy complies with the definition givenabove: it is a hierarchical classification structure in whichmeaning is passed from more generalized to more special-ized taxa.

MeSH

MeSH is an acronym for Medical Subject Headings, acontrolled vocabulary produced by the National Library ofMedicine (NLM), geared specifically for informationretrieval. MeSH is used for indexing and searching journalarticles in MEDLINE and other resources from the NLMCatalog.

The MeSH vocabulary consists of preferred terms, ordescriptors, and entry terms. However, MeSH is more than‘just’ a controlled vocabulary, it is a fully fledged thesaurus.The equivalence relationship is established by entry terms,which can be synonyms, near synonyms, abbreviations, oralternate forms of the MeSH term. Besides the equivalencerelationship, two other typical thesaurus relations, i.e.hierarchical and associative relations, are represented.

The concepts are structured into a hierarchy, the MeSHtree, with 16 main branches. Each descriptor can havemultiple parents and can consequently appear in severalplaces in the tree. This can be illustrated by looking at aspecific example, e.g. the Wolfram syndrome. This descriptorappears under the following subcategories: Nervous SystemDiseases [C10], Eye Diseases [C11], Male UrogenitalDiseases [C12], Female Urogenital Diseases and PregnancyComplications [C13], Congenital, Hereditary, and NeonatalDiseases and Abnormalities [C16], Nutritional and MetabolicDiseases [C18] and Endocrine System Diseases [C19]

Each descriptor has a notation—one or several MeSHnumber(s)—which is an indication of the concept’s rela-tionship to its neighbouring concepts. This type of notationis referred to by Aitchison et al. [25] as an “expressivenotation” or “hierarchical notation” (as opposed to (semi-)ordinal, synthetic and retroactive notations). The length of

15 http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Root

Fig. 7 Extract from the NCBIEntrez Taxonomy

J Med Syst

the number indicates the specificity of the term: the longerthe number, the more specific the concept. Figure 8 showsthat Eye [A09.371] is broader than Anterior Eye Segment[A09.371.060], which, in turn, is broader than AnteriorChamber [A09.371.060.067].

When applied in information retrieval, the MeSHthesaurus can be an extremely valuable tool. It allowsexplosion of the search terms, and in Entrez PubMed, theterms entered by the searcher are automatically mapped tothe appropriate MeSH term [53]. Term explosion, asdescribed above, is a technique which increases the searchyield considerably by searching not only for the term itself,but also for its narrower terms.

When examined for compatibility with the definition of athesaurus as an information retrieval tool, the MeSHthesaurus proves to fulfil almost all requirements. It is acontrolled vocabulary, with its descriptors and its non-preferred entry terms, which lead the searcher to thedescriptors. The MeSH tree is organized hierarchicallyand includes the standardized relations as described in ISO2788 [24]—the hierarchical, associative and equivalencerelationship. These relationships are a priori relationships,i.e. they exist independently of the contents of the articlesindexed with MeSH terms. Moreover, each term has ascope note, which contain background information on theusage and scope of the term. Scope notes can contain adefinition formulated by the MeSH project partners orcopied from other sources, like dictionaries or biomedicalpublications.

Greenberg [54], however, mentions a slight differencebetween thesauri for information retrieval and subjectheadings: thesauri generally tend to support post-coordinate

searching, whereas subject headings have a pre-coordinatedsyntax. In pre-coordinated vocabularies, combinations ofconcepts are made at the indexing stage by the indexers,rather than at the stage of query formulation by the user.This means that the searcher can select very specific,unambiguous and “ready-made” queries instead of com-bining single-concept terms. Compare, for example, thepre-coordinated MeSH term “Physiological effects ofdrugs” and the terms “physiological”, “effect” and “drugs”in post-coordination. The advantages of pre-coordinationdescribed in [55] include proximity searches, where thesearcher uses the relationships between concepts to selectthe best query. Pre-coordinated terms can be very useful forbrowsing, as they enable hierarchical displays. Some of thedisadvantages stated in [55] are that pre-coordinationrequires human manual construction, an expensive andtime-consuming task. Another disadvantage of pre-coordi-nation might be that some end-users who are not familiarwith this method of searching, might experience someproblems. Post-coordination implies that concepts will haveto be combined at the searching stage using Booleanoperators.

Subject headings have multi-word terms, and often useinverted word order. MeSH can thus be defined as athesaurus with the syntax of a subject heading list.

Controlled vocabularies

Controlled vocabularies used in bibliographic retrieval areusually incorporated into another structure, like a thesaurus(MeSH) or an ontology (the UMLS knowledge sourcescombine several controlled vocabularies).

The UMLS knowledge sources

The UMLS (Unified Medical Language System) Knowl-edge Sources combine three of the vocabulary systemsdescribed above: a thesaurus (the Metathesaurus), a lexicon(the SPECIALIST Lexicon) and an ontological structure (theSemantic Network).

The Metathesaurus consists of a large number of sourcevocabularies, including MeSH, SNOMED CT, the GeneOntology, and other controlled vocabularies. Partly as aconsequence of this combination of vocabularies, theMetathesaurus has a polyhierarchical structure. The Meta-thesaurus can be used in a wide range of applications,including information retrieval, and it becomes morepowerful when used in combination with the SPECIALISTLexicon and the Semantic Network.

The SPECIALIST Lexicon covers both the Englishgeneral language and concepts from the field of biomedi-cine. It provides syntactic, morphological and orthographicinformation about the terms included in the lexicon.Fig. 8 Expressive or hierarchical notation (MeSH)

J Med Syst

A third component of the UMLS KnowledgeSources is the Semantic Network, which consists ofSemantic Types, or broad subject categories, and Seman-tic Relations between these Semantic Types. This toolenables a consistent categorization of the concepts in theMetathesaurus.

The combination of the Knowledge Sources could beregarded as an ontology, as it represents knowledge from aspecific field, with its concepts and extensive relationships.Furthermore, the Semantic Relations are expressed in aformal language. The combination of Semantic Types andSemantic Relationships makes this knowledge source muchmore versatile than the average thesaurus. However, not allrelationships in the Metathesaurus are described formally.There are source vocabularies which do not use a formalrepresentation language (e.g. MeSH) in the Metathesaurus.

A medical ontology is being developed by the Lister HillNational Center for Biomedical Communications, a re-search division of the U.S. National Library of Medicine.This ontology will combine the UMLS with SNOMED-RT,GALEN and MEDLINE citations and will represent a“model for proximity between medical concepts”.16

The Gene Ontology

The Gene Ontology (GO) is a controlled vocabularydeveloped by the Gene Ontology Consortium for theannotation of gene products in model organisms. Thisvocabulary consists of three separate hierarchies, eachrepresenting concepts from a different subdomain: cellularcomponents, molecular functions and biological processes.It has a polyhierarchical structure, i.e. narrower term orhyponym can have more than one broader terms orhypernyms, and it has a simple RDF syntax.

Despite its name, the GO is not an ontology as describedin this article. Two types of relationships are present in thiscontrolled vocabulary, namely the hierarchical is-a andpart-of relationships and the equivalence relationship. Theterm ‘ontology’ here refers to the fact that knowledge abouta specific domain is represented, including the relationshipsbetween the concepts.

Smith et al. [56] give an overview of the requirementsfor the Gene Ontology to become a cost-effective andsemantically consistent system. These changes wouldconvert the Gene Ontology into a system with the relationalcharacteristics of a true ontology. However, making these

changes would raise many difficulties. As a result, the GeneOntology will probably remain in its current form, i.e. acontrolled vocabulary.

Topic maps

Beier and Tesche [57] developed a medical informationretrieval system, using the Medical Subject Headings (inEnglish and German) and their classification as theknowledge layer, and the resources layer includes AHCPRGuidelines, journal articles and selected internet sites. Thisis a federated search system, i.e. a system which simulta-neously searches several databases and/or web resources.The query entered by the user is automatically expandedwith the topic name (the preferred term), synonyms,translations and definition.

The interface (see Fig. 9) clearly shows the typical topicmap structure of the resources layer and the superimposedknowledge layer. Between both layers, some extra MeSHinformation (MeSH code, definition and annotations,synonyms and translations) is displayed, in order to helpthe user find the right topic name for his or her search. Theuser can select the resources in which he wants the engineto search.

This topic map complies with the ISO standard and withthe description of topic maps given above, except that theknowledge layer was not deduced from the resources.

Conclusion

There is a need for consistent terminology in the domainsof linguistics, knowledge management and informationretrieval, as in most fields of knowledge. Terms such astaxonomy, classification, thesaurus and ontology are oftenused interchangeably, resulting in definitions which areformulated from different perspectives.

Not only are the terms used in different ways, their scopemay also change. When terms are adopted in other fields—a shift which often has a historical aspect—this may causesome confusion.

Unambiguous definitions are proposed for each of theterms in question, depending on the context they are usedin, and criteria are presented for a more consistent use ofthe various competing designations. Some of the best-known vocabularies pertaining to biomedical linguistics,knowledge management and bibliographic retrieval arereviewed and examined for their compatibility with thedefinitions given in this article. We concluded that the useof the designations ‘ontology’ or ‘thesaurus’ in thebiomedical domain—as elsewhere—is not always consis-

16 http://lhncbc.nlm.nih.gov/lhc/servlet/Turbine/template/research,langproc,MedicalOntology.vm

J Med Syst

tent. More specifically, we found that the ICPC-2/ICD-10thesaurus and 3BT are not thesauri, but bicoded classifica-tions and that the Gene Ontology is not really an ontologybut a controlled vocabulary.

Table 5 below gives an overview of the systems inbiomedicine in a two-dimensional structure: according totheir domain of application (linguistics, knowledge man-agement—including medical registration—and bibliograph-

Table 5 Overview of (bio)medical vocabulary systems

Linguistics Knowledgemanagement

Bibliographicretrieval

Glossary, lexiconand dictionary

Wikipedia’s Glossary of medical terms related tocommunications disorders, Ziekenhuis.nl dictionary,Multilingual Glossary of Technical and PopularMedical Terms in Nine European Languages

The Specialist Lexicon

Pinkhof geneeskundig woordenboek, Diccionari d'infermeria

Taxonomy Linnaeantaxonomy

NCBI EntrezTaxonomy

Classification ICD, ICPC,ICPC2/ICD10thesaurus, 3BT

Thesaurus European Multilingual Thesaurus on Health Promotion MeSH

Controlledvocabulary

Multilingual Glossary of Technical and Popular MedicalTerms in Nine European Languages (multilingual)

MeSH, severalvocabularies inthe UMLS

Ontology OpenGalen SNOMED UMLS

Topic maps HyperCis Topic Map

Fig. 9 Interface of the MeSH-based topic map created byBeier and Tesche

J Med Syst

ical retrieval) and the kind of vocabulary (taxonomy,classification, thesaurus, controlled vocabulary, ontologyor topic maps) they represent.

References

1. University of London Computer Centre (ULCC), UNESCOThesaurus [cited 2009 26/02/2009]; Available from: http://www2.ulcc.ac.uk/unesco/#brow, 2003.

2. Roget, P., Roget's II: The new thesaurus. [cited 2009 26/02/2009];Third edition. Available from: www.bartleby.com/62, 1995.

3. Kagolovsky, Y., and Moehr, J. R., Terminological problems ininformation retrieval. J. Med. Syst. 27(5):399–408, 2003.

4. Harter, S. P., and Hert, C. A., Evaluation of information retrievalsystems: Approaches, issues, and methods. Annu. Rev. Inf. Sci.Technol. (ARIST). 32:3–94, 1997.

5. Ananiadou, S., and McNaught, J., Text mining for biology andbiomedicine. Norwood, Artech House, 2006.

6. CILF, Wörterbuch für Industrie und Technik. Paris, ConseilInternational de la Langue Française, 1993.

7. Mawson, C. O. S., Roget’s international thesaurus of Englishwords and phrases, 1st edition. New York, Thomas Y. Crowell,1922.

8. Kilgarriff, A., and Yallop, C., What's in a thesaurus? InProceedings of the Second Conference on Language Resourcesand Evaluation. Athens, Greece, 2000.

9. Wodtke, C., Mind your phraseology! Using controlled vocabu-laries to improve findability. [cited 2007 24/01/2007]; Availablefrom: http://www.digital-web.com/articles/mind_your_phraseology/,2002.

10. Garshol, L. M., Metadata? Thesauri? Taxonomies? Topic maps!Making sense of it all. [cited 13/06/2006]; Available from: www.ontopia.net/topicmaps/materials/tm-vs-thesauri.html 2004.

11. Davis, P. H., and Heywood, V. H., Principles of AngiospermTaxonomy. Princeton NJ, Van Nostrand, p. 556, 1963.

12. ISO/IEC_11179–2, Information technology—Metadata registries(MDR)—Part 2: Classification. 2nd ed. ISO/IEC 11179 MetadataRegistry (MDR) standard. Vol. 2. Geneva: ISO copyright office.16, 2005.

13. Agro, G., Classifications and Taxonomies. University of Texas,Austin, 2004.

14. Beck, H. and Pinto, H. S., Overview of approach, methodologies,standards, and tools for ontologies. University of Florida,Universidade Técnica de Lisboa, 2002.

15. Sterkenburg, P., (Ed.), A practical guide to lexicography.Terminology and lexicography. Research and practice. Vol. 6.Amsterdam/Philadelphia: John Benjamins Publishing Company,2003.

16. Van Rees, R., Clarity in the usage of the terms ontology, taxonomyand classification. In CIB73, 2003.

17. Cann, J., Principles of classification—suggestions for a procedureto be used by ICIS in developing international classification tablesfor the construction industry. NBS Services, ICIS, 1997.

18. Indira Gandhi National Open University, Part II: classificationschemes. In: Indexing languages, pp. 56–88. New Delhi: IndiraGandhi National Open University, pp. 56–88, 2006.

19. Procter, P., LONGMAN dictionary of contemporary English.London, Longman Dictionaries, 1978.

20. Simpson, J. A., and Weiner, S. C., Oxford English dictionary, 2ndedition. London, Oxford University Press, 1989.

21. Merriam-Webster Inc., Merriam-Webster online dictionary. [cited2007 20/03/2007]; Available from: http://www.merriam-webster.com, 2008.

22. Ullrich, M., Maier, A., and Angele, J., Taxonomie, thesaurus,topic map, Ontologie—ein Vergleich. Ontoprise GmbH, 2003.

23. Engineers Joint Council (Ed.), Thesaurus of engineering andscientific terms: a list of engineering and related terms and theirrelationships for use as a vocabulary reference in indexing andretrieving technical information. New York: Engineers JointCouncil and the US Department of Defense. 690, 1967.

24. ISO, Documentation—guidelines for the establishment and devel-opment of monolingual thesauri, 2nd edition. Geneva, ISO, 1986.

25. Aitchison, J., Gilchrist, A., and Bawden, D., Thesaurus construc-tion and use: a practical manual, Vol. 1, 4th edition. London,Aslib IMI, p. 218, 2000.

26. ANSI/NISO, Guidelines for the construction, format, and man-agement of monolingual controlled vocabularies. NISO Press,Bethesda, MD, p. 172, 2005.

27. BSI, Structured vocabularies for information retrieval—guide,Vol. 1. BSI British Standards, London, p. 10, 2005.

28. Chowdhury, G. G., Introduction to modern information retrieval,2nd edition. Facet Publishing, London, 2003.

29. Hagedorn, K., The information architecture glossary. [cited 200604/07/2006]; Available from: http://argus-acia.com/white_papers/ia_glossary.pdf, 2000.

30. Ribeiro-Neto, B., and Baeza-Yates, R. A., Modern informationretrieval. ACM Press, New York, 1999.

31. Landau, S., Dictionaries: the art and craft of lexicography.Charles Scribner's Sons, New York, 1984.

32. ISO, Documentation—guidelines for the establishment and devel-opment of multilingual thesauri. ISO, Geneva, 1985.

33. ISO, Information and documentation. Guidelines for the estab-lishment and development of thesauri [revision of ISO 2788 and5964]. ISO/TC 46/SC9, 2007.

34. Roe, S. K., and Thomas, A. R., The thesaurus: review, renaissanceand revision. Haworth Information Press, Binghamton, NY,2004.

35. Will, L., Glossary of terms relating to thesauri and other formsof structured vocabulary for information retrieval. [cited 200801–02–2008]; Available from: http://www.willpowerinfo.co.uk/glossary.htm, 2007.

36. Studer, R., Oppermann, H., and Schnurr, H.-P., Die Bedeutungvon Ontologien für das Wissensmanagement. Ontoprise GmbH,Karlsruhe, 2001.

37. Klein, G. O., and Smith, B., Concept systems and ontologies.Centre for medical terminology, Karolinska Institutet, Stockholm.Department of Philosophy, University at Buffalo, NY, p. 12, 2005.

38. Jonker, R., Termen en begrippen—Informatiebeheer. [cited 200715/01/2006]; Available from: http://labyrinth.opweb.nl/files/termenbegrippen.pdf, 2006.

39. Jernst, What are the differences between a vocabulary, ataxonomy, a thesaurus, an ontology, and a meta-model?. 15/01/2003 [cited 10/05/2006]; Available from: www.metamodel.com/article.php?story=2003011223271, 2003.

40. Gruber, T. R., Toward principles for the design of ontologies usedfor knowledge sharing. Int. J. Human-Comput. Stud. 43(5–6):907–928, 1995.

41. Beckett, D., RDF/XML Syntax specification (revised). [cited 200826/11/2008]; Available from: http://www.w3.org/TR/rdf-syntax-grammar/, 2004.

42. Dieng-Kuntz, R., et al., Building and using a medical ontology forknowledge management and cooperative work in a health carenetwork. Comput. Biol. Med. 36(7–8):871–892, 2006.

43. Kifer, M., Lausen, G., and Wu, J., Logical foundations of object-oriented and frame-based languages. University of Mannheim,1990.

44. Frohn, J., et al., FLORID: A prototype for F-logic. In: Proceed-ings of International Conference on Data Engineering. Birmingham,UK: IEEE Computer Science Press, 1997.

J Med Syst

45. Van Hamelen, F., et al., OIL: an ontology infrastructure for thesemantic web. IEEE Intelligent Systems. 16(2):38–45, 2001.

46. Bechhofer, S., et al., OWL web ontology language reference. In:Schreiber, Guus, M. Dean, (Ed.). W3C, 2004.

47. Uschold, M., An ontology research pipeline. Applied Ontology. 1(1):13–16, 2005.

48. Garshol, L. M., Living with topic maps and RDF. [cited 2007 05/01/2007]; Available from: http://www.ontopia.net/topicmaps/materials/tmrdf.html, 2003.

49. Ahmed, K., Introducing topic maps: a powerful, subject-orientedapproach to structuring sets of information. (Content manage-ment). XML J. 3(10):22–27, 2002.

50. Pepper, S., The TAO of topic maps. In: XML Europe 2000. Paris,France, 2000.

51. Hummel, B., Einsatz und Nutzenpotentiale von topic maps: EinState-Of-The-Art Bericht. In: Fachbereich Informationswissen-schaften. Fachhochschule Potsdam: Potsdam, 2004.

52. Okkes, I. M., et al., ICPC-2-E: the electronic version of ICPC-2.Differences from the printed version and the consequences. Fam.Pract. 17(2):101–107, 2000.

53. NN/LM, PubMed expert searching: using PubMed to getadvanced results. [cited 2007 14/03/2007]; Available from:http://nnlm.gov/ner/training/material/NER_PES.doc, 2006.

54. Greenberg, J., User comprehension and searching with informa-tion retrieval thesauri. Cat. Classif. Q. 37(3):103–120, 2004.

55. Cataloging Policy and Support Office, Pre- vs. post-coordinationand related issues. A. Management, Editor. Aquisitions andBibliographic Access Directorate, Library Services, Library ofCongress, 2007.

56. Smith, B., Williams, J., and Schulze-Kremer, S., The ontology of thegene ontology. In: AMIA Annual Symposium. Washington D.C., 2003.

57. Beier, J., and Tesche, T., Navigation and interaction in medicalknowledge spaces using topic maps. Int. Congr. Ser. 1230:384–388, 2001.

J Med Syst


Recommended