Experiments on Term Extraction using Noun Phrase ...Joseba Abaitua Solange O. Rezende Universidad...

Proceedings of Recent Advances in Natural Language Processing, pages 746–751,Hissar, Bulgaria, 12-14 September 2011.

Experiments on Term Extraction using Noun Phrase Subclassifications

Merley S. ConradoWalter Koza

Universidad de Sao [email protected]

[email protected]

Josuka Dıaz-LabradorJoseba Abaitua

Solange O. RezendeUniversidad Nacional de Rosario

[email protected]@deusto.es

[email protected]

Thiago A. S. PardoZulema Solana

Universidad de [email protected]@arnet.com.ar

AbstractIn this paper we describe and comparethree approaches for the automatic extrac-tion of medical terms using noun phrases(NPs) previously recognized on medicaltext corpus in Spanish. In the first ap-proach, as baseline, we extracted all NPs,while for the second and third ones theextraction process is directed to “specificNPs” that are determined on the basisof the syntactic and positional criteria,among others. As contributions (i) weshowed that it is possible to extract me-dical terms using “specific NPs”, (ii) newterms were added in the software dictio-nary, and (iii) terms that were not in thereference lists were extracted. For thethird contribution, we used the SNOMEDCT R© terms lists, aiming at improving theIULA reference lists.

1 Introduction

According to Moreno-Sandoval (2009), generally,noun phrases (NPs) correspond to specific terms ofa particular domain. The terms can be formed byonly a head or a head and complements. Then, theautomatic term extraction task was mainly basedon the recognition of this kind of phrases.

In this paper, automatic extraction experimentsfor medical term extraction using noun phrases(NPs) previously recognized on medical text cor-pus in Spanish are described and compared. Forthis task, in a first stage, as baseline, all identifiedNPs are considered as term candidates, while inthe other stages the extraction is directed to “spe-cific NPs” that are determined on the basis of syn-tactic and positional criteria, among others. Thenovelty of this work is that we are not using purenoun phrases, like many works utilize. In fact, weare using specific NPs, is to say, a subclassifica-tion of phrases. We use the IULA corpus (Bach et

al., 1997) of medical texts in Spanish and resultsare compared with reference lists of unigrams, bi-grams and trigrams.

According to the results, (i) we showed that itis possible to extract medical terms using “spe-cific NPs”, (ii) the software dictionary was im-proved with 2,445 new terms, and (iii) other termsthat were not in the reference lists were extracted.For the third contribution we used the SNOMEDCT R© term lists aiming at improving the IULA re-ference lists. However, it should be mentionedthat we detected other expressions that were nei-ther in the reference lists nor in SNOMED CT R©,although they could be considered medical terms.In this case, we have to say that new terms areadded almost on a daily basis, and it is practicallyimpossible to manually update the terms lists.

2 Term extraction in medicine

There are different works about term extractionthat may be applied for different domains, some-times adaptations are necessary for each of them.For the medical domain, we may mention the con-tributions of Neveol and Ozdowska (2005) andBessagnet et al. (2010) for the French; Hao-Minet al. (2008), for the Chinese, and the Lopes etal. (2009), for Portuguese. For the English, wecite the Krauthammer and Nenadic (2004) work,which makes a detailed description of automaticterm recognition (ATR) systems in the medicalfield. Those systems are based either on internalcharacteristics of specific classes or on externalclues that can support the recognition of word se-quences that represent specific domain concepts.Different types of features are used, such as ortho-graphic (capital letters, digits, Greek letters) andmorphological clues (specific affixes, POS tags),or syntactic information from shallow parsing.Also, different statistical measures are suggestedfor “promoting” term candidates into terms.

In our work, the term extraction is applied in

746

the medical domain in Spanish. So here, we men-tion the main works in this area. We may mentionthe ONCOTERM Project (Bilingual System of In-formation and Cancer Resources), the Describe R©System, the Vivaldi and Rodrıguez works, theCastro et al. works, and the large terminology de-veloped by the SNOMED CT R© Project.

ONCOTERM (Lopez Rodrıguez et al., 2006)is a Project whose goal is to develop a informa-tion system for the oncology domain, in whichthe concepts are linked to an ontology. The au-thors worked from Spanish texts to create a termi-nology database, with correspondences in Englishand German.

The Describe R© system (Sierra et al., 2009),meanwhile, applies a Defining Contexts Extractor(Alarcon, 2009) for the search, classification, andgrouping of medical definitions from the web.

Vivaldi and Rodrıguez (2010) created a term ex-traction system that uses Wikipedia (WP) seman-tic information. It was tested in a medical corpus,and, according to its results, WP was considered agood resource for tasks of medical term extraction.

Castro et al. (2010) work presents a semanticannotation of clinical notes and an application ofan automatic tool for medical concept recognitionon the SNOMED CT R© ontology. Furthermore,a tool test is presented in 100 clinical notes, and,according to the authors, the results are quite good.

SNOMED CT R©1 is a big medical terminologyand is the result of the fusion between SNOMEDRT and the Clinical Terms Version 3, a termino-logy previously known as Read Codes, created bythe National Health Service (NHS) in England.

3 Term extraction methodology

With the objective of indentifying medical terms,we have developed rules for “specific” NPs recog-nition. They were used for extracting terms and,as baseline, we consider the term extraction usu-ally performed with NP. We applied it to Spanish,but it may be adapted to others languages, adjus-ting the linguistic informations of parsers used.

1SNOMED CT R© - http://www.nlm.nih.gov/research/umls/Snomed/snomed_main.html- “This material includes SNOMED Clinical Terms R©(SNOMED CT R©), which is used with permission of theInternational Health Terminology Standards DevelopmentOrganisation (IHTSDO). All rights reserved. SNOMEDCT R© was originally created by The College of Ameri-can Pathologists. “SNOMED” and “SNOMED CT” areregistered trademarks of the IHTSDO.”

According to Figure 1, the term extraction, ca-rried out this work, starts with the delimitationof the domain and the corpus. Afterwards, it isnecessary to perform an orthographic normali-zation, changing the corpus file codification toUTF-8. Also, line changes are removed to pre-vent problems with the tools for the morphologicalanalysis. In the sequence, the tokenization andmorphological analysis is carried out aiming attagging words and punctuation marks.

This way, we developed NPs recognition rules(e.g., article + noun = “ NP”) to shape the NPsto be worked with. Phrase recognition allows theextraction of term candidates. At this stage, stop-words are removed of these candidates.

After cleaning the candidates, they are sepa-rated into lists of unigrams, bigrams, trigrams andhigher than trigrams to allow evaluation.

3.1 Experiments

For the experiments we used the IULA-UPF tech-nical corpus2 that belongs to the health and medi-cal domains. This corpus is composed of 12 textsin Spanish and the average of words per documentis 8,207. With it, the IULA-UPF has also providedthree reference term lists, containing a total of 697unigrams (e.g. “alergia” - allergy), 665 bigramsconsisting of a name plus an adjective (e.g. “acidobenzoico” - benzoic acid) and 82 trigrams formedby a name plus the preposition “de” plus anothername (e.g. “grupo de riesgo”).

From the corpus, we had to recognize nounphrases (NPs), prepositional phrases (PP), and nu-cleus verbal phrase (nvp).

The term extraction is detailed in Figure 1.The morphological analysis of corpus words wascarried out using the SMORPH program (Aıt-Mokhtar, 1998), that is a finite-state part of speechtagger that Infosur3 Group has adapted to Spa-nish. As an example, for the fragment “Pruebasde provocacion bronquial con ejercicio y con his-tamina en ninos asmaticos.” (Bronchial provo-cation tests with exercise and with histamine inasthmatic children.), the test result of SMORPH4

2IULA-UPF technical corpus - “Data belonging tothe TECHNICAL CORPUS from Institut Universitari deLinguıstica Aplicada de la Universitat Pompeu Fabra(http://bwananet.iula.upf.edu/) in December 2010.”

3Infosur - http://www.infosurrevista.com.ar4References: EMS: morphosyntactic tag; nom: noun;

GEN: genre; fem: female; NUM: number; PL: plural; v:verb; ind: indicative; PERS: person; 2a: second, TPO: time;pres: present; TR: type of regularity; irr: irregular; TC: type

747

1-gram 2-gram 3-gram

1-gram

2-gram

3-gram

Reference term list

Automatic evaluation

1-gram

2-gram

3-gram

1-gram

2-gram

3-gram

SNOMED CT®

Unknown words of the dictionary +

Term extraction

Extracted term candidates

Tokenization and morphological analysis

Orthographic normalization

Standard texts

Cleaning of candidates and application of stemming

Tagged text

Extraction

Evaluation

Corpus

Manual identification of interesting candidates for the domain

Figure 1: Term extraction and evaluation methodology.

is showed in Table 1. A total of 2,445 words ofthis corpus were not identified by the parser. Thisway, they were manually analyzed and added tothe original dictionary of the program.

‘Pruebas’.[ ‘prueba’, ‘EMS’,‘nom’, ‘GEN’,‘fem’, ‘NUM’,‘pl’].[ ‘probar’, ‘EMS’,‘v’, ‘EMS’,‘ind’, ‘PERS’,‘2a’, ‘NUM’,‘sg’, ‘TPO’,‘pres’, ‘TR’,‘irr’, ‘TC’,‘c1’, ‘TDIAL’,‘est’].‘de’. [ ‘de’, ‘EMS’,‘prde’].‘provocacion’.[ ‘provocacion’, ‘EMS’,‘nom’, ‘GEN’,‘fem’, ‘NUM’,‘sg’].‘bronquial’.[ ‘bronquial’, ‘EMS’,‘adj’, ‘GEN’,‘ ’, ‘NUM’,‘sg’].‘con’. [ ‘con’, ‘EMS’,‘prep’].‘ejercicio’.[ ‘ejercicio’, ‘EMS’,‘nom’, ‘GEN’,‘masc’, ‘NUM’,‘sg’].‘y’. [ ‘y’, ‘EMS’,‘cop’].‘con’. [ ‘con’, ‘EMS’,‘prep’].‘histamina’.[ ‘histamina’, ‘EMS’,‘nom’, ‘GEN’,‘fem’, ‘NUM’,‘sg’].‘en’. [ ‘en’, ‘EMS’,‘prep’].‘ninos’.[ ‘nino’, ‘EMS’,‘nom’, ‘GEN’,‘masc’, ‘NUM’,‘pl’].‘asmaticos’.[ ‘asmatico’, ‘EMS’,‘adj’, ‘GEN’,‘masc’, ‘NUM’,‘pl’].‘.’ [ ‘linsig’, ‘EMS’,‘pun’].

Table 1: Morphological analysis SMORPH.

In the sequence, noun phrase recognition ruleswere developed. These rules are loaded into theMPS syntactic parser (Abbaci, 1999) that receivesthe SMORPH output as input.

Three different experiments were performedconsidering the noun phrase sub-classification.

For the first experiment (Exp. NP), all ex-

of conjugation; c1: first conjugation; TDIAL: type of dialec-tal variety; est: standard; prde: preposition “de”; prep: prepo-sition; masc: male; cop: copulative; sg: singular, linsing:next line; pun: dot.

pressions previously tagged as NPs were consi-dered as term candidates. For the second one(Exp. S NP), after manual observations about theterms, some NP that could be relevant were sub-classified. This subclassification considered thepossibility that:

• the NP could be a verbal argument(NP VARG): “detecto la bronconeumonıa”(He detects bronchopneumonia). For it, therule corresponding to the structure NP + svn= NP VARG was created.

• the NP could be an antecedent of a non-defining clause (NP NONDEF): “el asma,que se traduce...” (asthma, which means).Here we took several rules and an exampleof them is NP + coma + relative + svn =NP NONDEF. Rules for non-defining clauserecognition were created. For this work, weonly considered that expression from the NP-antecedent until verb clause.

• the NP could be an item from an enumeration(NP ENUM): “dolor de garganta, fiebre ytos” (headache, fever, and cough). An exam-ple of enumeration rule is NP + coma + NP+ conjunction + NP = NOM COMP ENUM(Nominal complete enumeration).

• the NP could be in parentheses(NP PARENT): (fenoterol). The rulecorresponding to the structure parentheses

748

+ NP + parentheses = NP PARENT wascreated.

• the NP could be at the beginning of theclause (NP INIC): “...en los ultimos anos. Elmecanismo inmunologico es...” (...inrecent years. The immunologicalmechanism is...). In this case, for the cons-truction of the rule, the endpoint of the pre-vious sentence was considered: endpoint +NP = NP INIC. NP that appears at the be-ginning of clause was regarded as a candi-date, because the candidate of this sentenceposition could be the subject or it could bea topicalized element. This rule consideredthat subjects and topicalized elements are rel-evant to the terminology extraction.

• the NP could be a argument of a preposi-tional phrase (PP) at the beginning of theclause (NP PPINIC): “...infeccion bacteri-ana. Para el diagnostico...” (...bac-terial infection. For diagnosis...). In thesame way as in the previous case, the end-point of the sentence was considered: end-point + preposition + NP = NP PPINIC.

In the third experiment (Exp. S NP2), we usedthe subclassification of Exp. S NP and the NPsthat are PP arguments were added: “en estudiosepidemiologicos” (in epidemiological studies).

In all experiments, the cleaning of the extractedterms was carried out aiming at removing the nu-merals. This cleaning consists of discarding ofcandidates composed only of one letter, stopwordsfrom the extremities of the candidates, and can-didates that fully corresponded to stopwords. Weused the stoplist available in the Snowball Project5

and we added verb conjugations poder and deberand some words such as ano (year), dıas (days),algun (any), etc., totaling 733 stopwords.

Also, in the case of NP VERB, the right ex-tremities svn were removed. For example, in theNP VERB “se detectan 636 asmaticos” - (636asthmatics were detected), after removing “se de-tectan” and cleaning this example, the candidatewas reduced to: “asmaticos” (asthmatics).

Subsequently, in order to allow further evalua-tion, term candidates were separated into term listsof unigrams, bigrams, trigrams.

5Snowball Project - http://snowball.tartarus.org/algorithms/spanish/stop.txt

3.2 Results and evaluation of experimentsThe number of extracted candidates is showed inTable 2.

Unigrams Bigrams TrigramsExperiment NP 1744 2684 1999Experiment S NP 856 1172 824Experiment S NP2 1188 1913 1419

Table 2: Number of extracted candidates.

Two automatic tests were carried out (Figure 1).In the first one, IULA reference lists were used toverify the quality of extracted candidates.

First of all, it was necessary to apply stemmingtechniques (PreTexT II tool (Soares et al., 2008))to the extracted terms and reference term list, dueto morphological variations in the words. Subse-quently, it was possible to compare the extractedterms and the reference term list.

The accuracy and coverage for all three expe-riments (NP, S NP and S NP2) are showed in Fi-gures 2, 3, and 4, respectively, for unigrams, bi-grams, and trigrams. The figures are modifiedfrom Vivaldi and Rodrıguez (2010) because theyused the same corpus in their experiments, so,we also present a comparison between our andtheir results. In their work, EWN correspondsto the group of extracted terms using the YATEmethod (Vivaldi, 2001). The other terms were ex-tracted with the Wikipedia categories (WP) having“Medicina” as domain name and varying the cal-culation of the domain coefficient. In WP.lc, thenumber of simple steps given in Wikipedia is con-sidered; WP.lmc takes into consideration the meannumber of paths in Wikipedia; WP.nc takes intoconsideration the number of paths in Wikipedia. Itis important to notice that the extraction proposalof Vivaldi and Rodrıguez only considered patternswith the following structures: (i) noun (for uni-grams), (ii) noun + adjective (for bigrams), and(iii) noun + the “de” preposition + noun (for tri-grams). This highly contrasts with our extractionthat considers all possible combinations.

For the second test, the quality of the candidateswas verified according to the SNOMED CT R©list, which has 1,060,632 Spanish terms. Subse-quently, the candidates that could be interestingfor the medical domain were manually identifiedand, afterwards, we checked if those candidateswere present or not in the SNOMED CT R© list.The verification was done separately for each ex-periment (Exp. NP, Exp. S NP, and Exp. S NP2)and the results were separated into unigrams, bi-

749

Coverage

100

80

60

50

40

20

A c c u r a c y

(34 ; 88)

(33 ; 58)

(34 ; 44)

(a) Unigrams

0 20 40 60 80 100

Caption: NP

S_NP S_NP2

EWN WP.lmc

WP.lc WP.nc

Figure 2: Accuracy and coverage values obtainedfor unigrams.

Coverage

100

80

60

40

20

A c c u r a c y

(a) Bigrams

0 20 40 60 80 100

Caption: NP

S_NP S_NP2

EWN WP.lmc

WP.lc WP.nc

Figure 3: Accuracy and coverage values obtainedfor bigrams.

Coverage

100

80

60

40

20

A c c u r a c y

(a) Trigrams

0 20 40 60 80 100

Caption: NP

S_NP S_NP2

EWN WP.lmc

WP.lc WP.nc

Figure 4: Accuracy and coverage values obtainedfor trigrams.

grams, and trigrams. The candidates that couldrepresent terms according to the SNOMED CT R©list are showed in Figure 5.

It is quite difficult to get a constant and immedi-

Caption Exp. SN

Exp. S_SN2

Exp. S_SN

(a) Unigrams (b) Bigrams

(c) Trigrams

anemia peso (weight) afección (disease)

estimulante (stimulant) sistema (system)

emergencia (emergency) visita (visit)

penicilina (penicillin)

espasmo (spasm) hematoma

hiperlipidemia (hyperlipidemia)

enfermedad crónica (chronic disease)

enfermedad cardiopulmonar (cardiopulmonary disease)

peso corporal (body weight) cirugía torácica (thoracic surgery)

teofilina anhidra (theophylline anhydrous)

enfermedad venérea (sexually transmitted disease)

infección respiratoria aguda (acute respiratory infection) ácaro del polvo (dust mite)

enfermedad pulmonar crónicas (chronic lung disease)

Figure 5: Extra terms obtained.

ate updating on medical terminology (Krautham-mer and Nenadic, 2004). This fact motivated usto perform a manual identification of candidatesthat are interesting for the medical domain.These candidates were not present in the referencelists nor in SNOMED CT R©, although they seemto be important for this specific domain. Here wepresent some examples: “insuficiencia ventilato-ria obstructiva” (obstructive ventilatory failure),“paciente asmatico atopico” (atopic asthmatic pa-tient), (respiratory atopic diseases), “traumatismoencefalo craneano” (traumatic brain injury), etc.

4 Conclusions

If we compare the three experiments carried out(NP, S NP, and S NP2), little accuracy variationsare found for unigrams, bigrams, and trigrams, al-though the coverage varies in each case. We wereable to obtain the best coverage in the first expe-riments, in which we took all NPs as term candi-dates. Nevertheless, we expected those results be-cause most of the candidates are obtained when allNPs are extracted, and it allows for a large cove-rage. However, we expected better accuracy ratesfor the cases with “specific NPs”.

In the comparison, we may see that the re-sults obtained were similar to those of Vivaldi andRodrıguez in the case of unigrams, although theywere able to obtain better results for bigrams andtrigrams. Regarding this fact, we observed thatthe best accuracy rate was achieved with the expe-riments in which the NPs were part of an enumera-tion. Also, we emphasize the simplicity of our ex-

750

traction method, which does not require externalknowledge and was able to work well using theSMORPH dictionary and MPS recognition rules,also not considering only reference list patternsbut all possibilities. In addition, better accuracyis expected by new and more specific MPS rules.

According to the results, we obtained three in-teresting contributions: (i) we were able to showthe possibility of extracting medical terms fromrecognition of “specific NPs”, even that it isnecessary improvements in the method; (ii) theSMORPH dictionary was improved with 2,445new terms. Thus, we expect to have better exper-iments in the medical domain with this tool; (iii)other terms that were not present in the referencelists were also extracted. Those terms were testedwith the SNOMED CT R© and we obtained termsthat could be added to the IULA reference lists,which means an improvement of these lists. Atthe same time, we observed that there were otherterms with a different structure from “noun + the‘de’ preposition + noun”. This evidences the factthat there exists important trigrams that do not ne-cessarily fit to that pattern.

As future work, we intend to improvethe accu-racy with new filtering rules, to increase theSMORPH dictionary, and to test the extractionrules in larger corpora and other domains.

Acknowledgments

Thanks to Erasmus Mundus, CNPq, FAPESP yCONICET for financial support and to Vivaldi yRodrıguez for making available the dataset.

ReferencesF Abbaci. 1999. Developpment du module post-

smorph. In Memoria del DEA de Linguistique etInformatique. Groupe de Recherche dans les Indus-tries de la Langue - Universidad Blaise-Pascal -Clermont-Ferrand.

Rodrigo Alarcon. 2009. Extraccion automaticade contextos definitorios en corpus especializados.Ph.D. thesis, Universidad Pompeu Fabra, Barcelona.

S Aıt-Mokhtar. 1998. L’analyse presintaxique en uneseule etape. Ph.D. thesis, Groupe de Recherche dansles Industries de la Langue - Universidad Blaise-Pascal - Clermont-Ferrand.

Carme Bach, Roser Saurı Colomer, Jordi Vivaldi, andM. Teresa Cabre Castellvı. 1997. El corpus del’IULA: descripcio. Technical Report 17, Uni-versitat Pompeu Fabra – Institut Universitari deLinguıstica Aplicada, Barcelona - Spain.

Marie-Noelle Bessagnet, Eric Kergosien, and MauroGaio. 2010. Extraction de termes, reconnaissance etlabellisation de relations dans un thesaurus. CoRR,abs/1002.0215.

Elena Castro, Ana Iglesias, Paloma Martınez, andLeonardo Castano. 2010. Automatic identificationof biomedical concepts in spanish-language unstruc-tured clinical texts. In Proceedings of the 1st ACMInternational Health Informatics Symposium, pages751–757, New York, NY, USA. ACM.

Li Hao-Min, Ying Li, Hui-Long Duan, and Xu-DongLv. 2008. Term extraction and negation detectionmethod in chinese clinical document. Chinese Jour-nal of Biomedical Engineering, 27(5).

Michael Krauthammer and Goran Nenadic. 2004.Term identification in the biomedical literature. J.of Biomedical Informatics, 37:512–526, December.

Lucelene Lopes, Renata Vieira, Maria Finatto, DanielMartins, Adriano Zanette, and Luiz Ribeiro Jr.2009. Extracao automatica de termos compos-tos para construcao de ontologias: um experimentona area da saude - doi: 10.3395/reciis.v3i1.244pt.Revista Eletronica de Comunicacao, Informacao eInovacao em Saude, 3(1).

Clara Ines Lopez Rodrıguez, Maribel Tercedor, andPamela Faber. 2006. Gestion terminologica basadaen el conocimiento y generacion de recursos de in-formacion sobre el cancer: el proyecto Oncoterm.Revista E Salud, 2(8).

A. Moreno-Sandoval. 2009. Terminologıa y Sociedaddel conocimiento. pages 99–116. Peter Lang.

Aurelie Neveol and Sylwia Ozdowska. 2005. Extrac-tion bilingue de termes medicaux dans un corpusparallele anglais/francais. In EGC, pages 655–666.

Gerardo Sierra, Rodrigo Alarcon, Alejandro Molina,and Edwin Aldana. 2009. Web exploitation for De-finition extraction. In Proceedings of the 2009 LatinAmerican Web Congress, pages 217–223, Washing-ton, DC, USA. IEEE Computer Society.

M. V. B. Soares, R. C. Prati, and M. C. Monard.2008. Pretext II: Descricao da reestruturacao da fer-ramenta de pre-processamento de textos. TechnicalReport 333, Instituto de Ciencias Matematicas e deComputacao (ICMC) - USP - Sao Carlos, Sao Car-los - SP.

Jorge Vivaldi and Horacio Rodrıguez. 2010. U-sing wikipedia for term extraction in the biomedi-cal domain: first experiences. Procesamiento delLenguaje Natural, 45:251–254.

Jorge Vivaldi. 2001. Extraccion de candidatos atermino mediante combinacion de estrategias hete-rogeneas. Ph.D. thesis, Universitat Politecnica deCatalunya, Barcelona, Spain.

751

Date post:	13-Jul-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Experiments on Term Extraction using Noun Phrase ...Joseba Abaitua Solange O. Rezende Universidad...

Documents