Corpus-based Terminology Extraction Corpus-based Terminology Extraction applied to Information Accessapplied to Information Access
Anselmo Peñas, Felisa Verdejo and Julio Gonzalo
NLP Group, Dpto. Lenguajes y Sistemas Informáticos,
UNED, Spain
Corpus Linguistics 2001, Lancaster, UK
ContentContent
Introduction Resources, Tools and Corpora Terminology Extraction (TE) Evaluation of the TE procedure Terminology-based Information Access Conclusions
Introduction: Introduction: FrameworkFramework
The European Treasury Browser (ETB) project
• Web site of Educational Resources (primary and secondary school)
• Context of New Technologies• Objective: to build the structures to organise and
retrieve educational resources
Similar systems• The Educational Resources Information Centre• The British Education Index
Introduction: Introduction: use ofuse of ThesauriThesauri
ThesauriDefinition: controlled vocabulary, structured in relations
Structure: descriptors and relations (NT, BT, RT)
Existing educational thesauri• Don’t cover primary and secondary school vocabulary
within the new technologies context
Construction of a multilingual thesaurus is needed for the ETB project purposes
Terminology Lists
Objectives of the workObjectives of the work
To build the Spanish list of candidate terms for the ETB multilingual thesaurus.
To develop a general procedure to obtain terminology lists
• In an automatic way• Independently of the application domain
To explore effective ways of Information Retrieval • using the terminology lists instead of thesaurus• to bridge the gap between users’ and collection languages
ContentContent
Introduction Resources, Tools and Corpora Terminology Extraction (TE) Evaluation of the TE procedure Terminology based Information Access Conclusions
Resources and ToolsResources and Tools
Resources• Semantic network: EuroWordNet• Monolingual dictionary (VOX)• Bilingual dictionary (VOX)
Tools• Tokeniser• Morphological analyser• POS tagger• Shallow parser (based on syntactic patterns)
CorporaCorpora
Corpus of educational resources1,075 documents (670,646 words) from– Programa de Nuevas Tecnologías
(http://www.pntic.mec.es/main_recursos.html)– Aldea Global (http://sauce.pntic.mec.es/~alglobal)
Corpus of international news7,364 documents (2.9 million words)– (http://www.elpais.es/internac)
Pre-processing(html tags treatment, language detection, detection of repeated pages and chunks, etc.)
ContentContent
Introduction Resources, Tools and Corpora Terminology Extraction (TE) Evaluation of the TE procedure Terminology based Information Access Conclusions
Terminology Extraction (TE)Terminology Extraction (TE)
Terminology List:
List of mono-lexical and poly-lexical terms which are usual in a specific domain
Steps of Terminology Extraction1. Term detection
2. Term weighting
3. Term selection
1.1. Term Detection Term Detection (mono-lexical)(mono-lexical)
(Over both corpora, Educational Resources and International News)
Processing Tokenising Lemmatising,Tagging Removal of erroneous strings, abbreviations and words
from other languages Extraction of nouns, verbs and adjectives
Result List of candidate lemmas with its:
• Term frequency (any form) in both collections
• Document frequency in both collections
1.1. Term Detection Term Detection ( (poly-lexical)poly-lexical)
(Over Educational Resources corpus)
Processing Tokenising, Lemmatising,Tagging Shallow parsing (Syntactic pattern recognition)
Result List of candidate terminological phrases:
• Term frequency in the collection
• Document frequency in the collection
... como/CS en/Prep la/Art educación/N a/Prep distancia/N ,/Punc el/Art ministerio/N ...
Pattern: N Prep N
Detected term: educación a distancia
Syntactic Patterns for Spanish terminological phrasesN N N A
N [A] Prep N [A] N [A] Prep Art N [A]
N [A] Prep V N [A] Prep V N [A]
2.2. Term weighting Term weighting
Empirical measure• Proportional to
– term frequency
– document frequency
• Inversely proportional to– term frequency in other domain
• Normalisation
whereFt,sc: relative frequency of the term t in the specific corpus scFt,gc: relative frequency of the term t in the general corpus gcDt,sc: relative number of documents in sc where t appears.
1Relevance (t, sc, gc) = 1 –
Ft,sc · Dt,sc
log2 2 + Ft,gc
in the domain corpus
3.3. Term Selection Term Selection
Removal of unfrequent terms in the study domain Removal of very frequent terms in other domains Ranking of terms according to their weight Selection of top terms in the terminology list
(thresholds to obtain 2,000 / 3,000 terms from the 75,000 detected terms)
Addition of phrases with relevant components
ContentContent
Introduction Resources, Tools and Corpora Terminology Extraction (TE) Evaluation of the TE procedure Terminology based Information Access Conclusions
Evaluation: Evaluation: Visual explorationVisual exploration
Automatic generation of result pages in HTML
Purpose• To help in the decisions of the prototype
development
• To evaluate the measures and techniques and to suggest improvements or modifications
• To give further information to documentalists in order to assist final decisions in thesaurus construction
Evaluation: Evaluation: Visual explorationVisual exploration
Evaluation: Evaluation: PrecisionPrecision
Manual classification of the 2,856 selected terms
Adequate
Specific
domain
Computers
domain Variants Incorrect
Not
lexicalised
Not
domain
Total of
terms
1235
43.24%
513
17.96%
59
2.07%
78
2.73%
151
5.29%
515
18.03%
305
10.68%
2856
100%
66 % of terms are appropiate
Proyecto curricularCiencias socialesSistema operativoProyectos curriculares(Proyecto curricular)
Profesorado materiales ¿?Alumnos inglesesBiblioteca nacional
With a low effort, a large number of accurate terms is proposed to documentalists
Evaluation: Evaluation: PrecisionPrecision
precision
number of selected candidates
Precision, % of selected terms which are appropriate terms
Higher precision on the top of the ranking
With a lower number of candidates, the precision increases
ContentContent
Introduction Resources, Tools and Corpora Terminology Extraction (TE) Evaluation of the TE procedure Terminology-based Information Access Conclusions
Terminology-based Information AccessTerminology-based Information Access
Terminology Extraction in Information Retrieval provides:
At Indexing: to add poly-lexical terms to the indexes without the explosion of n-grams
Term browsing: to navigate through the terminology and access the documents from the terms (without the use of thesauri)
Terminology-based Information AccessTerminology-based Information Access
A difference with TE: terminology list truncation
(as query gives the relevant terms, now the task is concerned with recall rather than precision of terms)
A new task: to retrieve terminology• Poly-lexical terms are retrieved from mono-lexical
ones
Lemma
Phrase
Document
Indexing Levels
Terminology-based Information AccessTerminology-based Information Access
Terminology retrieval
To bridge the gap between• Collection terminology
• Query terms
Requires• Query expansion
• Query translation
But produces noise in the retrieval
However phrases provides an excellent way for ambiguity reduction (Ballesteros & Croft, 1998)
Terminology-based Information AccessTerminology-based Information AccessTratadosacuerdocapitulaciónconcertaciónconveniocuidar, pactomanejarprocesar
accorddiscoursehandlemanagepactprocesstreattreatisetreaty
Prohibiciónembargoentredichointerdiccióninterdictoproscripción
baninterdictionprohibitionproscription
Pruebascata, cataduradegustaciónensayoescandalloexperimentogustaciónmuestreo, tanteo
demonstrateestablish, exhibitexperimentexperimentationfall, fittingindicate, pointpresent, proofprove, runsample, samplingshew,show, tastetest, trial, try
de Nuclearesnuclear
nuclear
de
Nuclear test ban treaty?Nuclear fitting interdiction manage? Nuclear taste proscription process?
Exp
ansi
on
Tra
nsl
atio
n
ContentContent
Introduction Resources, Tools and Corpora Terminology Extraction (TE) Evaluation of the TE procedure Terminology based Information Access Conclusions
ConclusionsConclusions Extraction of relevant terms in Spanish for the ETB
project domain (primary and secondary school / new technologies)– Automatic process from free resources as web pages– Exploring contexts and statistical data via Internet
Development of a search engine based on terminology extraction– Using terminology lists in an intermediate way between free-
searching and thesaurus-guided searching– Without needing of thesaurus construction– Bridging the distance between the terms used in the query and
the terminology used in the collection (even in different languages)
Thanks for your attention