Combining terminology resources and statistical
methods for entity recognition: an evaluation
Angus Roberts, Robert Gaizauskas,Mark Hepple, Yikun Guo
presented by George Demetriou
Natural Language Processing Group, University of Sheffield, UK
Introduction
Combining techniques for entity recognition: Dictionary based term recognition Filtering of ambiguous terms Statistical entity recognition
How do the techniques compare: separately and in combination?
When combined, can we retain the advantages of both?
LocusConditionLocusInvestigation
Semantic annotation of clinical text
Our basic task is semantic annotation of clinical text
For the purposes of this paper, we ignore:
Modifiers such as negation Relations and coreference
These are the subject of other papers
Punch biopsy of skin. No lesion on the skin surface following fixation.
Entity recognition in specialist domains
Specialist domains, e.g. medicine, are rich in: Complex terminology Terminology resources and ontologies
We might expect these resources to be of use in entity recognition
We might expect annotation using these resources to add value to the text, providing additional information to applications
Ambiguity in term resources Most term resources have not been designed
with NLP applications in mind
When used for dictionary lookup, many suffer from problems of ambiguity
I: Iodine, an Iodine test or the personal pronoun be: bacterial endocarditis or the root of a verb
Various techniques can overcome this: Filtering or elimination of problematic terms Use of context: in our case, statistical models
Corpus: the CLEF gold standard
For experiments, we used a manually annotated gold standard
Careful construction of a schema and guidelines Double annotation with a consensus step Measurement of Inter Annotator Agreement (IAA) (Roberts et al 2008 LREC bio text mining workshop)
For the experiments reported, we use 77 gold standard documents
Entity types
Entity type Brief description Instances
Condition Symptom, diagnosis, complication, etc. 739
Drug or device Drug or some other prescribed item 272
Intervention Action performed by a clinician 298
Investigation Tests, measurements and studies 325
Locus 490
Total 2124
Anatomical location, body substance etc.
Terminomatchers
Terminoannotators
Externalontologies
Terminodatabase
Link back to resources
Externaldatabases
Externalterminologies
Dictionary lookup: Termino
Termino is loaded from external resources
FSM matchers are compiled out of Termino
Finding entities with termino
GATE application pipeline
Termino
Applicationtexts
Annotatedtexts
Termino termrecognition
Linguisticpre-processing
Termino loaded with selected terms from UMLS (600K terms)
Pre-processing includes tokenisation and morphological analysis
Lookup is against the roots of tokens
Filtering problematic terms
Many UMLS terms are not suitable for NLP
Ambiguity with common general language words
To identify the most problematic of these, we ran Termino over a separate development corpus, and manually inspected the results
A supplementary list of missing terms was compiled by domain experts (6 terms)
Creation of these lists took a couple of hours
Creating the filter list
1. Add all unique terms of 1 character to the list
2. For all unique terms of <= 6 characters:
i. Add to the list if it matches a common general language word or abbreviation
ii. Add to the list if it has a numeric component
iii. Reject from the list if it is an obvious technical term
iv. Reject from the list if none of the above apply
3. Filter list size: 232 terms
Entities found by Termino
UMLS UMLS+filter IAAP 0.2458 0.5224 0.5238R 0.6999 0.6939 0.7042F1 0.3638 0.5961 0.6008 0.7373
UMLS+filter+ supplementary
UMLS alone gives poor precision, due to term ambiguity with general language words
Adding in the filter list improves precision with little loss in recall
Statistical entity recognition
Statistical entity recognition allows us to model context
We use an SVM implementation provided with GATE
Mapping of our multi-class entity recognition task to binary SVM classifiers is handled by GATE
Features for machine learning
Token kind (e.g. number, word)
Orthographic type (e.g. lower case, upper case)
Morphological root
Affix
Generalised part of speech The first two characters of Penn Treebank tagset
Termino recognised terms
Finding entities: ML
GATE application pipeline
GATE training pipeline
Statisticalmodel of text
Term modellearning
Linguisticprocessing
Gold standardannotated texts(human annotated)
Applicationtexts
Annotatedtexts
Term modelapplication
Linguisticprocessing
Finding entities: ML + Termino
GATE application pipeline
GATE training pipeline
Statisticalmodel of text
Term modellearning
Linguisticprocessing
Termino termrecognition
Termino
Gold standardannotated texts(human annotated)
Applicationtexts
Annotatedtexts
Term modelapplication
Termino termrecognition
Linguisticprocessing
Entities found by SVM
Best UMLS SVM+tokens IAAP 0.5238 0.7931 0.8065R 0.7042 0.5417 0.6308F1 0.6008 0.6423 0.7071 0.7373
SVM+tokens+termino
Statistical entity recognition alone gives a higher P than dictionary lookup, but a lower R
The combined system gains from the higher R of dictionary lookup, with no loss in P
Linkage to external resources
The peritoneumcontains depositsof tumour... thetumour cells arenegative fordesmin.
Semantic annotation allows us to link texts to existing domain resources
Giving more intelligent indexing and making additional information available to applications
Linkage to external resources
UMLS links terms to Concept Unique Identifiers (CUIs)
Where a recognised entity is associated with an underlying Termino term, can likewise automatically link the entity to a CUI
If the SVM finds an entity when Termino has found nothing, the entity cannot be linked to a CUI
CUIs assigned
% of terms
0 146 16.94
1 486 56.38
2 190 22.043 31 3.604 6 0.705 3 0.35>0 716 83.06
CUIs assigned
Number of terms
At least one CUI can be automatically assigned to 83% of the terms in the gold standard
Some are ambiguous, and resolution is needed
Availability
Most of the software is open source and can be downloaded as part of GATE
We are currently packaging Termino for public release
We are currently preparing a UK research ethics committee application for release of the annotated gold standard
Conclusions Dicitionary lookup gives a good recall but poor
precision, due to term ambiguity
Much ambiguity is due to a few of terms, which can be filtered to give little loss in recall
Combining dictionary lookup with statistical models of context improves precision
A benefit of dictionary lookup, linkage to external resources, can be retained in the combined system
Questions?
http://www.clinical-escience.org
http://www.clef-user.com