1
Representing Meaning in Unsupervised Word Sense
Disambiguation
Bridget T. McInnes
5 September 2008
University of Minnesota Twin Cities
2
What is WSD?
The culture count doubled.
Culture
LaboratoryCulture
AnthropologicalCulture
Sense Inventory
3
Approaches to WSD
SupervisedAdvantages: obtains a high accuracyDisadvantages: manually annotated training data is required for each word that needs to be disambiguated therefore it can not scale
UnsupervisedAdvantages: does not require manually annotated training dataDisadvantages: generally does not obtain as high of an accuracy as supervised approaches
4
Unsupervised Approaches
Similarity and Relatedness Based
5
Unsupervised Approaches
Similarity and Relatedness BasedPatwardhan, Banerjee and Pedersen 2005Pedersen, et al 2006Budanitsky and Hirst 2006
6
Unsupervised Approaches
Similarity and Relatedness based
Vector Based
7
Unsupervised Approaches
Similarity and Relatedness Based
Vector-basedMohammad and Hirst, 2006Patwardhan, 2003Pedersen, et al 2006Humphrey, et al 2006
8
Unsupervised Approaches
Similarity and Relatedness-based
Vector-based
Clustering
9
Unsupervised Approaches
Similarity and Relatedness based
Vector-based
ClusteringPedersen and Bruce, 1997Shütze, 1998Pedersen and Bruce, 1998Purandare and Pedersen, 2004Kulkarni and Pedersen, 2005
10
Road Map
Previous Approaches
Our vector approach
Future Work
11
Previous Approaches
Similarity and Relatedness Based
SenseRelate (Banerjee and Pedersen, 2003)
Vector-based
Semantic Type Indexing (Humphrey et al 2006)
Clustering
SenseClusters (Kulkarni and Pedersen, 2005)
12
Banerjee and Pedersen 2003
Sense Relate
13
SenseRelateTarget Word: Transport
Concept 1: Biological Transport (C0005528)
Concept 2: Patient Transport (C0150390)
Transport of glutathione S-linked conjugates.
glutathione S-linked conjugates.
C0017817C0522529 C0301869
C0005528 = SS + SS + SS = Total SS for Concept 1
14
SenseRelateTarget Word: Transport
Concept 1: Biological Transport (C0005528)
Concept 2: Patient Transport (C0150390)
Transport of glutathione S-linked conjugates.
glutathione S-linked conjugates.
C0017817C0522529 C0301869
C0150390 = SS + SS + SS = Total SS for concept 2
C0005528 = SS + SS + SS = Total SS for concept 1
15
Humphrey et al, 2006
Semantic Type Indexing for WSD
16
Semantic Type Indexing (STI) Target Word: Transport
Concept 2 Vector
Concept 1 Vector
Target Word VectorCosine 2
Cosine 1
Concept 1: Biological TransportSemantic type: Cell Function
Concept 2: Patient TransportSemantic type: Health Care Activity
JDI
CV1 – JDI vectorCV2 – JDI vector
TW – JDI vector
Transport of glutathione S-linked conjugates.
17
Target Word Vector
Transport of glutathione S-linked conjugates.
Contains the words surrounding the ambiguous word
18
STI - Target Word Vectors
Transport of glutathione S-linked conjugates.
Contains the words surrounding the ambiguous word
19
STI -Concept Vectors
The concept vectors are created based on their semantic type(s)
Transport:C0005528: Biological TransportC0150390: Patient Transport
C0005528
C0150390
Cell FunctionOne word terms in the Metathesaurus associated with Cell Function
Health Care Activity One word terms in the Metathesaurus associated with Health Care Activity
20
Kulkarni and Pedersen, 2005
SenseClusters
21
Sense Clusters (SC)Target Word: Transport
Concept 1: Biological TransportConcept 2: Patient Transport
Instance 1Instance 2Instance 3Instance 4Instance 5Instance 6Instance 7Instance 8Instance 9Instance 10Instance 11Instance 12Instance 13…
Concept 1
Concept 2
Transport of glutathione S-linked conjugates.
22
Sense Clusters (SC)
Instance 1Instance 2Instance 3Instance 4Instance 5Instance 6Instance 7Instance 8Instance 9Instance 10Instance 11Instance 12Instance 13…
Concept 1
Concept 2
Target Word: Transport
Concept 1: Biological TransportConcept 2: Patient Transport
Transport of glutathione S-linked conjugates.
23
Sense Clusters
Concept 2 Vector
Concept 1 Vector
Target Word Vector
Cosine 2
Cosine 1
Target Word: Transport
Concept 1: Biological TransportConcept 2: Patient Transport
Transport of glutathione S-linked conjugates.
24
SC -Vectors
Contain the words surrounding the ambiguous word
Created using:
First order co-occurrences
Second order co-occurrences
25
First Order Co-occurrence Vectors
glutathione S-linked conjugates
Word 1
Word 2
Word N
.
.
.
.
.
.
.
50
6
5
.
.
.
5
6
1
.
.
.
5
0
15
.
.
.
20
4
7
TargetVector
26
Second Order Co-occurrence Vectors
Word 1
Word 2
Word N
.
.
.
.
.
.
.
10
30
0
1st orderglutathione
20 10 0
10
0
0
2
50
2
…
…
…
…
…
… …
Word1 Word 2 … Word N
0 2 2…
2nd orderglutathione
27
Second Order Co-occurrence Vectors
S-linked conjugates
Word 1
Word 2
Word N
.
.
.
.
.
.
.
10
30
2
.
.
.
0
6
0
.
.
.
5
0
13
.
.
.
5
13
5
TargetVector
glutathione
28
Our unsupervised approach
29
CuiTools ApproachOur approach uses a general vector approach with SenseCluster vectors
30
CuiTools
Concept 2 Vector
Concept 1 Vector
Target Word Vector
Cosine 2
Cosine 1
Target Word: Transport
Concept 1: Biological Transport (C0005528)
Concept 2: Patient Transport (C0150390)
Transport of glutathione S-linked conjugates.
31
CuiTools Approach
We explore using
First-order co-occurrence vectors
Second-order co-occurrence vectors
Our approach uses a general vector approach with SenseCluster vectors
32
Target Word Vector
Contains the words surrounding the ambiguous word
Transport of glutathione S-linked conjugates.
33
CuiTools - Concept Vectors
How to create a vector that can represent the meaning of
a concept for word sense disambiguation?
34
To answer this question
We explore information in the UMLS that can be used to
represent the meaning of a concept.
35
CuiTools - Concept Vectors
Adjustment
Individual AdjustmentConceptually broad term referring to a state of harmony between internal needs and external …
Adjustment ActionThe act of making necessary corrections or modifications …
Psychological AdjustmentA state of harmony between internal needs and external demands and the processes used …
CUI definition
36
CuiTools - Concept Vectors
Blood Pressure
Blood PressureForce exerted by the blood on the walls of the arteries and other vessels.
Blood Pressure DeterminationActions performed to measure the diastolic and systolic pressure of the blood.
Arterial PressureNO DEFINTION
CUI definition
37
CuiTools - Concept Vectors
CUI definitionUse CUI definition but if it doesn’t exist
PARent definitionSemantic Type definition
SYNonymous terms
For example:C0430400: Laboratory Culture
laboratory culturemicrobial culturesample culture
38
CuiTools - Concept Vectors
CUI definition
PARent definitionSemantic Type definition
SIBlings
For example:C0010453: Anthropological Culture
archeologyfamilysocial groups
If CUI definition doesn’t exist
SYNonymous terms
39
CuiTools - Concept Vectors
CUI definitionIf CUI definition doesn’t exist
PARent definitionSemantic Type definition
SIBlings
SYNonymous terms
TOP 50 most frequent words surrounding the terms associated with the CUI
40
Dataset
National Library of Medicine's Word Sense Disambiguation (NLM-WSD) Dataset
50 words from the 1998 MEDLINE abstracts
100 instances for each of the 50 words
The target word was manually assigned a UMLS concept or None
All instances of None were removed
Average number of concepts per ambiguous word is 2.26
41
Data subsets
Humphrey subset
Humphrey, et al 2006
45 out of the 50 words in NLM-WSD
5 words were excluded because at least two of the possible concepts associated with these words have the same semantic type
Instances that were assigned “None” were removed
42
Training Data
The training data used to create the 1st and 2nd order co-occurrence vectors is
2005 Medline baseline
43
Results
Results
45
Results of Co-occurrence Vectors
46
Results of the Representations of Meaning
47
Results of the Representations of Meaning - CUI
Adding the parent and semantic type definitions decreased the accuracy by 6 and 7 percentage points
Parent and semantic type definitions are too broad to define the meaning of a concept
48
Results of the Representations of Meaning - SYN
Using the synonymous terms associated with a concept is too narrow to represent the meaning.
Adjustment ActionAdjustment – actionAdjustmentsAdjustment, NOSAdjustment – action qualifier valueAdjustment – action procedure
49
Results of the Representations of Meaning - SIB
Using the terms associated the siblings of a concept is too broad to represent the meaning.
Adjustment ActionBiopsyCauterisationCauteryCold TherapyDesiccationDrainage procedureElectrolysis
50
Results of the Representations of Meaning
51
Supervised versus Unsupervised
Joshi McInnes Stevenson SenseClusters Humphrey CuiTools et al 04 et al 07 et al 08 et al 06
52
To recap
How to create a vector that can represent the meaning of
a concept for word sense disambiguation?
53
Conclusions
To answer this we explored information in the UMLS that could be used to represent the meaning of a concept
Finding a context to represent the meaning of a concept is difficult
We found using the top 50 most frequent words surrounding the terms associated with the concept best represented the concept for the task of word sense disambiguation
54
Take away message
Unsupervised approaches are showing promise
Their disadvantage due to supervised approaches obtaining a higher disambiguation accuracy is slowly disappearing
But we are not there yet … so there is more work to do
55
Future Work
UMLS-Similarity package
Using the Semantic Similarity scores rather than frequency in the 1st order co-occurrence vectors
56
First Order Co-occurrence Vectors
glutathione S-linked conjugates
Word 1
Word 2
Word N
.
.
.
.
.
.
.
50
6
5
.
.
.
5
6
1
.
.
.
5
0
15
.
.
.
20
4
7
TargetVector
FREQ (glutathione, word N) Average
57
First Order Co-occurrence Vectors
glutathione S-linked conjugates
Word 1
Word 2
Word N
.
.
.
.
.
.
.
.5
.6
.5
.
.
.
.5
.6
.1
.
.
.
.5
0
.15
.
.
.
.75
.6
.25
TargetVector
Similarity (glutathione, word N) Average
58
First Order Co-occurrence Vectors
glutathione S-linked conjugates
Word 1
Word 2
Word N
.
.
.
.
.
.
.
.5
.6
.5
.
.
.
.5
.6
.1
.
.
.
.5
0
.15
.
.
.
1.5
1.2
.75
TargetVector
Similarity (glutathione, word N) Sum (like SenseRelate)
59
First Order Co-occurrences
glutathione
Word 1
Word 2
Word N
.
.
.
.
.
.
.
.5
.6
.5
Word N
(C0005528)
.3+ .2
C0000000 C0000001
Similarity = = .5
C0005528
60
Future Work
UMLS-Similarity package
Creating 2nd order co-occurrence matrices based on highly similar concepts rather than words in text
Using the Semantic Similarity scores rather than frequency in the 1st order co-occurrence vectors
61
Second Order Co-occurrence Vectors
Word 1
Word 2
Word N
.
.
.
.
20 10 0
10
0
0
2
50
2
…
…
…
…
…
… …
Word1 Word 2 … Word N
Words come from training corpus
Frequency counts
62
Second Order Co-occurrence Vectors
CUI 1
CUI 2
CUI N
.
.
.
.
.20 .10 0
.10
0
0
.20
.50
.20
…
…
…
…
…
… …
CUI1 CUI2 … CUI N
Use concepts from the UMLS
Similarity scores
63
Future Work
UMLS-Similarity package
Creating 2nd order co-occurrence matrices based on highly similar concepts rather than co-occurrences in text
Use terms associated with CUIs that have a high similarity score with the possible concept to represent the meaning of the concept
Using the Semantic Similarity scores rather than frequency in the 1st order co-occurrence vectors
64
Similarity Scores
What is potentially gained by using the similarity or relatedness measures
May catch words/concepts that are similar but do not frequently occur together in the training data
culture and ethnology
Ethnology is the study of anthropology
ethnology appears with culture only five times in the training data
The concepts Anthropological Culture and Ethnology would have a high similarity score where as Laboratory culture and Ethnology would not
65
Software
CuiTools version 0.19
http://cuitools.sourceforge.net
66
Thank you
Lan AronsonFrançois LangJim MorkAurélie NévéolWill Rogers
Olivier BodenreiderAllen BrowneMay CheyDina Demner-FushmanGuy DivitaKin Wah FungSusanne HumphreyDwayne McCullyTom RindfleschSuresh Srinivasan