Date post: | 22-Jun-2015 |
Category: |
Science |
Upload: | ainl-conferences |
View: | 231 times |
Download: | 0 times |
112.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Programming language is not an island: Word Sense Alignment of Lexical-Semantic Resources
Iryna GurevychJoint work with: Judith Eckle-Kohler, Kostadin Cholakov, Silvana Hartmann, Michael Matuschek, Christian M. Meyerhttp://www.ukp.tu-darmstadt.de/data/uby
UBY
2
Motivation
Similarity-based Word Sense Alignment
Graph-based Word Sense Alignment
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Outline
Joint Modeling of Features
Putting the Pieces Together: UBY
Applications of Linked Lexical Resources
312.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Text Analysis Needs Lexical-Semantic Knowledge
Lexical resourceNLP application
Which lexical resource to choose?
412.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Resources are Largely Different
Different coverage of words/word sensesDifferent types of informationEncyclopedic vs. linguistic knowledgeSyntactic vs. semantic knowledge…
Resource integration can significantly influence the performance of your system! – Instead of choosing only one (best performing):
Why not combine multiple resources and benefit from all their knowledge?
512.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Overlap of Lexical Entries
25,541
7,885 1,729
56,240
28,650
Roget’s Thesaurus(62,797)
Wiktionary(364,663)
WordNet(149,502)
163,027 67,868
Common vocabulary is rather small (28,650).
Each resource contains a lot of “unique” words.
612.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Overlap of Lexical Entries
slang
dialect
naturalsciences
computerscience
neologisms
humanities
socialsciences
biologicaltaxonomy
named entities
surprisinglysmall
overlap
math
7
1. To sing: To produce musical or harmonious sounds with one’s voice.
2. To sing: To express audibly by means of a harmonious vocalization. 3. To sing: To confess under interrogation.
1. singen: Mit der Stimme harmonische Töne erzeugen.
1. To sing: Produce tones with the voice
2. To sing: divulge confidential information or secrets
1. To sing: To produce harmonious sounds with one's voice.
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Word Sense Alignment
812.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Prior Work on Linked Lexical Resources (LLR)
Meaning Multilingual Central Repository, Atserias et al. (2004)Yago, Suchanek et al. (2007)SemLink (Palmer, 2009)Universal Wordnet (UWN), Gerard de Melo and Gerhard Weikum
(2009)eXtended WordFrameNet, Laparra and Rigau (2010)BabelNet, Navigli and Ponzetto (2010)NULEX, McFate and Forbus (2011)UBY, Gurevych et al. (2012)… many more, e.g. on the Semantic Web
912.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Potential of Linked Lexical Resources
Increased coverage and the enriched sense representation
Linking FrameNet, VerbNet, and WordNet for semantic parsing (Shi and Mihalcea, 2005)
Linking VerbNet, FrameNet and PropBank for semantic role labeling (Palmer, 2009)
Linking WordNet and Wikipedia for word sense disambiguation (Navigli and Ponzetto, 2010)
Linking WordNet and Wiktionary for measuring verb similarity (Meyer and Gurevych, 2012)
Linking OmegaWiki and Wiktionary for mining translations (McCrae and Cimiano, 2013)
1012.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
The Challenge: Heterogeneity of Resources
Different coverage:missing entities in one of the resources
Different granularity:entities are defined at different levels
Different perspectives:entities are defined for a different purpose
vs.
vs.
vs.
(Euzenat/Shvaiko, 2007)
1312.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Lemma Alignment
Wiktionary
WordNet
Content integration at the lemma level is easy, but…
1412.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Word Sense Alignment
Wiktionary
WordNet
…integration at the sense level is hard!
Content integration at the lemma level is easy, but…
1512.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Word Sense Alignment
plant in Wiktionary (botany) An organism of the kingdom
Plantae […]
(proscribed as biologically inaccurate) Any creature that grows on soil or similar surfaces, including plants and fungi.
A factory or other industrial or institutional building or facility.
(snooker) A play in which the cue ball knocks one (usually red) ball onto another […]
plant in WordNet buildings for carrying on
industrial labor
(botany) a living organism lacking the power of locomotion
an actor situated in the audience whose acting is rehearsed but seems spontaneous to the audience
?
?
1712.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
The Alignment Process
Can be generalized for multiple resources „multi-alignment“:
Matching
parameters p
knowledge k
A‘A
r
r‘
resource 1
resource 2
initialalignment
(possibly empty)
outputalignment
A‘ = f(r, r‘, A, p, k)
A‘ = f(r1,…,rn, A, p, k)(Euzenat/Shvaiko, 2007)
20
Motivation
Similarity-based Word Sense Alignment
Graph-based Word Sense Alignment
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Outline
Joint Modeling of Features
Putting the Pieces Together: UBY
Applications of Linked Lexical Resources
2114.05.2014 | Technische Universität Darmstadt | Iryna Gurevych
Construction of aligned lexical resources
Sense Alignment
Niemann & Gurevych, IWCS 2011
█Meyer &
Gurevych, IJCNLP
2011
█
Matuschek & Gurevych, TACL, 2013
█ █ █Matuschek
& Gurevych, COLING,
2014
█ █ █
Hartmann & Gurevych, ACL 2013
█ █
Miller & Gurevych,
LREC 2014
█ █ █
█ Graph-based alignment
█ Resource-independent alignment
█ Text similarity-based alignment
█ Exploitation of existing LR alignments to produce new ones
What Psycholinguists Know About Chemistry: Aligning Wiktionary and WordNet for Increased Domain Coverage. Christian M. Meyer and Iryna Gurevych. In: Proceedings of IJCNLP, pp. 883-892, November 2011.
2212.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Similarity-based Word Sense Alignment
Enriched sense representations
Increased coverage
23
Wikipedia article …
Wikipedia article …
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Aligning Wiktionary and WordNet
A two-step approach:1. Candidate extraction2. Candidate disambiguation
plant (factory)
plant (organism)
plant (person)
works (machine)
bird(animal)
works (factory) …
WordNet synsets
Wiktionary senses
{plant, works, industrial plant}{plant, works,
industrial plant}{plant, works,
industrial plant}
to fly(move)reddish
(color)
24
Wikipedia article …
Wikipedia article …
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Aligning Wiktionary and WordNet
plant (factory)
plant (organism)
plant (person)
works (machine)
bird(animal)
works (factory) …
WordNet synsets
Wiktionary senses
{plant, works, industrial plant}{plant, works,
industrial plant}{plant, works,
industrial plant}
to fly(move)reddish
(color)
A two-step approach:1. Candidate extraction2. Candidate disambiguation
25
Wikipedia article …
Wikipedia article …
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Aligning Wiktionary and WordNet
plant (factory)
plant (organism)
plant (person)
works (machine)
bird(animal)
works (factory) …
WordNet synsets
Wiktionary senses
{plant, works, industrial plant}{plant, works,
industrial plant}{plant, works,
industrial plant}
to fly(move)reddish
(color)
X
X
X
A two-step approach:1. Candidate extraction2. Candidate disambiguation
2612.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Bag of Words Representation
hypernyms
synset
hyponyms
hyper- & hyponyms
bag-of-words
bag-of-words
sense definition
lemma
usage examples
synonyms
Synsets are represented by synonyms, gloss, examples
2712.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Candidate Disambiguation
bag-of-words
bag-of-words
semantic relatedness
measure
s < threshold s ≥ threshold
No alignment!Align this pair of
WordNet synset and Wiktionary sense!
score sCOS: Cosine similarity
PPR: Personalized PageRank
3912.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Evaluation Dataset
Dataset creation: No previous alignments = no other evaluation datasets We created a new dataset with 2,423 sense pairs 10 human raters (students/researchers from CS, math, linguistics) Annotate each pair as “same meaning” or “different meaning”
Dataset reliability: Inter-rater agreement: AO = .93, κ = .70 Removing two biased raters: AO = .94, κ = .74
Gold standard: Majority vote of the 8 raters, additional tie breaker
Publicly available!
4012.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Evaluation Results
Method A P R F1
RAND .662 .212 .594 .313
MFS .802 .329 .508 .399
COS only .901 .598 .703 .646
PPR only .915 .684 .636 .659
COS&PPR .914 .674 .649 .661
RAND: Random baseline MFS: Baseline aligning always the first sense (≈ most frequent sense)
Our approach significantly outperforms the baseline (at 1% level) COS highest recall; PPR highest precision; COS&PPR highest F1 Significant difference of PPR, COS&PPR over COS (at 1% level) No significant difference between PPR and COS&PPR
4212.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Error Analysis
110 false negatives: “same meaning, but was not aligned”
Very different wording “good discernment” vs.“ability to notice what others might miss”
Similar senses but slightly below threshold “plants of the genus Centaurea” vs. “common weeds of the genus
Centaurea”
Pointing to another entry rather than a content-based gloss pacification: “the process of pacifying”
4312.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Error Analysis
98 false positives: “different meaning, but have been aligned”
Similar wording, but refer to different concepts “a computer that provides client stations with access to files and
printers as shared resources to a computer network” vs. “any computer attached to a network”
High relatedness, but generic- versus domain-specific vocabulary “any computer attached to a network” vs. “any organization that
provides resources and facilities for a function or event”
4412.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Increased Coverage: Parts of Speech
Wiktionary AND WordNet
Additionally inWiktionary
Additionally in WordNet
Nouns 34,464 158,085 47,651
Verbs 8,252 29,119 5,515
Adj./Adv. 14,236 60,977 7,541
Other POS 0 16,778 0
Inflected Forms 0 106,328 0
Our alignment: 56,970 sense pairs Final resource contains 488,988 word senses Substantial increase in the coverage of senses Wiktionary is not restricted to nouns/verbs/adjectives: proverbs,
idioms, collocations, particles, determiners, inflected forms, etc.
4512.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Increased Coverage: Domains
Wiktionary AND WordNet
Additionally in Wiktionary
Additionally in WordNet
Biology 4,465 4,067 12,869
Chemistry 2,561 8,260 2,268
Engineering 1,108 940 1,080
Geology 2,287 2,898 2,479
Humanities 4,949 2,700 5,060
IT 439 3,032 557
Linguistics 1,249 1,011 1,576
Math 615 2,747 483
Medicine 3,613 3,728 3,058
Military 574 426 585
Physics 1,246 2,835 1,252
Religion 733 1,154 781
Social Sciences 3,745 2,907 4,458
Sport 905 2,821 807
4612.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Enriched Sense Representation
Synonyms
Gloss
Example sentence
Subsumption hierarchy
Synset organization
…
Pronunciation
Etymology
Syntactic knowledge
Quotations
Related terms
Translations
…
4712.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Selected Conclusions
Aligned Wiktionary – WordNet is characterized by:(1) Increased coverage Different parts of speech, not only nouns e.g. humanities and social sciences from WordNet e.g. technical domains and leisure from Wiktionary
(2) Enriched sense representation Pronunciation, etymology, related terms, translations, etc.
Novel evaluation dataset annotated by 10 human raters
Better results based on the resource-structure based and hybrid techniques in later work (Matuschek & Gurevych, TACL ‘13)
48
Motivation
Similarity-based Word Sense Alignment
Graph-based Word Sense Alignment
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Outline
Joint Modeling of Features
Putting the Pieces Together: UBY
Applications of Linked Lexical Resources
4914.05.2014 | Technische Universität Darmstadt | Iryna Gurevych
Construction of aligned lexical resources
Sense Alignment
Niemann & Gurevych, IWCS 2011
█Meyer &
Gurevych, IJCNLP
2011
█
Matuschek & Gurevych, TACL, 2013
█ █ █Matuschek
& Gurevych, COLING,
2014
█ █ █
Hartmann & Gurevych, ACL 2013
█ █
Miller & Gurevych,
LREC 2014
█ █ █
█ Graph-based alignment
█ Resource-independent alignment
█ Text similarity-based alignment
█ Exploitation of existing LR alignments to produce new ones
Michael Matuschek and Iryna Gurevych: Dijkstra-WSA: A Graph-Based Approach to Word Sense Alignment, in: Transactions of the Association for Computational Linguistics (TACL), vol. 1, p. 151-164, May 2013
5012.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Similarity-Based Approaches Suffer From…
Different vocabulary employed by definitions
Example: English noun eye/discernment, e.g.,she has an eye for fresh talenthe has an artist's eye
good discernment (either visually or as if visually)
ability to notice what others might miss
low semantic relatedness score…
5112.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Solution: Use the Graph Topology
Word Senses of JavaJava1 Java2
Java3
5212.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Intuition of Graph Topology
Monosemous lexeme
programming language
Word Senses of JavaJava1 Java2
Java3
programminglanguage1
53
Word Senses of Ruby
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Intuition of Graph Topology
Monosemous lexeme
programming language
Word Senses of JavaJava1 Java2
Java3
programminglanguage1
Ruby1
5412.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Intuition of Graph Topology
Related senses are in the same region of the graph
Word Senses of Ruby
Monosemous lexeme
programming language
Word Senses of JavaJava1 Java2
Java3
programminglanguage1
Ruby1
5512.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Dijkstra-WSA
Graph-based word sense alignment approach
Key ideas:Represent lexical resources as graphsRely on trivial alignments as “reference nodes” and “bridges”Use Dijkstra’s shortest path algorithm
to find alignments
Steps:1. Graph construction2. Computing sense alignments
(Matuschek/Gurevych, 2013)
5612.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Step 1: Graph Construction
Represent each lexical resource as an undirected graph L = (V, E) with the set of nodes V representing senses or synsets the set of edges E V x V representing some kind of (semantic)
similarity between a pair of nodes
An edge connects sense S1 and sense S2 if, for example…There exists a semantic relation between S1 and S2A lexeme W2 occurs in the sense definition of S1, and
W2 is monosemous S1 and S2 share the same syntactic behavior…
(Matuschek/Gurevych, 2013)
5712.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Step 1: Graph Construction
Graph of resource 1
Graph of resource 2
edges representing some kind of (semantic) similarity between nodes
Java1
Java2
Java3
Java1
programminglanguage1
programminglanguage1
espresso1
espresso1
5812.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Step 2: Computing Sense Alignments
a) Create trivial alignments between the resources: Trivial = lexeme is unique/monosemous in both resources Example: programming language Precision: >0.95
b) Identify alignment candidates For example: nodes representing the same lemma
c) For all nodes still unaligned, find shortest paths to the candidate nodes in the other graph Trivial alignments serve as “bridges” between the graphs Align the node pair with the shortest path
(Matuschek/Gurevych, 2013)
5912.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Step 2: Computing Sense Alignments
Graph of resource 1
Graph of resource 2
Java1
Java2
Java3
Java1
programminglanguage1
programminglanguage1
espresso1
espresso1
6012.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Step 2a: Create Trivial Alignments
Graph of resource 2
Graph of resource 1
Java1
Java2
Java3
Java1
programminglanguage1
programminglanguage1
espresso1
espresso1
6112.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Step 2b: Identify Alignment Candidates
Graph of resource 2
Graph of resource 1
?
?
?
Java1
Java2
Java3
Java1
programminglanguage1
programminglanguage1
espresso1
espresso1
6212.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Step 2c: Shortest Paths to the Candidates
Graph of resource 2
Graph of resource 1
3
5
∞
Java1
Java2
Java3
Java1
programminglanguage1
programminglanguage1
espresso1
espresso1
6312.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Step 2c: Align the Nodes
Graph of resource 2
Graph of resource 1
!Java1
Java2
Java3
Java1
programminglanguage1
programminglanguage1
espresso1
espresso1
6512.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Parameter Choices
Restricting the number of alignments Stop when the first candidate is found (1:1 alignment) Keep going and align everything you can reach (1:n alignment)
Possibly with a restricted search depth
Graph construction Use semantic relations, monosemous linking, or both Get rid of relations to high frequent monosemous lexemes (e.g., there is)
Limiting to rare lexemes avoids “explosion” of edges Rare = only appearing in 1 / N of the definitions (e.g., N = 200)
Computing Sense Alignments Path length L: unbounded L yields unmanageable runtime!
Best F1 score between 5 and 8, depending on the resource pair
6612.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Hybrid Approach
Main issue of Dijkstra-WSA Low recall due to missing edges / sparse graph
Hybrid approachTry to align using the graph first Parameterized for high precisionAlign those with no match using a similarity-based approach
6712.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Evaluation Datasets
Sampled datasets: WordNet – Wikipedia (1,815 sense pairs) WordNet – Wiktionary (2,423 sense pairs) FrameNet – Wiktionary (2,789 sense pairs) WordNet – OmegaWiki (683 sense pairs) Wiktionary – OmegaWiki (586 sense pairs) Wiktionary –Wikipedia English (367 sense pairs)
Full datasets: GermaNet – Wiktionary (45,636 sense pairs) Wiktionary –Wikipedia German (31,808 sense pairs)
6812.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Datasets Display Different Properties
WordNet, OmegaWiki, Wikipedia: sense definitions and semantic relations
Wiktionary: no disambiguated semantic relations => sparse graphs GermaNet: very few sense definitions
6912.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Evaluation
Random baseline1:11st
Similarity-based (SB)
Semantic Relations (SR)Linking Monosemes (LM)
SR + LM
SR + SBLM + SB
SR + LM + SB
Human performance
Hybrid
(Matuschek/Gurevych, 2013)
7012.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Evaluation
Random baseline1:11st
Similarity-based (SB)
Semantic Relations (SR)Linking Monosemes (LM)
SR + LM
SR + SBLM + SB
SR + LM + SB
Human performance
Hybrid
(Matuschek/Gurevych, 2013)
Significant improvement in recall….
7112.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Evaluation
Random baseline1:11st
Similarity-based (SB)
Semantic Relations (SR)Linking Monosemes (LM)
SR + LM
SR + SBLM + SB
SR + LM + SB
Human performance
Hybrid
(Matuschek/Gurevych, 2013)
… and F-measure…
7212.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Evaluation
… also on all other datasets!
7612.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Selected Conclusions
Dijkstra-WSA ≥ gloss similarity for densely linked LSRsÞ Generic alignment approach is validBut: low recall for sparse LSRs (English Wiktionary, OmegaWiki)
Dijkstra-WSA + similarity-based backoff outperfoms previous work on all datasets
Þ The two notions of similarity are complementaryÞ Could they be combined in a smarter way?
77
Motivation
Similarity-based Word Sense Alignment
Graph-based Word Sense Alignment
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Outline
Putting the Pieces Together: UBY
Applications of Linked Lexical Resources
Joint Modeling of Features
7814.05.2014 | Technische Universität Darmstadt | Iryna Gurevych
Construction of Aligned Lexical Resources
Sense Alignment
Niemann & Gurevych, IWCS 2011
█Meyer &
Gurevych, IJCNLP
2011
█
Matuschek & Gurevych, TACL, 2013
█ █ █Matuschek
& Gurevych, COLING,
2014
█ █ █
Hartmann & Gurevych, ACL 2013
█ █
Miller & Gurevych,
LREC 2014
█ █ █
█ Graph-based alignment
█ Resource-independent alignment
█ Text similarity-based alignment
█ Exploitation of existing LR alignments to produce new ones
Michael Matuschek and Iryna Gurevych: High Performance Word Sense Alignment by Joint Modeling of Sense Distance and Gloss Similarity, in: Proceedings of the 25th International Conference on Computational Linguistics (COLING 2014). Dublin, Ireland.
7912.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Joint Usage of Features
Similarity- and graph-based approaches both have weaknessesDifferent formulation of glosses Sparse / disconnected graphs
Two-step hybrid approach already helped improve recallBut: No real combination of both notions
Idea: Combine them using Machine LearningExploit the complementary strengths more effectively
8012.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Setup - Features
Features:Gloss similarity (COS, PPR)Dijkstra-WSA distances Infinite distance if no target can be found
Other possible features:Part of speech, sense index, translation overlap, example sentence
patternsNo significant improvement by using them!
Þ Glosses and structure are sufficient
8112.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Setup - Classifiers
Classifiers used:Naive BayesBayesian NetworksPerceptronsSupport Vector Machines (SVMs)Decision Trees
Evaluation using 10-fold cross validationSame datasets as before
8212.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Evaluation
Random1:11st
SBDWSAHybrid
SVMNaive Bayes
Bayesian NetworkPerceptron
Decision Tree
Human performance
8312.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Evaluation
Random1:11st
SBDWSAHybrid
SVMNaive Bayes
Bayesian NetworkPerceptron
Decision Tree
Human performance
General improvement in precision…
8412.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Evaluation
Random1:11st
SBDWSAHybrid
SVMNaive Bayes
Bayesian NetworkPerceptron
Decision Tree
Human performance
…but in F-measure only for some of the
datasets!
8812.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Selected Conclusions
Better overall results on 4 out of 8 datasetsMachine learning helps most for sparse and incomplete LSRs like
OmegaWiki and Wiktionary For „complete“ LSRs like WordNet, we cannot gain much
Better precision on 7 out of 8 Most robust: Bayesian NetworksComplex classifiers (e.g. SVMs) challenged by skewed values
Main source of improvements:Better classification of „borderline“ examplesHigh gloss similarity & distance or vice versa
8912.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Borderline Example
Genome:1. “The non-redundant genetic information stored in DNA sequences
that defines an individual organism”2. “In the context of a genetic algorithm, the information that defines
an individual entity”Very similar descriptionBut: Far apart in the graph => No alignment!
90
Motivation
Similarity-based Word Sense Alignment
Graph-based Word Sense Alignment
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Outline
Putting the Pieces Together: UBY
Applications of Linked Lexical Resources
Joint Modeling of Features
9112.0.2014 | Technische Universität Darmstadt | Iryna Gurevych
Linked Lexical Resources
LLRs
Gurevych et al., EACL
2012
█ █
Eckle-Kohler et al., LREC
2012
█ █
Eckle-Kohler & Gurevych, EACL 2012
█
Eckle-Kohler et al., LMF,
2013
█ █ █
Eckle-Kohler et al., SWJ,
2014
█
█ Large-scale unified LR based on LMF
█ Standardizing heterogeneous LRs
█ Standardized format for subcat frames
█ Language independence of lexicon models
Iryna Gurevych, Judith Eckle-Kohler, Silvana Hartmann, Michael Matuschek, Christian M. Meyer and Christian Wirth: UBY - A Large-Scale Unified Lexical-Semantic Resource Based on LMF, in: Proceedings of the 13th Conference of the European chapter of the Association for Computational Linguistics (EACL), April 2012.
9212.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
UBY: Linking Lexical Resource
Web 2.0
IMSLex-Subcat
UBY Two main characteristics:- Word Sense Alignments
- Standardized Representation
9312.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Heterogeneity of Lexical Resources
Complementary information types
Different terminology
Incompatible Data formats
9412.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Unified Lexical Resource UBY
Unified lexicon model
Extensible
Preserves variety of lexical information
9512.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Structure Integration
Standardized representation frameworks Lexical Markup Framework (LMF)
http://www.lexicalmarkupframework.org Text Encoding Initiative (TEI)
http://www.tei-c.org<entry> <form> <orth>disproof</orth> <pron>dIs"pru:f</pron> </form> <gramGrp> <pos>n</pos> </gramGrp> <sense n="1"> <def>facts that disprove something.</def> </sense> <sense n="2"> <def>the act of disproving.</def> [..]
9612.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Structure Integration in UBY
(Eck
le-K
oh
ler
et
al.
20
12
)
97
1. To sing: To produce musical or harmonious sounds with one’s voice.
2. To sing: To express audibly by means of a harmonious vocalization. 3. To sing: To confess under interrogation.
1. singen: Mit der Stimme harmonische Töne erzeugen.
1. To sing: Produce tones with the voice
2. To sing: divulge confidential information or secrets
1. To sing: To produce harmonious sounds with one's voice.
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Sense Alignments Enable Semantic Interoperability
Senses linked by SenseAxis class (over 1,000,000 instances) English alignments, e.g. WordNet-Wikipedia German alignments, e.g. GermaNet-Wiktionary Cross-lingual alignments, e.g. WordNet-OmegaWiki DE
9812.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Available Alignments
Wikipedia English—WordNet 83,192
Wiktionary English—WordNet 138,282
GermaNet—Wiktionary German 32,850
FrameNet—Wiktionary English 12,340
Wiktionary English—OmegaWiki English 34,509
WordNet—OmegaWiki German 27,529
Wiktionary German—Wikipedia German 21,872
Wiktionary English—Wikipedia English 66,050
WordNet—VerbNet 40,716
FrameNet—VerbNet 17,529
Wikipedia English—OmegaWiki English 3,960
Wikipedia German—OmegaWiki German 1,097
Wikipedia English—Wikipedia German 463,311
OmegaWiki English—OmegaWiki German 58,785
9912.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Resource Integration Workflow in UBY
JWNL FN API JWPL JWKTL
Human users Machines
10012.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Step 1. Structure Integration
UBY API UBY API UBY API UBY API
Human users Machines
UBY
101
UBY-API
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Step 2. Content Integration
Human users Machines
UBY
10412.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
UBY Web UI – Textual View
Textual View: allows to list senses across all resources, to display sense details and to perform sense comparisons.
10512.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
UBY Web UI – Visual View
Visual view: allows to explore the sense alignments.
10612.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
UBY Java API
The UBY API is open source at Google Code: http://code.google.com/p/uby/
Getting Started:1. Download a UBY database dump2. Import the dump into a MySQL database3. Start using the UBY API
The UBY API is work in progress! Many API methods need to be added – consider contributing!
107
http://uby.ukp.informatik.tu-darmstadt.de/uby/UBY
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
UBY – Data and Tools
http://code.google.com/p/uby/
https://uby.ukp.informatik.tu-darmstadt.de/webui/
Web Interface
Open Source API (JAVA)
Database Dumps UBY
108
Motivation
Similarity-based Word Sense Alignment
Graph-based Word Sense Alignment
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Outline
Joint Approaches to Word Sense Alignment
Putting the Pieces Together: UBY
Applications of Linked Lexical Resources
10914.05.2014 | Technische Universität Darmstadt | Iryna Gurevych
Utilizing Linked Lexical Resources
Utilizing LLRs
Cholakov et al., EACL
2014
█
Matuschek et al.,
KONVENS 2014
█
Matuschek et al., TC3,
2013
█
Hartmann & Gurevych, ACL 2013
█
Hartmann et al., 2014 (in preparation)
█
█ Sense annotation/disambiguation
█ Machine/computer-assisted translation
█ Semantic role labelling
█ Cross-language transfer of lexical-semantic resources
Michael Matuschek and Tristan Miller and Iryna Gurevych : A Language-independent Sense Clustering Approach for Enhanced WSD, in Proceedings of the 12th Konferenz zur Verarbeitung naturlicher Sprache (KONVENS 2014), to appear
Michael Matuschek and Christian M. Meyer and Iryna Gurevych: Multilingual Knowledge in Aligned Wiktionary and OmegaWiki for Translation Applications, in: Translation: Corpora, Computation, Cognition (TC3), vol. 3, no. 1, p. 87-118, July 2013
Kostadin Cholakov and Judith Eckle-Kohler and Iryna Gurevych: Automated Verb Sense Labelling Based on Linked Lexical Resources, in: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2014), pp. 68-77, April 2014
110April 28, 2014 | Computer Science Department | UKP Lab Prof. Iryna Gurevych | Dr. Judith Eckle-Kohler
Automatic Verb Sense Labelling of Corpora
Motivation Automatically create verb sense-annotated corpora as training data for
supervised approachesApproach1. Create sense patterns from UBY (combining WordNet, FrameNet, VerbNet,
Wiktionary)2. Compare these to patterns derived from corpus instances3. Assign word sense in corpus if similarity is above a threshold4. Use this data to train supervised systems (distant supervision)Results Significant improvement over MFS baseline for verb sense disambiguation on
MASC and Senseval-3
111April 28, 2014 | Computer Science Department | UKP Lab Prof. Iryna Gurevych | Dr. Judith Eckle-Kohler
Using Alignments for Word Sense Clustering
Motivation Cluster fine-grained word senses in expert-built resources to improve WSD
performanceApproach1. Create alignments between resources using Dijkstra-WSA, allowing 1:n
alignments Source: GermaNet, WordNet Target: Wiktionary, Wikipedia, OmegaWiki
2. If two or more senses are aligned to the same sense in the other resource, merge them into one coarse sense
3. Rescore state-of-the-art WSD algorithms on clustered sense inventoryResults Significant improvement over random clusters of same granularity on WebCAGe
(GermaNet) and Senseval-3 (WordNet)
112April 28, 2014 | Computer Science Department | UKP Lab Prof. Iryna Gurevych | Dr. Judith Eckle-Kohler
Using Aligned Resources for Computer-aided Translation
Motivation SMT systems help, but are not smart enough to replace manual translationApproach1. Create sense alignments between multilingual resources2. Display information from all resources for a particular meaningResults Substantially more available translations and other information types Example: “bass” in Wiktionary and OmegaWiki
12812.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Programming language is not an island!
Word Sense Alignment is vital for increasing coverage and richness of sense representations
But: It is a hard problem!Various approaches Similarity-based, graph-based, combined
Performance depends on resources Sparsity, availability of glosses,…Machine learning shows most robust results
Aligned resources help improve performance for various applicationsVSD, coarse-grained WSD, computer-aided translation
12912.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Future Work
1. Linked lexical resources (LLRs) Integrating and aligning further resources in UBY Special focus: cross-lingual alignment
2. Construction of aligned lexical resources Investigating more elaborate similarity measures for glosses Using different graph algorithms to better express similarity Aligning several resources at once (n-way alignment)
3. Utilizing LLR for language processing Unified deep learning framework utilizing linked resources Distant supervision applied to semantic role labeling Word sense disambiguation and lexical substitution for German
13012.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Thank you. Questions?
13112.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Sense Alignment of Lexical Resources(References)
Elisabeth Niemann and Iryna Gurevych. The People’s Web Meets Linguistic Knowledge: Automatic Sense Alignment of Wikipedia and WordNet. In: Proceedings of the 9th International Conference on Computational Semantics (IWCS), p. 205-214, January 2011.
Christian M. Meyer and Iryna Gurevych. What Psycholinguists Know About Chemistry: Aligning Wiktionary and WordNet for Increased Domain Coverage. In: Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP), p. 883–892, November 2011.
Michael Matuschek and Iryna Gurevych. Dijkstra-WSA: A Graph-Based Approach to Word Sense Alignment. Transactions of the Association for Computational Linguistics (TACL), vol. 1, p. 151-164, May 2013.
Silvana Hartmann and Iryna Gurevych. FrameNet on the Way to Babel: Creating a Bilingual FrameNet Using Wiktionary as Interlingual Connection. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL), vol. 1, p. 1363-1373, August 2013.
Tristan Miller and Iryna Gurevych. WordNet-Wikipedia-Wiktionary: Construction of a Three-way Alignment. In: Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC), May 2014. (to appear)
13212.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Linked Lexical Resources @ UKP(References)
Judith Eckle-Kohler and Iryna Gurevych. Subcat-LMF – Fleshing out a Standardized Format for Subcategorization Frame Interoperability. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL), p. 550-560, April 2012.
Judith Eckle-Kohler, Iryna Gurevych, Silvana Hartmann, Michael Matuschek and Christian M. Meyer. UBY-LMF - A Uniform Model for Standardizing Heterogeneous Lexical-Semantic Resources in ISO-LMF. In: Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC), p. 275-282, May 2012.
Judith Eckle-Kohler, Iryna Gurevych, Silvana Hartmann, Michael Matuschek and Christian M. Meyer. UBY-LMF - Exploring the Boundaries of Language-Independent Lexicon Models. In: LMF Lexical Markup Framework, chap. 10, p. 145-156, ISTE - HERMES - Wiley, 2013. ISBN 978 184 821 4309.
Iryna Gurevych, Judith Eckle-Kohler, Silvana Hartmann, Michael Matuschek, Christian M. Meyer and Christian Wirth. UBY - A Large-Scale Unified Lexical-Semantic Resource Based on LMF. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL), p. 580--590, April 2012.
Judith Eckle-Kohler, John Philip McCrae, and Christian Chiarcos. lemonUby - A Large, Interlinked, Syntactically-rich Lexical Resource for Ontologies. Semantic Web Journal, March 2014.
13312.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Utilizing Linked Lexical Resources(References)
Kostadin Cholakov, Judith Eckle-Kohler, and Iryna Gurevych. Automated Verb Sense Labelling Based on Linked Lexical Resources. In: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL), p. 68-77, April 2014.
Silvana Hartmann and Iryna Gurevych. FrameNet on the Way to Babel: Creating a Bilingual FrameNet Using Wiktionary as Interlingual Connection. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL), vol. 1, p. 1363-1373, August 2013.
Michael Matuschek, Tristan Miller, and Iryna Gurevych. A Language-independent Sense Clustering Approach for Enhanced WSD. In: Proceedings of the 19th Conference on Empirical Methods in Natural Language Processing, October 2014. (in submission)
Michael Matuschek, Christian M. Meyer, and Iryna Gurevych. Multilingual Knowledge in Aligned Wiktionary and OmegaWiki for Translation Applications. Translation: Corpora, Computation, Cognition (TC3), vol. 3, no. 1, p. 87-118, July 2013.