CSE Department, I.I.T. Bombay
Automatic Lexicon Generation through WordNet
by
Nitin Verma and Pushpak Bhattacharyya
Jan 21, 2004
CSE Department, I.I.T. Bombay
Introduction A lexicon is the heart of any natural language
processing system. Difficult to construct requiring enormous
amount of time and man power. Document specific dictionary generation –
– Given a document D and word W therein, which sense S of W should be picked up from the document ?
– Can one construct a document specific dictionary wherein single senses of the words are stored ?
CSE Department, I.I.T. Bombay
UW Dictionary An important machine readable lexical
resource used by the enconverter and deconverter software's.
Introduction
Enconverter
UWDictionary
AnalysisRules
Natural Language
UNL
CSE Department, I.I.T. Bombay
Format of dictionary entries –
– Semantic attributes (derived from the ontology).– Syntactic attributes (POS, person, number,
tense).– Used for the firing of appropriate analysis rules.
Introduction (UW dictionary)
[crane] “crane (icl>bird)” (N, ANIMT, FAUNA, BIRD);
Restriction
HW UW Attributes (both syntactic and semantic)
CSE Department, I.I.T. Bombay
Animate (ANIMT)– Flora (FLORA)
Shrubs (ANIMT, FLORA, SHRB), e.g. jasmine Aquatic plants(ANIMT, FLORA, AQTC), e.g. lotus ….
– Fauna (FAUNA) Mammals (MML) Reptiles (ANIMT, FAUNA, RPTL), e.g. lizard Birds (ANIMT, FAUNA, BIRD) Fish (ANIMT, FAUNA, FISH) Insects (ANIMT, FAUNA, INSCT), e.g. butterfly ……
Ontology*
*Dictionary group, CFILT, IIT Bombay.
Introduction
CSE Department, I.I.T. Bombay
English-UW dictionary generation
CSE Department, I.I.T. Bombay
Resources used –– English WordNet, a WSD* system (soft
word sense disambiguation method), the UNLKB and an inferencer.
Knowledge based approach.
English-UW dictionary generation
* G. Ramakrishnan and P. Bhattacharya. Soft Word Sense Disambiguation, GWN 2004
CSE Department, I.I.T. Bombay
Stage 1 –
Stage 2 –
English-UW dictionary generation
Method
Word1 word2..----------------------
Input Document
WSD*
Word1:N:1Word2:N:3
----------------------
POS and Sense tagged document
CSE Department, I.I.T. Bombay
English-UW dictionary generation (Method)
Word1:pos1:sense1Word2:pos2:sense2
----------------------
InferenceEngine
KB
WordNet
Database of rules
Tagged Document
---------------------------------
------
UW Dictionary
Explanation
UNL KB
CSE Department, I.I.T. Bombay
UW generation for nouns
UW generation
CSE Department, I.I.T. Bombay
UW generation for nouns
crane:N:4Word2:pos2:sense2
----------------------
InferenceEngine
KB
WordNet
UNL KB
Tagged Document
crane:N:4
1
CSE Department, I.I.T. Bombay
UW generation for nouns
crane:N:4Word2:pos2:sense2
----------------------
InferenceEngine
KB
WordNet
UNL KB
Tagged Document
crane:N:4
A query to collect
semantic information
1
2
CSE Department, I.I.T. Bombay
UW generation for nouns
crane:N:4Word2:pos2:sense2
----------------------
InferenceEngine
KB
WordNet
UNL KB
Tagged Document
crane:N:4
A query to collect
semantic information
crane
bird
fauna, animal
organism
1
2
3
CSE Department, I.I.T. Bombay
UW generation for nouns
crane:N:4Word2:pos2:sense2
----------------------
InferenceEngine
KB
WordNet
UNL KB
Tagged Document
crane:N:4
A query to collect
semantic information
crane
bird
fauna, animal
organism
A query to collect relevant
rules
1
4
2
3
CSE Department, I.I.T. Bombay
UW generation for nouns
crane:N:4Word2:pos2:sense2
----------------------
InferenceEngine
KB
WordNet
UNL KB
Tagged Document
crane:N:4
A query to collect
semantic information
crane
bird
fauna, animal
organism
A query to collect relevant
rules
1
4
2
3
5
depth word relation restriction
6 bird icl animal
5 animal icl living thing
4 living thing null null
CSE Department, I.I.T. Bombay
UW generation for nouns
crane:N:4Word2:pos2:sense2
----------------------
InferenceEngine
KB
WordNet
UNL KB
Tagged Document
crane:N:4
A query to collect
semantic information
crane
bird
fauna, animal
organism
A query to collect relevant
rules
Crane(icl>bird)
1
4
2
3
5
6
depth word relation restriction
6 bird icl animal
5 animal icl living thing
4 living thing null null
6
CSE Department, I.I.T. Bombay
UW generation for nouns
crane:N:4Word2:pos2:sense2
----------------------
InferenceEngine
KB
WordNet
UNL KB
Tagged Document
crane:N:4
A query to collect
semantic information
crane
bird
fauna, animal
organism
A query to collect relevant
rules
Crane(icl>bird)
1
4
2
3
5
6
Explanation7
depth word relation restriction
6 bird icl animal
5 animal icl living thing
4 living thing null null
6
CSE Department, I.I.T. Bombay
UW generation for verbs
UW generation
CSE Department, I.I.T. Bombay
UW generation for verbs
Input word
{hypernyms(word)} Π {‘be’, ‘continue’, etc}= 0
true(icl > be)
e.g. : exist (icl > be)
{hypernyms(nominal word)} Π {‘phenomenon’, ‘natural event’, etc}
= 0
true(icl > occur)
e.g. : rain (icl > occur)
false
false
(icl > do) e.g. : make (icl > do)
CSE Department, I.I.T. Bombay
UW generation for adjectives
Input word
UW present in the UNL KB ?Yes
Pick the UW
e.g. : broad (aoj > thing)
No
IS_DEFINED (is_a_value_of relation) on the input word ?
Yes(aoj > thing)
e.g. : good (aoj > thing)
No
(mod > thing) e.g. : green (mod > thing)
CSE Department, I.I.T. Bombay
Semantic attribute generation
English-UW dictionary generation (Method)
CSE Department, I.I.T. Bombay
Semantic attribute generation
crane:N:4Word2:pos2:sense2
----------------------
InferenceEngine
KB
WordNet
Database of rules
Tagged Document
crane:N:4
1
CSE Department, I.I.T. Bombay
Semantic attribute generation
crane:N:4Word2:pos2:sense2
----------------------
InferenceEngine
KB
WordNet
Database of rules
Tagged Document
crane:N:4
A query to collect
semantic information
1
2
CSE Department, I.I.T. Bombay
Semantic attribute generation
crane:N:4Word2:pos2:sense2
----------------------
InferenceEngine
KB
WordNet
Database of rules
Tagged Document
crane:N:4
A query to collect
semantic information
crane
bird
fauna, animal
organism
1
2
3
CSE Department, I.I.T. Bombay
Semantic attribute generation
crane:N:4Word2:pos2:sense2
----------------------
InferenceEngine
KB
WordNet
Database of rules
Tagged Document
crane:N:4
A query to collect
semantic information
crane
bird
fauna, animal
organism
A query to collect relevant
rules
1
4
2
3
CSE Department, I.I.T. Bombay
Semantic attribute generation
crane:N:4Word2:pos2:sense2
----------------------
InferenceEngine
KB
WordNet
Database of rules
Tagged Document
crane:N:4
A query to collect
semantic information
crane
bird
fauna, animal
organism
A query to collect relevant
rules
IF hypernym=‘organism’ THEN generate ‘ANIMT’
ELSE generate ‘INANI’;
IF hypernym=‘fauna’ THEN generate ‘FAUNA’;
IF hypernym=‘bird’ THEN generate ‘BIRD’;
--- ------ ----
1
4
2
3
5
CSE Department, I.I.T. Bombay
Semantic attribute generation
crane:N:4Word2:pos2:sense2
----------------------
InferenceEngine
KB
WordNet
Database of rules
Tagged Document
crane:N:4
A query to collect
semantic information
crane
bird
fauna, animal
organism
A query to collect relevant
rules
IF hypernym=‘organism’ THEN generate ‘ANIMT’
ELSE generate ‘INANI’;
IF hypernym=‘fauna’ THEN generate ‘FAUNA’;
IF hypernym=‘bird’ THEN generate ‘BIRD’;
--- ------ ----
(N,ANIMT,FAUNA,BIRD)1
4
2
3
5
6
CSE Department, I.I.T. Bombay
Database of rules
Semantic attribute generation
No of such rules: 4344
HYPERNYM ATTRIBUTE
organism ANIMT
flora FLORA
fauna FAUNA
bird BIRD
HYPERNYM ATTRIBUTE
change VOA,CHNG
communicate VOA,COMM
move VOA,MOTN
complete VOA,CMPLT
IS_A_VALUE_OF ATTRIBUTE
weight DES,WT
strength DES,STRNGTH
qual DES,QUAL
SYNONYMY OR ANTONYMY
ATTRIBUTE
bright DES,APPR
deep DES,DPTH
shallow DES,DPTH
SYNONYMY ATTRIBUTE
backward DRCTN
always FREQ
frequent FREQ
beautifully MAN
Table 1. Rules for nouns (96) Table 2. Rules for verbs (405)
Table 4. Rules for adverbs (556)Table 3.2. Rules for adjectives (3258)
Table 3.1. Rules for adjectives (29)
CSE Department, I.I.T. Bombay
Experiments and Results
82
84
86
88
90
92
94
96
98
1 2 3 4 5 6 7 8 9 10
Precision
No of correct entries in the dictionary
Total no of entries in the dictionary
70
72
74
76
78
80
82
84
86
88
90
92
1 2 3 4 5 6 7 8 9 10
Precision
Precision for nouns – 93.9% Precision for verbs – 84.4%
Document No Document No
Precision =
CSE Department, I.I.T. Bombay
78
80
82
84
86
88
90
92
94
96
1 2 3 4 5 6 7 8 9 10
Precision
No of correct entries in the dictionary
Total no of entries in the dictionary
72
74
76
78
80
82
84
86
88
90
92
94
1 2 3 4 5 6 7 8 9 10
Precision
Precision for adjectives – 90.06% Precision for adverbs – 86%
Document No Document No
Precision =
Experiments and results
CSE Department, I.I.T. Bombay
Implementation details Subtasks identified –
– MySQL database is used for storing the rules and the UNL KB.
7540 entries in the UNL KB. 4344 entries in the rule base.
– Inference engine in C++.– Web interface of the DDG in CGI & PHP.– Other utilities like UNL KB organizer, Rule entry
interface, WSD integrator are implemented in Perl.
– LOC 4761
CSE Department, I.I.T. Bombay
Demo
CSE Department, I.I.T. Bombay
Hindi-UW dictionary generation
Method
CSE Department, I.I.T. Bombay
Hindi-UW dictionary generation
1. WordNet API is used to obtain all possible parts-of-speech and all possible senses for every word.
2. Hindi WN is queried (by using Hindi WN API) to obtain the semantic attributes.
CSE Department, I.I.T. Bombay
2. Hindi WN is queried (by using Hindi WN API) to obtain the semantic attributes.
3. The Hindi UW dictionary database is queried (on the basis of input-word and its POS) to obtain an appropriate UW.
4. In this step the irrelevant entries are disabled and the incorrect ones are corrected manually by the lexicographer.
Hindi-UW dictionary generation
CSE Department, I.I.T. Bombay
Demo
CSE Department, I.I.T. Bombay
The burden of lexicography has been reduced considerably.
The system is being routinely used in our work on machine translation in a tri-language setting (English, Hindi and Marathi).
Future work will be directed towards the implementation of part-of-speech tagger and word-sense-disambiguator, for Hindi and Marathi languages.
Conclusion and future work