Date post: | 27-Dec-2015 |
Category: |
Documents |
Upload: | claire-casey |
View: | 226 times |
Download: | 5 times |
Automatic Translation of Nominal Compound into Hindi
Prashant Mathur
IIIT Hyderabad
Soma Paul
IIIT Hyderabad
OUTLINEOUTLINE
What is a Nominal Compound (NC) ? Translation variation of English NC into
Hindi Motivation Approach Results Future Work Bibliography
2Prashant Mathur
Nominal Compound
A construct of two or more nouns. The rightmost noun being the head, preceding
nouns modifiers.
Oil Pump : a device used to pump oil
Customer satisfaction indices : index that indicates the satisfaction rate of customer
Two word nominal compounds are the object of study here
3Prashant Mathur
Frequency of NC in English Corpus (Baldwin et al 2004)
Corpus Words NC Frequency
BNC 84M 2.6%
Reuters 108M 3.9%
4Prashant Mathur
OUTLINEOUTLINE
What is a Nominal Compound (NC) ? Translation variation of English NC into
Hindi Motivation Approach Results Future Work Bibliography
5Prashant Mathur
Variation in translating English NC into Hindi
As Nominal Compound ‘Hindu texts’ hindU SastroM, ‘milk production’ dugdha
utpAdana
As Genitive Construction ‘rice husk’ cAval kI bhUsI, ‘room temperature’ kamare ka tApamAnaAs one word Cow dung gobar
As Adjective Noun Construction ‘nature cure’ prAkratik cikitsA, ‘hill camel’ ‘pahARI UMTa’
As other syntactic phrase wax work mom par kalAkArI ‘work on wax’, body pain SarIr meM dard ‘pain in body’Others Hand luggage haat meM le jaaye jaane vaale saamaan
6Prashant Mathur
OUTLINEOUTLINE
What is a Nominal Compound (NC) ? Translation variation of English NC into
Hindi Motivation Approach Results Future Work Bibliography
7Prashant Mathur
Motivation
Issues in translation Choice of the appropriate target lexeme during
lexical substitution; and Selection of the right target construct type.
Occurrence of NCs in a corpus is high in frequency, however individual compound occur only a few times.
NCs are too varied to be precompiled in an exhaustive list of translated candidates
8Prashant Mathur
Therefore …
NCs are to be handled on the fly. The task of translation of NCs from English
into Hindi becomes a challenging task of NLP
9Prashant Mathur
With Google translator
When tested on the same dataset that has been used to evaluate our system
Translation formation Precision
Overall 45%
Eng NC Hindi NC 29%
Eng NC Hindi Genitive 10%
Others 6%
10Prashant Mathur
OUTLINEOUTLINE
What is a Nominal Compound (NC) ? Translation variation of English NC into
Hindi Motivation Approach Results Future Work Bibliography
11Prashant Mathur
Approach
Translation template generation Extraction of NC from English corpus Sense disambiguation of components Lexical substitution of the component nouns using
Bi-Lingual Dictionary Preparing translation candidates Corpus Search of translation candidates and their
Ranking.
12Prashant Mathur
Translation Template GenerationTranslation Template Generation
Construction Type No. of occurrences Percentage
Nominal Compound 3959 42.9%
Genitive 1976 21.4%
Long Phrases 581 6.284
Adjective Noun Phrase 557 6.024%
Single Word 766 8.285%
Transliterated Nominal Compound
1208 13.065%
None 199 2.152%
We did the survey of 50,000 sentences of parallel corpora and found out the following construction types.
13Prashant Mathur
Some Templates
Nominal Compound H1 H2
Genitive H1 kA H2 H1 ke H2 H1 kI H2
Long Phrases H1 pe H2 H1 meM H2 H1 par H2 H1 ke xvArA H2 H1 se prApwa H2
Total of 44 templates were formed, some of them are showed below.
Adjective H1-ikA H2
Single-Word H1
14Prashant Mathur
Approach
Translation template generation Extraction of NC from English corpus Sense disambiguation of components Lexical substitution of the component nouns using Bi-
Lingual Dictionary Preparing translation candidates Corpus Search of translation candidates and their
Ranking.
15Prashant Mathur
ExtractionExtraction
1Tree-Tagger is a POS-Tagger which gives some extra information.
Word Tree-Tagger word POS TAG lemmarods rods_NNS_rod
2As assumed previously we consider only Noun-Noun formation as Nominal Compound.
16Prashant Mathur
Approach
Translation template generation Extraction of NC from English corpus Sense disambiguation of components Lexical substitution of the component nouns using Bi-
Lingual Dictionary Preparing translation candidates Corpus Search of translation candidates and their
Ranking.
17Prashant Mathur
Lexical Substitution
18Prashant Mathur
Step 3 : Sense Disambiguation of components
To reduce the number of translation candidates
Example :
Campaigns for road safety are organized to keep everyone safer on the Indian roads
Noun Component
No. of WN sense
Sense selected
Synset
Road 2 #1 <road, route>
Safety 6 #2 <safety, refuge>
19Prashant Mathur
WordNet Sense-Relate by Ted Peterson. 80% accuracy in case of NC disambiguation.
20Prashant Mathur
Approach
Translation template generation Extraction of NC from English corpus Sense disambiguation of components Lexical substitution Preparing translation candidates Corpus Search of translation candidates and their
Ranking.
21Prashant Mathur
Lexical Substitution
Now how to translate it into Hindi ?We don’t have direct wordnet mapping from
English to Hindi. We use alternative method to translate.
22Prashant Mathur
Step 4: Lexical SubstitutionStep 4: Lexical Substitution
Acquire all possible translations for all the words within a synset.
Road path, maarg, saDak, raastaa
Route maarg, saDak, raastaa
Safety ahAnikArakatA, suraksita sthAna, suraksA, salAmatI, suraksA sAdhana
Refuge ASraya sthAna, ASraya, sahArA, SaraNa, CipanA
23Prashant Mathur
Contd…
Select those Hindi words which are common translations to all English words of a synset, if there is one
Selected words are: maarg, saDak, raastaa
All words are selected
Road path, maarg, saDak, raastaa
Route maarg, saDak, raastaa
Safety ahAnikArakatA, suraksita sthAna, suraksA, salAmatI, suraksA sAdhana
Refuge
ASraya sthAna, ASraya, sahArA, SaraNa, CipanA
24Prashant Mathur
Approach
Translation template generation Extraction of NC from English corpus Sense disambiguation of components Lexical substitution Preparing translation candidates Corpus Search of translation candidates and their
Ranking.
25Prashant Mathur
Step 5: Preparing Translation CandidateStep 5: Preparing Translation Candidate
For “road safety” Templates generated are:
mArga para surakRA,
mArga surakRA,
SaDak para surakRA,
SaDak kI surakRA
...
26Prashant Mathur
Approach
Translation template generation Extraction of NC from English corpus Sense disambiguation of components Lexical substitution Preparing translation candidates Corpus Search of translation candidates and their
Ranking.
27Prashant Mathur
Step 6 Corpus Search Step 6 Corpus Search
Hindi Corpus (Raw): 28 million words IndexedSearch – pattern match
28Prashant Mathur
Example
election time cunAva ke samaya temple community maMxira kA samAja marriage customs vivAha kI praWA
…
But we didn’t found any translation for
road safety Ф
Prashant Mathur 29
CTQ (Corpus based Translation Quality)
Rate a given translation candidate for both The fully specified translation and Its parts in the context of the translation template in
question.
CTQ (w1H , w2
H , t) = αP(w1H , w2
H , t) + βP(w1H,t) P(w2
H , t) P(t)
t is the translation template used w1
H , w2H are the translations of components of NC
α = 1, β=0 if P(w1H , w2
H , t) > 0 (didn’t perform variation in α, β constants)
30Prashant Mathur
Contd..
Example road safety P(w1
H , w2H , t) = 0
road mArga, mArga ke, mArga meM, saDaka, saDaka par … safety surakRA, ke surakRA, meM surakRA, … so on
P (mArga, meM) * P(meM, surakRA) * P(meM) = (2.28*10-5) * (9.14*10-6) * (.286) = 6 * 10-11
P (mArga, kI) * P(kI, surakRA) * P(kI) = (1.35 × 10-5) * (3.82857143 × 10-5) * (.228) = 1.17 × 10-10
Higher probablity for “mArga kI surakRA”
31Prashant Mathur
Ranking
Baseline Ranking: Count based ranking
A stronger ranking measure CTQ
(borrowed from Baldwin and Tanaka (2004))
32Prashant Mathur
Results
0
10
20
30
40
50
60
70
80
90
100
Dictionary 1st Sense+Dict WSD + Dict
Baseline Recall
Baseline Precision
CTQ Recall
CTQ Precision
14
50
24
46.1
24.6
53.6
19
56.2
28
54.1
28.5
62.1
33Prashant Mathur
Contd..
Measure taken to improve recall: By using genitives as default construct when
translation for a NC is not found
Motivation: We conduct one experiment on development data We verify whether the NCs for which no translation found
during corpus search can be legitimately translated as a genitive construct
We found the heuristics is working for 59% cases
34Prashant Mathur
Results
0102030405060
Recall
Precision
24.8
54
44.5
57
Using genitive as default construct where the system fails to produce a translation
35Prashant Mathur
Related works
Similar approaches (search of translation templates in the corpus) adopted in Bungum and Oepen (2009) for Norwegian to
English nominal compound translation Tanaka and Baldwin (2004) for English to
Japanese nominal compound and vice versa
36Prashant Mathur
Conclusion
Novelty of our approach Using a WSD tool on Source language - to select
the correct sense of nominal components The result : The number of possible translation
candidates to be searched in the target language corpus is significantly reduced.
37Prashant Mathur
Future Work
Multinary NC translation Using semantic features provided in
UW-Dictionary Varying α & β in ranking technique to produce
more effective results.
38Prashant Mathur
Bibliography
Translation by Machine of Complex Nominals: Getting it right Tanaka and Timothy Baldwin
Translation Selection for Japanese-English Noun-Noun Compounds
Tanaka, Takaaki and Timothy Baldwin
Automatic Translation Of Noun Compounds Rackow, Ido Dagan, Ulrike Schwall
Norwegian to English nominal compound translation Bungum, Oepen
39Prashant Mathur