A Knowledge-Light Approach to Luo
Machine Translation and Part-of-Speech Tagging
Guy De Pauw ([email protected])Naomi Maajabu ([email protected])Peter Waiganjo Wagacha ([email protected])
2
A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging - AfLaT 2010
Outline
• Resource-scarce language engineering- The case of Luo (Dholuo)
• A trilingual parallel corpus English – Swahili – Luo
• Machine Translation experiments• Projection of Annotation experiments• Conclusion & Future Work
3
A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging - AfLaT 2010
Resource-Scarce Languages
• Limited financial, political, … resources• Few digital resources: digital lexicons, corpora• Bottleneck of linguistic expertise (in LT)• Two approaches:
- Rule-based approaches• Advantages: meticulous design, linguistically relevant
- Corpus-based, data-driven approaches• Growing importance and availability of digital text material• Advantages: performance models, fast development,
automatic quantitative evaluation
4
A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging - AfLaT 2010
Dholuo
KE
TZ
UG
DRCRW
BU
- Western Nilotic language- Spoken by +3M Luo people- Kenya, Uganda, Tanzania- No official dialect- Tonal, but not marked in orthography- Latin alphabet, no diacritics- Resource-scarce (not official language)- Web-mined corpus of 200k words- De Pauw, Wagacha & Abade (2007) Unsupervised
Induction of Dholuo Word Classes using Maximum Entropy Learning
5
A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging - AfLaT 2010
Most famous Luo
• Nilotic language• Spoken by +3M people in Kenya, Tanzania and
Uganda • Not an official language• Latin script
• Most famous Luo:
6
A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging - AfLaT 2010
AfLaT 2009 / LRE (submitted)
SAWA CORPUS• 2 million word parallel corpus English – Swahili• Competitive machine translation results• Projection of annotation of part-of-speech tags from
English into Swahili is viable
But what about true resource-scarce languages?
7
A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging - AfLaT 2010
Parallel Data for Luo
• International Bible Society (2005) Luo New Testament. Available at http://www.biblica.com/bibles/luo
• Use English and Swahili New Testament data of SAWA corpus to construct small trilingual parallel corpus
• Preprocessing:- Pdftext conversion- Tokenization- Sentence alignment
8
A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging - AfLaT 2010
Parallel Data for Luo
• International Bible Society (2005) Luo New Testament. Available at http://www.biblica.com/bibles/luo
• Use English and Swahili New Testament data of SAWA corpus to construct small trilingual parallel corpus
• Preprocessing:- Pdftext conversion
• Koolwire.com
- Tokenization- Sentence alignment
9
A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging - AfLaT 2010
Parallel Data for Luo
• International Bible Society (2005) Luo New Testament. Available at http://www.biblica.com/bibles/luo
• Use English and Swahili New Testament data of SAWA corpus to construct small trilingual parallel corpus
• Preprocessing:- Pdftext conversion
• Koolwire.com
- Tokenization- Sentence alignment
Luk 1:1 Ji mang'eny osebedo ka chano weche mane otimore e dierwa ,
Luk 1:2 Mana kaka nochiwgi ne wan kod jogo motelo mane joneno wang'giwang kendo jotich wach .
Luk 1:3 Kuom mano , an bende kaka asenono tiend wechegi malong'o nyaka a chakruok , en gimaber bende mondo andik ni e yo mochanore maler , in mulour Theofilo .
Luk 1:4 Mondo ing'e adier mar gik mosepwonji .
Luk 1:5 E ndalo ma Herode ruodh Judea , ne nitie jadolo ma nyinge Zakaria , ma ne en achiel kuom oganda jodolo mag Abija ; chiege Elizabeth bende ne nyar dhood Harun .
Luk 1:6 Gi duto jariyo ne gin joma kare nyim Nyasaye , ne girito chike madongo gi matindo mag Ruoth Nyasaye , maonge ketho .
Luk 1:7 To ne gi onge gi nyithindo , nikech Elizabeth ne migumba , kendo gin jariyogo hikgi : nose niang'
Luk 1:8 Chieng' moro kane oganda gi Zakaria ne ni e tich to notiyo kaka jadolo e nyim Nyasaye ,
Luk 1:9 Noyiere gi ombulu kaka chik mar jodolo , mondo odonji ei hekalu mar Ruoth kendo owang ubani .
Luk 1:10 To ka sa mar wang'o ubani ochopo , jolemo duto nochokore oko kendo negilamo .
Luk 1:11 Eka malaika mar Ruoth Nyasaye nofwenyorene , kochungo bath kendo mar ubani korachwich .
10
A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging - AfLaT 2010
Parallel Data for Luo
• International Bible Society (2005) Luo New Testament. Available at http://www.biblica.com/bibles/luo
• Use English and Swahili New Testament data of SAWA corpus to construct small trilingual parallel corpus
• Preprocessing:- Pdftext conversion
• Koolwire.com
- Tokenization- Sentence alignment
• R.C.Moore (2002) Fast and accurate sentence alignment of bilingual corpora.
11
A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging - AfLaT 2010
Trilingual corpus
English Swahili Luo
New Testament 192k 156k 170k
• 80% training set
• 10% validation set
• 10% test set (partly annotated for pos-tags)
Tiny, register-specific parallel corpus
12
A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging - AfLaT 2010
Word alingment
• Goal: word aligned corpus
• Misconception: morphologically rich languages cannot be used in statistical machine
translation, since word-alignment is word-based
nimemkatalia
have turned him downI
13
A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging - AfLaT 2010
Factored Data
• Goal: word aligned corpus
• Misconception: morphologically rich languages cannot be used in statistical machine
translation, since it is word-based
• Word alignment and language modeling can be enhanced by using factored data
• General idea: use extra annotation layers (part-of-speech tagging, lemmatization) to
aid discovery of possible translation pairs
14
A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging - AfLaT 2010
Factored Data
15
A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging - AfLaT 2010
Factored Data
English You/PP/you have/VBP/have let/VBN/let go/VB/go
of/IN/of the/DT/the commands/NNS/command ..• TreeTagger
Swahili Ninyi/PRON/ninyi mnaiacha/V/mnai amri/N/amri
ya/GEN-CON/ya Mungu/PROPNAME/Mungu ...• De Pauw et al. (2006)• De Pauw & de Schryver (2008)
Luo Useweyo/useweyo Chike/chike Nyasaye/nyasaye mi/mi
koro/koro umako/mako ...• MORFESSOR
16
A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging - AfLaT 2010
Machine Translation Experiments
• English Luo and Swahili Luo
• Use standard SMT tool MOSES (Koehn et al 2007)- Phrase-based machine translation- Can handle factored data- Uses SRILM language modeling tool (Stolcke 2002)
• English: Gigaword corpus• Swahili:TshwaneDJe Kiswahili Internet Corpus• Luo: 200k Luo corpus + Training/Evaluation Set of New
Testament data
17
A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging - AfLaT 2010
Results
• OOV: percentage of out-of-vocabulary words (i.e. words unknown to the language model)
• BLEU: Bilingual Evaluation Understudy (calculates n-gram overlap between reference and machine translation)
• NIST: modification of BLEU, taking into account information value of n-grams
BLEU & NIST attempt to optimize correlation with human evaluation
18
A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging - AfLaT 2010
SMT ExperimentsOOV NIST BLEU
Luo English 4.4% 5.39 0.23
Luo English [F] 4.4% 6.52 0.29
English Luo 11.4% 4.12 0.18
English Luo [F] 11.4% 5.31 0.22
Luo Swahili 6.1% 2.91 0.11
Luo Swahili [F] 6.1% 3.17 0.15
Swahili Luo 11.4% 2.96 0.10
Swahili Luo [F] 11.4% 3.36 0.15
• Translation dictionary did not significantly improve results, but factored data did
19
A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging - AfLaT 2010
SMT ExperimentsOOV NIST BLEU
Luo English 4.4% 5.39 0.23
Luo English [F] 4.4% 6.52 0.29
English Luo 11.4% 4.12 0.18
English Luo [F] 11.4% 5.31 0.22
Luo Swahili 6.1% 2.91 0.11
Luo Swahili [F] 6.1% 3.17 0.15
Swahili Luo 11.4% 2.96 0.10
Swahili Luo [F] 11.4% 3.36 0.15
English Swahili (Google) -- 3.96 0.26
English Swahili (SAWA) -- 3.05 0.22
Swahili English (Google) -- 4.14 0.29
Swahili English (SAWA) -- 4.52 0.35
20
A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging - AfLaT 2010
SMT ExperimentsOOV NIST BLEU
Luo English 4.4% 5.39 0.23
Luo English [F] 4.4% 6.52 0.29
English Luo 11.4% 4.12 0.18
English Luo [F] 11.4% 5.31 0.22
Luo Swahili 6.1% 2.91 0.11
Luo Swahili [F] 6.1% 3.17 0.15
Swahili Luo 11.4% 2.96 0.10
Swahili Luo [F] 11.4% 3.36 0.15
English Spanish (SMT07) -- -- 0.35
Spanish English (SMT07) -- -- 0.32
English Czech (SMT07) -- -- 0.13
Czech English (SMT07) -- -- 0.24
21
A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging - AfLaT 2010
Examples
Source en ng’a moloyo piny ? mana jalo moyie ni yesu en wuod nyasaye
Translation who is more than the earth ? only he who believes that he is the son of god
Reference who is it that overcomes the world ? only he who believes that jesus is the son of god
Source atimo erokamano kuom thuoloni
Translation do thanks about this time
Reference I am thankful for your leadership
22
A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging - AfLaT 2010
Projection of annotation• Use word alignment to bootstrap annotation in a resource-scarce language
• Project part-of-speech tags from resource-rich(er) language
• Direct correspondence assumption (Hwa 2002)
• Tag-projection list + Tag priority list
• 2000word subset of test set manually annotated for pos-tags
23
A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging - AfLaT 2010
Projection of annotation - Results
Rate Precision Accuracy
English Luo 73.6% 69.7% 51.3%
Swahili Luo 71.5% 68.4% 48.9%
Exclusive Luo 66.5% 78.5% 52.2%
Inclusive Luo 75.4% 69.5% 52.4%
• Difference English – Swahili diminishes (word alignment)
• Errors made mainly on function words (closed class)
• Possible improvements:
• Translation dictionary
• Data-driven, morphology-aware part-of-speech tagger
24
A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging - AfLaT 2010
Work-load
Data preprocessing: 1 weekMOSES configuration: 2 days (5 mins CPU)(annotation: 1 week)
• Re-inventing the wheel not necessary to arrive at decent results
• Only modest MT performance, pos-tagging performance- But free and easily extendible (more data, languages, …)- Starting point for other efforts
25
A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging - AfLaT 2010
Conclusion
• First proof-of-the-principle experiments (machine translation, projection of annotation) for a Nilotic language
“If you have a digital bible, you have an MT system and other NLP components”
• Small register-specific parallel corpus English – Swahili – Luo• Modest, but encouraging BLEU & NIST scores
- SMT for Luo is possible- No alternatives (cf. other African languages?)
• Factored data can overcome limitations of pure word-based methods for word-alignment.
• Morphological generation on the target language side is still a bottleneck
26
A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging - AfLaT 2010
Future Work
• Make trilingual, annotated corpus available through OPEN-CONTENT TEXT CORPUS- (SAWA corpus will be made available soon as well)
• Use translation model as seed for bilingual web mining• Tweak & tune MOSES parameters to improve quality• Better morphological analysis, generation for Dholuo
- Unsupervised morphology induction
• Use automatically induced annotation as training data for supervised data-driven taggers
• Repeat experiment for other resource-scarce languages
27
A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging - AfLaT 2010
.org
http://aflat.org/luomt
Demonstration System