Download - A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging

A Knowledge-Light Approach to Luo

Machine Translation and Part-of-Speech Tagging

Guy De Pauw ([email protected])Naomi Maajabu ([email protected])Peter Waiganjo Wagacha ([email protected])

2

A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging - AfLaT 2010

Outline

• Resource-scarce language engineering- The case of Luo (Dholuo)

• A trilingual parallel corpus English – Swahili – Luo

• Machine Translation experiments• Projection of Annotation experiments• Conclusion & Future Work

3


Resource-Scarce Languages

• Limited financial, political, … resources• Few digital resources: digital lexicons, corpora• Bottleneck of linguistic expertise (in LT)• Two approaches:

- Rule-based approaches• Advantages: meticulous design, linguistically relevant

- Corpus-based, data-driven approaches• Growing importance and availability of digital text material• Advantages: performance models, fast development,

automatic quantitative evaluation

4


Dholuo

KE

TZ

UG

DRCRW

BU

- Western Nilotic language- Spoken by +3M Luo people- Kenya, Uganda, Tanzania- No official dialect- Tonal, but not marked in orthography- Latin alphabet, no diacritics- Resource-scarce (not official language)- Web-mined corpus of 200k words- De Pauw, Wagacha & Abade (2007) Unsupervised

Induction of Dholuo Word Classes using Maximum Entropy Learning

5


Most famous Luo

• Nilotic language• Spoken by +3M people in Kenya, Tanzania and

Uganda • Not an official language• Latin script

• Most famous Luo:

6


AfLaT 2009 / LRE (submitted)

SAWA CORPUS• 2 million word parallel corpus English – Swahili• Competitive machine translation results• Projection of annotation of part-of-speech tags from

English into Swahili is viable

But what about true resource-scarce languages?

7


Parallel Data for Luo

• International Bible Society (2005) Luo New Testament. Available at http://www.biblica.com/bibles/luo

• Use English and Swahili New Testament data of SAWA corpus to construct small trilingual parallel corpus

• Preprocessing:- Pdftext conversion- Tokenization- Sentence alignment

8





• Preprocessing:- Pdftext conversion

• Koolwire.com

- Tokenization- Sentence alignment

9






• Koolwire.com


Luk 1:1 Ji mang'eny osebedo ka chano weche mane otimore e dierwa ,

Luk 1:2 Mana kaka nochiwgi ne wan kod jogo motelo mane joneno wang'giwang kendo jotich wach .

Luk 1:3 Kuom mano , an bende kaka asenono tiend wechegi malong'o nyaka a chakruok , en gimaber bende mondo andik ni e yo mochanore maler , in mulour Theofilo .

Luk 1:4 Mondo ing'e adier mar gik mosepwonji .

Luk 1:5 E ndalo ma Herode ruodh Judea , ne nitie jadolo ma nyinge Zakaria , ma ne en achiel kuom oganda jodolo mag Abija ; chiege Elizabeth bende ne nyar dhood Harun .

Luk 1:6 Gi duto jariyo ne gin joma kare nyim Nyasaye , ne girito chike madongo gi matindo mag Ruoth Nyasaye , maonge ketho .

Luk 1:7 To ne gi onge gi nyithindo , nikech Elizabeth ne migumba , kendo gin jariyogo hikgi : nose niang'

Luk 1:8 Chieng' moro kane oganda gi Zakaria ne ni e tich to notiyo kaka jadolo e nyim Nyasaye ,

Luk 1:9 Noyiere gi ombulu kaka chik mar jodolo , mondo odonji ei hekalu mar Ruoth kendo owang ubani .

Luk 1:10 To ka sa mar wang'o ubani ochopo , jolemo duto nochokore oko kendo negilamo .

Luk 1:11 Eka malaika mar Ruoth Nyasaye nofwenyorene , kochungo bath kendo mar ubani korachwich .

10






• Koolwire.com


• R.C.Moore (2002) Fast and accurate sentence alignment of bilingual corpora.

11


Trilingual corpus

English Swahili Luo

New Testament 192k 156k 170k

• 80% training set

• 10% validation set

• 10% test set (partly annotated for pos-tags)

Tiny, register-specific parallel corpus

12


Word alingment

• Goal: word aligned corpus

• Misconception: morphologically rich languages cannot be used in statistical machine

translation, since word-alignment is word-based

nimemkatalia

have turned him downI

13


Factored Data

• Goal: word aligned corpus

• Misconception: morphologically rich languages cannot be used in statistical machine

translation, since it is word-based

• Word alignment and language modeling can be enhanced by using factored data

• General idea: use extra annotation layers (part-of-speech tagging, lemmatization) to

aid discovery of possible translation pairs

14


Factored Data

15


Factored Data

English You/PP/you have/VBP/have let/VBN/let go/VB/go

of/IN/of the/DT/the commands/NNS/command ..• TreeTagger

Swahili Ninyi/PRON/ninyi mnaiacha/V/mnai amri/N/amri

ya/GEN-CON/ya Mungu/PROPNAME/Mungu ...• De Pauw et al. (2006)• De Pauw & de Schryver (2008)

Luo Useweyo/useweyo Chike/chike Nyasaye/nyasaye mi/mi

koro/koro umako/mako ...• MORFESSOR

16


Machine Translation Experiments

• English Luo and Swahili Luo

• Use standard SMT tool MOSES (Koehn et al 2007)- Phrase-based machine translation- Can handle factored data- Uses SRILM language modeling tool (Stolcke 2002)

• English: Gigaword corpus• Swahili:TshwaneDJe Kiswahili Internet Corpus• Luo: 200k Luo corpus + Training/Evaluation Set of New

Testament data

17


Results

• OOV: percentage of out-of-vocabulary words (i.e. words unknown to the language model)

• BLEU: Bilingual Evaluation Understudy (calculates n-gram overlap between reference and machine translation)

• NIST: modification of BLEU, taking into account information value of n-grams

BLEU & NIST attempt to optimize correlation with human evaluation

18


SMT ExperimentsOOV NIST BLEU

Luo English 4.4% 5.39 0.23

Luo English [F] 4.4% 6.52 0.29

English Luo 11.4% 4.12 0.18

English Luo [F] 11.4% 5.31 0.22

Luo Swahili 6.1% 2.91 0.11

Luo Swahili [F] 6.1% 3.17 0.15

Swahili Luo 11.4% 2.96 0.10

Swahili Luo [F] 11.4% 3.36 0.15

• Translation dictionary did not significantly improve results, but factored data did

19



Luo English 4.4% 5.39 0.23

Luo English [F] 4.4% 6.52 0.29

English Luo 11.4% 4.12 0.18

English Luo [F] 11.4% 5.31 0.22

Luo Swahili 6.1% 2.91 0.11

Luo Swahili [F] 6.1% 3.17 0.15

Swahili Luo 11.4% 2.96 0.10

Swahili Luo [F] 11.4% 3.36 0.15

English Swahili (Google) -- 3.96 0.26

English Swahili (SAWA) -- 3.05 0.22

Swahili English (Google) -- 4.14 0.29

Swahili English (SAWA) -- 4.52 0.35

20



Luo English 4.4% 5.39 0.23

Luo English [F] 4.4% 6.52 0.29

English Luo 11.4% 4.12 0.18

English Luo [F] 11.4% 5.31 0.22

Luo Swahili 6.1% 2.91 0.11

Luo Swahili [F] 6.1% 3.17 0.15

Swahili Luo 11.4% 2.96 0.10

Swahili Luo [F] 11.4% 3.36 0.15

English Spanish (SMT07) -- -- 0.35

Spanish English (SMT07) -- -- 0.32

English Czech (SMT07) -- -- 0.13

Czech English (SMT07) -- -- 0.24

21


Examples

Source en ng’a moloyo piny ? mana jalo moyie ni yesu en wuod nyasaye

Translation who is more than the earth ? only he who believes that he is the son of god

Reference who is it that overcomes the world ? only he who believes that jesus is the son of god

Source atimo erokamano kuom thuoloni

Translation do thanks about this time

Reference I am thankful for your leadership

22


Projection of annotation• Use word alignment to bootstrap annotation in a resource-scarce language

• Project part-of-speech tags from resource-rich(er) language

• Direct correspondence assumption (Hwa 2002)

• Tag-projection list + Tag priority list

• 2000word subset of test set manually annotated for pos-tags

23


Projection of annotation - Results

Rate Precision Accuracy

English Luo 73.6% 69.7% 51.3%

Swahili Luo 71.5% 68.4% 48.9%

Exclusive Luo 66.5% 78.5% 52.2%

Inclusive Luo 75.4% 69.5% 52.4%

• Difference English – Swahili diminishes (word alignment)

• Errors made mainly on function words (closed class)

• Possible improvements:

• Translation dictionary

• Data-driven, morphology-aware part-of-speech tagger

24


Work-load

Data preprocessing: 1 weekMOSES configuration: 2 days (5 mins CPU)(annotation: 1 week)

• Re-inventing the wheel not necessary to arrive at decent results

• Only modest MT performance, pos-tagging performance- But free and easily extendible (more data, languages, …)- Starting point for other efforts

25


Conclusion

• First proof-of-the-principle experiments (machine translation, projection of annotation) for a Nilotic language

“If you have a digital bible, you have an MT system and other NLP components”

• Small register-specific parallel corpus English – Swahili – Luo• Modest, but encouraging BLEU & NIST scores

- SMT for Luo is possible- No alternatives (cf. other African languages?)

• Factored data can overcome limitations of pure word-based methods for word-alignment.

• Morphological generation on the target language side is still a bottleneck

26


Future Work

• Make trilingual, annotated corpus available through OPEN-CONTENT TEXT CORPUS- (SAWA corpus will be made available soon as well)

• Use translation model as seed for bilingual web mining• Tweak & tune MOSES parameters to improve quality• Better morphological analysis, generation for Dholuo

- Unsupervised morphology induction

• Use automatically induced annotation as training data for supervised data-driven taggers

• Repeat experiment for other resource-scarce languages

27


.org

http://aflat.org/luomt

Demonstration System

http://aflat.org/luomt