+ All Categories
Home > Documents > 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´...

2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´...

Date post: 06-Aug-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
115
Abu-MaTran Automatic building of Machine Translation PIAP- GA-2012-324414 D2.3. Workshop on data creation Dissemination level Public Delivery date 2015/01/31 Status and version Final, v1.0 Authors and affiliation Gema Ramírez-Sánchez (Prompsit), Nikola Ljubešić (UZ) Project funded by the European Community under the Seventh Framework Programme for Research and Technological Development
Transcript
Page 1: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Abu-MaTranAutomatic building of Machine Translation

PIAP- GA-2012-324414

D2.3. Workshop on data creation

Dissemination level Public

Delivery date 2015/01/31

Status and version Final, v1.0

Authors andaffiliation

Gema Ramírez-Sánchez (Prompsit), Nikola Ljubešić (UZ)

Project funded by the European Community underthe Seventh Framework Programme for Research

and Technological Development

Page 2: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

PIAP- GA-2012-324414 Abu-MaTran Deliverable 2.3: Workshop on data creation

Table of ContentsExecutive Summary..............................................................................................................................2Workshop Description..........................................................................................................................4Conclusions..........................................................................................................................................5Annex A. Workshop Materials.............................................................................................................6

1

Page 3: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

PIAP- GA-2012-324414 Abu-MaTran Deliverable 2.3: Workshop on data creation

Executive SummaryThis deliverable corresponds to task T.2.2. (Workshop on data creation for the languages of thecase study) within work package 2 (Dissemination and outreach) aiming at the dissemination of theproject for different audiences (industrial stakeholders, academia and general public).

The task was carried out by the industrial partner, Prompsit Language Engineering, during thesecondment of Gema Ramírez-Sánchez at the University of Zagreb (UZ) in October/November2014. It was developed under the supervision of Dr. Nikola Ljubešić and the support of UZtechnical and administrative staff. This workshop on data creation for rule-based machinetranslation systems will be completed with a second one for advanced users scheduled for May/June2015.

The workshop took place on 5th and 6th November 2014 at the Computer room in the Library ofthe Faculty of Humanities and Social Sciences at the University of Zagreb with the participation of20 people. The audience was very heterogeneous: students, researchers, freelance translators,computer engineers and translation company managers. It was organised as a two-days 6-hourworkshop (3 hours each day) combining theory and hands-on activities on how to add data to thelanguage pairs in the Apertium rule-based machine translation1 (RBMT) platform.

1 www.apertium.org

2

Page 4: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

PIAP- GA-2012-324414 Abu-MaTran Deliverable 2.3: Workshop on data creation

1 Worshop executionPrior to the workshop,

• we prepared and tested the language pair structure for some of the languages that are part ofthe case study as Apertium-like language pairs

• we collaborated with UZ and the Apertium community to create data and user interfaces toease the task of the contributors and the attendees.2

• we gave a talk about Apertium and Prompsit at UZ Natural Language Processing circle topotential attendees to encourage them to participate in the workshop and to introduce themto the field of RBMT

• we prepared the materials of the workshop (workshop guide, slides, data, examples, etc.)

• we managed the call, registration and communication with attendees

After the workshop we had as a tangible output,

• all workshop3 and talk4 materials: they were made available to the attendees and to thegeneral public through the Abu-MaTran website under a free license.

• a new language pair in Apertium5: the language pair to which the attendees contributedduring the workshop has been contributed to the Apertium platform: Croatian to Serbian andthe reverse direction

• an enhanced Abu-MaTran online translation application6: the new language pair and the restof South-Slavic-related language pairs in Apertium have been added to the Abu-MaTranonline translation portal available for the general public.

• improved user-interfaces to add data to Apertium7: the creation and enhancement of userinterfaces done partly for this workshop is a contribution seeking to join forces with otherattempts inside the Apertium community to low the bar for newcomers to the platform

• motivated potential future contributors and commercial exploiters: some participants wantedto keep contributing and some of the company attendees showed their interest in keepingupdated about the progress in the Abu-MaTran project for future joint actions

2 In particular, UZ contributed more than 22,000 new entries to the Croatian, Bosnian and Serbian dictionaries and prepared the entries to be added to monolingual dictionaries as well as a set of entries that differ in Croatian and Serbian lexicons. Prompsit's recruitee worked in the first prototype for an interface to associate monolingual entries to paradigms. People from the Apertium community having contributed in the past to South-Slavic language pairs shared their impressions and answered to doubts during a meeting held in Zagreb. Another contributor from Apertium improved an application to annotate corpora and train a disambiguation module inside Apertium.

3 http://www.abumatran.eu/?p=2924 http://www.abumatran.eu/?p=2855 https://svn.code.sf.net/p/apertium/svn/staging/apertium-hbs_HR-hbs_SR/6 http://translator.abumatran.eu/7 The Paradigm Association Tool (http://paradigm.abumatran.eu/) and Annotatrix (http://abumatran.eu:28000/) are

still experiemental user interfaces. They will be improved as part of the Abu-MaTran project.

3

Page 5: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

PIAP- GA-2012-324414 Abu-MaTran Deliverable 2.3: Workshop on data creation

Workshop Description There are several objectives in the organisation of a Workshop on Data Creation for South-Slaviclanguages in the Abu-MaTran project: introducing Apertium to researchers, students and companies,looking for new contributors, having data for the language pairs of the case study (as Apertium isparticularly suitable for closely related languages) and testing new approach to add data focused onexperimental user interfaces.

The workshop, splitted into two (basic and advanced level) after a change in the original plan, waspresented as a “Workshop on the Apertium free/open-source machine translation platform: basics onhow to control the engine through linguistics”.

The topics covered by the workshop were taken from Prompsit's experience working within theApertium8 platform project over the last 9 years and organising training sessions for companies andindividuals (trainees, translators and computer engineers) that want to contribute to it.

Apertium is a modular system based on dictionaries and transfer rules as it knowledge base.Dictionaries are monolingual and bilingual and, due to the nature of languages, adding entries tomonolingual dictionaries may introduce ambiguity inside the system. Deciding whether book is anoun or a verb is an example of this ambiguity introduced when adding the entry book to an Englishmonolingual dictionary. A module that decides which is the most probable reading (noun or verb)for an ambiguous word taking into account the context is based on statistics computed on a corpus.Manually disambiguated corpora are better to compute this statistics so they are a valuable(although also expensive) resource. After having the most probable reading, both book-libro (noun)and book-reservar (verb) will have to be defined as entries in the English-Spanish dictionary.

This first workshop was focused on how to add new monolingual and bilingual entries todictionaries and how to manually disambiguate a corpus. It will be followed by a second workshopfocused on advanced knowledge on how to add transfer rules to Apertium.

In all, the following 6 objectives were covered during the two sessions:

• O1. Have a general idea of how machine translation works: you will test some machinetranslation systems and understand what's going on behind them

• 02. Understand how Apertium works: you will see and touch the inner parts of Apertium,module by module

• 03. Understand Apertium monolingual dictionaries: you will help us improving Apertiummonolingual dictionaries

• 04. Understand why ambiguity is our main problem and how do we cope with it: you willexplore ambiguous sentences and try to define some rules to disambiguate them

• 05. Understand Apertium bilingual dictionaries: you will help us improving Apertiumbilingual dictionaries

• 06. Understand Apertium from the developer point of view : you will work by frequencyestimates, defined tasks and corpora-driven knowledge

8 www.apertium.org

4

Page 6: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

PIAP- GA-2012-324414 Abu-MaTran Deliverable 2.3: Workshop on data creation

During the workshop the session topics were addressed as follows:

• all participants received a workshop guide containing step-by-step exercises that werecompleted after an introduction to the topic

• each topic was introduced with the support of slides leading to a discussion of the topic

• after discussion, participants were asked to complete the exercises for each topic which werealso discussed together

The guide as well as the slides were distributed to all participants and are available under afree/open-source creative commons license through the Abu-MaTran website.

A copy of both materials is also attached to this deliverable.

ConclusionsWe can highlight some very positive aspects that arose from the organisation of this workshop:

• Attendees interest: we had 20 participants in the two-day workshop and many of themshowed their interest in participating in the second one. We were seeking 25 participants forthis outreach activity. Taking into account that there will be two workshops instead of oneand that the previous talk had 50 people in the audience, we will easily overcome the initialscope.

• Useful content: the workshop guide and slides and the talk allow for casual off-line users tobe introduced to the topic, follow the workshop and test the machine translation systems.

• Useful data: we worked on enhancing the monolingual and bilingual dictionaries forCroatian and Serbian. We started with 16 entries and ended with 485 in the bilingualdictionaries with their corresponding monolingual entries already defined or added. Areview of them was needed after the workshop (completed by UZ partner), but in general,people get the idea about how to complete the task.

• Feedback: we received positive feedback about the workshop from the participants, many ofthem showed their interest in enrolling for the next workshop and we set the path to look forways of collaborating with companies in the area

• User interfaces: the amount of technical knowledge that a user needs to directly contribute tothe Apertium files was saved during the workshop and invested in explaining the interactionwith user interfaces which are still new and experimental. The attendees were able to usethem to complete the tasks which is a strong support to go ahead to get them ready forproduction. They also suggested modifications to improve the user-experience andcapabilities.

5

Page 7: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

PIAP- GA-2012-324414 Abu-MaTran Deliverable 2.3: Workshop on data creation

Annex A. Workshop Materials

The written materials used at the workshop (guide and slides) are delivered together with thisdocument in PDF format. The fonts for slides (LibreOffice Open Document format) and guide(LaTeX format) will be delivered on demand and under a Creative Commons Attribution-ShareAlike 3.0 license9

9 http://creativecommons.org/licenses/by-sa/3.0/

6

Page 8: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Workshop on the Apertium free/open-sourcemachine translation platform

Gema Ramırez-SanchezPrompsit Language Engineering, S.L.

www.prompsit.comCampus UMH. Edifici Quorum III.

Av. de la Universitat, s/n. 03203. Elx (Alacant). Spain

5/6 November 2014. Zagreb.

Contents

1 Introduction to machine translation 21.1 Rule-based vs corpora-based approaches . . . . . . . . . . . 2

2 Apertium 42.1 How does Apertium work? . . . . . . . . . . . . . . . . . . . 4

3 Dictionaries 83.1 Monolingual entries . . . . . . . . . . . . . . . . . . . . . . . 83.2 Bilingual entries . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4 Part-of-speech tagger data 154.1 Annotated corpora . . . . . . . . . . . . . . . . . . . . . . . . 154.2 Tagsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

About this workshop

The materials of this workshop on the ”Apertium free/open-source ma-chine translation platform” have been created by Prompsit Language Engi-neering, S.L., as part of the Abu-MaTran (Automatic Building of MachineTranslation) project1 funded by European Union Seventh Framework Pro-gramme FP7/2007-2013 under grant agreement number PIAP-GA-2012-324414. Special thanks to Nikola Ljubesic for his help.

1www.abumatran.eu

1

Page 9: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Overview

This guide is intended to be your best friend during this workshop for thehands-on and hands-up practical exercises you’ll be working on to meetthe following objectives:

1. Have a general idea of how machine translation works: you will testsome translators and understand what’s going on behind them

2. Understand how Apertium works: you will see and touch the innerparts of Apertium, module by module

3. Understand Apertium monolingual dictionaries: you will help us im-proving Apertium monolingual dictionaries

4. Understand why ambiguity is our main problem and how do we copewith it: you will explore ambiguous sentences, annotate corpora andsee some rules for disambiguation

5. Understand Apertium bilingual dictionaries: and help us improvingApertium bilingual dictionaries

6. Understand Apertium from the developer point of view: you willwork by frequency estimates, defined tasks and corpora-driven knowl-edge

For every section there will be a basic introduction to the topic beforeputting our hands on it.

1 Introduction to machine translation

1.1 Rule-based vs corpora-based approaches

We’ve reviewed together the basics about machine machine: definition,main uses and types of machine translation. Before going on, let’s takea look to some machine translated texts.

Task 1. Taking a look to machine translation systems [30 min.]

In the next table you are presented with 2 texts translated by 4 differentmachine translation systems from Croatian into English. Translations havebeen sorted randomly for each one of them.

2

Page 10: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

For each text:

• try to guess whether the translations are from a SMT or a RBMT andmotivate your answer indicating two reasons for your classification

• choose a best candidate for assimilation purposes (that is, the one thatwould be better for getting the meaning of the original sentence)

• choose a best candidate for dissemination purposes, more specifically,for post-editing (that is, the one that would be more useful to pro-duce an adequate translation by applying the minimum number ofchanges to it)

Text: 2. The Raveonnetes!

Croatian Prvi put u Zagreb na samostalan koncert stizu The Raveonettes, danski indie rock dvojacu kojem su basistica i pjevacica Sharin Foo i gitarist Sune Rose Wagner.

MT1 That’s the first time in Zagreb to act independently concert are The Raveonettes, Danish indie rock pairin which they basistica and singer Sharin Foo and guitarist Sune Rose Wagner.

MT2 First time in Zagreb on solo concert arrive The Raveonettes, Danish indie rock dvojacin which are basistica and singer Sharin Foo and gitarist Sune Dew Wagner.

MT3 For the first time in Zagreb on standalone concert coming The Raveonettes, a Danish indie rockin which the bass player and singer Sharin Foo and guitarist Sune Rose Wagner.

MT4 First into a Zagreb at an substantive concert stizu The Raveonettes , Danish indie rock braceinto a which have been basistica plus songstress Sharin Foo plus guitar Sune Scarlet-rash Wagner.

RBMT or SMT Reason Best for assimilation Best for dissemination

MT1

MT2

MT3

MT4

Text: 3. Family Fazlinovic!

Croatian Nova, osma sezona kultne serije ”Lud, zbunjen, normalan” krece od ponedjeljkana Novoj TV! Ne propustite nove zgode u legendarnoj obitelji Fazlinovic!

MT1 Nova, the eighth seasons screen cult series ”Lud, zbunjen, normalan” ranges from Mondayon Nova TV! There miss new convenience in legendary family Fazlinovic!

MT2 Money , eight high season cult serial ” witless , confused , unexceptional kree with Mondayat an Learner TV! Does not let off nove time into a legendary families Fazlinovi!

MT3 New, eighth season kultne series ”Lud, zbunjen, normal” moves from Mondayon New TV! Not propustite new zgode in legendary family Fazlinovic!

MT4 The new, eighth season of the cult series ”Lud, normal” ranges from Mondayon Nova TV! Do not miss the next game in the legendary family Fazlinovic!

3

Page 11: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

RBMT or SMT Reason Best for assimilation Best for dissemination

MT1

MT2

MT3

MT4

2 Apertium

2.1 How does Apertium work?

We’ve seen that Apertium is an engine with a modular architecture. Eachmodule performs an action to the input it receives from the precedent mod-ule. Let’s see how the ouput of each module looks like.

Task 4. Taking a look to Apertium with apertium-viewer [30 min.]

Apertium-viewer2 is a tool that shows the translation process in Aper-tium module by module. To access it:

• Open a browser and copy/paste the following URL or click on it:http://tinyurl.com/nwj97bl

• A menu to Open or Save a file will appear. Let’s just open it. Click onAccept.

• A security warning window will pop out. Click on Run.

• A confirmation window will then pop out. Click on Yes.

• A final reconfirmation window will pop out. Click again on Yes.

You’ll finally see an interface like the one shown below.

Please, follow these instructions:

• First of all, make sure you have the option Online (and not Local)on the right top of the screen selected. Otherwise click on Onlineand wait for some seconds.

2Further reading: http://wiki.apertium.org/wiki/Apertium-viewer

4

Page 12: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Figure 1: apertium-viewer

• Next to it you have a menu called Modewhich says SELECT A MODE.In Apertium language a mode is a translation direction. Open themenu and select mode ingles-espanol. Wait for some seconds, ittakes a bit to load all dictionaries...

• For better user experience, let’s change the font of the user interface.Go to the left menu and click on View. Go to Font and set it toDialog-Bold-28. Click on Done.

We are ready for testing! Let’s start:

• Write a simple sentence: Hello world.

• You’ll see the translation appearing as you type and the final transla-tion at the end: Hola mundo.

• Easy, isn’t it? But even the most simple sentences can be ambigu-ous. That’s why before you type the final dot, your translation is Holamundial and not Hola mundo. Note that world can be and noun or anadjective.

• So, how does Apertium know what to do? If you click on any of thebars appearing in the screen and you swipe it down you’ll start to seeall intermediate modules output in Apertium.

5

Page 13: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Figure 2: View of apertium-viewer modules

• Don’t be scared, here is the reading for those strange name com-mands. You’ll also find useful to open in a separate tab the wikipage which specifies how part-of-speech and other morphologicalfeatures3 are denoted in Apertium:

1. Morphological analyser output: lt-proc data/en-es.automorf.bin

2. Part-of-speech tagger: apertium-tagger -g $2 data/en-es.prob

3. Multiple-word unit handler: apertium-pretransfer

4. Saxon genitive handler: apertium-transfer -n data/apertium-en-es.en-es.genitive.t1x data/en-es.genitive.bin

5. First transfer step: apertium-transfer data/apertium-en-es.en-es.t1xdata/en-es.t1x.bin data/en-es.autobil.bin)

6. Second transfer step: apertium-interchunk data/apertium-en-es.en-es.t2x data/en-es.t2x.bin

7. Third transfer step: apertium-postchunk data/apertium-en-es.en-es.t3xdata/en-es.t3x.bin

8. Morphological generator output: lt-proc $1 data/en-es.autogen.bin

9. Post-generator output: lt-proc -p data/en-es.autopgen.bin

3http://wiki.apertium.org/wiki/List_of_symbols

6

Page 14: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

• Let’s take a look to some sentences.

Hello world travelers

– Inspect module 2 to see ambiguity.

– Inspect module 5 to see a rule ADJECTIVE + NOUN = NOUN +ADJECTIVE. Note that we don’t know the gender yet (<GD>).

– Inspect module 6 to see how agreement between NOUN + AD-JECTIVE is propagated.

– Inspect module 7 to see the sequence of lexical forms in the targetlanguage.

– Inspect module 9 for final translation!!!

I saw Lily’s shoes

– Inspect module 2 to see ambiguity: all words are ambiguous!!!

– Inspect module 3 to see how ambiguity was solved: well donein this case...

– Inspect modules 5 and 6 comparatively to see: that the pronoundisappears because it is not needed in Spanish how the Saxongenitive rule is applied: PROPER NOUN+’S + NOUN = NOUN+ DE + PROPER NOUN

– Inspect module 7 to see the sequence of lexical forms in the targetlanguage.

– Inspect module 9 for final translation!!!

In the end, I’ll take the soup of the day

– Inspect module 2 to see a multiple-word unit (in the end)

– Inspect module 8 to see the output of the morphological genera-tor: note the marks for the postgenerator ( ).

– Inspect module 9 for final translation where a contraction ap-plied de + el = del

As you have seen, when a user clicks on the Translate button of arule-based machine translation system, a number of linguistic-motivatedprocessings are applied before delivering the machine translated output.But even if the information is accurate, rule-based machine translation isnot fully capable of solving the four big problems already reviewed in thissession: analysis, synthesis, transfer and description.

7

Page 15: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

3 Dictionaries

3.1 Monolingual entries

Apertium monolingual dictionaries contain information about words neededthrough all the translation process. Correspondences between surface forms(toe, toes) and lexical forms (toe, singular noun and toe, plural noun) aredefined in Apertium’s dictionaries in a synthetic way: by associating wordsto an inflection paradigm.

To ease this task, we’ve created a user interface to work with nouns,verbs and adjectives which are not inside the dictionaries yet And now weneed your help to choose the correct paradigm for them.

Let’s think about this task as a real situation. Take a look a this texttranslated from Serbian into Croatian.4

*MIROSLAV Raduljica, CENTAR REPREZENTACIJE SRBIJE:

Do *juce sam bio bradati majmun, a sada sam car!

*Miroslav Raduljica se prije nekoliko nedjelja *otisnuou novu avanturu u Kinu, a srpski centar koji trenutnobrani boje *Sandogana, ne krije da mu je ovo *leto bilojedno od najzanimljivijih.

Raduljica trenutno brani boje kineskoga *Sandogana

Do srebra na *SP u *Spaniji, Raduljica je bio poznatkao super *talentovani centar, ali na koga se ne mozeuvijek *racunati, pa ga je tako i nekadasnji *selektor’orlova’ Duda Ivkovic *precrtao sa spiska.Sada je situacija potpuno drukcija:

- Vrlo mi je zanimljivo kako sam sad car, bog, legenda,a do *juce sam bio *istetovirani bradati majmun i splavar.Pa, ja sam isti taj Raduljica, koji sam bio i 2010.Dobro, malo sam unaprijeden sto se tice karaktera, stabilnijisam, ali sam potpuno isti momak, istih *rezonovanja,iste licnosti i percepcije - rekao je Raduljica u intervjuuza *novembarsko izdanje srpskoga ’*Eskvajera’.

4Text from the Serbian www.alo.rs portal at at http://tinyurl.com/ob4yrbc

8

Page 16: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

There are still many problems on it: some words are not in our dic-tionaries (the ones preceded with a star such as *juce) and some of themshould have a different translation (which should be jucer in Croatian). Af-ter this session and the one related to bilingual dictionaries, this text shouldlook much better...

So, let’s start working for this purpose.

Task 5. Paradigm association tool [40 min.]

Open a browser and navigate to http://paradigm.abumatran.eu.Log in with the user/password corresponding to you surname without di-acritics and you’ll be facing the Overview tab of this tool as shown below.

Figure 3: Paradigm association tool overview

In this tab you’ll find now just a description of the tool and some statsabout tasks already completed to be associated with a paradigm in Serbian.Later on, you’ll find your completed sessions (every 10 words for a givencategory) to be able to review them.

To get started, please, go to tab Noun.

• You’ll see a word below the tab name. This is the surface form of theword you will be working on. Below, a set of probable paradigms isshown.

• For each paradigm we show 4 things: the lemma or base form thatthe surface form could have according to this paradigm, the paradigmname as in Apertium dictionaries, all the surface forms this paradigmwould generate, and some morphological information.

9

Page 17: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

• The goal is to choose the correct paradigm for the surface form given.Once you’ve clicked on it, you can go to the next surface form byclicking on the Next button. Please, select the first paradigm that fitsyou in case of doubt.

• Note that there is a menu – Change Category – in the upper-right sideof the interface, next to the surface form, that will allow you to re-assign the surface form you are working on to another category ifneeded.

• Note also that regarding the morphological information provided foreach entrance, you can identify 4/5 different sources of informationin this order:

1. Main category: Nc denotes Noun, common

2. Gender: n denotes neuter, m masculine, f feminine

3. Number: s denotes singular, p plural

4. Case: n denotes nominative, a accusative, v vocative, g genitive, ddative, l locative, i instrumental

5. y denotes animacy

• When you complete 10 entries you’ll a session will be saved for youand you will be able to access it from the Overview page.

Once you’ve completed the 10 noun entries, please go to tab Verb.

• This tab is very similar to the Noun tab. The only specifics of thistab are the check boxes next to the – Change Category – menu. Usingthem you you can indicate additional information for a verb besidesthe paradigm to fully cover the information we need for a verb thatcan be, for example, transitive and intransitive. We don’t need you toprovide this information for the purposes of this workshop. Just doit if you feel like.

• Regarding the morphological information, you should be able to readit as follows:

1. Main category: Vm denotes Verb main

2. Tense: n denotesinfinitive , m imperative, a aorist, r present, e im-perfect, f future, p participle

3. Person: 1 denotes first person, 2 second person, 3 third person

4. Number: s or -s denotes singular, p or -p plural

5. Gender: m masculine, f feminine

10

Page 18: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

• Note that all forms beginning by App and Rr are for the adjectives andadverb forms that can be derived from the verb.

• Note also that the name of the paradigm gives you information abouttransitivity (tv - transitive, iv - intransitive) and aspect (perf - perfec-tive, imperf - imperfective).

Once you’ve completed the 10 Verb entries, please go to tab Adjective.

• This tab is very similar to the Noun and Verb tabs. In this case, thecheck boxes next to the – Change Category – menu will help us knowwhether the adjective has a comparative and superlative form or if theform contains the yat variant. Again, we don’t need you to providethis information for the purposes of this workshop. Just do it if youfeel like.

• Regarding the morphological information, for adjectives should readas follows:

1. Main category: Agp denotes Adjective general positive2. Gender: n denotes neuter, m masculine, f feminine3. Number: s denotes singular, p plural4. Case: n denotes nominative, a accusative, v vocative, g genitive, d

dative, l locative, i instrumental5. y denotes animacy

If you have completed 10 forms for each category, congratulations, youhelped the coverage of Apertium HBS dictionaries a lot!!!

You can always recheck your work through the Overview tab or go on abit assigning paradigms to your preferred category.

Any ideas for improvement? Let’s discuss them together.

3.2 Bilingual entries

To complete the work we started by adding entries to monolingual dictio-naries, bilingual equivalents should be defined now. So, where is the userinterface? I’m afraid that we still don’t have one.

To replace it, we’ve created a spreadsheet that will help you givingApertium the info needed to perform lexical transfer: lemma equivalence,

11

Page 19: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

translation direction, part-of-speech and changes from source to target for therest of morphological information, e.g. gender change.

We will collaboratively define translation equivalents for a bunch ofwords, some of them for entries added yesterday to the monolingual dic-tionaries and some other for already existing entries that do not have atranslation yet. We will work in our Serbian to Croatian language pair in alist of frequent unknown words created from Serbian corpora. Let’s start!

Task 6. Translation equivalents [40 min.]

Open the shared spreadsheet by clicking on http://tinyurl.com/mk2fgey. Take a moment to understand it through the examples providedfor user gramirez:

• column A: contains information about the user

• column B: contains the unknown surface form in Serbian (green)

• column C: is for the lemma in Croatian (red)

• column D: is for the lemma in Serbian (green)

• column E : is for the part-of-speech or main category of the equiva-lents. Please indicate: n - for nouns, adj - for adjectives, vblex - forverbs, adv - for adverbs, pr - for prepositions, num - for numbers, andnp for proper names.

• column F: is to indicate if the translation works in both directions orjust in one of them. Please, indicate: yes - for both directions, HR-SR- for only from Croatian to Serbian, SR-HR - for only from Serbian toCroatian.

• column G: is a free comment area to clarify or specify informationabout the entry.

If you scroll down, you’ll see blocks of 30 entries and your name as-signed to one of them. Please work on the 30 entries to provide the missinginformation taking the following instructions into account:

• When lemmas are the same in both languages: as Croatian andSerbian share a big portion of vocabulary, we don’t need translationequivalents for all words, only when there is a difference. When notranslation equivalent is needed, just leave the row empty.A special case are the words that differ only in the yat phoneme. We

12

Page 20: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Figure 4: Bilingual equivalent definition

consider them as different words sharing the same lemma, so in thiscase, please just indicate it in Comments column by writting YAT.

• When multiple translations are possible: first, recheck the exampleprovided in the spreadsheet for tomato). When more than one transla-tion is possible, we will choose the most general and frequent one5 towork in both directions (Column F set to yes and add the other(s) in-dicating the appropriate translation direction: HR-SR - from Croatianto Serbian, SR-HR - from Serbian to Croatian.

• When equivalents have different gender or number: please indi-cate it in the Comments column as in the example for tomato: Genderchange: f (HR) – mi (SR)

Congrats! You’ve completed your first bilingual task in Apertium. Weshould now check whether the lemmas you’ve provided in Croatian arealso in the Croatian monolingual dictionaries and add the ones missed. Butthis will be not done now as other topics and tasks are waiting for us.

5To check frequency and contexts you can use the concordance tool developed by theNatural Language Processing group at the Department of Information Sciences in the Fac-ulty of Humanities and Social Sciences at the University of Zagreb: http://tinyurl.com/nw96nsy (thanks Tomaz Erjavec and Nikola Ljubesic!). You’ll find there the hrWaC(web-based corpus for Croatian) and srWac (web-based corpus for Serbian) corpora to per-form searches.

13

Page 21: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

But let’s take a look to our working text from yesterday. If we did thingsproperly, it should look like:

MIROSLAV Raduljica, CENTAR REPREZENTACIJE SRBIJE:

Do jucer sam bio bradati majmun, a sada sam car!

Miroslav Raduljica se prije nekoliko nedjelja *otisnuou novu avanturu u Kinu, a srpski centar koji trenutnobrani boje Sandogana, ne krije da mu je ovo ljeto bilojedno od najzanimljivijih.

Raduljica trenutno brani boje kineskoga Sandogana

Do srebra na SP u Spanjolskoj, Raduljica je bio poznatkao super talentirani centar, ali na koga se ne mozeuvijek racunati, pa ga je tako i nekadasnji izbornik’orlova’ Duda Ivkovic *precrtao sa spiska.Sada je situacija potpuno drukcija:

- Vrlo mi je zanimljivo kako sam sad car, bog, legenda,a do jucer sam bio *istetovirani bradati majmun i splavar.Pa, ja sam isti taj Raduljica, koji sam bio i 2010.Dobro, malo sam unaprijeden sto se tice karaktera, stabilnijisam, ali sam potpuno isti momak, istih *rezonovanja,iste licnosti i percepcije - rekao je Raduljica u intervjuuza studensko izdanje srpskoga ’*Eskvajera’.

We added: Raduljica, Sandogan and Miroslav which appeared 6, 2 and 2times in the text: not a big deal, they remain the same.

We also added: juce, appearing 2 times and translated differently intoCroatian (jucer) and words appearing only once in this text, but quite fre-quent in our list of monolingual unknown words: nedjelja (as nedjelja andnot tjedan), leto (becomes ljeto), talentovan (becomes talentiran), selektor (be-comes izbornik), novembarski (becomes studenski) and racunati (remains thesame). And, of course, Spanija becomes Spanjolska!!

Remaining problems: some unknown words (*otisnuo, *precrtao, *iste-tovirani, *Eskvajera, *rezonovanja) and maybe some rules (next workshop!).

But overall, much better now (for post-editing purposes), isn’t it? That’sall thanks to you all and to the frequency strategy, of course!

14

Page 22: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

4 Part-of-speech tagger data

One of the most valuable resources to build a disambiguation module forApertium is an annotated corpus. Just a few language pairs have onebecause it is expensive to build and rare to find. Compatibility betweentagsets makes reusability also difficult.

Last year we developed Annotatrix, a tool for annotating corpora ac-cording to Apertium dictionaries and to inspect and improve tagsets. Wewill use it during this practice to annotate a brief corpus and to see a tagsetdefinition file.

4.1 Annotated corpora

Annotating corpora can be really time-consuming but with Apertium wecan at least semi-automate the task: only the ambiguous words will needyour help (provided that we have accurate dictionaries!). Let’s see!

Task 7. Annotate an HBS corpus [25 min.]

We are going to work on a Croatian corpus, just a paragraph to have anidea of how annotators work. Follow these steps:

• Open a browser and go to http://abumatran.eu:28000/accounts/login/. Log in with username and password uzguest. Once you arelogged in, you’ll see the main dashboard for Annotatrix.

Figure 5: Bilingual equivalent definition

15

Page 23: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

• Click on Insert a new corpus, you’ll go to an interface to upload orcopy/paste a corpus to be annotated.

• Paste the following text6 (or another) into the text box Corpus text:

"Kada je fotografija dvoje mladih umotanih u hrvatskui srpsku zastavu izazvala veliku pozornost medija, kakopozitivnu tako i negativnu, shvatili smo kako je ovatema relevantna i aktualna. Problem tolerancije izrazenje na ulicama, u medijima, na sportskim priredbama iu svakodnevnom zivotu, pa smo se odlucili na okupljanjemladih iz cijelog svijeta kako bismo pokazali da, neovisnoo tome iz koje zemlje dolaze, mogu ostvariti zajednickicilj ako se ujedine", izjavio je za SETimes PetarAntanasovski, predstavnik AISEC-a Srbije.

• In the box named Corpus title, add your name followed by a hyphenand word mycorpus, e.g. gramirez-mycorpus.

• On Select the corpus language on the drop-down menu choose HBS andclick on the Annotate & Train option below that menu. You’ll be trans-ferred to the a new screen.

• In the drop down menu called Language pair mode, select HBS–>NONEand finally click on the Start annotating button. You’ll be transferredto the Corpus annotator screen.

You’ll see your Corpus title and Corpus language and below it the text youpasted in the first screen with some words in bold. These are the ambiguousones according to the data encoded in Apertium: the ones that have morethan one lexical form for a given surface form.

Your starting point is the first ambigous word (Kada in case you chosethe sample text given). A set of World alternatives is shown on the rightupper side of the screen, each one having a number. There are four in ourexample:

1. Kada, cnjsub

2. Kada, adv

3. Kada, n, f, sg, nom

4. Kada, n, f, pl, gen6From SETtimes: http://www.setimes.com/cocoon/setimes/xhtml/hr/

features/setimes/audio_story/2013/07/25/audio_story-04)

16

Page 24: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

To select an option, you have to use the numbers in your keyboard tosay whether you choose alternative 1, 2, 3 or 4 as the correct one taking intoaccount the context of the sentence. Once you press on the option chosen,e.g. 1, you’ll be transferred to the next ambiguous word. You can also usethe left/right arrows in your keyboard to move from word to word.

Please, go on disambiguating your text. In case of doubt or when youdon’t find the right option, just press 1 and go to the next word. Sometimes,for adjectives specially, you’ll have a hard time to see all options in thescreen. Sorry about that, we are improving the design.

Finally, don’t forget to press button Save and Train once you are donewith the text.

Any ideas for improvement? We will discuss them together.

4.2 Tagsets

Many of the PoS taggers in Apertium rely on statistical disambiguation.Tagsets definition files are defined to help the statistical disambiguationmodule calculate probabilites to choose the correct part-of-speech. Tagsetscontain mappings between all the morphological information delivered bythe morphological analyser grouped in supra sets that have the same be-haviour in a text.

To get started, we create groups for almost all main categories, separat-ing closed and open, and we distinguish then between those that have aspecial role in disambiguation. Lemmas are not taken into account, only inspecial cases.

Let’s take English as an example. In the English tagset:

• auxiliary verbs as ”to be” (VSER or ”to have” VHAVE are not groupedwith the rest of verbs (VLEX).

• modal verbs as ”can” also have a separate group (VMOD).

• we also distinguish between tenses: infinitives (INF), past participles(PP) and gerunds (GER) are separated from present (PRES) and past(PAST).

• singular and plural adjectives are grouped together (ADJ) but we dis-tinguish between singular and plural nouns (NOMSG, NOMPL).

17

Page 25: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

• and, when a noun shares ambiguitiy with other lexical form that ishighly frequent, we put it in a separate special category and we takeinto account the lemma: this is the case of noun ”can” (CANNOM).

Inside the tagsets, we may also define some rules to forbid sequencesof categories, enforce a category after another one of set a preference for acategory after another one. In this section, we work with the groups previ-ously defined.

Following our English tagset:

• we forbid the sequence verb ”to have” in past participle (VHAVEPP)followed by a verb in past tense (PAST), so we avoid bad reads ofsentences like: They’ve had baked potatoes in their set menu for years.

• we enforce Saxon genitive (GEN) after proper nouns (ANTROPONIM,TOPONIM, NPALTRES) and others to avoid bad reads of (’s) as a formof verb ”to be” in many sentences: Jame’s father. Cat’s eyes.

• we give preference to acronyms (n.acr.sg) to help appropriate read-ings of sentences like: I’ve been working in IT deparments for a longtime.

Let’s take a look at the English tagset to discover other interesting groups,forbid and enforce rules:

Task 8. Inspecting a tagset [25 min.]

Having the list of Apertium symbols opened is going to be highly help-ful for this practice too. Open it in a separate tab or window: http://wiki.apertium.org/wiki/List_of_symbols

Go to the dashboard of Annotatrix by clicking on Annotatrix in the up-per menu or typing/clicking in http://abumatran.eu:28000/ again.

Click on TSX Manager. You’ll be seeing a screen to upload a TSX fileor consult a previous uploaded one. To avoid uploading the tsx file forEnglish, you’ll find it available under Your latest TSX as a link to apertium-en-es.en.tsx. Click on this link and you’ll see a view of the TSX having:

• Labels: on the right side of the screen

• Tabs for Multi labels - Forbid rules - Enforce rules - Preferences: onthe left of the screen.

18

Page 26: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Take a look to Labels first:

• Explore some of the Labels and try to understand why are they de-fined for: you are seeing Labels given to grouped categories. If youclick on any of them, e.g. ADJ, you’ll see all the categories includedin it: adj (beautiful), adj.comp (more beautiful), adj.sup (the most beau-tiful), adj.sint (long), adj.sint.comp (longer), adj.sint.sup (the longest).

– What is VDO closed and NOT closed?

– Why a special group for some determiners under the DETQNT ORDclosed label (much, many, enough, first, second)?

– Are ADJPOS closed so different from other adjectives (mine, yours,hers, his, etc.)?

• Try to understand some rules: if you click on them you’ll be able tosee the sequence of forbidden, enforced and preferred. Explore someof the rules and try to find an example in which the rule should beapplied:

– Forbid:

1. ADJPOS {+ NOMSG + NOMPL + NOMCAN + NOMWILL}2. SENT {+ RELAN + RELNN + RELADV}

– Enforce-after:

1. PREDET {+ NOMSG, + NOMPL, + CANNOM, + WILL-NOM, + ADJ, + DET}

Congrats! Now you are an expert reader of Apertium tagsets!

Any improvement to this view? Ideas for new rules? They are morethan welcome!

19

Page 27: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Recap and useful info

In this workshop we’ve introduced you to the basics of machine translationsystems and Apertium dictionaries and part-of-speech tagger.

We thank you for your participation in this workshop and encourageyou to join the Apertium community to help us improving. To do so, justsubscribe to our mailing list or show up in the chat: we will help you tocome in. You’ll find how to contact us in our wiki page called Contact.7

User interfaces have come to Apertium to last. During the Abu-MaTranproject, we will go on improving the ones you’ve been testing today andproducing others. We want to hear about you if you have your say aboutthem. Please contact us through the Abu-MaTran website form.8

This workshop is to be continued: we will be running another one forrules and advanced dictionaries next year around May. Stay tuned!

License

This guide is released under a Creative Commons Atribution-Share Alike3.0 licence.9

More details: http://creativecommons.org/licenses/by-sa/3.0/deed.en.

Please contact Gema Ramırez-Sanchez (gramirez at prompsit dot com)for a copy of the source files.

7http://wiki.apertium.org/wiki/Contact8http://www.abumatran.eu/?page_id=489© Prompsit Language Engineering.

20

Page 28: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Workshop on the Apertium free/open-sourcemachine translation platform: basics on how to

control the engine through linguistics

5th/6th November 2014

Page 29: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Session 1: Introduction to machine translation[15 min.]

Page 30: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

What is “machine translation”?

Machine translation is translating texts from one language toanother with the help of computer programs.

Page 31: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

How is it used ? (Assimilation)

To get a rough idea of a text when you don’t speak the language oryou speak it badly. I don’t speak Breton,

Ofis Publik ar Brezhoneg: Brudan ar yezh ha skoazellan anezhi d’en em zispakan warholl dachennou implij ur yezh zo e-touez kefridiou pennan ar benveg.

nor Basque,

Txillidaren obraren katalogo lehen liburukia kalean da.

But with machine translation, I can get by – in a limited fashion.

Page 32: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

How is it used ? (Assimilation)

To get a rough idea of a text when you don’t speak the language oryou speak it badly. I don’t speak Breton,

Ofis Publik ar Brezhoneg: Brudan ar yezh ha skoazellan anezhi d’en em zispakan warholl dachennou implij ur yezh zo e-touez kefridiou pennan ar benveg.

l’Office Public de la Langue Bretonne : faire connaıtre la langue et l’aider a deployer sur tout les terrains de l’emploi d’une

langue sont parmi les missions principales de l’outil.

nor Basque,

Txillidaren obraren katalogo lehen liburukia kalean da.

Txillidaren De la obra katalogo el primer tomo en la calle es.

But with machine translation, I can get by – in a limited fashion.

Page 33: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

How is it used? (Dissemination)

You have a text in Spanish, and you want to translate it toPortuguese. You first translate the text with the help of machinetranslation, and then you need to only make a few changes beforeit is adequate.

Cheboksary es una ciudad del centro de Rusia europea, capital de la Republica de

Chuvashia y puerto del rıo Volga. Hay fabricas textiles y de artıculos de madera y

cuero. Tambien hay una central hidroelectrica. Fundada en el siglo XIV, Cheboksary se

transformo en un importante nucleo economico tras finalizarse el enlace ferroviario con

Kanash en 1939. En esta ciudad se encuentra la Universidad Estatal Chuvashia Ulyanov

(1967).

Page 34: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

How is it used? (Dissemination)

You have a text in Spanish, and you want to translate it toPortuguese. You first translate the text with the help of machinetranslation, and then you need to only make a few changes beforeit is adequate.

Cheboksary es una ciudad del centro de Rusia europea, capital de la Republica de

Chuvashia y puerto del rıo Volga. Hay fabricas textiles y de artıculos de madera y

cuero. Tambien hay una central hidroelectrica. Fundada en el siglo XIV, Cheboksary se

transformo en un importante nucleo economico tras finalizarse el enlace ferroviario con

Kanash en 1939. En esta ciudad se encuentra la Universidad Estatal Chuvashia Ulyanov

(1967).

Cheboksary e uma cidade do centro da Russia europeia, capital da Republica de Chuvachia e

porto do rio Volga. Ha fabricas texteis e de artigos de madeira e couro. Tambem ha uma

central hidroeletrica. Fundada no seculo XIV, Cheboksary transformou-se en um importante

nucleo economico depois de finalizar-se o enlace ferroviario com Kanash em 1939. Nesta

cidade encontra-se a Universidade Estatal Chuvachia *Ulyanov (1967).

Page 35: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

How is it used? (Dissemination)

You have a text in Spanish, and you want to translate it toPortuguese. You first translate the text with the help of machinetranslation, and then you need to only make a few changes beforeit is adequate.

Cheboksary es una ciudad del centro de Rusia europea, capital de la Republica de

Chuvashia y puerto del rıo Volga. Hay fabricas textiles y de artıculos de madera y

cuero. Tambien hay una central hidroelectrica. Fundada en el siglo XIV, Cheboksary se

transformo en un importante nucleo economico tras finalizarse el enlace ferroviario con

Kanash en 1939. En esta ciudad se encuentra la Universidad Estatal Chuvashia Ulyanov

(1967).

Cheboksary e uma cidade do centro da Russia europeia, capital da Republica de Chuvachia e

um porto do rio Volga. Ha fabricas texteis e de artigos de madeira e couro. Tambem ha

uma central hidroeletrica. Fundada no seculo XIV, Cheboksary transformou-se en um

importante nucleo economico depois da conexao ferroviaria com Kanash ser finalizada em

1939. Nesta cidade encontra-se a Universidade Estatal Chuvachia Ulyanov (1967).

Page 36: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

How is it used? (Dissemination)

You have a text in Spanish, and you want to translate it toPortuguese. You first translate the text with the help of machinetranslation, and then you need to only make a few changes beforeit is adequate.

Cheboksary es una ciudad del centro de Rusia europea, capital de la Republica de

Chuvashia y puerto del rıo Volga. Hay fabricas textiles y de artıculos de madera y

cuero. Tambien hay una central hidroelectrica. Fundada en el siglo XIV, Cheboksary se

transformo en un importante nucleo economico tras finalizarse el enlace ferroviario con

Kanash en 1939. En esta ciudad se encuentra la Universidad Estatal Chuvashia Ulyanov

(1967).

Cheboksary e uma cidade do centro da Russia europeia, capital da Republica de Chuvachia e

um porto do rio Volga. Ha fabricas texteis e de artigos de madeira e couro. Tambem ha

uma central hidroeletrica. Fundada no seculo XIV, Cheboksary transformou-se en um

importante nucleo economico depois da conexao ferroviaria com Kanash ser finalizada em

1939. Nesta cidade encontra-se a Universidade Estatal Chuvachia Ulyanov (1967).

68 words, 7 changes

Page 37: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Is there a difference?

Necessary Unnecessary

Assimilation

Understandability Syntactic correctnessFast translation Lexical correctness

Predictable errorsHappy translators

Dissemination

Adequate syntactic transfer UnderstandabilityPredictable errors Fast translationHigh accuracy

(WER ≤ 15%)Happy translators

With the binoculars the hat-having man sees the squirrel.

Page 38: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Is there a difference?

Necessary Unnecessary

Assimilation

Understandability Syntactic correctnessFast translation Lexical correctness

Predictable errorsHappy translators

Dissemination

Adequate syntactic transfer UnderstandabilityPredictable errors Fast translationHigh accuracy

(WER ≤ 15%)Happy translators

With the binoculars the hat-having man sees the squirrel.The man wearing a hat sees the squirrel with the binoculars.

Page 39: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Is there a difference?

Necessary Unnecessary

Assimilation

Understandability Syntactic correctnessFast translation Lexical correctness

Predictable errorsHappy translators

Dissemination

Adequate syntactic transfer UnderstandabilityPredictable errors Fast translationHigh accuracy

(WER ≤ 15%)Happy translators

The migration gave a great deal of criticism when it spoke out.

Page 40: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Is there a difference?

Necessary Unnecessary

Assimilation

Understandability Syntactic correctnessFast translation Lexical correctness

Predictable errorsHappy translators

Dissemination

Adequate syntactic transfer UnderstandabilityPredictable errors Fast translationHigh accuracy

(WER ≤ 15%)Happy translators

The migration gave a great deal of criticism when it spoke out.The organisation received a great deal of criticism when it spoke

out.

Page 41: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Typology of machine translation systems

Page 42: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Kinds of machine translation

Rule-baseddictionaries and rules

Corpus-basedexisting translationsof sentences

Page 43: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Rule-based machine translation

Strengths Weaknesses

Rule-based machine translation is like taking a set of dictionariesand a descriptive grammar, and trying to translate from one

language you don’t know into another.

Page 44: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Corpus-based machine translation

Strengths Weaknesses

Corpus-based machine translation is like taking two documents intwo languages you don’t know which are translations of each otherand trying to match up words. Then you use these words to buildsentences which you put into Google to see if they sound likely.

Page 45: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Corpus-based machine translation works best when...

You have a big corpus of pre-translated and alignedsentences from one language to another — or programmerswho don’t mind doing the alignment

The language to be translated into is not morphologicallycomplex — and the language to be translated from is moremorphologically complex.

The domain you want to translate is the same or similar as theone of your corpus.

You lack linguists who are interested and motivated.

Page 46: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Rule-based machine translation works best when...

You don’t have any pre-aligned corpora, or the pre-alignedcorpora you have are bad.

The languages to be translated are typologically similar.

You are translating formal language.

You have interested and motivated linguists.

Page 47: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Practice 1: Taking a look to machinetranslation systems [30 min.]

Page 48: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Rule-based machine translation

Strengths Weaknesses+ Predictable output - Lack of fluency+ Predictable errors! - Lack of idiomaticness+ Incremental improvements - “Mechanical” output+ Translation errors traceable - Development can be+ Terminology control easy time consuming+ No need for large quantity

of existing translations

Page 49: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Corpus-based machine translation

Strengths Weaknesses+ Fluent output - Unpredictable+ Idiomatic output - Incremental improvements

are hard+ No need for linguistic - Development can be

resources: time consuming- dictionaries- grammars- linguists /

Page 50: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Session 2: Rule-based Machine Translation [20min.]

Page 51: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Why do we work on rule-based machine translation ?

Machine translation conferences are full of papers aboutcorpus-based MT, so why work on rule-based MT ?

Sometimes there are no corpora, or only rubbish corporaWhen we codify translation rules, it tells us something aboutlanguage(s) and translationWe can produce useful systems! – really!Languages are interestingIt’s really fun!

Page 52: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Die Pyramide der maschinellen Ubersetzung

Page 53: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Intermediate representation

The idea of an intermediate representation is to provide anabstraction of the meaning of the text.

Direct translation: No intermediate representationSyntax transfer: Intermediate representation is either a parsetree, or a graph, along with feature structuresSemantic transfer: Intermediate representation are predicateswith semantic roles.Interlingua: As with semantic transfer, only the sameintermediate representation is shared by all languages /language pairs

This traditional division leaves out the Apertium approach:

Shallow transfer: The intermediate representation can belexical forms – combinations of lemma and part-of-speechchunks – collections of words into segments broadly reflectingphrases

Page 54: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Problems in rule-based machine translation

Page 55: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Analysis

Form does not entirely determine content.

Many sentences in natural language can have more than oneinterpretation, and these interpretations may be translateddifferently in different languages.

Traıan noticias de Grecia – theme ‘about’ or source ‘from’?Traziam notıcias da Grecia?Traıan noticias de Grecıa?

I saw the girl with the telescope – who has the telescope?J’ai vu la fille avec le telescopeJ’ai vu la fille a traves le telescope

The machine only knows as much as you can explain to it.

Page 56: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Synthesis

Content does not entirely determine form.

A single meaning can be expressed in more than one way. Asingle sentence may have many adequate equivalents.

What time is it? (en)

¿Que hora es? ¿Que hora tienes? ¿Me dices la hora? (es)Quelle heure est-il? Vous avez l’heure? (fr)Que horas sao? Que horas tem? (pt)

Page 57: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Transfer

The same content is represented differently in different languages.

Languages differ how they express a particular meaning. Somelanguages encode facets of meaning which are not encoded byothers.

DefinitenessCaseDirection and class of movement

Me gusta CroatiaMe = object gusta = verb Croacia = subject

I like CroatiaI = subject like = verb Croatia = object

Svida mi se HrvatskaSvida = verb mi = object se = reflexive Hrvatska =textcolorbluesubject

Page 58: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Description

Representing knowledge about the translation process inmachine-readable form

It is difficult to indicate in an explicit and declarative way which isthe process we use to translate to a machine. Concepts as ”mostcommon sense” or ”context” are not trivial for machines.

Page 59: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Description

Representing knowledge about the translation process inmachine-readable form

Fortunately, sometimes nothing of this is necessary:

Apertium e um sistema de traducao automatica. (pt)Apertium ei un sistema de traduccion automatica. (oc)Apertium es un sistema de traduccio automatica. (ca)Apertium este o platforma de traducere automata. (ro)Apertium je platforma za racunalnisko prevajanje. (sl)Apertium je platforma za racunarsko prevodenje. (bs)Apertium je platforma za racunalno prevodenje. (hr)Apertium je platforma za kompjutersko prevodenje. (sr)

Page 60: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´
Page 61: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Apertium: free/open source RBMT plaftorm

2005: engine, tool, language pairs = GNU-GPL v2rule-based: focus on related languages and less-resourcedlanguagesstandards: C++, XML, code and linguistic data are decoupledmodular, robust, fast: Unix pipes, works on any PC, 10.000words/seconddeveloped by computer engineers and linguistsbig documentation, support and community

Page 62: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Apertium: free/open source RBMT plaftorm

funding: public (research projects) and private (companies,GSoC, individuals)opportunities:

research: 5 masters, 2 PhD, 70 papers, 6 research projectsbussiness: services around Apertium – Prompsit (and others)languages: some ”first systems” – Breton, Occitan, Afrikaans

Page 63: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Apertium: architecture

Classic shallow-transfer systemPipeline made by 8 independent modules:

Deformatter Pre-transfer

Chunker

Morphological analyser

Tagger

ReformatterMorphological

generatorPost-generator

Monolingual dictionary

Post-gen dictionary

Lexical transference

Monolingual dictionary

Transferencemodule

Input

document

Output

document

Interchunk

Postchunk

Page 64: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Apertium: performance

reasonable quality for closely-related languages:

word error rate around 5% for general purpose textsnaive coverage around 95%dictonaries with a minimum of 10,000 lemas and some 80frequent transfer rules

Page 65: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Apertium: languages

es

ca40446

gl

10807

pt

11447

ast

47671

oc15772

fr26549

eo18160

18896

10554

en

24601

21481

11844

ro 21511

cy

11405

32491eu

12174

nb nn73809

sv da11398

br15762

mk

8874

bg8055

is24475

Page 66: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Apertium: more info

code and languages freely available at Sourceforgehttp://sourceforge.net/projects/apertium

informationa, developers material, tools, interfaces, chat andmuch more at: http://wiki.apertium.orgtesting interface at:http://www.apertium.org

Page 67: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Practice 2:Taking a look to Apertium withapertium-viewer [30 min.]

Page 68: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

3. Monolingual entries [20 min.]

Page 69: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Which data are necessary for a language pair?

Deformatter Pre-transfer

Chunker

Morphological analyser

Tagger

ReformatterMorphological

generatorPost-generator

Monolingual dictionary

Post-gen dictionary

Lexical transference

Monolingual dictionary

Transferencemodule

Input

document

Output

document

Interchunk

Postchunk

Page 70: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Language pair structure

Dictionaries (.dix/.metadix)apertium-es-pt.es.dix, Spanish monodixapertium-es-pt.pt.dix, Portuguese monodixapertium-es-pt.es-pt.dix, bidix Spanish-Portugueseapertium-es-pt.post-es.dix, Spanish post-generatorapertium-es-pt.post-pt.dix, Portuguese post-generator

Tagger (tsx)apertium-es-pt.es.tsxapertium-es-pt.pt.tsx

Rules (.t1x, .t2x, .t3x)apertium-es-pt.es-pt.t1xapertium-es-pt.pt-es.t1x

Page 71: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Dictionaries

Alphabet definition

Symbol definition

Paradigm declaration

Sections of entries

Alphabet definition

Symbol definition

Sections of entries

MONOLINGUALDICTIONARY

BILINGUAL DICTIONARY

Page 72: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

English monodix

A (simplified) monodix looks like this:<?xml version="1.0" encoding="UTF-8"?><dictionary><alphabet>abcdefghijklmnopqrstuvwxyz</alphabet><sdefs><sdef n="n" c="noun"/><sdef n="sg" c="singular"/><sdef n="pl" c="plural"/></sdefs><pardefs><pardef n="book n"><e><p><l></l><r><s n="n"/><s n="sg"/></r></p></e><e><p><l>s</l><r><s n="n"/><s n="pl"/></r></p></e></pardef></pardefs><section id="id" type="standard"><e><i>dream</i><par n="book n"/></e><e><i>hug</i><par n="book n"/></e></section></dictionary>

Page 73: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Bilingual dictionary

A simplified bidix looks like:<?xml version="1.0" encoding="UTF-8"?><dictionary><alphabet>abcdefghijklmnopqrstuvwxyz</alphabet><sdefs><sdef n="n" c="noun"/><sdef n="sg" c="singular"/><sdef n="pl" c="plural"/>

</sdefs><section id="id" type="standard"><e><l>dream</l><s n="n"/><r>sueno</r><s n="n"/><s

n="m"/></e><e><i>hug</i><s n="n"/><r>abrazo</r><s n="n"/><s

n="m"/></e></section>

</dictionary>

Page 74: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Spanish dictionary

Another monodix:<?xml version="1.0" encoding="UTF-8"?><dictionary><alphabet>abcdefghijklmnopqrstuvwxyz</alphabet><sdefs><sdef n="n" c="noun"/><sdef n="sg" c="singular"/><sdef n="pl" c="plural"/><sdef n="m" c="masculino"/></sdefs><pardefs><pardef n="libro n"><e><p><l></l><r><s n="n"/><s n="m"/><s n="sg"/></r></p></e><e><p><l>s</l><r><s n="n"/><s n="m"/><s n="pl"/></r></p></e></pardef></pardefs><section id="id" type="standard"><e><i>sueno</i><par n="libro n"/></e><e><i>abrazo</i><par n="libro n"/></e></section></dictionary>

Page 75: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Practice 3: Paradigm association tool (to increaseHBS dixes) [40 min.]

Page 76: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Hvala!

Page 77: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Workshop on the Apertium free/open-sourcemachine translation platform: basics on how to

control the engine through linguistics

5th/6th November 2014

Page 78: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Session 4: lexical transfer [15 min.]

Page 79: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Transfer stage /1

Deformatter Pre-transfer

Chunker

Morphological analyser

Tagger

ReformatterMorphological

generatorPost-generator

Monolingual dictionary

Post-gen dictionary

Lexical transference

Monolingual dictionary

Transferencemodule

Input

document

Output

document

Interchunk

Postchunk

Page 80: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Transfer stage /2

The transfer module is where the magic happens: the intermediaterepresentation in source language (SL) is converted into anintermediate representation in target language (TL).

Transfer in Apertium consists of two submodules:Lexical transfer:

selects the most suitable equivalent in TL for a SL word;marks some lexical features which will be used by thestructural transfer.

Structural transfer: performs syntactic operations involvinggroups of words

Page 81: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Lexical transfer

The lexical transfer module reads each SL lexical form anddelivers the corresponding TL lexical form by looking it up in abilingual dictionary.

Bilingual dictionaryNo surface forms in this stage: input and output are lexicalforms consisting of lemma, part-of-speech and inflectioninformation.The dictionary contains a list of equivalent lexical forms.A single bilingual dictionary is used for both directions oftranslation.XML syntax similar (but simpler) to monolingual dictionaries.Paradigms are usually not necessary.

Page 82: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Translation equivalents /1

A simple task... apparently:

[fr] [es]transducteur<n><m><s> ←→ transductor<n><m><s>transducteur<n><m><pl> ←→ transductor<n><m><pl>

Page 83: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Translation equivalents /2

A shorter representationOnly lemma and part-of-speech are mandatory if the rest of tagsdo not change:

transducteur<n> ←→ transductor<n>

XML encoding in the bilingual dictionary

<e><p><l>transducteur<s="n"></l><r>transductor<s="n"></r>

</p></e>

These can be used for fr→ es , and es→ fr.

Page 84: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Change of gender

Only the tags until the last change need to be indicated:

vallee<n><f> ←→ valle<n><m>

XML encoding in the bilingual dictionary

<e><p><l>vallee<s="n"><s="f"></l><r>valle<s="n"><s="m"></r>

</p></e>

Page 85: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Lexical ambiguity

Real life is a bit more complex...

HomographyEnglish book (noun or verb) translates into French livre (noun) orreserver (verb).

PolysemyEnglish bank (noun) translates into Spanish banco or ribera.Free-rides do not pose any problem: English plant is planta inSpanish both the living organism or a kind of factory/installation.

Page 86: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Adding entries to the dictionary /1

Page 87: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Adding entries to the dictionary /2

<e><p><l>gare<s n="n"/></l> <r>estacion<s n="n"/></r>

</p></e>

Page 88: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Adding entries to the dictionary /3

Page 89: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Adding entries to the dictionary /4

Page 90: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Adding entries to the dictionary /5

<e><p><l>gare<s n="n"/></l> <r>estacion<s n="n"/></r>

</p></e><e r="LR"><p><l>saison<s n="n"/></l> <r>estacion<s n="n"/></r>

</p></e>

Page 91: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Adding entries to the dictionary /6

Page 92: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Adding entries to the dictionary /7

Page 93: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Adding entries to the dictionary /8

<e><p><l>gare<s n="n"/></l> <r>estacion<s n="n"/></r>

</p></e><e r="LR"><p><l>saison<s n="n"/></l> <r>estacion<s n="n"/></r>

</p></e><e r="RL"><p><l>saison<s n="n"/></l> <r>temporada<s n="n"/></r>

</p></e>

Page 94: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Adding entries to the dictionary /9

Page 95: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Adding entries to the dictionary /10

Page 96: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Adding entries to the dictionary /11

<e><p><l>gare<s n="n"/></l> <r>estacion<s n="n"/></r>

</p></e><e r="LR"><p><l>saison<s n="n"/></l> <r>estacion<s n="n"/></r>

</p></e><e r="RL"><p><l>saison<s n="n"/></l> <r>temporada<s n="n"/></r>

</p></e><e r="LR"><p><l>station<s n="n"/></l> <r>estacion<s n="n"/></r>

</p></e>

Page 97: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Disambiguation of polysemy

We may cope with lexical selection of polysemous terms by usingmultiwords:

gare<n> ←→ estacion<n>station <g>de ski</g><n> ←→ estacion <g>de esquı</g><n>

Apertium also includes an optional module for lexicalselection.

Page 98: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Marking lexical features for the structural transfer /1

The lexical transfer also marks some lexical features whichwill be used by the structural transfer.

For instance, a noun with the same surface form for its two genders.

Spanish monolingual dictionary:

estudiante −→ estudiante<n><mf><sg>estudiantes −→ estudiante<n><mf><pl>

The structural transfer will choose the gender by looking at thesurrounding context.

The lexical transfer simply marks this issue with the tag GD.

Similar things hold for number (ND).

Page 99: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Marking lexical features for the structural transfer /2

<e r="LR"><p><l>etudiant<s n="n"/><s n="m"/></l><r>estudiante<s n="n"/><s n="mf"/></r>

</p></e>

<e r="LR"><p><l>etudiant<s n="n"/><s n="f"/></l><r>estudiante<s n="n"/><s n="mf"/></r>

</p></e>

<e r="RL"><p><l>etudiant<s n="n"/><s n="GD"/></l><r>estudiante<s n="n"/><s n="mf"/></r>

</p></e>

Page 100: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Practice 4: Translation equivalents [40 min.]

Page 101: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Session 5: Morphological disambiguation [15min.]

Page 102: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Part-of-speech tagger: where are we?

Deformatter Pre-transfer

Chunker

Morphological analyser

Tagger

ReformatterMorphological

generatorPost-generator

Monolingual dictionary

Post-gen dictionary

Lexical transference

Monolingual dictionary

Transferencemodule

Input

document

Output

document

Interchunk

Postchunk

Page 103: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Lexical ambiguity and part-of-speech tagging /1

Lexical ambiguityA surface form with more than one possible morphologicalanalysisEx. [en] book (noun or verb)→ [fr] livre (noun)→ [fr] reserver (verb)

This is not polysemy!A lemma and part-of-speech tag that have several meaningsEx. [en] bank (noun)→ [es] banco (institution that provides financial services)→ [es] ribera (slope of land adjoining a river)

Page 104: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Lexical ambiguity and part-of-speech tagging /1

Lexical ambiguityA surface form with more than one possible morphologicalanalysisEx. [en] book (noun or verb)→ [fr] livre (noun)→ [fr] reserver (verb)

This is not polysemy!A lemma and part-of-speech tag that have several meaningsEx. [en] bank (noun)→ [es] banco (institution that provides financial services)→ [es] ribera (slope of land adjoining a river)

Page 105: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Lexical ambiguity and part-of-speech tagging /2

Ambiguity between part-of-speech:

I (acr)work (vblex.pres or n.sg)

Ambiguity within part-of-speech:

I (prn)see (vblex.inf or vblex.pres

Page 106: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Statistical disambiguation /1

Statistics about the context in which each tag appears help tosolve the part-of-speech ambiguity

These statistics are collectedfrom hand-tagged texts (more accurate), orfrom untagged texts (less accurate)

Tagged text

I (prn.subj.p1.pl)see (vblex.pres)my (det.pos.1.sg)screen (n.sg)

Page 107: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Statistical disambiguation /2

Apertium statistical tagger is based on first-order hidden MarkovmodelsIt chooses the combination of tags with the highest probability:

Book (verb) a (prep) calm (adj) room (noun)Book (verb) a (prep) calm (vblex) room (noun)Book (verb) a (prep) calm (noun) room (noun)Book (noun) a (prep) calm (adj) room (noun)

Book (noun) a (prep) calm (vblex) room (noun)Book (noun) a (prep) calm (noun) room (noun)

Page 108: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Practice 5: Annotating a corpus [20 min.]

Page 109: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Tagset definition /1

To alleviate the problem of data sparseness the sequences ofmorphological tags are grouped into coarse tags (called Labels)

Sequence of tags Coarse tagnoun.m.sg NOUN... ...noun.f.pl NOUNverb.pres.1p.sg VERB.PRESENT... ...verb.pres.3p.pl VERB.PRESENTprn.1p.sg PRONOUNprn.2p.sg PRONOUNprn.3p.sg PRONOUN.3P.SG... ...prn.3p.pl PRONOUN

Page 110: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Tagset definition /2

How to design a tagset:

Rules of thumbGroup sequences of tags having the same syntactic role andappearing in the same contexts under the same coarse tagDo not group under the same coarse tag those sequences oftags among which the disambiguator needs to distinguish

Starting with a tagset borrowed from a similar language might help

Page 111: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Tagset definition /3

Example of tagset:<?xml version="1.0" encoding="iso-8859-1"?><tagger name="English"><tagset>...<def-label name="ADJ"><tags-item tags="adj"/><tags-item tags="adj.comp"/><tags-item tags="adj.sup"/><tags-item tags="adj.sint"/><tags-item tags="adj.sint.*"/>

</def-label><def-label name="PREP" closed="true"><tags-item tags="pr"/>

</def-label>...

</tagset>...

</tagger>

Page 112: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Rule-based disambiguation /1

Statistical disambiguator

Guarantees that a sentences is completely disambiguatedMay make mistakes because it uses a limited context window

Constraint grammar rules [optional]Do not guarantee that a sentences is always completelydisambiguated

They must be applied before the statistical disambiguator

Can reduce (or even solve) the ambiguityCan use a variable-length context window

Page 113: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Rule-based disambiguation /2

Este (prn.dem and det.dem) dıa (n.m.sg) (Spanish)This (det.dem) day (n.sg)(English)This one (prn.dem) day (n.sg)(English)

Example of constraint grammar rule:LIST DET-DEM = (det dem);LIST PRON-DEM = (prn dem);

REMOVE PRON-DEM IF (0 PRON-DEM) (0 DET-DEM) (1C N);

Remove a reading of demonstrative pronoun IFcurrent word can be a demonstrative pronoun, ANDcurrent word can also be a demonstrative determiner, ANDfirst word to the right can ONLY be a noun

Page 114: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Practice 6: Taking a look to a tagset [20 min.]

Page 115: 2012€¦ · Workshop on the Apertium free/open-source machine translation platform Gema Ram ´ rez-S anchez´ Prompsit Language Engineering, S.L. Campus UMH. Edici Qu orum III.´

Hvala!


Recommended