+ All Categories
Home > Business > 9. Manuel Harranz (pangeanic) Hybrid Solutions for Translation

9. Manuel Harranz (pangeanic) Hybrid Solutions for Translation

Date post: 11-May-2015
Category:
Upload: riilp
View: 480 times
Download: 0 times
Share this document with a friend
Popular Tags:
29
Alex Helle / Manuel Herranz PangeaMT Sharing Experiences on MT System, Data management, Hybridation
Transcript
Page 1: 9. Manuel Harranz (pangeanic) Hybrid Solutions for Translation

Alex Helle / Manuel Herranz

PangeaMT

Sharing Experiences on MT System,

Data management,

Hybridation

Page 2: 9. Manuel Harranz (pangeanic) Hybrid Solutions for Translation

IntroBrief history

Pangea system introduction /

features for EXPERT

Hybridation experiences at

Pangeanic (+future work)

Page 3: 9. Manuel Harranz (pangeanic) Hybrid Solutions for Translation

Intro

Brief history

http://youtu.be/K-HfpsHPmvw

• “1-2 million words an hour”• “quite adequate speed to

cope with the whole output of the Soviet Union in a week… a few hours computer time a week”

• [full scale production] “if our experiments go well, within 5 years or so”

Page 4: 9. Manuel Harranz (pangeanic) Hybrid Solutions for Translation

What is PangeaMT? The first commercial application of Open Source Moses (AMTA 2010, http://euromatrixplus.net/moses)

A development overcoming Moses limitations for localizationindustry presented at Association for MT in the Americas : PangeaMT putting open standards to work... well AMTA 2010 http://bit.ly/uM8x6V

06/2011 PangeaMT launches the DIY Solution to Machine Translate independently and flexibly like never before http://bit.ly/kSd3wC

07/2011 MT experiences Sony Europe http://slidesha.re/oxZmBS

07/2011 A harness that eases re-training and updating DIY SMT as presented at TAUS Barcelona 2011 http://slidesha.re/nEe5mU

02/2012 API for hosted solutions

Page 5: 9. Manuel Harranz (pangeanic) Hybrid Solutions for Translation

What is PangeaMT?2007/08

2009/10

2011/12

• DIY SMT• Automated retraining• API v1• Glossary• Automated re-training• Transfer architecture and know-how to users• Compatibility withcommercial formats (ttx, sdlxliff, docx, odt)

2007 and before

• RB tests with commercial software• Insufficiently good output• Only internal production

• EU Post-Editing Award

• V1: Small data sets (2-5M words), automotive & electronics• (ES), then Fr/It/De in other fields

• Division born• 00's of engine trials and language combinations• Open-Source to commercial

• TMX / XLIFF workflows

2013

• Powerful API v2 for live translation• Confidence scores• Compatibility with more commercial formats

Page 6: 9. Manuel Harranz (pangeanic) Hybrid Solutions for Translation

Unrest is continuing in Cairo as protesters set up their demand for Egypt’s

military rulers to resign

+ specific language rules

+ job or client glossary

+ hybrid technologies

SMT at work

Page 7: 9. Manuel Harranz (pangeanic) Hybrid Solutions for Translation

Data? best clean, thank youCleaning

<tu srclang="en-GB">

<tuv xml:lang="EN-GB">

<seg>A system for recovering the methane that is emitted from the manure so that

it does not leak into the atmosphere.</seg>

</tuv>

<tuv xml:lang="FR-FR">

<seg>Système permettant de r€ pérer le méthane qui se dégage de l'engrais naturel

d'origine animale de sorte qu'il ne se dissipe pas dans l'atmosphère.</seg>

</tuv>

<tu creationdate="20090817T114430Z" creationid="APIACCESS"

changedate="20110617T141159Z" changeid=“pat">

<tuv xml:lang="EN-US">

<seg>Overall heigtht –<bpt i="1">{\f43 </bpt> <ept i="1">}</ept>25&quot;; width –

<bpt i="2">{\f43 </bpt> <ept i="2">}</ept>20.1&quot;.</seg>

</tuv>

<tuv xml:lang="ES-EM">

<seg><bpt i="1">{\f2 </bpt>Altura total - 25&quot;; anchura <ept i="1">}</ept>–

<bpt i="2">{\f43 </bpt> <ept i="2">}</ept><bpt i="3">{\f2 </bpt>20,1&quot;.<ept

i="3">}</ept></seg>

</tuv>

</tu>

<tuv xml:lang=“EN-US">

<seg>On 22nd May we decided not to join the group.</seg>

<tuv xml:lang=“DE-DE">

<seg>Am 22. </seg>

More cleaning

Cleaning

Page 8: 9. Manuel Harranz (pangeanic) Hybrid Solutions for Translation

Data? best clean, thank youCleaning

More cleaning

Cleaning

<tu srclang="en-GB">

<tuv xml:lang="EN-GB">

<seg>The President of the United States visited Costa Rica.</seg>

</tuv>

<tuv xml:lang=“ES-ES">

<seg>El Presidente de los Estados Unidos, el señor Obama y su esposa la señora

Michelle, visitaron Costa Rica el pasado sábado.</seg>

</tuv>

<tuv xml:lang=“JP">

<seg>同書は「通訳・翻訳キャリアガイド」の2011-2012年度版。

英字新聞のジャパンタイムズ社が強みとするジャーナリスティックな視点で、通訳や翻訳という仕事が持つ魅力ややりがい、プロに要求されるスキルおよび意識の持ち方などを紹介。また通訳者・翻訳者になるための道すじから、実際の仕事の現場にいたるまで、今日の通訳・翻訳業界の実像を包括的に紹介。</seg><tuv xml:lang=“EN-US">

<seg>It is a journalistic point of view and strengths of the English-

language newspaper Japan Times. It includes a description of the exciting and

rewarding work of translation and interpretation, as well as the introduction of

consciousness and how to acquire the required professional skills. The road to

becoming a translator and interpreter also down to the actual work site, a

comprehensive guide to interpreting the reality of today'stranslation industry.

</seg>

Page 9: 9. Manuel Harranz (pangeanic) Hybrid Solutions for Translation

Data? best clean, thank youCleaning

Engine training with clean data

Having approved, terminologically sound, clean data improves engine accuracy and performance with even small sets of data.

Data cleaning modules

• Remove any “suspects”:

• Sentences that are too long

• Mismatches (of many kinds!)

• Terminological inaccuracies

• Non-useful segments, etc

Parallel text extraction / Translation input / Post-edited material

This is often comes from CAT tools or document alignments, crawling

Data Cleaning (in-lines)

Remove all non-translation data.

TMX Human approval

Some of this material may actually be OK for training. It is then input in the training set.

Page 10: 9. Manuel Harranz (pangeanic) Hybrid Solutions for Translation

System features – For EXPERTCleaning

Page 11: 9. Manuel Harranz (pangeanic) Hybrid Solutions for Translation

System features – For EXPERTDomain

Page 12: 9. Manuel Harranz (pangeanic) Hybrid Solutions for Translation

System features – For EXPERTEngine Creation

Page 13: 9. Manuel Harranz (pangeanic) Hybrid Solutions for Translation

System features – For EXPERTEngine Training

Page 14: 9. Manuel Harranz (pangeanic) Hybrid Solutions for Translation

Unrest is continuing in Cairo as protesters set up their demand for Egypt’s

military rulers to resign

• specific language rules

• job / client glossary

• hybrid technologies

• good bleu tracking, ideal for experimentation

System features – For EXPERTTypically a 5 n-gram, DL, table

Page 15: 9. Manuel Harranz (pangeanic) Hybrid Solutions for Translation

Different MT Systems for Different

Lang Pairs?

Related languages

SMT, with accurate n-gram training and in-domain data (typically 5, distorsion limit, weighs and fine-tuning)

Morphology-rich languages

Data is not enough and casuistry too large (Baltic languages like Lavian are extreme, Turkish is regular but too many suffixes) SMT cannot cope. Rule-based or Hybrid

Syntactically distant languages

Need additional information, this is where different HYBRID TECHNIQUES come into place. NO “SIZE FITS ALL”

Page 16: 9. Manuel Harranz (pangeanic) Hybrid Solutions for Translation

- when the syntactic distance between languages is very large (unrelated languages). Patterns are lost (or not found) monotone TR

-

-

Hybridation Experiences at PangeanicRationale

Output Translation

Data

LinguisticInformation

LanguageKnowledge

Page 17: 9. Manuel Harranz (pangeanic) Hybrid Solutions for Translation

SYNTAX-BASED HYBRID SMT

Altaic languages English

Arabic European languages

Agglutinative Non- agglutinative

Output Translation

Data

LinguisticInformation

LanguageKnowledge

Hybridation Experiences at PangeanicTWO OPTIONS

RE-ORDERING

Toshiba / Mecab benchmarking EN JP

Page 18: 9. Manuel Harranz (pangeanic) Hybrid Solutions for Translation

CHALLENGES

SVO vs SOV

Tokenization: No spaces between words Mecab/KyTea for JP, Peterson Segmentor for ZH

RBMT systems have traditionally worked with linguistic & morphological analyzers. Thus “units” were segmented.

SMT can’t and so we need to tokenize to leave similar amount of “words” on both sides Giza++ can then relate words and groups.

Hybridation Experiences at PangeanicTWO METHODS

Page 19: 9. Manuel Harranz (pangeanic) Hybrid Solutions for Translation

CHALLENGES

SVO vs SOV

Hybridation Experiences at PangeanicTWO OPTIONS

Page 20: 9. Manuel Harranz (pangeanic) Hybrid Solutions for Translation

CHALLENGES

SVO vs SOV

Re-ordering?

Phrase-based or hierarchical models (syntactical)?

Hybridation Experiences at PangeanicTWO METHODS

Continue to press the button to scroll through the components of the program until

the display shows the desired current selection.

Japanese proper word order would be

the display the desired current selection shows until the components the program of

through to scroll the button to press continue.

Page 21: 9. Manuel Harranz (pangeanic) Hybrid Solutions for Translation

SYNTAX-BASED (TREE) FOR HYBRID SMT

Hybridation Experiences at PangeanicSyntax-based analysis & re-ordering rules

Tree depth: 10Calc time +59% !!

Page 22: 9. Manuel Harranz (pangeanic) Hybrid Solutions for Translation

When available, the company plans to offer the following:

available When , the company the following : plans to offer :

発売時には、同社は次のバージョンを提供する予定です。

(VBPt3) (to) (VBinf) (DET) (NN)

(Predicate)

Nipponization module

Translation & Cleaning

(Subject) (VBPt) (to)

(ADV) (ADJ) (Punct) (DET) (NNSing)

(Cond clause),

SYNTAX-BASED RULES FOR HYBRID SMT

Hybridation Experiences at PangeanicSyntax-based analysis & re-ordering rules

Page 23: 9. Manuel Harranz (pangeanic) Hybrid Solutions for Translation

TOSHIBA vs MECAB

Toshiba’s The Honyaku is a established RB system (+30 years)

Lacks flexibility, rules contradict each other

Proposal: re-arrange whole corpus EN for JP with Toshiba’srules, but this meant dependency on a proprietary system forfuture inputs.

Hybridation Experiences at PangeanicTWO OPTIONS

Page 24: 9. Manuel Harranz (pangeanic) Hybrid Solutions for Translation

TOSHIBA vs MECAB – LESSONS LEARNT

Mecab re-ordering produced higher BLEU than Toshiba’s

5-fold structure

Hybridation Experiences at PangeanicTWO OPTIONS

Page 25: 9. Manuel Harranz (pangeanic) Hybrid Solutions for Translation

TOSHIBA vs MECAB – LESSONS LEARNT

Mecab re-ordering produced higher BLEU than Toshiba’s

Paper published December 2011 AAMT Going Hybrid: Pangeanic’s and Toshiba’sFirst Steps Toward ENJP MT Hybridation

Hybridation Experiences at PangeanicTWO OPTIONS

Page 26: 9. Manuel Harranz (pangeanic) Hybrid Solutions for Translation

TOSHIBA vs MECAB – LESSONS LEARNT

Mecab re-ordering produced higher BLEU than Toshiba’s

Paper published December 2011 AAMT Going Hybrid: Pangeanic’s and Toshiba’sFirst Steps Toward ENJP MT Hybridation

Hybridation Experiences at PangeanicTWO OPTIONS

Page 27: 9. Manuel Harranz (pangeanic) Hybrid Solutions for Translation

Future (current) Work on Hybrids

Morphology-rich langs: RU in particular.

Improve DE

Distant languages: re-ordering for AR?

Agglutinative langs: TK – new paradigm

Page 28: 9. Manuel Harranz (pangeanic) Hybrid Solutions for Translation

IntroBrief history

Pangea system introduction /

features for EXPERT

Hybridation experiences at

Pangeanic (+future work)

Page 29: 9. Manuel Harranz (pangeanic) Hybrid Solutions for Translation

[email protected]

#manuelhrrnz #pangeanic pangeanic


Recommended