+ All Categories
Home > Documents > a ndrejs v asiļjevs c hairman of the b oard [email protected]

a ndrejs v asiļjevs c hairman of the b oard [email protected]

Date post: 26-Feb-2016
Category:
Upload: zihna
View: 42 times
Download: 0 times
Share this document with a friend
Description:
d ata is c ore. s. a ndrejs v asiļjevs c hairman of the b oard [email protected]. LOCALIZATION WORLD PARIS, JUNE 5, 2012. L anguage technology developer Localization service provider Leadership in smaller languages Offices in Riga (Latvia), Tallinn (Estonia) and Vilnius (Lithuania) - PowerPoint PPT Presentation
Popular Tags:
50
s andrejs vasiļjevs chairman of the board [email protected] data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012
Transcript
Page 1: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

sandrejs vasiļjevs

chairman of the [email protected]

data is core

LOCALIZATION WORLD PARIS, JUNE 5, 2012

Page 2: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

• Language technology developer

• Localization service provider

• Leadership in smaller languages

• Offices in Riga (Latvia), Tallinn (Estonia) and Vilnius (Lithuania)

• 135 employees

• Strong R&D team

• 9 PhDs and candidates

Page 3: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

MTmachine translation

machine translation

Page 4: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

INNOVATIONd i s r u p ti v e

d i s r u p ti v e

Page 5: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

rule-based MT

statistical MT

• High quality translation in specialized domains• Require highly qualified

linguists, researchers and software developers• Time and resource consuming• Difficult to evolve

• Translation and linguistic knowledge is derived from data• Relatively easy and quick to develop• Requires huge amounts of parallel and monolingual data• Translation quality inconsistent and can differ dramatically from

domain to domain

MT paradigms

Page 6: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

CHALLENGE

Page 7: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

15largest

languages

50%

Page 8: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

domains

IT Aerospace

Agriculture Automotive

Chemistry Coal and mining industries

Communications Culture

Defence Education

Electronics Energy

Finance Food technology

Government affairs Legal

Life sciences Logistics

Marketing Mechanical engineering

Medicine Pharmaceuticals

Religion Social affairs

Trade

Page 9: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

one size fits all

?

Page 10: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

DATA

Page 11: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com
Page 12: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

The total body of European Union law applicable in the EU Member States

JRC-Acquis http://langtech.jrc.it/JRC-Acquis.html

Page 13: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

The DGT Multilingual Translation Memory of the Acquis Communautaire

DGT-TMhttp://langtech.jrc.it/DGT-TM.html

Page 14: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

Parallel data collected from the Web by University of Uppsala

90 languages, 3800 language

2,7B parallel units

Opushttp://opus.lingfil.uu.se

Page 15: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

open European language resource infrastructure

http://www.meta-net.eu

Page 16: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

Data for SMT training

Page 17: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

PLATFORM

Page 18: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

Moses toolkit

[ttable-file]0 0 5 /.../unfactored/model/phrase-table.0-0.gz% ls steps/1/LM_toy_tokenize.1* | catsteps/1/LM_toy_tokenize.1steps/1/LM_toy_tokenize.1.DONEsteps/1/LM_toy_tokenize.1.INFOsteps/1/LM_toy_tokenize.1.STDERRsteps/1/LM_toy_tokenize.1.STDERR.digeststeps/1/LM_toy_tokenize.1.STDOUT% train-model.perl \--corpus factored-corpus/proj-syndicate \--root-dir unfactored \--f de --e en \--lm 0:3:factored-corpus/surface.lm:0% moses -f moses.ini -lmodel-file "0 0 3 ../lm/europarl.srilm.gz“use-berkeley = truealignment-symmetrization-method = berkeleyberkeley-train = $moses-script-dir/ems/support/berkeley-train.shberkeley-process = $moses-script-dir/ems/support/berkeley-process.shberkeley-jar = /your/path/to/berkeleyaligner-2.1/berkeleyaligner.jarberkeley-java-options = "-server -mx30000m -ea"berkeley-training-options = "-Main.iters 5 5 -EMWordAligner.numThreads 8"berkeley-process-options = "-EMWordAligner.numThreads 8"berkeley-posterior = 0.5tokenizein: raw-stemout: tokenized-stemdefault-name: corpus/tokpass-unless: input-tokenizer output-tokenizertemplate-if: input-tokenizer IN.$input-extension OUT.$input-extensiontemplate-if: output-tokenizer IN.$output-extension OUT.$output-extensionparallelizable: yesworking-dir = /home/pkoehn/experimentwmt10-data = $working-dir/data

Page 19: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

buildyour ownMT engine

Page 20: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

Tilde / CoordinatorLATVIA

University of EdinburghUK

Uppsala UniversitySWEDEN

Copehagen UniversityDENMARK

University of ZagrebCROATIA

MoraviaCZECH REPUBLIC

SemLabNETHERLANDS

Page 21: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

• Cloud-based self-service MT factory

• Repository of parallel and monolingual corpora for MT generation

• Automated training of SMT systems from specified collections of data

• Users can specify particular training data collections and build customised MT engines from these collections

• Users can also use LetsMT! platform for tailoring MT system to their needs from their non-public data

Page 22: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

• Stores SMT training data• Supports different formats –

TMX, XLIFF, PDF, DOC, plain text

• Converts to unified format• Performs format

conversions and alignmentResourceRepository

Page 23: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

• Put users in control of their data

• Fully public or fully private should not be the only choice

• Data can be used for MT generation without exposing it

• Empower users to create custom MT engines from their data

user-driven machine translation

Page 24: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

• Integration with CAT tools• Integration in web pages • Integration in web browsers• API-level integration

integration

Page 25: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

Integration of MT in SDL Trados

Page 26: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

Training UsingSharing of training data

Giza++Moses SMT toolkit

SMT Resource Repository

SMT Multi-Model Repository

(trained SMT models)

Proc

esin

g, E

valu

ation

...

Upl

oad

Anon

ymou

sac

cess

Auth

entic

ated

acce

ss

System management, user authentication, access rights control ...

Web page

Web service

Web pagetranslation widget

CAT tools

Web browserPlug-ins

SMT Resource Directory

SMT System Directory

Moses decoder

Page 27: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com
Page 28: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com
Page 29: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

use caseFORTERA

Page 30: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

EVALUATION

Page 31: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

• Keyboard-monitoring of post-editing (O´Brien, 2005)

• Productivity of MS Office localization (Schmidtke, 2008)

5-10% productivity gain for SP, FR, DE

• Adobe(Flournoy and Duran, 2009)

22%-51% productivity increase for RU, SP, FR

• Autodesk Moses SMT system (Plitt and Masselot, 2010)

74% average productivity increase for FR, IT, DE, SP

Previous Work

Page 32: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

Evaluation at Tilde

• Latvian: About 1,6 M native speakers Highly inflectional - ~22M possible

word forms in total Official EU language

• Tilde English – Latvian MT system

• IT Software Localization Domain

• Evaluation of translators’ productivity

Page 33: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

English-Latvian data

Bilingual corpus Parallel unitsLocalization TM 1 290 KDGT-TM 1 060 KOPUS EMEA 970 KFiction 660 KDictionary data 510 KWeb corpus 900 KTotal 5 370 K

Monolingual corpus WordsLatvian side of parallel corpus

60 M

News (web) 250 MFiction 9 MTotal, Latvian 319 M

Page 34: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

MT Integration into Localization Workflow

Evaluate original / assign Translator and Editor

Analyze against TMs

Translateusing translation suggestions for TMs

and MT

Evaluate translation quality / Edit

Fix errors

Ready translation

MT translate new sentences

Page 35: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

• Key interest of localization industry is to increase productivity of translation process while maintaining required quality level

• Productivity was measured as the translation output of an average translator in words per hour

• 5 translators participated in evaluation including both experienced and new translatorsEvaluation of

Productivity

Page 36: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

• Performed by human editors as part of their regular QA process

• Result of translation process was evaluated, editors did not know was or was not MT applied to assist translator

• Comparison to reference is not part of this evaluation

• Tilde standard QA assessment form was used covering the following text quality areas:

Accuracy

Spelling and grammar

Style

Terminology

Evaluation of Quality

Page 37: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

QA Grades

Error Score (sum of weighted errors)

Resulting Quality Evaluation

0…9 Superior

10…29 Good

30…49 Mediocre

50…69 Poor

>70 Very poor

Tilde Localization QA assessment applied in the evaluation

Page 38: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

Evaluation data

►54 documents in IT domain►950-1050 adjusted words in

each document►Each document was split in

half:

►the first part was translated using suggestions from TM only

►the second half was translated using suggestions from both TM and MT

Page 39: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

%

productivity32.9%*

* Skadiņš R., Puriņš M., Skadiņa I., Vasiļjevs A., Evaluation of SMT in localization to under-resourced inflected language, in Proceedings of the 15th International Conference of the European Association for Machine Translation EAMT 2011, p. 35-40, May 30-31, 2011, Leuven, Belgium

Latvian

Page 40: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

Evaluation at Moravia

► IT Localization domain►Systems trained on the

LetsMT platform►English - Czech translation

25.1% productivity increase

Error score increase from 19 to 27, still at the GOOD grade (<30)

►English – Polish translation 28.5% productivity

increase Error score increase from

16.8 to 23.6, still at the GOOD grade (<30)

Page 41: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

%

productivity

25%

*For Czech and Polish formal evaluation was done by MoraviaForor Slovak productivity increase was estimated by Fortera

28.5%

Slovak* Polish

25.1%

Czech

Page 42: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

MORE DATA

Page 43: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

corpora collection tools

comparability metrics

named entity recognition tools

terminology extraction tools

ACCURAT TOOLKIT

Page 44: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

use caseAUTOMOTIVE

MANUFACTURER

Page 45: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

very smalltranslation memories(just 3500 sentences)

noin-domain corpora in target languages

nomoney for expensive developments

?

Page 46: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

Terminology extraction

Web crawling parallel

monolingual

Parallel data extraction from comparable corpora

data collection workflow

Page 47: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

TMs

Terminology glossary

Parallel phrases

Parallel Named Entities

Monolingual target language corpus

Resulting data

Page 48: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

General domain data as a basis

Domain specific language model

Impose domain specific terminology, named entity translations

Add linguistic knowledge atop of statistical components

SMT Training

Page 49: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

right data &right tools

Page 50: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

tilde.comtechnologies

for smaller

languages

The research within the LetsMT! project leading to these results has received funding from the ICT Policy Support Programme (ICT PSP), Theme 5 – Multilingual web, grant agreement no 250456


Recommended