+ All Categories
Home > Documents > S andrejs vasiļjevs chairman of the board [email protected] data is core LOCALIZATION WORLD PARIS,...

S andrejs vasiļjevs chairman of the board [email protected] data is core LOCALIZATION WORLD PARIS,...

Date post: 24-Dec-2015
Category:
Upload: leslie-wilcox
View: 217 times
Download: 1 times
Share this document with a friend
Popular Tags:
50
s andrejs vasiļjevs chairman of the board [email protected] data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012
Transcript
Page 1: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

sandrejs vasiļjevs

chairman of the [email protected]

data is core

LOCALIZATION WORLD PARIS, JUNE 5, 2012

Page 2: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

• Language technology developer

• Localization service provider

• Leadership in smaller languages

• Offices in Riga (Latvia), Tallinn (Estonia) and Vilnius (Lithuania)

• 135 employees

• Strong R&D team

• 9 PhDs and candidates

Page 3: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

MTmachine translation

machine translation

Page 4: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

INNOVATIONd i s r u p ti v e

d i s r u p ti v e

Page 5: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

rule-based MT

statistical MT

• High quality translation in specialized domains• Require highly qualified

linguists, researchers and software developers• Time and resource consuming• Difficult to evolve

• Translation and linguistic knowledge is derived from data• Relatively easy and quick to develop• Requires huge amounts of parallel and monolingual data• Translation quality inconsistent and can differ dramatically from

domain to domain

MT paradigms

Page 6: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

CHALLENGE

Page 7: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

15largest

languages

50%

Page 8: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

domains

IT Aerospace

Agriculture Automotive

Chemistry Coal and mining industries

Communications Culture

Defence Education

Electronics Energy

Finance Food technology

Government affairs Legal

Life sciences Logistics

Marketing Mechanical engineering

Medicine Pharmaceuticals

Religion Social affairs

Trade

Page 9: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

one size fits all

?

Page 10: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

DATA

Page 11: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.
Page 12: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

The total body of European Union law applicable in the EU Member States

JRC-Acquis http://langtech.jrc.it/JRC-Acquis.html

Page 13: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

The DGT Multilingual Translation Memory of the Acquis Communautaire

DGT-TM

http://langtech.jrc.it/DGT-TM.html

Page 14: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

Parallel data collected from the Web by University of Uppsala

90 languages, 3800 language

2,7B parallel units

Opus

http://opus.lingfil.uu.se

Page 15: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

open European language resource infrastructure

http://www.meta-net.eu

Page 16: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

Data for SMT training

Page 17: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

PLATFORM

Page 18: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

Moses toolkit

[ttable-file]0 0 5 /.../unfactored/model/phrase-table.0-0.gz% ls steps/1/LM_toy_tokenize.1* | catsteps/1/LM_toy_tokenize.1steps/1/LM_toy_tokenize.1.DONEsteps/1/LM_toy_tokenize.1.INFOsteps/1/LM_toy_tokenize.1.STDERRsteps/1/LM_toy_tokenize.1.STDERR.digeststeps/1/LM_toy_tokenize.1.STDOUT% train-model.perl \--corpus factored-corpus/proj-syndicate \--root-dir unfactored \--f de --e en \--lm 0:3:factored-corpus/surface.lm:0% moses -f moses.ini -lmodel-file "0 0 3 ../lm/europarl.srilm.gz“use-berkeley = truealignment-symmetrization-method = berkeleyberkeley-train = $moses-script-dir/ems/support/berkeley-train.shberkeley-process = $moses-script-dir/ems/support/berkeley-process.shberkeley-jar = /your/path/to/berkeleyaligner-2.1/berkeleyaligner.jarberkeley-java-options = "-server -mx30000m -ea"berkeley-training-options = "-Main.iters 5 5 -EMWordAligner.numThreads 8"berkeley-process-options = "-EMWordAligner.numThreads 8"berkeley-posterior = 0.5tokenizein: raw-stemout: tokenized-stemdefault-name: corpus/tokpass-unless: input-tokenizer output-tokenizertemplate-if: input-tokenizer IN.$input-extension OUT.$input-extensiontemplate-if: output-tokenizer IN.$output-extension OUT.$output-extensionparallelizable: yesworking-dir = /home/pkoehn/experimentwmt10-data = $working-dir/data

Page 19: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

buildyour ownMT engine

Page 20: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

Tilde / CoordinatorLATVIA

University of EdinburghUK

Uppsala UniversitySWEDEN

Copehagen UniversityDENMARK

University of ZagrebCROATIA

MoraviaCZECH REPUBLIC

SemLabNETHERLANDS

Page 21: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

• Cloud-based self-service MT factory

• Repository of parallel and monolingual corpora for MT generation

• Automated training of SMT systems from specified collections of data

• Users can specify particular training data collections and build customised MT engines from these collections

• Users can also use LetsMT! platform for tailoring MT system to their needs from their non-public data

Page 22: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

• Stores SMT training data• Supports different formats –

TMX, XLIFF, PDF, DOC, plain text

• Converts to unified format• Performs format

conversions and alignmentResourceRepository

Page 23: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

• Put users in control of their data

• Fully public or fully private should not be the only choice

• Data can be used for MT generation without exposing it

• Empower users to create custom MT engines from their data

user-driven machine translation

Page 24: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

• Integration with CAT tools• Integration in web pages • Integration in web browsers• API-level integration

integration

Page 25: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

Integration of MT in SDL Trados

Page 26: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

Training UsingSharing of training data

Giza++Moses SMT toolkit

SMT Resource Repository

SMT Multi-Model Repository

(trained SMT models)

Proc

esin

g, E

valu

ation

...

Upl

oad

Anon

ymou

sac

cess

Auth

entic

ated

acce

ss

System management, user authentication, access rights control ...

Web page

Web service

Web pagetranslation widget

CAT tools

Web browserPlug-ins

SMT Resource Directory

SMT System Directory

Moses decoder

Page 27: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.
Page 28: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.
Page 29: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

use caseFORTERA

Page 30: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

EVALUATION

Page 31: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

• Keyboard-monitoring of post-editing (O´Brien, 2005)

• Productivity of MS Office localization (Schmidtke, 2008)

5-10% productivity gain for SP, FR, DE

• Adobe(Flournoy and Duran, 2009)

22%-51% productivity increase for RU, SP, FR

• Autodesk Moses SMT system (Plitt and Masselot, 2010)

74% average productivity increase for FR, IT, DE, SP

Previous Work

Page 32: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

Evaluation at Tilde

• Latvian:

About 1,6 M native speakers Highly inflectional - ~22M possible

word forms in total Official EU language

• Tilde English – Latvian MT system

• IT Software Localization Domain

• Evaluation of translators’ productivity

Page 33: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

English-Latvian data

Bilingual corpus Parallel units

Localization TM 1 290 K

DGT-TM 1 060 K

OPUS EMEA 970 K

Fiction 660 K

Dictionary data 510 K

Web corpus 900 K

Total 5 370 K

Monolingual corpus Words

Latvian side of parallel corpus

60 M

News (web) 250 M

Fiction 9 M

Total, Latvian 319 M

Page 34: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

MT Integration into Localization Workflow

Evaluate original / assign Translator and Editor

Analyze against TMs

Translateusing translation suggestions for TMs

and MT

Evaluate translation quality / Edit

Fix errors

Ready translation

MT translate new sentences

Page 35: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

• Key interest of localization industry is to increase productivity of translation process while maintaining required quality level

• Productivity was measured as the translation output of an average translator in words per hour

• 5 translators participated in evaluation including both experienced and new translatorsEvaluation of

Productivity

Page 36: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

• Performed by human editors as part of their regular QA process

• Result of translation process was evaluated, editors did not know was or was not MT applied to assist translator

• Comparison to reference is not part of this evaluation

• Tilde standard QA assessment form was used covering the following text quality areas:

Accuracy

Spelling and grammar

Style

Terminology

Evaluation of Quality

Page 37: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

QA Grades

Error Score (sum of weighted errors)

Resulting Quality Evaluation

0…9 Superior

10…29 Good

30…49 Mediocre

50…69 Poor

>70 Very poor

Tilde Localization QA assessment applied in the evaluation

Page 38: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

Evaluation data

►54 documents in IT domain

►950-1050 adjusted words in each document

►Each document was split in half:

►the first part was translated using suggestions from TM only

►the second half was translated using suggestions from both TM and MT

Page 39: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

%

productivity32.9%*

* Skadiņš R., Puriņš M., Skadiņa I., Vasiļjevs A., Evaluation of SMT in localization to under-resourced inflected language, in Proceedings of the 15th International Conference of the European Association for Machine Translation EAMT 2011, p. 35-40, May 30-31, 2011, Leuven, Belgium

Latvian

Page 40: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

Evaluation at Moravia

► IT Localization domain►Systems trained on the

LetsMT platform►English - Czech translation

25.1% productivity increase

Error score increase from 19 to 27, still at the GOOD grade (<30)

►English – Polish translation

28.5% productivity increase

Error score increase from 16.8 to 23.6, still at the GOOD grade (<30)

Page 41: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

%

productivity

25%

*For Czech and Polish formal evaluation was done by MoraviaForor Slovak productivity increase was estimated by Fortera

28.5%

Slovak* Polish

25.1%

Czech

Page 42: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

MORE DATA

Page 43: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

corpora collection tools

comparability metrics

named entity recognition tools

terminology extraction tools

ACCURAT TOOLKIT

Page 44: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

use caseAUTOMOTIVE

MANUFACTURER

Page 45: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

very smalltranslation memories(just 3500 sentences)

noin-domain corpora in target languages

nomoney for expensive developments

?

Page 46: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

Terminology extraction

Web crawling parallel

monolingual

Parallel data extraction from comparable corpora

data collection workflow

Page 47: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

TMs

Terminology glossary

Parallel phrases

Parallel Named Entities

Monolingual target language corpus

Resulting data

Page 48: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

General domain data as a basis

Domain specific language model

Impose domain specific terminology, named entity translations

Add linguistic knowledge atop of statistical components

SMT Training

Page 49: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

right data &right tools

Page 50: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

tilde.comtechnologies

for smaller

languages

The research within the LetsMT! project leading to these results has received funding from the ICT Policy Support Programme (ICT PSP), Theme 5 – Multilingual web, grant agreement no 250456


Recommended