8.13: Public Project Presentation (Update)
Stephan Busemann
Distribution: Public
EuroMatrixPlus
Bringing Machine Translation
for European Languages to the User
ICT 231720 Deliverable 8.13
Project funded by the European Community
under the Seventh Framework Programme for
Research and Technological Development.
Project ref no. ICT-231720Project acronym EuroMatrixPlus
Project full title Bringing Machine Translation for European Languages to theUser
Instrument STREPThematic Priority ICT-2007.2.2 Cognitive systems, interaction, roboticsStart date / duration 01 March 2009 / 38 Months
Distribution PublicContractual date of delivery August 31, 2011Actual date of delivery November 30, 2011Date of last update November 28, 2011Deliverable number 8.13Deliverable title Public Project Presentation (Update)Type ReportStatus & version FinalNumber of pages 3Contributing WP(s) WP8WP / Task responsible DFKIOther contributors CU, LSILAuthor(s) Stephan BusemannEC project o�cer Michel BrochardKeywords
The partners in DFKI GmbH, Saarbrucken (DFKI)EuroMatrixPlus University of Edinburgh (UEDIN)are: Charles University (CUNI-MFF)
Johns Hopkins University (JHU)Fondazione Bruno Kessler (FBK)Universite du Maine, Le Mans (LeMans)Dublin City University (DCU)Lucy Software and Services GmbH (Lucy)Central and Eastern European Translation, Prague (CEET)Ludovit Stur Institute of Linguistics,Slovak Academy of Sciences (LSIL)Institute of Information and Communication Technologies,Bulgarian Academy of Sciences (IICT-BAS)
For copies of reports, updates on project activities and other EuroMatrixPlus-relatedinformation, contact:
The EuroMatrixPlus Project Co-ordinatorProf. Dr. Hans Uszkoreit, DFKI GmbHStuhlsatzenhausweg 3, 66123 Saarbrucken, [email protected] +49 (681) 85775-5282 - Fax +49 (681) 85775-5338
Copies of reports and other material can also be accessed via the project’s homepage:http://www.euromatrixplus.net/
c� 2011, The Individual Authors
No part of this document may be reproduced or transmitted in any form, or by any means,
electronic or mechanical, including photocopy, recording, or any information storage and
retrieval system, without permission from the copyright owner.
Executive Summary
This document contains the public project presentation representing the current state of the
EuroMatrixPlus project after 33 months. After motivating and describing the goals set out,
a survey of scientific progress is given and each major point detailed in the sequel. After an
overview of dissemination activities, the presentation concludes with an assessment of how the
goals are being met.
This presentation may form the basis for project presentations. The corresponding source
file will be used by the Consortium to create updates whenever needed.
The slide set is available as a PDF document from the project website at
http://www.euromatrixplus.eu/activities/.
3
EuroMatrix Plus - ICT 231720
Bringing Machine Translation for European Languages
to the User
March 2009 - April 2012
2 © 2011 EuroMatrix Plus Consortium – Public Project Presentation
Motivation – Approaches to MT Different approaches to MT have complementary PROs and CONs: Source: Chen & Chen: A Hybrid Approach to Machine Translation System Design,
Computational Linguistics and Chinese Language Processing, 1996
3 © 2011 EuroMatrix Plus Consortium – Public Project Presentation
Motivation – Direction of Research The different paradigms of
rule-based MT (RBMT) and statistical MT (SMT) complement each other regarding their pros and cons. Thus we
• Combine their strengths to
compensate for their weaknesses
• Develop special strategies to tackle difficult to deal with phenomena
RBMT SMT
Syntax, Morphology ++ - Structural Semantics + --
Lexical Semantics - +
Lexical Adaptivity -- +
Lexical Reliability + -
4 © 2011 EuroMatrix Plus Consortium – Public Project Presentation
Objectives of EuroMatrix Plus
1. Continue the rapid advance of machine translation technology, creating example systems for every official EU language, and providing other machine translation developers with our infrastructure for building statistical translation models.
2. Continue and broaden the controlled systematic investigation of different approaches and techniques to accelerate the scientific evolution of novel methods, including both selection and cross-fertilization. The aim is to arrive at scientifically well understood novel combinations of methods that are demonstrably superior to the state of the art.
3. Focus on bringing machine translation to the users – both professional translation services and lay end users.
5 © 2011 EuroMatrix Plus Consortium – Public Project Presentation
Objectives for EuroMatrix Plus (cont’d)
4. Contribute to the growth and competitiveness of the European MT research scene and infrastructure through its open international competitive shared tasks and through living community supported surveys of resources, tools, systems and their respective capabilities.
5. Create an openly accessible sample application that enables users to automatically translate news stories and web pages from any European language into any other, and whose corrections will be exploited as data for improving translation technology.
6 © 2011 EuroMatrix Plus Consortium – Public Project Presentation
Project Features
• FP7 ICT Grant 231720 • Budget: 5.94 M€ • Duration: 03/2009 - 04/2012
(38 months) • Co-ordinator: DFKI GmbH • http://www.euromatrixplus.eu
7 © 2011 EuroMatrix Plus Consortium – Public Project Presentation
The Partners in Detail
Name Country Research Focus
Deutsches Forschungs-schungszentrum für
Künstliche Intelligenz GmbH
Germany Hybrid MT
University of Edinburgh United Kingdom Statistical MT
Charles University Czech Republic Tree-based MT
Johns Hopkins University United States of America Community-based MT
Fondazione Bruno
Kessler Italy Statistical MT
Laboratoire d'Informatique de
l'Université du Maine France Statistical MT
8 © 2011 EuroMatrix Plus Consortium – Public Project Presentation
The Partners in Detail (cont‘d)
Name Country Research Focus
Dublin City University Ireland Translation and Localization & MT
Lucy Software and Services GmbH Germany Hybrid MT
CEET language solutions Czech Republic MT evaluation
Ľudovít Štúr Institute of Linguistics Slovakia MT between closely
related languages
Institute of Information and Communication Technologies of the
Bulgarian Academy of Sciences
Bulgaria HPSG-based MT
9 © 2011 EuroMatrix Plus Consortium – Public Project Presentation
Scientific Results
Considerable improvements of SMT by enriching phrase-based and hierarchical models
Next step in hybrid MT research by adding statistical weights to intermediate representations of a commercial RBMT system
Progress in translating between data-poor language pairs by using another translation path through a pivot language, and exploiting comparable data
New training method for quickly updating the model and thus utilizing corrections provided by users
Exploiting monolingual post-editing results to improve MT
Embedding of MT technology into translation and localisation workflows, combined with Translation Memories
10 © 2011 EuroMatrix Plus Consortium – Public Project Presentation
Improving SMT by Enriching Models Shallow syntax modeling • Improve statistical MT by reordering the source text to reflect the
structure of the target text
Reordering for hierarchical models • Use a maximum entropy model to score the movement proposed
by rule application
the x of the x → das x des x the np of the np → das x des x the np of the x → das x des x
Mixed source syntax model • Hierarchical rules use general non-
terminals without enforcing a particular category.
• Source syntax rules use linguistic categories which restrict the type of phrase the non-terminal can be replaced with.
• The mixed source syntax model relaxes the strict categories of the syntax model to facilitate translation
11 © 2011 EuroMatrix Plus Consortium – Public Project Presentation
Progress in Hybrid MT
A commercial rule- based MT system was extended by statistical modules according to the „SMT feeds RBMT“ hybrid architecture. Results: 1. RBMT analysis now includes a state of the art stochastic parser
in order to select the best from the many parse trees. 2. The transfer lexicon has been extended with bilingual
terminology extracted from a parallel corpus, enriched with linguistic information including the internal structure of multiword expressions, frequency and category of the overall term.
RBMT Engine
Source Text
Target Text
Lexicon
Linguistic Processing
Manual Validation
Phrase Table
Parallel Corpus
Alignment, Phrase
extraction
12 © 2011 EuroMatrix Plus Consortium – Public Project Presentation
Other Hybrid MT Set-Ups Explored
Pivot languages The pivot method used is composition.
Tree-based translation In addition to morphological and shallow syntactic layers, a new
TectoMT system utilizes the so called tectogrammatical layer, which describes deep syntax including co-reference information. TectoMT is built using our new publicly available platform Treex and makes use of both rule-based and statistical processing.
HPSG-based translation Following the usual set-up of an RBMT system, HPSG
processing is used for analysis and generation, whereas the transfer between the HPSG semantic representations is modeled statistically.
13 © 2011 EuroMatrix Plus Consortium – Public Project Presentation
Languages With Low Resources
Updated and extending existing resources • Europarl (57m words, 21 languages) • News Commentary (1-2m words) • UN corpus (300m words) • Monolingual news corpus (1b words) Creating new data resources • Czech-English:
• Corpus annotated with tectogrammatical information • Slovak-Czech and English-Slovak:
• Sentence aligned corpus annotated with lemma and morphological information
• Bulgarian-English parallel tree bank
14 © 2011 EuroMatrix Plus Consortium – Public Project Presentation
Extending parallel corpora Non-parallel corpora can be exploited to extend a parallel corpus.
1. Translating monolingual texts 2. Extracting parallel sentences from comparable corpora (e.g.
press agency releases) using information retrieval methods
Languages With Low Resources (cont’d)
15 © 2011 EuroMatrix Plus Consortium – Public Project Presentation
The Moses SMT Decoder
• Moses is an open source SMT system that allows you to automatically train translation models for any language pair. All you need is a parallel corpus. An efficient search algorithm finds quickly the highest probability translation among the exponential number of choices.
• Moses development has been funded in the Sixth and Seventh Framework Programme for Research and Technological Development. It is currently supported within EuroMatrix Plus.
• Moses is licensed under the LGPL
• Detailed information is found at http://statmt.org/moses/
16 © 2011 EuroMatrix Plus Consortium – Public Project Presentation
The Moses Success Story
• More than 18,000 downloads of release packages, probably many more by SVN checkout.
• 3,785+ revisions in 31 branches inside the SVN repository. • 718+ scientific citations reported by Google Scholar. • Moses mailing list has around 485 members and is "one of the
most active MT-related list out there". • MT Marathons organised by EuroMatrix Plus attracted lots of
Moses projects/people interested in the software. • Autodesk used Moses for a post-editing productivity test
presented at the MT Marathon 2010 in Dublin. • Installed, tested and used by EC DGT, by EuroScript GmbH,
Germany, by Spanish language service provider Pangeanic. • TAUS Data Association: "the translation industry is steadily
appropriating the Moses translation engine."
17 © 2011 EuroMatrix Plus Consortium – Public Project Presentation
User-Centered Support for SMT
• Incremental model updates • Redefine SMT as a dynamic, continuous learning process • New methods for updates of statistical translation models • Allows us to incorporate user feedback immediately
• Translation aid tools for interactive MT
• Allow monolingual users to translate sentences written in foreign languages
• The monolingual user is shown a visualization of possible translations for each phrase in an input sentence, and chooses among them to construct a translation.
18 © 2011 EuroMatrix Plus Consortium – Public Project Presentation
Integrated Localization Workflow • Translation Memories (TMs) are still the base technology in
industrial localization workflows. How can MT be integrated? • Loose integration of TM and MT: decide which to prefer
• TM/MT recommendation model based on estimated post-editing effort
• TM/MT reranking model for outputs • Improving post-editing
experience • Tight integration: use bits of
TM in MT • Constrain the MT system
in such a way that matched input bits are translated as per TM and others as per MT system
• Tree-alignment based system
19 © 2011 EuroMatrix Plus Consortium – Public Project Presentation
Conferences and Workshops EuroMatrix Plus organizes various conferences and workshops on a regular basis, thus disseminating its results to both the scientific community and industrial companies. For instance, • Translingual Europe: 2010 in Berlin, Germany (in connection with
Localization World) • Joint CNGL-EuroMatrix Plus Workshop for Users (in connection
with AMTA 2010) • WMT workshop:
• 2010 in Uppsala, Sweden, in connection with ACL2010 • 2011 in Edinburgh, UK, in connection with EMNLP2011
• MT Marathons with papers, discussions, tutorials and hands-on experience: • Dublin, Ireland, January 2010 • Le Mans, France, September 2010 • Trento, Italy, September 2011
20 © 2011 EuroMatrix Plus Consortium – Public Project Presentation
Scientific Publications The work in the project has so far lead to more than 80 scientific publications. Some examples: • Parallel Sentence Generation from Comparable Corpora for
improved SMT (2011) by Sadaf Abdul-Rauf, Holger Schwenk, Machine Translation Journal
• Convergence of Translation Memory and Statistical Machine Translation (2010) by Philipp Koehn, Jean Senellart (AMTA Workshop on MT Research and the Translation Industry)
• Hierarchical Hybrid Translation between English and German (2010) by Yu Chen, Andreas Eisele, Proceedings of the 14th Annual Conference of the European Association for Machine Translation
A complete list of publications by the project is found at
http://www.euromatrixplus.eu/publications/
21 © 2011 EuroMatrix Plus Consortium – Public Project Presentation
Fulfilling the Goals 1. Advancing MT Technology
Translation and language model can now incorporate new data, such as user edits, instantaneously without having to retrain on the entire corpus. A large effort went into making use of (shallow) syntactical information to increase the translation quality.
2. Investigating different approaches & technologies Different approaches to hybrid MT are being investigated. A rule-based commercial system has been successfully extended with stochastic modules. A hybrid HPSG-based translation approach is currently being implemented.
3. Bringing MT to the users Work on integrating MT into translation and localization workflows has been carried out to find the most useful set-up for professional users. Lay users are targeted by the WikiTrans work package.
22 © 2011 EuroMatrix Plus Consortium – Public Project Presentation
Fulfilling the Goals (cont‘d)
4. Contributing to the European MT research scene Numerous workshops and conferences organized by the consortium increase the visibility of European MT research efforts and foster the exchange between researchers and industry.
5. Create an openly accessible MT sample application Based on the successful open source system Moses, a broker server platform has been developed (“MT Serverland”). Work on creating an interface for WikiTrans is currently being carried out.
23 © 2011 EuroMatrix Plus Consortium – Public Project Presentation