PangeaMT Manuel Herranz
PangeaMT : A Solution Built by the
Language Industry for Language as a Business
#manuelhrrnz #pangeanic E: [email protected] pangeanic
Why machine Translation?
The Data Deluge ü As of May 2009: 487 Billion gigabytes or 1,000,000,000 * 487,000,000,000 = 4,87 x 1020 ü EsAmates § Up 50% a year (Oracle) § Doubles every 11 hours (IBM)
As Content Volume Explodes, Machine TranslaCon Becomes an Inevitable Part of Global Content Strategy hDp://ow.ly/jVuhZ
§ In 2011, it took about two days for the world to create the same 5 exabytes of data that it took human eons to generate.
§ In 2013, it took the world just 10 minutes to create 5 exabytes.
§ Humankind has stored more than 295
billion gigabytes (or 295 exabytes) of data since 1986
ComputerWorld -‐ 2011
§ Researchers at the University of California, Berkeley, that found the amount of data generated from the dawn of Ame through 2002 was about 5 exabytes.
Where is data stored?
MT Usage Machine TranslaAon applicaAon, NEW usage and success depend on
ü MT for assimilaCon: “gisCng” or “understanding“
Sports Politics
Social etc
Output format
• Prac?cally unlimited demand; but free web-‐based services reduce incen?ve to improve technology
• Coverage + important. Instant quality ü MT for disseminaCon: “publicaCon“
ü MT for direct communicaCon
Output format
Sports Politics
Social etc
• Publishable quality that can only be achieved by humans. MT & tools a produc?vity booster
Output format
Output format
Sports Politics
Social etc • Current R&D, Military uses systems for
spoken MT, first applica?ons for smartphones, online help, mul?lingual chat systems
PangeaMT System – Domain Creation
PangeaMT System – Data Cleaning
PangeaMT System – Engine Creation
PangeaMT System – Engine Training
A Success Story Sony Professional Europe, Salomé Lopez-‐Lavado Needs -‐ Improve
publicaCon French, Italian, Spanish
-‐ 8M words training set
-‐ Cme-‐to-‐market: from 3 days down to 1,5 days: html, InDesign,
-‐ Outsourcing cost: -‐20%
-‐ Volume: 1,5M words/year
Japanese AutomoCve manufacturer -‐ Spanish -‐ 8M words/year -‐ Time to market
reduced by 2 week – 3 weeks from 8 to 6 or 5 weeks
-‐ Team of 17 freelancers down to 4-‐7 post-‐editors
-‐ Outsourcing cost: -‐30%
Spanish LSP working for banking sector -‐ Spanish -‐ 1-‐2M words/year -‐ Time to market: 1-‐
week to 2 days!!!! -‐ Docx, html, tmx -‐ Down from 2-‐3 in-‐
house staff and 2-‐3 freelancers to 2 in-‐house!!!
Successfully applied (third-‐party applicaCons / beneficiaries)
Use Case -
✔ Even with small data sets!!
• PangeaMT can be self-hosted when data security is critical (all processes internal to the organization) - commercially sensitive data, - financial, legal, institutional, - intelligence, knowledge-gathering, - product pre-release, etc
• Control Panel + full system statistics
• Re-trainings and updates by the client for data privacy / more accuracy
Potential Uses of Machine Translation
• Information discovery: patent, unknown documents,
• Automatic, on-demand creation of foreign language versions / web apps – keyword testing
• multilingual crawling, data discovery
• Pre-translation
Potential Uses of Machine Translation
Myth: MT will never be as good as humans
“We cannot solve the problem using the same tools and the way of thinking that created it” A. Einstein
uhmmm, it is going to get really good...
2nd stage PE material and more data make engines even
more predictable. More specialist engines
3rd stage Beyond 2030... no predictions
1st stage We are creating usable engines, ?irst PE
experiences 2009-‐2015 or 2020