The Right Data for MT Olga Beregovaya, PROMT Kerstin Bier, Sybase Melissa Biggs, Oracle Karen R....

The Right Data for MT

Olga Beregovaya, PROMTKerstin Bier, Sybase

Melissa Biggs, OracleKaren R. Combe, PTCJessica Roland, EMC

AgendaIntroductionProblem data for MT trainingSMT experienceRBMT experiencePre-EditingControlled Authoring – Lessons LearnedControlled Authoring – Signs of SuccessRecommendations

Problematic Data

Karen R. Combe, PTC

Issue: Excessive number of internal tags

4

Pour effectuer la plupart de ces tâches, vous pouvez utiliser {1}{2}Fichier (File){3}{4}Traitement des instances (Instance Operations){5}{6}Actualiser l'index (Update Index){7}{8} ou {9}{10}Fichier (File){11}{12}Traitement des instances (Instance Operations){13}{14}Options d'accélérateur (Accelerator Options){15}{16} afin d'ouvrir la boîte de dialogue {17}Accélérateur d'instances (Instance Accelerator){18}

You can use {1}{2}File{3}{4}Instance Operations{5}{6}Update Index{7}{8}{9}{10}File{11}{12}Instance Operations{13} {14}Accelerator Options{15}{16} (which opens the {17}Instance Accelerator{18} dialog box) to perform most instance operations.

Issue: Irrelevant data

5

English: 0.31%French: 0,31 %

English: &asm.mbr.name==part*French: &asm.mbr.name==pièce*

English: (Windows NT/95/98/2000)D:\partlib\{1}\objectsFrench: (Windows NT/95/98/2000)D:\partlib\{1}\objects

Issue: homonyms

6

English: This figure shows that after midsurface compression, the resulting model develops a gap between the collet and the bracket.French: Cette figure montre qu'après la compression en feuillet moyen, le modèle obtenu crée un jeu entre le collet et le gousset.

English: All data in brackets [] are optional.French: Toutes les données entre crochets [] sont facultatives.

Bracket #1 (gousset): An overhanging member that projects from a structure (as a wall) and is usually designed to support a vertical load or to strengthen an angle.

bracket #2 (crochet): The bracket character, such as [ or (.

Issue: Acronyms spelled out in the target

7

English: You cannot propagate SDTAEs and DTAEs in a DTAF.

French: Vous ne pouvez propager ni des éléments d'annotation d'étiquette de référence ni des éléments d'annotation de référence de positionnement à l'intérieur d'une FARP.

Issue: Mismatching number of sentences

8

English: You can have multiple entries for the same pipe size in the bend file, that is, a single pipe size can have multiple bend radius values associated with it, as shown in the following example of a bend file.

French: Vous pouvez avoir plusieurs entrées pour la même taille de tuyau dans le fichier de pliage. En d'autres termes, une même taille de tuyau peut être associée à plusieurs valeurs de rayon de pliage, comme dans le fichier de pliage d'exemple suivant.

Issue: Inconsistent double quote usage

9

Ainsi, si vous créez une pièce portant le nom "bracket", elle est tout d'abord enregistrée dans le fichier {1}.

For example, if you create a part with the name bracket, it initially saves to the file name {1}.

Issue: Entity mismatch

10

English: One way is to create a "flexible model.French: Une méthode consiste à créer un modèle souple.

Issue: Punctuation mismatch (brace vs. dash)

11

English: {1}Copy as Skeleton{2} (the option cannot be changed) to create a skeleton model.

French: Cliquez sur {1}Copier en tant que squelette (Copy as Skeleton){2} - option non modifiable - pour créer un modèle squelette.

Issue: Punctuation mismatch (dash vs. colon)

12

English: {1}Additional Rotation{2} — Enter a real-number value for the number of degrees to rotate the spring's Y axis.

French: {1}Rotation supplémentaire (Additional Rotation){2} : entrez un nombre réel pour indiquer le nombre de degrés de rotation de l'axe Y du ressort.

Issue: Capitalization mismatch

13

English: Piping Master Catalog Directory FileFrench: Fichier répertoire du catalogue principal de tuyauterie

Issue: English UI strings in the translation

14

English: Click View > Color and Appearance to create or modify colors.

Cliquez sur Affichage (View) > Couleur et apparence (Color and Appearance) pour créer ou modifier les couleurs.

The right data for MTAn SMT experience

Kerstin Bier Sybase

Getting started with MT: The Sybase SMT experience(s)

Engine: Moses Add-On: PangeaMT parser for inline markup in output

Initial language pair: EN -> DE

Data volume for training: 5 million words

small data volume, but we do not have more our own data to have better control

MT (and post-editing) in use for documentation localization for ca. 2 months now

Getting the data right:Automated cleaning and preparation

TMX data Cleanup: Entities

Conversion

Cleanup: Characters

Two plain text files

Moses Cleanup: Segments

TokenizationLower-casing

Two aligned text files, no tags, lower-cased

MT engine training

Cleanup: Tags

Bilingual XML with inline tags/markup

XML entities like © etc.

Invalid characters

Remove: <ph> etc.

Empty linesSentence ratio wrong

Example: By default, èBy default ,

Example: HOUSE è house

Got the right data?

Pilot project results: High BLEU score, good productivity test „good“ data? Restricted domain Consistent style (authoring effort!) Consistent terminology (we thought)

Results of „real world“ MT usage confirms results: productivity > 25 % - 300% compared to baseline

BUT: analysis revealed some issues in training data with

effect on output

Three main data issues:A problem (not only) for MT

Source and targetissues

Issues: Inline content translation

(translate vs. notranslate) UI references

Issues: Do Not Translates:

domain-specific terms sample output and more...

Issues: complex sentences inconsistencies ambiguity

DNTs(Do Not Translates)

Inline MarkupContent

MT issue: Inline content

Inline Content

Problems in training data: XML tags are removed

Loss of context information (e.g. DNT) Protected (notranslate) inline content

Removal of tags incl. content = gaps in training sentences

Output results: Incorrectly translated inline content

Output quality degraded

Possible solutions: Amend training data:

Restore content Use placeholders

Pre-process input (add XML markup)

MT issue: UI References

UI References

Problems in training data: XML tags removed

Loss of UI information UI strings are „zones“, do not fit in

sentence structure

Output results: Many incorrect UI string translations Weird translations in some places

Possible solutions: More training data? More promising:

Handle UI references outside MT

MT issue: Do Not Translates

DNTs(Do Not Translates)

Problems in training data: Loss of DNT information

Lower-casing (SELECT => select) Tokenization (sp_proc => sp _ proc) Many untranslated words in corpus

Output results: DNTs translated English words in „translate“ contexts

Possible solutions: Customize lower-casing (=> truecasing) Customize tokenizer Pre-processing

MT issue: Source and target issues

Source and targetissues

Problems in training data:: Long, complex sentences (source) Inconsistent wording/terms (both) Ambiguities, omissions (source) Translation too „creative“, too „free“

Output results: Quality degradation up to useless MT

output

Possible solutions: Source: Pre-editing, authoring control

tool Target: Translation control (authoring

control for target side)

Summary TMs (TMX) are usually a good basis for SMT training Automated cleaning takes out most of the „dirt“ MT output improvements can be achieved by:

Improving the source - authoring control/pre-editing Improving the target - translation controlExtensive terminology work (source and target)Pre- and post-processing steps

For many special output requirements, it makes more sense to invest time in pre-processing and post-processing steps than in the training data

RBMT – Pre-processing terminology and metadata

Olga BeregovayaPROMT

Preprocessing of Glossaries

Glossaries are one of the best ways to create a dictionary, but most of the glossaries provided by customers need to be preprocessed. Preprocessing includes extracting:

•segments with and without translation

•segments with correct and “incorrect” translation (for example, translation with comments in brackets)

•segments where the source is equal to the target (proper names)

•segments with special characters

•segments in upper cases, lower cases and mixed cases (comparing them and separating the common and unique strings)

27

Standard TM verification/normalization process

During TM verification the following is addressed through automatic steps

• Irregular characters gets flagged and replaced• Incomplete sentences get flagged• Punctuation suspects get flagged• UI strings and other irregular sentences get

added to phrase tables

Handling internal tags – not excessive but useful

• Original Source Segment in File Check <codeph class="+ topic/ph pr-d/codeph>NativeApplication.supportsSystemTrayIcon</codeph> to determine whether system tray icons are supported on the current system.

• Converted to GMS Segment format (after GMS-native segmentation) • Check {1}NativeApplication.supportsSystemTrayIcon{2} to determine whether

system tray icons are supported on the current system. • Pre Processed String in XLIFF Segment format is sent to PROMT. ‐• Check <ph i=1 x=”<codeph class="+ topic/ph pr-d/codeph"

>”>{1}</ph> NativeApplication.supportsSystemTrayIcon<ph i=2 &lt/codeph>>{2}</ph> to determine whether system tray icons are supported on the current system.

• Format of the translated XLIFF Segment returned by PROMT to GMS • Проверить <ph i=1 x=”<codeph class="+ topic/ph pr-d/codeph"

>”>{1}</ph> NativeApplication.supportsSystemTrayIcon<ph i=2 &lt/codeph>>{2}</ph> для определения системном трее иконки поддерживает нынешнюю систему.

GMS Integration with XLIFF Connector – Why is metadata so Important?

Better inline parsing GMS Side

Better inline parsing by the MT engine

Better, more natural machine translation output

Reduced effort for post-editors

More productivity and reduced costs

30

Handling irrelevant data

• Scenario 1: We can leave the irrelevant data untouched and let it propagate from TM or be handled through special formatting rules

• Scenario 2: We will normalize it and add to the phrase table

Our system will perform well in either scenario and our course of action needs to be the clients call

31

Handling homonyms

• PROMT system is specially tailored to handle one-to-many translations and homonymy

• PROMT approach is to create context-based dictionary entries, whether single words or MWE which allows the system to properly indentify the correct translation for ambiguous entries

• PROMT also uses XML metadata when assigning a semantic class to an entry

32

Handling expanding acronyms

• PROMT system handles expansion of acronyms or different acronyms between languages through creating explicit mapping

• This is a rather standard task in the process of PROMT engine customization, along with DoNotTranslate and variable lists

• Should an abbreviation or the expanded version change, this can be fixed through the client interface in a matter of seconds

33

Handling locale-specific punctuation

• Quotation mark usage for a specific small group of terms can be defined on a dictionary level

• If the use of quotation marks or other punctuation is universal for a specific locate it will be defined on the linguistic rules level

34

Handling Entity and Capitalization mismatch

• The differences in locale setting for Entities and Capitalization rules are already pre-built in the baseline engine and are regulated through regional settings in the product interface

• All additional differences between locales are learnt from the TM during the engine customization phase and then are added to the client profile template

35

Suggestion for UI string handling

• All the UI strings will be automatically added to DoNotTranslate lists when appearing in the appropriate context

• The context can be detected semantically, though formatting and punctuation

36

PROMT handling of internal markup

• This step is not necessary for PROMT translation process

• Scenario 1: the markup is handled by PROMT TMX Level 2 extensive TM metadata support

• Scenario 2: if we need to create phrase table entries from these strings we will normalize, but the markup will still be preserved in the translation process

37

PROMT handling of empty fields• Scenario 1: “Red flag”: During TM verification

an automatic script will render a warning message and the empty unit will not be propagated

• Scenario 2: We also can send the empty segment to the customized engine and obtain a translation which will be propagated into the TM for further verification

Pre-Editing

Olga BeregovayaPROMT

Pre-editing

Definitions:

• Pre-editing - preprocessing the source language before it is sent to automated translation. Typical modifications of the source language include reducing complexity and ambiguity to achieve a more fluent automated translation.

• Normalization (in this context) – pre-processing of marked-up data to train MT systems

Examples of incorrect translation caused by poor source

• English > SpanishIncorrect: Correct:Are you going to school Are you going to school?Son usted yendo a la escuela ¿Va usted a la escuela?

• German > EnglishIncorrect: Correct:wie funktioniert das übersetzen Wie funktioniert das mit dem “clipboard”? Übersetzen mit dem “clipboard”?how does this function translate How does the translation with “clipboard”with “clipboard”? function?

• Russian > EnglishIncorrect: Correct:Я часто использую это ПО Я часто использую это

программное обеспечениеI frequently use it ON I frequently use this software

PROMT-specific pre-editing tips

For best translation quality the following clauses are to be avoided in the source:

• Adjacent identical clauses (with standard and non-standard passive – i.e. “he was asked and helped”); similar participles are not always analyzed as such, always a good practice to repeat an ancillary verb

• “When asked” and similar clauses, a full sentence is always better • Ellipses and all other types of incomplete sentences, including sentences like “I

have a suspicion he can be late today”, a good practice to always add “ that”• Missing articles and determinatives when homonymy needs to be parsed • Postposition participles, such as “the problems discussed”• Incorrect punctuation, including incorrectly used hyphens (hyphen used instead of

an em-dash); an expression with a hyphen will be parsed as a single word

Other possible sources of PROMT errors:

The following errors need to be corrected in the customer profile, then files need to be re-translated:

• Morphological errors: Incorrect morphology in the target may be caused by incorrect morphological attributes in your dictionary, check the attributes using PROMT Dictionary Editor

• Proper names, brand names and alike are translated: add them to the DoNotTranslate list

• Incorrect syntax in the target may be caused by incorrect markup parsing rules: check your filters and rules settings

Controlled Authoring

Melissa BiggsOracle

“Technical” Challenges for Authoring Tools Adoption

Diverse authoring tools and styles Multiple and wide range of authors/groups in an Enterprise Lack of process, measurement methodology and corporate accountability in authoring communities

Tracking Metrics/measures

Standalone use (lack of architecture to produce automated process for full lifecycle -> editing, publishing, translation)

“Cultural” Pitfalls for Authoring Tools Adoption

Multiple and wide range of authors/groups in an Enterprise Resistance by authors to a “control” tool Lack of interest by content creator as the full benefits may not visible to the creator Challenge in defining a clear ROI definition Standalone tool use (lack of architecture to produce automated process for full lifecycle -> editing, publishing, translation)

Case Study (pre-MT)

Globalization group purchases SW license for authoring tool G11n group drives adoption in pubs groups; provides training, support, assistance with rules Implemented and mandated for use by 1 publications group using SGML authoring Demos, but no traction or acceptance, in 4 additional publications groups + marketing No metrics or tracking implemented by pubs group Decreased acceptance of use of tool over time

Case Study: The Tool

Supports application of a common style through a rule set which results in

Clean and structured sourceConsistent terminology across the document (less confusion and higher user satisfaction)Optimizing the maintenance of informationImproved search and retrieval of information

Applying rules via tool helped to create a clean and structured translation source – important for implementing machine translation

Case Study: The Tool

English documentation processed using tool = easier and faster translation (for x target languages 1 ambiguity in the source generates x queries during translation cycle) Reduced translation cycle, faster time-to-market Fewer ambiguities in the source => more accurate and consistent translations => higher customer satisfaction

Case Study - Results

Increased content reuse for both English and translated content Limits in ability to scale (increase) content and increase quality Editor time not reduced, but less focus on minor, repetitive errors Decreasing acceptance of use of tool over time

– Value proposition not compelling in Pubs– Cross - product savings/benefits not visible– Publications measurements/metrics not

tracked consistently Globalization team viewed as an enforcer

Learnings: It's the Culture, not the Tool Define the total value proposition + process chain for the tool

Include Terminology, localization/translation Define and administer a Content LifeCycle methodology

• Include Critical phase for “pre-editing” Find right central ownership for authoring tool Not a standalone technology/process

Simultaneous adoption may scale more effectively than group-by-group adoption Engage with globalization early

TRACK & MEASURE Accountability to management -- products Continuous Scorecard reporting

Controlled Authoring

Jessica RolandEMC Information Intelligence Group

Controlled Source - Pro•Acquired Controlled Authoring tool in 2008• Compared two market leaders •Influenced by IT peer company references• 86% of writers have access•Current focus: spelling, grammar, style

Controlled Source - Pro•Positive feedback from writers

• “I did run the tool, and to my shock and amazement, it found lots of stuff”

• “And I thought I didn't use passive voice much!”• “I'm finding it very helpful!...it has flagged

passive constructions that I was too lazy or time-crunched to fix before, as well as a number of other "gotchas" that simply take a little more time to reconsider.”

Controlled Source - Pro•Before and after reports - results and scores•Measurable improvement in grammar and style•Need intelligent reuse module for word count reduction • Lesson learned: Get the IR module right away

Controlled Source - Pro•Careful with changes to legacy content during L10N…$$$•Process with writers:

•Check legacy content after last drop or post-release• Check discrete new feature content and improve iteratively – it’s relatively small•Run before/after metrics on whole book, after the last drop to L10N

Controlled Source - Pro•MT is only used with documentation•Observing MT savings increase since tool deployment, even without IR•MT likes cleaner text•Greater savings by hour than by word

Controlled Source – ProSummary

•Acquired Controlled Authoring tool in 2008•Positive feedback from writers•Need intelligent reuse module for word count reduction •Careful with changes to legacy TM•Observing MT savings increase since tool deployment

Metrics and recommendations

What is good data for MT?

General content pre-editing tips

Good source = Good machine translationFlawed source = $%#@!

Recommendations:• Check your spelling, including upper/lower case• Check for proper punctuation• Use diacritics correctly• Use simple syntactic constructions • Do not omit syntactic words • Use conventional abbreviations• Avoid slang

Data normalization tips

• Identify your data problematic issues and ways of addressing them either through pre-editing tools or your MT engine pre-processing capabilities

• Decide what needs to be addressed through automated processing and what can be left to post-editors to correct

• Sometimes your preferred formatting and markup can be in conflict with MT engine’s logic – quotes, brackets, capitalization are not MT’s best friends. Be prepared to choose your battles

Result – successful MT deployment

• Automated metric scores, i.e. BLEU/Methor scores will double with engine trained on clean data and/or good terminology

• Post-editors are able to concentrate on polishing the language rather than dealing with omissions, incorrect terminology, mystery tags

• Time to market can be reduced by 25 to 40 percent• Translation costs can be reduced by approximately 25

to 40 percent

Date post:	17-Dec-2015
Category:	Documents
Upload:	randell-jeffry-black
View:	218 times
Download:	1 times

The Right Data for MT Olga Beregovaya, PROMT Kerstin Bier, Sybase Melissa Biggs, Oracle Karen R....

Documents