Exploding the Myththe gerund in machine translation
Nora Aranberri
Optional Footer Information Here
Background
• Nora Aranberri– PhD student at CTTS (Dublin City University)
– Funded by Enterprise Ireland and Symantec (Innovation Partnerships Programme)
• Symantec– Software publisher
– Localisation requirements
• Translation – Rule-based machine translation system (Systran)
• Documentation authoring – Controlled language (CL checker: acrocheck™)
– Project: CL checker rule refinement
Optional Footer Information Here
The Myth
• Sources: translators, post-editors, scholars
– Considered a translation issue for MT due to its ambiguity• Bernth & McCord, 2000; Bernth & Gdaniec, 2001
– Addressed by CLs• Adriaens & Schreurs, 1992; Wells Akis, 2003; O’Brien 2003; Roturier, 2004
The gerund is handled badly by MT systems
and should be avoided
• Sources: translators, post-editors, scholars
– Considered a translation issue for MT due to its ambiguity• Bernth & McCord, 2000; Bernth & Gdaniec, 2001
– Addressed by CLs• Adriaens & Schreurs, 1992; Wells Akis, 2003; O’Brien 2003; Roturier, 2004
The gerund is handled badly by MT systems
and should be avoided
Optional Footer Information Here
What is a gerund?
• -ing either a gerund, a participle, or continuous tense keeping the same form
• Examples– GERUND: Steps for auditing SQL Server instances.
– PARTICIPLE: When the job completes, BACKINT saves a copy of the Backup Exec restore logs for auditing purposes.
– CONTINUOUS TENSE: Server is auditing and logging.
• Conclusion: gerunds and participles can be difficult to differentiate for MT.
Optional Footer Information Here
Methodology: creating the corpus
• Initial corpus– Risk management components texts– 494,618 words – uncontrolled
• Structure of study– Preposition or subordinate conjunction + -ing
• Extraction of relevant segments– acrocheck™: CL checker asked to flag the patterns of the
structure• IN + VBG|NN|JJ “-ing”
– 1,857 sentences isolated
Optional Footer Information Here
Methodology: translation
• Apply machine translation for target language
– MT used: Systran Server 5.05
– Dictionaries • No specific dictionaries created for the project
• Systran in-built computer science dictionary applied
– Languages• Source language: English
• Target languages: Spanish, French, German and Japanese
Optional Footer Information Here
Methodology: evaluation (1)
• Evaluators
– one evaluator per target language only
– native speakers of the target languages
– translators / MA students with experience in MT
• Evaluation format
Optional Footer Information Here
Methodology: evaluation (2)
• Analysis of the relevant structure only
• Questions:
– Q1: is the structure correct?
– Q2: is the error due to the misinterpretation of the source or because the target is poorly generated?
• Both are “yes/no” questions.
Optional Footer Information Here
Results: prepositions / subordinate conjunctions
prepositionexamples
by + ing 377
for + ing 339
when + ing 256
before + ing 163
after + ing 122
about + ing 96
on + ing 89
without + ing 75
of + ing 71
from + ing 68
while + ing 54
in + ing 36
if + ing 19
rather than + ing 14
such as + ing 13
TOTAL 1857
%
Optional Footer Information Here
Results: correctness for Spanish
Spanish
prepositionexamples
correct
incorrect
by + ing 377 351 26
for + ing 339 243 96
when + ing 256 205 51
before + ing 163 145 18
after + ing 122 107 15
about + ing 96 82 14
on + ing 89 38 51
without + ing 75 47 28
of + ing 71 65 6
from + ing 68 30 38
while + ing 54 3 51
in + ing 36 27 9
if + ing 19 15 4
rather than + ing 14 0 14
such as + ing 13 9 4
TOTAL 1857 1393 464
% 75.01
% 24.99%
Optional Footer Information Here
Results: correctness for French
Spanish French
prepositionexamples
correct
incorrect correct
incorrect
by + ing 377 351 26 358 19
for + ing 339 243 96 284 55
when + ing 256 205 51 2 254
before + ing 163 145 18 146 17
after + ing 122 107 15 117 5
about + ing 96 82 14 82 14
on + ing 89 38 51 80 9
without + ing 75 47 28 65 10
of + ing 71 65 6 65 6
from + ing 68 30 38 31 37
while + ing 54 3 51 45 9
in + ing 36 27 9 9 27
if + ing 19 15 4 10 9
rather than + ing 14 0 14 0 14
such as + ing 13 9 4 9 4
TOTAL 1857 1393 464 1341 516
% 75.% 24.99% 72.21% 27.79%
Optional Footer Information Here
Results: correctness for German
Spanish French German
prepositionexamples
correct
incorrect correct
incorrect
correct
incorrect
by + ing 377 351 26 358 19 364 13
for + ing 339 243 96 284 55 262 77
when + ing 256 205 51 2 254 213 43
before + ing 163 145 18 146 17 145 18
after + ing 122 107 15 117 5 114 8
about + ing 96 82 14 82 14 88 8
on + ing 89 38 51 80 9 58 31
without + ing 75 47 28 65 10 71 4
of + ing 71 65 6 65 6 60 11
from + ing 68 30 38 31 37 24 44
while + ing 54 3 51 45 9 27 27
in + ing 36 27 9 9 27 23 13
if + ing 19 15 4 10 9 17 2
rather than + ing 14 0 14 0 14 0 14
such as + ing 13 9 4 9 4 9 4
TOTAL 1857 1393 464 1341 516 1514 343
% 75.01
% 24.99% 72.21% 27.79%81.53
% 18.47%
Optional Footer Information Here
Results: correctness for Japanese
Spanish French German Japanese
preposition
examples correct incorrect correct incorrect correct
incorrect correct
incorrect
by + ing 377 351 26 358 19 364 13 301 76
for + ing 339 243 96 284 55 262 77 224 115
when + ing 256 205 51 2 254 213 43 161 95
before + ing 163 145 18 146 17 145 18 134 29
after + ing 122 107 15 117 5 114 8 108 14
about + ing 96 82 14 82 14 88 8 88 8
on + ing 89 38 51 80 9 58 31 29 60
without + ing 75 47 28 65 10 71 4 66 9
of + ing 71 65 6 65 6 60 11 57 14
from + ing 68 30 38 31 37 24 44 33 35
while + ing 54 3 51 45 9 27 27 44 10
in + ing 36 27 9 9 27 23 13 9 27
if + ing 19 15 4 10 9 17 2 17 2
rather than + ing 14 0 14 0 14 0 14 1 13
such as + ing 13 9 4 9 4 9 4 8 5
TOTAL 1857 1393 464 1341 516 1514 343 1303 554
% 75.% 24.99% 72.21% 27.79% 81.53% 18.47% 70.17% 29.83%
Optional Footer Information Here
Significant results
Spanish French German Japanese
prepositionexamples
correct
incorrect correct
incorrect correct
incorrect correct
incorrect
by + ing 377 351 26 358 19 364 13 301 76
for + ing 339 243 96 284 55 262 77 224 115
when + ing 256 205 51 2 254 213 43 161 95
before + ing 163 145 18 146 17 145 18 134 29
after + ing 122 107 15 117 5 114 8 108 14
about + ing 96 82 14 82 14 88 8 88 8
on + ing 89 38 51 80 9 58 31 29 60
without + ing 75 47 28 65 10 71 4 66 9
of + ing 71 65 6 65 6 60 11 57 14
from + ing 68 30 38 31 37 24 44 33 35
whil e + ing 54 3 51 45 9 27 27 44 10
in + ing 36 27 9 9 27 23 13 9 27
if + ing 19 15 4 10 9 17 2 17 2
rather than + ing 14 0 14 0 14 0 14 1 13
such as + ing 13 9 4 9 4 9 4 8 5
TOTAL 1857 1393 464 1341 516 1514 343 1303 554
% 75.% 24.99% 72.21% 27.79% 81.53% 18.47% 70.17% 29.83%
Optional Footer Information Here
Results: correlation of problematic structures
0
10
20
30
40
50
60
70
80
Spanish French German Japanese
for when from on while by
• The most problematic structures seem to strongly correlate across languages
• Top 6 prep/conj account for >65% of errors
Optional Footer Information Here
Analysis and generation errors
Spanish French German Japanese
prepositionexamples
Source-error
Target-error
Source-error
Target-error
Source-error
Target-error
Source-error
Target-error
by + ing 377 4 27 10 13 4 9 16 58
for + ing 339 37 120 37 55 33 47 30 82
when + ing 256 13 49 0 256 10 38 3 93
before + ing 163 4 27 4 17 4 14 8 22
after + ing 122 5 12 5 5 1 7 4 11
about + ing 96 7 51 10 13 5 3 4 1
on + ing 89 3 51 0 9 1 30 2 57
without + ing 75 3 26 2 8 2 2 1 8
of + ing 71 4 4 3 7 4 8 7 11
from + ing 68 5 36 1 37 1 43 8 33
while + ing 54 2 50 2 8 3 26 0 10
in + ing 36 5 7 6 27 2 13 12 18
if + ing 19 1 3 1 9 2 0 0 2
rather than + ing 14 0 14 0 14 0 14 0 13
such as + ing 13 3 8 1 4 2 2 3 2
TOTAL 1857 106 523 83 514 85 267 98 459
% 0.60% 0.63% 0.54% 0.74% 0.61% 0.72% 0.60% 0.72%
Optional Footer Information Here
Analysis and generation errors
Spanish French German Japanese
prepositionexamples
Source-error
Target-error
Source-error
Target-error
Source-error
Target-error
Source-error
Target-error
by + ing 377 4 27 10 13 4 9 16 58
for + ing 339 37 120 37 55 33 47 30 82
when + ing 256 13 49 0 256 10 38 3 93
before + ing 163 4 27 4 17 4 14 8 22
after + ing 122 5 12 5 5 1 7 4 11
about + ing 96 7 51 10 13 5 3 4 1
on + ing 89 3 51 0 9 1 30 2 57
without + ing 75 3 26 2 8 2 2 1 8
of + ing 71 4 4 3 7 4 8 7 11
from + ing 68 5 36 1 37 1 43 8 33
while + ing 54 2 50 2 8 3 26 0 10
in + ing 36 5 7 6 27 2 13 12 18
if + ing 19 1 3 1 9 2 0 0 2
rather than + ing 14 0 14 0 14 0 14 0 13
such as + ing 13 3 8 1 4 2 2 3 2
TOTAL 1857 106 523 83 514 85 267 98 459
% 0.60% 0.63% 0.54% 0.74% 0.61% 0.72% 0.60% 0.72%
Optional Footer Information Here
Analysis and generation errors
Spanish French German Japanese
prepositionexamples
Source-error
Target-error
Source-error
Target-error
Source-error
Target-error
Source-error
Target-error
by + ing 377 4 27 10 13 4 9 16 58
for + ing 339 37 120 37 55 33 47 30 82
when + ing 256 13 49 0 256 10 38 3 93
before + ing 163 4 27 4 17 4 14 8 22
after + ing 122 5 12 5 5 1 7 4 11
about + ing 96 7 51 10 13 5 3 4 1
on + ing 89 3 51 0 9 1 30 2 57
without + ing 75 3 26 2 8 2 2 1 8
of + ing 71 4 4 3 7 4 8 7 11
from + ing 68 5 36 1 37 1 43 8 33
while + ing 54 2 50 2 8 3 26 0 10
in + ing 36 5 7 6 27 2 13 12 18
if + ing 19 1 3 1 9 2 0 0 2
rather than + ing 14 0 14 0 14 0 14 0 13
such as + ing 13 3 8 1 4 2 2 3 2
TOTAL 1857 106 523 83 514 85 267 98 459
% 0.60% 0.63% 0.54% 0.74% 0.61% 0.72% 0.60% 0.72%
Optional Footer Information Here
Analysis and generation errors
Spanish French German Japanese
prepositionexamples
Source-error
Target-error
Source-error
Target-error
Source-error
Target-error
Source-error
Target-error
by + ing 377 4 27 10 13 4 9 16 58
for + ing 339 37 120 37 55 33 47 30 82
when + ing 256 13 49 0 256 10 38 3 93
before + ing 163 4 27 4 17 4 14 8 22
after + ing 122 5 12 5 5 1 7 4 11
about + ing 96 7 51 10 13 5 3 4 1
on + ing 89 3 51 0 9 1 30 2 57
without + ing 75 3 26 2 8 2 2 1 8
of + ing 71 4 4 3 7 4 8 7 11
from + ing 68 5 36 1 37 1 43 8 33
while + ing 54 2 50 2 8 3 26 0 10
in + ing 36 5 7 6 27 2 13 12 18
if + ing 19 1 3 1 9 2 0 0 2
rather than + ing 14 0 14 0 14 0 14 0 13
such as + ing 13 3 8 1 4 2 2 3 2
TOTAL 1857 106 523 83 514 85 267 98 459
% 0.60% 0.63% 0.54% 0.74% 0.61% 0.72% 0.60% 0.72%
Optional Footer Information Here
Analysis and generation errors
Spanish French German Japanese
prepositionexamples
Source-error
Target-error
Source-error
Target-error
Source-error
Target-error
Source-error
Target-error
by + ing 377 4 27 10 13 4 9 16 58
for + ing 339 37 120 37 55 33 47 30 82
when + ing 256 13 49 0 256 10 38 3 93
before + ing 163 4 27 4 17 4 14 8 22
after + ing 122 5 12 5 5 1 7 4 11
about + ing 96 7 51 10 13 5 3 4 1
on + ing 89 3 51 0 9 1 30 2 57
without + ing 75 3 26 2 8 2 2 1 8
of + ing 71 4 4 3 7 4 8 7 11
from + ing 68 5 36 1 37 1 43 8 33
while + ing 54 2 50 2 8 3 26 0 10
in + ing 36 5 7 6 27 2 13 12 18
if + ing 19 1 3 1 9 2 0 0 2
rather than + ing 14 0 14 0 14 0 14 0 13
such as + ing 13 3 8 1 4 2 2 3 2
TOTAL 1857 106 523 83 514 85 267 98 459
% 0.60% 0.63% 0.54% 0.74% 0.61% 0.72% 0.60% 0.72%
Optional Footer Information Here
Analysis and generation errors
Spanish French German Japanese
prepositionexamples
Source-error
Target-error
Source-error
Target-error
Source-error
Target-error
Source-error
Target-error
by + ing 377 4 27 10 13 4 9 16 58
for + ing 339 37 120 37 55 33 47 30 82
when + ing 256 13 49 0 256 10 38 3 93
before + ing 163 4 27 4 17 4 14 8 22
after + ing 122 5 12 5 5 1 7 4 11
about + ing 96 7 51 10 13 5 3 4 1
on + ing 89 3 51 0 9 1 30 2 57
without + ing 75 3 26 2 8 2 2 1 8
of + ing 71 4 4 3 7 4 8 7 11
from + ing 68 5 36 1 37 1 43 8 33
while + ing 54 2 50 2 8 3 26 0 10
in + ing 36 5 7 6 27 2 13 12 18
if + ing 19 1 3 1 9 2 0 0 2
rather than + ing 14 0 14 0 14 0 14 0 13
such as + ing 13 3 8 1 4 2 2 3 2
TOTAL 1857 106 523 83 514 85 267 98 459
% 0.60% 0.63% 0.54% 0.74% 0.61% 0.72% 0.60% 0.72%
Optional Footer Information Here
Source and target error distribution
• Target errors seem to be more important across languages
• The prep/conj with the highest error rate and common to 3 or 4 target languages cover 43-54% of source errors and 48-59% of target errors
Spanish French German Japanese
Source-error
Target-error
Source-error
Target-error
Source-error
Target-error
Source-error
Target-error
for + ing 37 120 37 55 33 47 30 82
when + ing 13 49 0 256 10 38 3 93
from + ing 5 36 1 37 1 43 8 33
on + ing 3 51 0 9 1 30 2 57
SUM 58 256 38 357 45 158 43 265
Total 106 523 83 514 85 267 98 459
%54.72
% 48.95 45.78 69.45 52.94 59.18 43.88 57.73
Optional Footer Information Here
Conclusions
• Overall success rate between 70-80% for all languages
• Target language generation errors are higher than the errors due to the misinterpretation of the source.
• Great diversity of prepositions/subordinate conjunctions with varying appearance rates.
• Strong correlation of results across languages.
Optional Footer Information Here
Next steps
• Further evaluations to consolidate results– 4 evaluators per language– Present sentences to the evaluators out of alphabetical order by
preposition/conjunction– Note the results for the French “when”.
• Make these findings available to the writing teams• Take our prominent issues
– Source issues • controlled language or pre-processing
– Formulate more specific rules in acrocheck to handle the most problematic structures/prepositions and reduce false positives
• Standardise structures with low frequencies
– Target issues • post-processing or MT improvements
Optional Footer Information Here
References
• Adriaens, G. and Schreurs, D., (1992) ‘From COGRAM to ALCOGRAM: Toward a Controlled English Grammar Checker’, 14th International Conference on Computational Linguistics, COLING-92, Nantes, France, 23-28 August, 1992, 595-601.
• Bernth, A. and Gdaniec, C. (2001) ‘MTranslatability’ Machine Translation 16: 175-218.
• Bernth, A. and McCord, M. (2000) ‘The Effect of Source Analysis on Translation Confidence’, in White, J. S., eds., Envisioning Machine Translation in the Information Future: 4th Conference of the Association for Machine Translation in the Americas, AMTA 2000, Cuernavaca, Mexico, 10-14 October, 2000, Springer: Berlin, 89-99.
• O’Brien, S. (2003) ‘Controlling Controlled English: An Analysis of Several Controlled Language Rule Sets’, in Proceedings of the 4th Controlled Language Applications Workshop (CLAW 2003), Dublin, Ireland, 15-17 May, 2003, 105-114.
• Roturier, J. (2004) ‘Assessing a set of Controlled Language rules: Can they improve the performance of commercial Machine Translation systems?’, in ASLIB Conference Proceedings, Translating and the Computer 26, London, 18-19 November, 2004, 1-14.
• Wells Akis, J. and Sisson, R. (2003) ‘Authoring translation-ready documents: is software the answer?’, in Proceedings of the 21st annual international conference on Documentation, SIGDOC 2003, San Francisco, CA, USA, October 12-15, 2003, 38-44.
Optional Footer Information Here
Thank you!
e-mail: nora.aranberrimonasterioATdcu.ie