+ All Categories
Home > Documents > Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By:...

Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By:...

Date post: 21-Dec-2015
Category:
View: 218 times
Download: 2 times
Share this document with a friend
Popular Tags:
28
Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006
Transcript
Page 1: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.

Minimum Error Rate Training in Statistical Machine Translation

By: Franz Och, 2003

Presented By: Anna Tinnemore, 2006

Page 2: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.

GOAL

To directly optimize translation quality

WHY?? No direct correlation in popular evaluation criteria

F-Measure (parsing) Mean Average Precision (ranked retrieval) BLEU—multi-reference word error rate

(statistical machine translation)

Page 3: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.

Problem: The difference in classification of error between the statistical approach and the automatic evaluation methods.

Solution (maybe): optimize model parameters according to individual evaluation methods

Page 4: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.

Background

Optimal under “zero-one loss function”

A different metric would have a different optimal decision rule

Page 5: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.

Background, continued

Problems: finding suitable feature functions (M) and parameter values(λ)

MMI (max mutual info) One unique global optimum Algorithms guaranteed to

find it Optimal translation quality?

Page 6: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.

So what?

Review automatic evaluation criteria Two training criteria that might help New training algorithm for optimizing an

unsmoothed error count Och’s approach Evaluation of training criteria

Page 7: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.

Translation quality metrics

mWER –(multi-reference word error rate) Compute edit distance to closest ref. transl.

mPER – (multi-reference position independent error rate)

bag of words, edit distance BLEU

The mean of the precision of n-grams NIST

Weighted precision of n-grams

Page 8: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.

Training

Minimize error rate

Problems: argmax operation (6)- no

global optimum Many local optima

Page 9: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.

Smoothed Error Count

This is easier to deal with than last function, but still tricky

Performance doesn’t change much with smoothing

Page 10: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.
Page 11: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.

Unsmoothed Error Count

Standard: Powell’s algorithm – grid-based line optimization

Fine-grained grid: slow Large grid: miss optimal solution

NEW: Log-linear model Guaranteed to find the optimal solution Much faster and more stable

Page 12: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.

New Algorithm

Each candidate translation in C corresponds to a line

(t and m are constants)

Piecewise linear

Page 13: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.

Algorithm: the nitty-gritty

For every f : Compute ordered sequence of linear intervals

that make up f(γ;f) Compute each change in error count

between intervals Merge all sequences γf and ΔEf

Traverse the sequence of boundaries while keeping track of error count to find the optimal γ

Page 14: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.

Baseline

Same as alignment template approach This model, log-linear, had M = 8 features

Extract n-best candidate translations from all possible translations

Wait a minute . . .

Page 15: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.

N-best???

Overfitting? Unseen data? First, compute n-best list using “made-up”

parameter values. Use this list to train model for new parameters.

Second, use new parameters, do new search, make new n-best list, append to old n-best list

Third, use new list to train model for even better parameters

Page 16: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.

Keep going until the n-best list doesn’t change – all possible translations are in list

Each iteration generates approx. 200 additional translations

The algorithm only takes 5-7 iterations to converge

Page 17: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.

Additional Sneaky Stuff

Problems with MMI (maximum mutual info) Reference sentences have to be part of n-best list

Solution: Fake reference sentences, of course Select from the n-best list, those sentences with

the fewest word errors with respect to the REAL references, and call these: “pseudo-references”

Page 18: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.

Experiment

2002 TIDES Chinese-English small data track task

News text from Chinese to English

Note: no rule-based components used to translate numbers, dates, or names

Page 19: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.

Development Corpus Results

Page 20: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.

Test Corpus Results

Page 21: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.

Conclusions

Alternative training criteria which directly relate to quality of translation Unsmoothed and smoothed error count on

development corpus Optimizing error rate in training yields better

results on unseen test data Maybe ‘true’ translation quality is also increased We don’t know because the evaluation metrics

need help

Page 22: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.

Future Questions

How many parameters can be reliably estimated using differing criteria on development corpuses (corpi) of various sizes?

Does the criteria used make a difference? Which error rate criteria (smooth/unsmooth)

should be optimized in training?

Page 23: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.

Boasting

This approach applies to any evaluation technique

If the evaluation methods ever get better, this algorithm will yield correspondingly better results

Page 24: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.

Side-stepping

It’s possible that this algorithm could be used to “overfit” the evaluation method, giving falsely inflated scores

It’s not our problem. The developers of the evaluation methods should develop so this can’t happen

Page 25: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.

. . . And Around The World

This algorithm has a place wherever evaluation methods are used

It could yield improvements in these other areas as well

Page 26: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.

Questions, observations, accolades . . .

Page 27: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.

My Observations

Improvements do not seem significant This exposes a problem in the evaluation

metrics, but does nothing to solve it Seems like a good idea, but has many

unanswered questions regarding optimal implementation

Page 28: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.

THANK YOU

and Good Night!


Recommended