Hybridity in MT: Experiments on the Europarl Corpus

Hybridity in MT: Experiments on the Europarl Corpus

Declan Groves24th May,

NCLT Seminar Series 2006

Outline

• Example-Based Machine Translation– Marker-Based EBMT

• Statistical Machine Translation– Phrasal Extraction

• Experiments:– Data Sources Used– EBMT vs PBSMT– Hybrid System Experiments

• Improving EBMT lexicon• Making use of merged data sets

• Conclusions• Future Work

Example-Based MT

• As with SMT, makes use of information extracted from sententially-aligned bilingual corpus. In general:

– SMT only uses parameters, throws away data– EBMT makes use of linguistic units directly.

• During Translation:1. Source side of bitext is searched for close matches2. Source-target sub-sentential links are determined3. Relevant target fragments retrieved and recombined

to derive final translation.

Example-Based MT: An Example

• Assumes an aligned bilingual corpus of examples against which input text is matched

• Best match is found using a similarity metric based on word co-occurrence, POS, generalized templates and bilingual dictionaries (exact and fuzzy matching)


• Assumes an aligned bilingual corpus of examples against which input text is matched

• Best match is found using a similarity metric based on word co-occurrence, POS, generalized templates and bilingual dictionaries (exact and fuzzy matching)

Given the Corpus

The shop is open on Monday Le magasin est ouvert Lundi

John went to the swimming pool Jean est allé à la piscine

The butcher’s is next to the baker’s La boucherie est à côté de la boulangerie


• Identify useful fragments

Given the Corpus





Isolate useful fragments

We can now translate:

on Monday LundiJohn went to Jean est allé àthe baker’s la boulangerie

Given the Corpus




• Identify useful fragments• Recombination depends on nature of examples used

Marker-Based EBMT

“The Marker Hypothesis states that all natural languages have a closed set of specific words or morphemes

which appear in a limited set of grammatical contexts and which signal that context.”

(Green, 1979)• Universal psycholinguistic constraint: languages are marked for syntactic

structure at surface level by close set of lexemes or morphemes• Use a set of closed-class marker words to segment aligned source and

target sentences during a pre-processing stage

Determiners <DET>

Quantifiers <QUANT>

Prepositions <PREP>

Conjunctions <CONJ>

Wh-Adverbs <WRB>

Possessive Pronouns <POSS>

Personal Pronouns <PRON>

Marker-Based EBMT

• Source-target sentence pairs are marked with their Marker categoriesEN: <PRON> you click apply <PREP> to view <DET> the effect <PREP>

of <DET> the selectionFR: <PRON> vous cliquez <PRON> sur appliquer <PREP> pour visualiser

<DET>l’ effet <PREP> de <DET> la sélection

• Aligned source-target chunks are created by segmenting the sentences based on these marker tags along with cognate and word co-occurrence information:<PRON> you click apply: <PRON> vous cliquez sur appliquer<PREP> to view: <PREP> pour visualiser<DET> the effect: <DET> l’effet<PREP> of the selection: <PREP> de la sélection

Marker-Based EBMT

• Chunks containing only one non-marker word in both source and target languages can then be used to extract a word-level lexicon:

<PREP> to: <PREP> pour<LEX> view: <LEX> visualiser<LEX> effect: <LEX> effet<DET> the: <DET> l<PREP> of: <PREP> de

• In a final pre-processing stage, we produce a set of generalized marker templates by replacing marker words with their tags:

<PRON> click apply : <PRON> cliquez sur appliquer <PREP> view : <PREP> visualiser<DET> effect : <DET> effet <PREP> the selection : <PREP> la sélection

• Any marker tag pair can now be inserted at the appropriate tag location.

• More general examples add flexibility to the matching process and improve coverage (and quality).

Marker-Based EBMT

• During translation:– Resources are searched from maximal (specific

source-target sentence-pairs) to minimal context (word-for-word translation).

– Retrieved example translation candidates are recombined, along with their weights, based on source sentence order

– System outputs n-best list of translations

Phrase-Based SMT

• Translation models now estimate both word-to-word and phrasal translation probabilities (allowing in addition many-one and many-many word mappings)

• Phrases incorporate some idea of syntax.– Able to capture more meaningful relationships

between words within phrases

• In order to extract phrases, we can make use of word alignments

SMT Phrasal Extraction

• Perform word alignment in both source-target and target-source directions

• Take intersection of uni-directional alignments– Produces a set of highly confident word-alignments

• Extend the intersection iteratively into the union by adding adjacent alignments within the alignment space (Och & Ney 2003, Koehn et al., 2003).

• Extract all possible phrases from sentence pairs which correspond to these alignments (possibly including full sentences)

• Phrase probabilities can be calculated from relative frequencies– Phrases and their probabilities make up phrase translation

table (translation model).

Experiments: Data Resources

• Made use of French-English training and testing sets of the Europarl corpus (Koehn, 2005)

• Extracted training data from designated training sets, filtering based on sentence length and relative sentence length.

# sentence pairs

# words

78K 1.49M

156K 2.98M

322K 6.12M

• For testing, randomly extracted 5000 sentences from the Europarl common test set.– Avg. sentence lengths: 20.5 words (French), 19.0 words

(English)

EBMT vs PBSMT

• Compared the performance of our Marker-Based EBMT system against that of a PBSMT system built using:– Pharaoh Phrase-Based Decoder(Koehn, 2003) – SRI LM toolkit.– Refined alignment strategy (Och & Ney, 2003)

• Trained on incremental data sets, tested on 5000 sentence test set

• Performed translation for French-English and English-French

EBMT vs PBSMT: French-English

00.10.20.30.40.50.60.70.80.9

1

Bleu Prec Recall WER

00.10.20.30.40.50.60.70.80.9

1


00.10.20.30.40.50.60.70.80.9

1


78K

156K

322K

• Doubling the amount of data improves performance across the board for both EBMT and PBSMT

• PBSMT system clearly outperforms EBMT system, on average achieving 0.07 BLEU score higher

• PBSMT achieves a significantly lower WER (e.g. 68.55 vs. 82.43 for the 322K data set)

• Increasing amount of training data results in:– 3-5% increase in relative BLEU

for PBSMT– 6.2% to 10.3% relative BLEU

score improvement for EBMT

EBMT vs PBSMT: English-French

00.10.20.30.40.50.60.70.80.9

1


00.10.20.30.40.50.60.70.80.9

1


00.10.20.30.40.50.60.70.80.9

1


78K

156K

322K

• PBSMTcontinues to outperform the EBMT system by some distance– e.g. 0.1933 vs 0.1488 BLEU score,

0.518 vs 0.4578 Recall for 322K data set

• Difference between scores is somewhat less for English-French for French-English– EBMT system performance is much

more consistent for both directions– PBSMT system performs 2% BLEU

score worse (10% relative) for English-French than French-English

• French-English is ‘easier’– Less agreement errors, problems

with boundary friction e.g. le -> the (French-English), the-> le, la, les, l’ (English-French)

Hybrid System Experiments

• Decided to merge elements of EBMT marker-based alignments with PBSMT phrases and words induced via GIZA++

• Number of Hybrid Systems– LEX-EBMT: Replaced EBMT lexicon with higher quality

PBSMT word-alignments, to lower WER– H-EBMT vs H-PBSMT: Merged PBSMT words and phrases

with EBMT data (words and phrases) and passed resulting data to baseline EBMT and baseline PBSMT systems

– EBMT-LM and H-EBMT-LM: Reranked the output of EBMT and H-EBMT systems using the PBSMT system’s equivalent language model

Hybrid Experiments: French-English

00.020.040.060.080.10.120.140.160.180.20.220.24

78K 156K 322K

EBMT

LEX-EBMT

H-EBMT

H-EBMT-LM

PBSMT

H-PBSMT

• Use of the improved lexicon (LEX-EBMT), leads to only slight improvements (average relative increase of 2.9% BLEU)

• Adding Hybrid data improves above baselines, for both EBMT (H-EBMT) and PBSMT (H-PBSMT)– H-PBSMT system achieves higher BLEU score trained on 78K & 156K

compared with PBSMT system when trained on twice as much data.• The addition of the language model to the H-EBMT system helps

guide word order after lexical selection and thus improves results further

Hybrid Experiments: English-French

00.020.040.060.080.10.120.140.160.180.20.220.24

78K 156K 322K

EBMT

LEX-EBMT

H-EBMT

H-EBMT-LM

PBSMT

H-PBSMT

• We see similar results for English-French as for French-English

• Using the hybrid data set we get a 15% average relative increase in BLEU score for the EBMT system, and 6.2% for the H-PBSMT system over its baseline

• The H-EBMT system performs almost as well as the baseline system trained on over 4 times the amount of data

Conclusions

• In Groves & Way, 2005, we showed how an EBMT system outperforms a PBSMT system when trained on the Sun Microsystems’ data set

• This time around, the baseline PBSMT system achieves higher quality than all variants of the EBMT system– Heterogenous Europarl vs. Homogeneous Sun data– Chunk coverage is lower on Europarl data set: 6% translations

produced using chunks alone (Sun) vs. 1% on Europarl– EBMT system considered 13 words on average for direct

translation• Significant improvements seen when using higher-quality lexicon• Improvements also seen when LM introduced

• H-PBSMT system able to outperform baseline PBSMT system

• Further gains to be made from hybrid corpus-based approaches

Future Work

• Automatic detection of Marker Words• Plan to increase levels of hybridity

– Code a simple EBMT decoder, factoring in Marker-Based recombination approach along with probabilities (rather than weights)

– Use exact sentence matching as in EBMT, along with statistical weighting of knowledge sources

– Integration of generalized templates into PBSMT system– Use Good-Turing methods to assign probabilities to fuzzy

matching. • Often a fuzzy chunk match may be more favourable to a word-for-

word translation

• Plan to code a robust, wide-coverage Statistical EBMT system– Make use of EBMT principles in a statistically-driven system.

Date post:	09-Jan-2016
Category:	Documents
Upload:	airlia
View:	43 times
Download:	5 times

Hybridity in MT: Experiments on the Europarl Corpus

Documents