+ All Categories
Home > Documents > Parallel Data & Sentence Alignment

Parallel Data & Sentence Alignment

Date post: 24-Feb-2016
Category:
Upload: cecil
View: 27 times
Download: 0 times
Share this document with a friend
Description:
Parallel Data & Sentence Alignment. Declan Groves, DCU [email protected]. Parallel Corpus. Seen why from week 2 that data-driven Machine Translation (MT), based on using real-world translation examples/data, is now the most prevelant approach 3 approaches to data-driven MT: - PowerPoint PPT Presentation
Popular Tags:
39
Parallel Data & Sentence Alignment Declan Groves, DCU [email protected]
Transcript
Page 1: Parallel Data & Sentence Alignment

Parallel Data & Sentence Alignment Declan Groves, DCU

[email protected]

Page 2: Parallel Data & Sentence Alignment

Parallel CorpusSeen why from week 2 that data-driven Machine Translation (MT), based on using real-world translation examples/data, is now the most prevelant approach

3 approaches to data-driven MT:Example Based (EBMT)Statistical Based (SMT)Hybrid models (a mix of different approaches; may include non-data driven approaches such as rule-based) which use some probabilistic processing

All need a parallel copus (or bitext) of aligned sentences

Can create the resource manually, otherwise if we have un-aligned bilinugal texts, we can automate the alignment

Page 3: Parallel Data & Sentence Alignment

Automatic Alignment (1/2)Most alignments are one-to-one:

E1: Often, in the textile industry, businesses close their plant in Montreal to move to the Eastern townships.F1: Dans le domaine du textile souvent, dans Montréal, on ferme et on va s’installer dans les Cantons de l’Est.

E2: There is no legislation to prevent them from doing so, for it is a matter of internal economy.F2: Il n’ya aucune loi pour empêcher cela, c’est de la r´egie interne.

E3: That is serious.F3: C’est grave.

Page 4: Parallel Data & Sentence Alignment

Automatic Alignment (2/2)But not always:

E1: Honourary members opposite scoff at the freeze suggested by this party; to them it is laughable.

F1: Les deputés d’en face se moquent du gel que a propose notre parti.F2: Pour eux, c’est une mesure risible.

Page 5: Parallel Data & Sentence Alignment

Automatic Alignment (2/2)But not always:

So some (like this) have one sentence correspond to two, or more, or none. Or there may be a two-to-two sentence alignment without there being a one-to-one relation between the component sentences.

E1: Honourary members opposite scoff at the freeze suggested by this party; to them it is laughable.

F1: Les deputés d’en face se moquent du gel que a propose notre parti.F2: Pour eux, c’est une mesure risible.

E1F1F2{

Page 6: Parallel Data & Sentence Alignment

Any statistical approach to MT requires the availability of aligned bilingual corpora which are:

large;good-quality;representative

Class QuestionAssmume the following (tiny) corpus:

Q1: what’s P(have) vs. P(has) in a general particular corpus? Which is more likely?Q2: what’s P(have | John) vs. P(has | John) in a general corpus?Q3: what’s P(have) vs. P(has) in this corpus? What’s their relative probability?Q4: what’s P(have | John) vs. P(has | John) in this corpus?

* SPEECH and LANGUAGE PROCESSING: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (Jurafsky & Martin)

Good Language Models Come from Good Copora (1/2)

Mary and John have two children.The children that Mary and John have are aged 3 and 4.John has blue eyes.

Page 7: Parallel Data & Sentence Alignment

Good Language Models Come from Good Copora (2/2)

Assume a different small corpus:

Q5: What two generalisations would a probabilistic language model (based on bigrams, say) infer from this data, which are not true of English as a whole? Are there any other generalisations that could be inferred?

Am I right, or am I wrong?Peter and I are seldom wrong.I am sometimes right.Sam and I are often mistaken.

Page 8: Parallel Data & Sentence Alignment

Good Language Models Come from Good Copora (2/2)

Assume a different small corpus:

Q5: What two generalisations would a probabilistic language model (based on bigrams, say) infer from this data, which are not true of English as a whole? Are there any other generalisations that could be inferred?

Am I right, or am I wrong?Peter and I are seldom wrong.I am sometimes right.Sam and I are often mistaken.

Page 9: Parallel Data & Sentence Alignment

Good Language Models Come from Good Copora (2/2)

Assume a different small corpus:

Q5: What two generalisations would a probabilistic language model (based on bigrams, say) infer from this data, which are not true of English as a whole? Are there any other generalisations that could be inferred?

Q6: Try to think of some trigrams (and 4-grams, if you can) that cannot be ‘discovered’ by a bigram model? What you’re looking for here is a phrase where the third (or subsequent) word depends on the first word, which in a bigram model is ‘too far away’...

Note that all the sentences in these corpora are well-formed. If, on the other hand, the corpus contains ill-formed input, then that too will skew our probability models ...

Am I right, or am I wrong?Peter and I are seldom wrong.I am sometimes right.Sam and I are often mistaken.

Page 10: Parallel Data & Sentence Alignment

Bilingual Corpora (1/2)Previous examples, all monolingual, but the same applies w.r.t. bilingual corpora.

One issue that we can think about now is what corpora we’re going to extract our probabilistic language (and translation) models from. What sorts of large, good quality, representative bilingual corpora exist?

Canadian HansardsProceedings of the Hong Kong parliamentDáil Proceedings…

i.e. while the statistical techniques can be applied to any pair of languages, this approach is currently limited to only a few language pairs.

Page 11: Parallel Data & Sentence Alignment

Bilingual Corpora (1/2)W.r.t. ‘representativeness’, consider the following from a 9000 word experiment (Brown et al., 1990) on the Canadian Hansards ...

Q1: For what English word are these possible candidate translations?Q2: What’s “???”:

Beware of sparse data and unrepresentative corpora!!Ditto poor quality language ... though I’ll come back to this one!

French Probability??? .808

entendre .079

entendu .026

entends .024

entendons .013

Page 12: Parallel Data & Sentence Alignment

Bilingual Corpora (1/2)W.r.t. ‘representativeness’, consider the following from a 9000 word experiment (Brown et al., 1990) on the Canadian Hansards ...

Q1: For what English word are these possible candidate translations?Q2: What’s “???”:

Beware of sparse data and unrepresentative corpora!!Ditto poor quality language ... though I’ll come back to this one!

If the corpora are small, or of poor quality, or are unrepresentative, then our statistical language models will be poor, so any results we achieve will be poor.

French Probabilitybravo .808entendre .079

entendu .026

entends .024

entendons .013

Page 13: Parallel Data & Sentence Alignment

Is the World Wide Web a good corpus to use?Let’s imagine you want to find the correct translation of the French compound (a word that consists of more than one token/stem) “groupe de travail”. For each of the two French nouns here, there are several translations:

groupe: cluster, group, grouping, concern, collectivetravail: work, labor, labour

If groupe de travail is not in our dictionary (cf. proliferation of compounds), but the two composite nouns are, then any compositional translation is multiply ambiguous.

Now here’s the trick: let’s search for all 15 possible pairs on the WWW. We might find:

labour cluster: 2labour concern: 9.....work group: 66593

Page 14: Parallel Data & Sentence Alignment

Is the World Wide Web a good corpus to use?

There are at least two ways in which we could attempt to resolve the potential ambiguity:

simply take the most often occurring (frequency) term as the translation;see which candidate translations surpass some threshold (in terms of relative frequency, say)

cf. share price, stock price: assuming these two massively outrank all other candidates, we can keep both as synonymous translations.

So, we can use the WWW to create a very large bilingual lexicon ...

Page 15: Parallel Data & Sentence Alignment

Can we use the WWW to extract parallel corpora?

Lots of pages where you can click on a link to get a version of that page in a different language – how to find these?

Query a search engine for a string “English” ‘not too far away’ from the string “Spanish”. This might get you things like:

Click here for English version vs. Click here for Spanish version

But also:

English literature vs. Spanish literature

What then? Need a process to evaluate these candidates. Using something like ‘diff’, you could align each case and manually remove poor candidates:

Page 16: Parallel Data & Sentence Alignment

Can we use the WWW to extract parallel corpora?Comparison using ‘diff’

Or automate the comparison fully using string compare routines measured against some % threshold....

sunsite{away}691: diff a b1c1< Mary and John have two children.---> Maria und Johannes haben zwei Kinder.3c3< The children that Mary and John have are aged 3 and 4.---> Die Kinder, die Maria und Johannes haben, sind 3 und 4 Jahre alt.5c5< John has blue eyes.---> Johannes hat blaue Augen.

Page 17: Parallel Data & Sentence Alignment

Sentence AlignmentManual construction of aligned corpora (which are essential for probabilistic techniques) also avoids the considerable problem of trying to align source and target texts.

But what if we already have bilingual corpora which are not aligned?!

Automation!

Page 18: Parallel Data & Sentence Alignment

Automatic AlignmentWe’ve just seen one (novel) way of automating this process (using simple string comparison techniques).

What are the problems?Most alignments are one-to-one ... but some are not, as we saw previously:

Some have one sentence correspond to two, or more, or none. Or there may be a two-to-two sentence alignment without there being a one-to-one relation between the component sentences.

Let’s look at (some of) the major algorithms for aligning pairs of sentences ...

Page 19: Parallel Data & Sentence Alignment

Gale & Church’s Algorithm (1/4)Gale & Church (1993): ‘A program for aligning sentences in bilingual corpora’, in Computational Linguistics 19(1):75–102.

All sentence alignment algorithms are premised on the notion that good candidates are not dissimilar with regards to length (i.e. shorter sentences tend to have shorter translations then longer sentences).G&C use this length-based notion, together with the probability that a typical translation is one of various many-to-many sentence relations (0-1, 1-0, 1-1, 2-1, 1-2, 2-2 etc).Distance measure: P(match | d), where match = 1-1, 0-1 etc., and d = difference in lengthHow to calculate text length?

word tokens;characters.

G&C use characters. No language maps character by character to another language, so first calculate the ratio of chars. in L1 to chars. in L2.

e.g. English text = 100,000 chars; Spanish text = 110,000, then scaling ratio = 1.1

Page 20: Parallel Data & Sentence Alignment

Gale & Church’s Algorithm (2/4)Q: What’s the difference in length between an English text of 50 chars. and a Spanish text of 56 chars given this scaling ratio (1.1)?

Typically, the longer the text, the bigger the difference, so we need to normalize the difference in length to take into account longer texts:

2..sllcl

m

st d

where lt and ls are the lengths of the target and source texts respectively, and lm is the average of the two lengths, c is the scaling factor, and s2 is the variance for all values d of in a typical aligned corpus, which G&C suggest is 6.8.

Page 21: Parallel Data & Sentence Alignment

Gale & Church’s Algorithm (3/4)Terms like P(match | d) are called conditional probabilitiesQ: P(throwing at least 7 with 2 die)?Q: P(throwing at least 7 | 1st dice thrown was a 6)?

Bayes’ Theorem is used to to relate probabilities:

P(match | d) = P(d|match).P(match)P(d)

P(d) is a normalizing constant, so can be ignored. P(match) is estimated from correctly aligned corpora

G&C give one-to-one alignments a probability of 0.89, with the rest of the probability space assigned to other possible alignments. P(d| match) has to be estimated (cf. Trujillo, 1999:71).

Page 22: Parallel Data & Sentence Alignment

Gale & Church’s Algorithm (4/4)Comment:

Aligning complete texts is computationally expensive. Luckily, we can use paragraphs, sections,headings, chapters etc to identify chunks of texts and align these smaller elements. That is, we’re looking for anchors which can identify text chunks in both languages. Other things can be tags, proper names, acronyms, cognates, figures, dates…(Q: can you think of others?)

Results? G&C report error rates of 4.2% on 1316 alignments. Most errors occur in non 1-1 alignments. Selection of the best 80% alignments reduces the error rate to 0.7%. Interestingly, for European languages at least, the algorithm is not sensitive to different values for c and s2, but ‘noisy’ texts or from very different languages can degrade performance.

Page 23: Parallel Data & Sentence Alignment

RecapData-driven (e.g. statistical) MT relies on the availability of parallel data that is:

Of sufficient quantityOf sufficient (i.e. high) qualityRepresentative of the language we’re trying to model

Poor quality data = poor quality modelsToo little data: data sparseness & poor coverageToo poor quality: bad quality translationsNot representative: model will make wrong assumptions, produce incorrect translations

Data extracted from the web can be used to create some parallel textsBest for dictionary extractionFinding corresponding documents can be difficult

Parallel data needs to be alignedDocument-level alignmentSentence-level alignment (manual vs automatic, using rel. sentence length ratio)Gale & Church (1993) character-based sentence alignment

Page 24: Parallel Data & Sentence Alignment

Brown et al.’s Algorithm (1/3)Brown et al. (1991): ‘Aligning sentences in parallel corpora’, in Proceedings of 29th Annual Meeting of the Association for Computational Linguistics, University of California, Berkeley, pp.169–176.

Brown et al. measure the length of a text in terms of the number of word tokens.

Their method assumes alignments of text fragments are produced by a Hidden Markov Model (HMM), and the correct alignment is that which maximizes the output of the HMM

The HMM of Brown et al. models alignment by determining probabilities for eight types of alignments, or beads:

s, t, st, sst, stt, ¶s, ¶t, ¶s ¶t

i.e. sst indicates that two source strings align with one target string (2-1), and ¶s signifies that a source paragraph delimiter matches nothing in the target language (i.e. it is deleted) (1-0).

Page 25: Parallel Data & Sentence Alignment

Brown et al.’s Algorithm (2/3)

The probability of s bead emitting a sentence of length ls = P(ls)We allow P(ls) = probability of finding sentence of length l in source text

Similarly, probability of a t bead emitted sentence of length lt = P(lt)We allow P(lt) = probability of finding sentence of length l in target text

For st (i.e. a 1-1 alignment), P(ls,lt) depends on P(ls) and P(lt | ls)Log of ratio (r) of lt to ls is normally distributed with a mean of and variance thus,

if , then , with a constant ensuring that for all values of , 1

English text (s)17s25s12s…

French text (t)19t20t8t…

st

stt

Page 26: Parallel Data & Sentence Alignment

Brown et al.’s Algorithm (3/3)For stt or sst beads, r is calculated by adding the lengths of the two source (or two target) strings.

The probability of both sentences is calculated by and , respectively

For this method, Brown el al. calculated and based on a sample corpus of 1000 sentences with a maximum length of 81 words.

The method used anchor points (major: paragraph markers; minor: session points, time stamps, names of speakers…) to guide the alignment process

only consider sections that contain the same number of minor anchor points

Results: Reported error rates for st beads for English-French are 0.9% (very similar to Gale & Church’s approach)

Page 27: Parallel Data & Sentence Alignment

Kay & Röscheisen’s Algorithm (1/2)Kay & Röscheisen’s (1993): “Text-translation alignment” in Computational Linguistics 19(1):121-142

Text-based alignment assumes set of anchors, parallel text and access to an English-French dictionary

Two sentences will be aligned iff word translation pairs from bilingual dictionary are found with sufficient frequency between two sentences (relative to size and to other sentences in corpus)

Compare with length based algorithms. Disadvantages: requires bilingual dictionary – always available?

Uses information on relative sentence position for English-French Canadian Hansards.

Previously seen relative sentence length (length-based algorithms)

Page 28: Parallel Data & Sentence Alignment

Kay & Röscheisen’s Algorithm (2/2)Assume 10 source and 12 target sentences.Initial estimate alignment maps and Let’s look at : Assume this aligned with +/- 3 sentences

i.e. is estimated to align with

Page 29: Parallel Data & Sentence Alignment

Kay & Röscheisen’s Algorithm (2/2)Assume 10 source and 12 target sentences.Initial estimate alignment maps and Let’s look at : Assume this aligned with +/- 3 sentences

i.e. is estimated to align with Produce a set of word alignments by selecting word pairs in currently aligned sentences.

For example: if file occurs in and and fichier occurs in and , then sentence alignments and are made and the alignments become additional anchors.

Caveat: sentence anchors may not crosse.g. if we have and then can only align with or but nothing after as this is already aligned with

Results: algorithm converges after 4/5 iterations and on first 1000 sentences all but 7 were aligned correctly, but high time/space complexity

… file …

… file …

… fichier …

… fichier …

Page 30: Parallel Data & Sentence Alignment

Exact MatchesIn Translation Memory (TM) (and EBMT systems) we would like to first look for exact matches for input or source language sentencesSome non-exact matches:

Different spelling: Change the color of the font.Change the color of the font.

Different punctuation: Open the file and select the text.Open the file, and select the text.

Different inflection: Delete the document.Delete the documents.

Different numbers: Use version 1.1.Use version 1.2.

Different formatting: Click on OK.Click on OK.

Page 31: Parallel Data & Sentence Alignment

Fuzzy MatchesExact matches often do not occur; TM systems then use “fuzzy” matching:

A Fuzzy Match

New input segment: The specified operation failed because it requires the Character to be active.

Stored TM Unit: EN: The specified language for the Character is not supported on the computer

FR: La langue spécifiée pour le Compagnon n’est pas prise en charge par cet ordinateur.

Page 32: Parallel Data & Sentence Alignment

Fuzzy MatchesExact matches often do not occur; TM systems then use “fuzzy” matching:

A Fuzzy Match

New input segment: The specified operation failed because it requires the Character to be active.

Stored TM Unit: EN: The specified language for the Character is not supported on the computer

FR: La langue spécifiée pour le Compagnon n’est pas prise en charge par cet ordinateur.

Page 33: Parallel Data & Sentence Alignment

Fuzzy MatchesExact matches often do not occur; TM systems then use “fuzzy” matching:

Multiple Fuzzy Matches in Ranked Order

New input segment: The operation was interrupted because the Character was hidden.

Best Match: EN: The operation was interrupted because the Listening key was pressed.

FR: L’opération a été interrompue car la touche d’écout a été enfoncée.

Page 34: Parallel Data & Sentence Alignment

Fuzzy MatchesExact matches often do not occur; TM systems then use “fuzzy” matching:

Multiple Fuzzy Matches in Ranked Order

New input segment: The operation was interrupted because the Character was hidden.

Best Match: EN: The operation was interrupted because the Listening key was pressed.

FR: L’opération a été interrompue car la touche d’écout a été enfoncée.

Page 35: Parallel Data & Sentence Alignment

Fuzzy MatchesExact matches often do not occur; TM systems then use “fuzzy” matching:

Multiple Fuzzy Matches in Ranked Order

New input segment: The operation was interrupted because the Character was hidden.

Best Match: EN: The operation was interrupted because the Listening key was pressed.

FR: L’opération a été interrompue car la touche d’écout a été enfoncée.

2nd Best Match: EN: The specified method failed because the Character is hidden.

FR: La méthode spécifiée a échoué cat le Compagnon est masqué

3nd Best Match: EN: The operation was interrupted by the application.

FR: L’opération a été interrompue par l’application

Page 36: Parallel Data & Sentence Alignment

Subsentential Alignment (1/4)Words and terms tend to have similar distributions within two texts and can be expected to occur in the same aligned sentences.

Word alignment -> bilingual dictionaries.

e.g. ‘computer’ occurs in 30 English strings and ‘computadora’ in 29 Spanish strings and nowhere else: strong chance they are translations of each other.Can measure this co-occurrence using Mutual Information:

e.g. given an aligned corpus of 100 strings, if ‘store’ occurs in 30 aligned English strings, ‘speichern’ occurs in 20 aligned German sentences and these alignments coincide in 19 cases, then for this corpus:

Page 37: Parallel Data & Sentence Alignment

Subsentential Alignment (2/4)Mutual information values:

positive values indicate the occurrence of one word strongly predicts that the other word will occurA value near 0 indicates that they occur independentlyNegative values predict that if one word occurs the other does not

Problem with MI: doesn’t provide probabilities (difficult to use in SMT systems)

Trick: use MI to align source-target words, then if a threshold can be determined empirically, rank them. If a has a higher MI score with then with then is deleted as a less like translation.

Page 38: Parallel Data & Sentence Alignment

Subsentential Alignment (3/4)Different technique: translation of source word can be found in ‘about the same position’ as the target sentence to which it is aligned, if not then within a fixed distance.

Algorithm:1. Estimate word translation and offset (relates to word position) probabilities2. Find the most probable alignment for the given text.

Naïve alignment: for a source word located 25% of the way through a source text, its translation is located 25% of the way +/- 20 words.

Ignoring function words (e.g. prepositions, determiners etc.) taken from a stoplist, we might induce the following alignments:

A Simple Aligned Text

EN: Start the operating system ES: Comenzar el sistema operativoEN: Launch the program via the ES: Empezar el programa mediante keyboard el telcando

Page 39: Parallel Data & Sentence Alignment

Subsentential Alignment (4/4)

comenzar start *sistema start *empezat system telcado keyboard

Two are right, two are wrong… but their correct translations lie close byAssuming enough text, we assume the pairings (sistema,system) and (empezar,launch) can be confirmed as correct alignments. This leaves fewer candidates to be aligned.

For European languages at least, we can assume that words which are close together in the source translate as words which are close together in the target.

Similarly, words at the start/end of a source sentence map to words at the start/end of the target sentence.

Other subsentential alignment techniques:Distance between occurrences of source word mirrors distance between occurrence of a target word.

Confirm aligned word pairs, using a third corpus as reference.

A Simple Aligned Text

EN: Start the operating system ES: Comenzar el sistema operativoEN: Launch the program via the ES: Empezar el programa mediante keyboard el telcando


Recommended