Parallel Data & Sentence Alignment

Parallel Data & Sentence Alignment Declan Groves, DCU

[email protected]

Parallel CorpusSeen why from week 2 that data-driven Machine Translation (MT), based on using real-world translation examples/data, is now the most prevelant approach

3 approaches to data-driven MT:Example Based (EBMT)Statistical Based (SMT)Hybrid models (a mix of different approaches; may include non-data driven approaches such as rule-based) which use some probabilistic processing

All need a parallel copus (or bitext) of aligned sentences

Can create the resource manually, otherwise if we have un-aligned bilinugal texts, we can automate the alignment

Automatic Alignment (1/2)Most alignments are one-to-one:

E1: Often, in the textile industry, businesses close their plant in Montreal to move to the Eastern townships.F1: Dans le domaine du textile souvent, dans Montréal, on ferme et on va s’installer dans les Cantons de l’Est.

E2: There is no legislation to prevent them from doing so, for it is a matter of internal economy.F2: Il n’ya aucune loi pour empêcher cela, c’est de la r´egie interne.

E3: That is serious.F3: C’est grave.

Automatic Alignment (2/2)But not always:

E1: Honourary members opposite scoff at the freeze suggested by this party; to them it is laughable.

F1: Les deputés d’en face se moquent du gel que a propose notre parti.F2: Pour eux, c’est une mesure risible.

Automatic Alignment (2/2)But not always:

So some (like this) have one sentence correspond to two, or more, or none. Or there may be a two-to-two sentence alignment without there being a one-to-one relation between the component sentences.

E1: Honourary members opposite scoff at the freeze suggested by this party; to them it is laughable.

F1: Les deputés d’en face se moquent du gel que a propose notre parti.F2: Pour eux, c’est une mesure risible.

E1F1F2{

Any statistical approach to MT requires the availability of aligned bilingual corpora which are:

large;good-quality;representative

Class QuestionAssmume the following (tiny) corpus:

Q1: what’s P(have) vs. P(has) in a general particular corpus? Which is more likely?Q2: what’s P(have | John) vs. P(has | John) in a general corpus?Q3: what’s P(have) vs. P(has) in this corpus? What’s their relative probability?Q4: what’s P(have | John) vs. P(has | John) in this corpus?

* SPEECH and LANGUAGE PROCESSING: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (Jurafsky & Martin)

Good Language Models Come from Good Copora (1/2)

Mary and John have two children.The children that Mary and John have are aged 3 and 4.John has blue eyes.


Assume a different small corpus:

Q5: What two generalisations would a probabilistic language model (based on bigrams, say) infer from this data, which are not true of English as a whole? Are there any other generalisations that could be inferred?

Am I right, or am I wrong?Peter and I are seldom wrong.I am sometimes right.Sam and I are often mistaken.








Q6: Try to think of some trigrams (and 4-grams, if you can) that cannot be ‘discovered’ by a bigram model? What you’re looking for here is a phrase where the third (or subsequent) word depends on the first word, which in a bigram model is ‘too far away’...

Note that all the sentences in these corpora are well-formed. If, on the other hand, the corpus contains ill-formed input, then that too will skew our probability models ...


Bilingual Corpora (1/2)Previous examples, all monolingual, but the same applies w.r.t. bilingual corpora.

One issue that we can think about now is what corpora we’re going to extract our probabilistic language (and translation) models from. What sorts of large, good quality, representative bilingual corpora exist?

Canadian HansardsProceedings of the Hong Kong parliamentDáil Proceedings…

i.e. while the statistical techniques can be applied to any pair of languages, this approach is currently limited to only a few language pairs.

Bilingual Corpora (1/2)W.r.t. ‘representativeness’, consider the following from a 9000 word experiment (Brown et al., 1990) on the Canadian Hansards ...

Q1: For what English word are these possible candidate translations?Q2: What’s “???”:

Beware of sparse data and unrepresentative corpora!!Ditto poor quality language ... though I’ll come back to this one!

French Probability??? .808

entendre .079

entendu .026

entends .024

entendons .013

Bilingual Corpora (1/2)W.r.t. ‘representativeness’, consider the following from a 9000 word experiment (Brown et al., 1990) on the Canadian Hansards ...

Q1: For what English word are these possible candidate translations?Q2: What’s “???”:

Beware of sparse data and unrepresentative corpora!!Ditto poor quality language ... though I’ll come back to this one!

If the corpora are small, or of poor quality, or are unrepresentative, then our statistical language models will be poor, so any results we achieve will be poor.

French Probabilitybravo .808entendre .079

entendu .026

entends .024

entendons .013

Is the World Wide Web a good corpus to use?Let’s imagine you want to find the correct translation of the French compound (a word that consists of more than one token/stem) “groupe de travail”. For each of the two French nouns here, there are several translations:

groupe: cluster, group, grouping, concern, collectivetravail: work, labor, labour

If groupe de travail is not in our dictionary (cf. proliferation of compounds), but the two composite nouns are, then any compositional translation is multiply ambiguous.

Now here’s the trick: let’s search for all 15 possible pairs on the WWW. We might find:

labour cluster: 2labour concern: 9.....work group: 66593

Is the World Wide Web a good corpus to use?

There are at least two ways in which we could attempt to resolve the potential ambiguity:

simply take the most often occurring (frequency) term as the translation;see which candidate translations surpass some threshold (in terms of relative frequency, say)

cf. share price, stock price: assuming these two massively outrank all other candidates, we can keep both as synonymous translations.

So, we can use the WWW to create a very large bilingual lexicon ...

Can we use the WWW to extract parallel corpora?

Lots of pages where you can click on a link to get a version of that page in a different language – how to find these?

Query a search engine for a string “English” ‘not too far away’ from the string “Spanish”. This might get you things like:

Click here for English version vs. Click here for Spanish version

But also:

English literature vs. Spanish literature

What then? Need a process to evaluate these candidates. Using something like ‘diff’, you could align each case and manually remove poor candidates:

Can we use the WWW to extract parallel corpora?Comparison using ‘diff’

Or automate the comparison fully using string compare routines measured against some % threshold....

sunsite{away}691: diff a b1c1< Mary and John have two children.---> Maria und Johannes haben zwei Kinder.3c3< The children that Mary and John have are aged 3 and 4.---> Die Kinder, die Maria und Johannes haben, sind 3 und 4 Jahre alt.5c5< John has blue eyes.---> Johannes hat blaue Augen.

Sentence AlignmentManual construction of aligned corpora (which are essential for probabilistic techniques) also avoids the considerable problem of trying to align source and target texts.

But what if we already have bilingual corpora which are not aligned?!

Automation!

Automatic AlignmentWe’ve just seen one (novel) way of automating this process (using simple string comparison techniques).

What are the problems?Most alignments are one-to-one ... but some are not, as we saw previously:

Some have one sentence correspond to two, or more, or none. Or there may be a two-to-two sentence alignment without there being a one-to-one relation between the component sentences.

Let’s look at (some of) the major algorithms for aligning pairs of sentences ...

Gale & Church’s Algorithm (1/4)Gale & Church (1993): ‘A program for aligning sentences in bilingual corpora’, in Computational Linguistics 19(1):75–102.

All sentence alignment algorithms are premised on the notion that good candidates are not dissimilar with regards to length (i.e. shorter sentences tend to have shorter translations then longer sentences).G&C use this length-based notion, together with the probability that a typical translation is one of various many-to-many sentence relations (0-1, 1-0, 1-1, 2-1, 1-2, 2-2 etc).Distance measure: P(match | d), where match = 1-1, 0-1 etc., and d = difference in lengthHow to calculate text length?

word tokens;characters.

G&C use characters. No language maps character by character to another language, so first calculate the ratio of chars. in L1 to chars. in L2.

e.g. English text = 100,000 chars; Spanish text = 110,000, then scaling ratio = 1.1

Gale & Church’s Algorithm (2/4)Q: What’s the difference in length between an English text of 50 chars. and a Spanish text of 56 chars given this scaling ratio (1.1)?

Typically, the longer the text, the bigger the difference, so we need to normalize the difference in length to take into account longer texts:

2..sllcl

m

st d

where lt and ls are the lengths of the target and source texts respectively, and lm is the average of the two lengths, c is the scaling factor, and s2 is the variance for all values d of in a typical aligned corpus, which G&C suggest is 6.8.

Gale & Church’s Algorithm (3/4)Terms like P(match | d) are called conditional probabilitiesQ: P(throwing at least 7 with 2 die)?Q: P(throwing at least 7 | 1st dice thrown was a 6)?

Bayes’ Theorem is used to to relate probabilities:

P(match | d) = P(d|match).P(match)P(d)

P(d) is a normalizing constant, so can be ignored. P(match) is estimated from correctly aligned corpora

G&C give one-to-one alignments a probability of 0.89, with the rest of the probability space assigned to other possible alignments. P(d| match) has to be estimated (cf. Trujillo, 1999:71).

Gale & Church’s Algorithm (4/4)Comment:

Aligning complete texts is computationally expensive. Luckily, we can use paragraphs, sections,headings, chapters etc to identify chunks of texts and align these smaller elements. That is, we’re looking for anchors which can identify text chunks in both languages. Other things can be tags, proper names, acronyms, cognates, figures, dates…(Q: can you think of others?)

Results? G&C report error rates of 4.2% on 1316 alignments. Most errors occur in non 1-1 alignments. Selection of the best 80% alignments reduces the error rate to 0.7%. Interestingly, for European languages at least, the algorithm is not sensitive to different values for c and s2, but ‘noisy’ texts or from very different languages can degrade performance.

RecapData-driven (e.g. statistical) MT relies on the availability of parallel data that is:

Of sufficient quantityOf sufficient (i.e. high) qualityRepresentative of the language we’re trying to model

Poor quality data = poor quality modelsToo little data: data sparseness & poor coverageToo poor quality: bad quality translationsNot representative: model will make wrong assumptions, produce incorrect translations

Data extracted from the web can be used to create some parallel textsBest for dictionary extractionFinding corresponding documents can be difficult

Parallel data needs to be alignedDocument-level alignmentSentence-level alignment (manual vs automatic, using rel. sentence length ratio)Gale & Church (1993) character-based sentence alignment

Brown et al.’s Algorithm (1/3)Brown et al. (1991): ‘Aligning sentences in parallel corpora’, in Proceedings of 29th Annual Meeting of the Association for Computational Linguistics, University of California, Berkeley, pp.169–176.

Brown et al. measure the length of a text in terms of the number of word tokens.

Their method assumes alignments of text fragments are produced by a Hidden Markov Model (HMM), and the correct alignment is that which maximizes the output of the HMM

The HMM of Brown et al. models alignment by determining probabilities for eight types of alignments, or beads:

s, t, st, sst, stt, ¶s, ¶t, ¶s ¶t

i.e. sst indicates that two source strings align with one target string (2-1), and ¶s signifies that a source paragraph delimiter matches nothing in the target language (i.e. it is deleted) (1-0).

Brown et al.’s Algorithm (2/3)

The probability of s bead emitting a sentence of length ls = P(ls)We allow P(ls) = probability of finding sentence of length l in source text

Similarly, probability of a t bead emitted sentence of length lt = P(lt)We allow P(lt) = probability of finding sentence of length l in target text

For st (i.e. a 1-1 alignment), P(ls,lt) depends on P(ls) and P(lt | ls)Log of ratio (r) of lt to ls is normally distributed with a mean of and variance thus,

if , then , with a constant ensuring that for all values of , 1

English text (s)17s25s12s…

French text (t)19t20t8t…

st

stt

Brown et al.’s Algorithm (3/3)For stt or sst beads, r is calculated by adding the lengths of the two source (or two target) strings.

The probability of both sentences is calculated by and , respectively

For this method, Brown el al. calculated and based on a sample corpus of 1000 sentences with a maximum length of 81 words.

The method used anchor points (major: paragraph markers; minor: session points, time stamps, names of speakers…) to guide the alignment process

only consider sections that contain the same number of minor anchor points

Results: Reported error rates for st beads for English-French are 0.9% (very similar to Gale & Church’s approach)

Kay & Röscheisen’s Algorithm (1/2)Kay & Röscheisen’s (1993): “Text-translation alignment” in Computational Linguistics 19(1):121-142

Text-based alignment assumes set of anchors, parallel text and access to an English-French dictionary

Two sentences will be aligned iff word translation pairs from bilingual dictionary are found with sufficient frequency between two sentences (relative to size and to other sentences in corpus)

Compare with length based algorithms. Disadvantages: requires bilingual dictionary – always available?

Uses information on relative sentence position for English-French Canadian Hansards.

Previously seen relative sentence length (length-based algorithms)

Kay & Röscheisen’s Algorithm (2/2)Assume 10 source and 12 target sentences.Initial estimate alignment maps and Let’s look at : Assume this aligned with +/- 3 sentences

i.e. is estimated to align with

Kay & Röscheisen’s Algorithm (2/2)Assume 10 source and 12 target sentences.Initial estimate alignment maps and Let’s look at : Assume this aligned with +/- 3 sentences

i.e. is estimated to align with Produce a set of word alignments by selecting word pairs in currently aligned sentences.

For example: if file occurs in and and fichier occurs in and , then sentence alignments and are made and the alignments become additional anchors.

Caveat: sentence anchors may not crosse.g. if we have and then can only align with or but nothing after as this is already aligned with

Results: algorithm converges after 4/5 iterations and on first 1000 sentences all but 7 were aligned correctly, but high time/space complexity

… file …

… file …

… fichier …

… fichier …

Exact MatchesIn Translation Memory (TM) (and EBMT systems) we would like to first look for exact matches for input or source language sentencesSome non-exact matches:

Different spelling: Change the color of the font.Change the color of the font.

Different punctuation: Open the file and select the text.Open the file, and select the text.

Different inflection: Delete the document.Delete the documents.

Different numbers: Use version 1.1.Use version 1.2.

Different formatting: Click on OK.Click on OK.

Fuzzy MatchesExact matches often do not occur; TM systems then use “fuzzy” matching:

A Fuzzy Match

New input segment: The specified operation failed because it requires the Character to be active.

Stored TM Unit: EN: The specified language for the Character is not supported on the computer

FR: La langue spécifiée pour le Compagnon n’est pas prise en charge par cet ordinateur.


A Fuzzy Match

New input segment: The specified operation failed because it requires the Character to be active.

Stored TM Unit: EN: The specified language for the Character is not supported on the computer

FR: La langue spécifiée pour le Compagnon n’est pas prise en charge par cet ordinateur.


Multiple Fuzzy Matches in Ranked Order

New input segment: The operation was interrupted because the Character was hidden.

Best Match: EN: The operation was interrupted because the Listening key was pressed.

FR: L’opération a été interrompue car la touche d’écout a été enfoncée.











2nd Best Match: EN: The specified method failed because the Character is hidden.

FR: La méthode spécifiée a échoué cat le Compagnon est masqué

3nd Best Match: EN: The operation was interrupted by the application.

FR: L’opération a été interrompue par l’application

Subsentential Alignment (1/4)Words and terms tend to have similar distributions within two texts and can be expected to occur in the same aligned sentences.

Word alignment -> bilingual dictionaries.

e.g. ‘computer’ occurs in 30 English strings and ‘computadora’ in 29 Spanish strings and nowhere else: strong chance they are translations of each other.Can measure this co-occurrence using Mutual Information:

e.g. given an aligned corpus of 100 strings, if ‘store’ occurs in 30 aligned English strings, ‘speichern’ occurs in 20 aligned German sentences and these alignments coincide in 19 cases, then for this corpus:

Subsentential Alignment (2/4)Mutual information values:

positive values indicate the occurrence of one word strongly predicts that the other word will occurA value near 0 indicates that they occur independentlyNegative values predict that if one word occurs the other does not

Problem with MI: doesn’t provide probabilities (difficult to use in SMT systems)

Trick: use MI to align source-target words, then if a threshold can be determined empirically, rank them. If a has a higher MI score with then with then is deleted as a less like translation.

Subsentential Alignment (3/4)Different technique: translation of source word can be found in ‘about the same position’ as the target sentence to which it is aligned, if not then within a fixed distance.

Algorithm:1. Estimate word translation and offset (relates to word position) probabilities2. Find the most probable alignment for the given text.

Naïve alignment: for a source word located 25% of the way through a source text, its translation is located 25% of the way +/- 20 words.

Ignoring function words (e.g. prepositions, determiners etc.) taken from a stoplist, we might induce the following alignments:

A Simple Aligned Text

EN: Start the operating system ES: Comenzar el sistema operativoEN: Launch the program via the ES: Empezar el programa mediante keyboard el telcando

Subsentential Alignment (4/4)

comenzar start *sistema start *empezat system telcado keyboard

Two are right, two are wrong… but their correct translations lie close byAssuming enough text, we assume the pairings (sistema,system) and (empezar,launch) can be confirmed as correct alignments. This leaves fewer candidates to be aligned.

For European languages at least, we can assume that words which are close together in the source translate as words which are close together in the target.

Similarly, words at the start/end of a source sentence map to words at the start/end of the target sentence.

Other subsentential alignment techniques:Distance between occurrences of source word mirrors distance between occurrence of a target word.

Confirm aligned word pairs, using a third corpus as reference.

A Simple Aligned Text

EN: Start the operating system ES: Comenzar el sistema operativoEN: Launch the program via the ES: Empezar el programa mediante keyboard el telcando

Date post:	24-Feb-2016
Category:	Documents
Upload:	cecil
View:	27 times
Download:	0 times

Parallel Data & Sentence Alignment

Documents