+ All Categories
Home > Documents > 1 The Web as a Parallel Corpus Parallel corpora are useful Training data for statistical MT ...

1 The Web as a Parallel Corpus Parallel corpora are useful Training data for statistical MT ...

Date post: 22-Dec-2015
Category:
View: 213 times
Download: 0 times
Share this document with a friend
Popular Tags:
12
1 The Web as a Parallel Corpus Parallel corpora are useful Training data for statistical MT Lexical correspondences for cross- lingual IR Early work: Hansards Canadian parliamentary proceedings French/English only Still most resources are in formal newspaper style only
Transcript
Page 1: 1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.

1

The Web as a Parallel Corpus

Parallel corpora are useful Training data for statistical MT Lexical correspondences for cross-lingual IR

Early work: Hansards Canadian parliamentary proceedings French/English only

Still most resources are in formal newspaper style only

Page 2: 1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.

2

Harvesting parallel text from web Strand: use similar structure to find

likely translations

Using similar content to find translations

Applying methods to the Internet Archive, dramatically increasing quantity

Page 3: 1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.

3

STRAND

Structural Translation Recognition Acquiring Natural Data

Architecture Location of possible translations Generation of candidate translations Filtering of candidates based on structure

Page 4: 1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.

4

Search for language in anchors (anchor: “English” OR anchor: “French”)

Page 5: 1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.

5

Structural Filtering

Linearize HTML and discard content Run through transducer to produce:

[START element-label] [END element-label] [CHUNK length]

Page 6: 1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.

6

Align sequences using dynamic programming

Page 7: 1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.

7

Scalar values

Dp: difference in # structural items that have no match

N: number of aligned non-markup chunks of different lengths

R: correlation of chunk lengths P: significance level of the correlations

Page 8: 1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.

8

Evaluation Human judgments on 326 English-

French paired pages Using manually set thresholds on dp

and n 100% precision 68.6% recall Similar results on English/Chinese;

English/Spanish Typically throws out 1/3 data

Using machine learning: recall: 84% precision: 96%

Page 9: 1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.

9

Drawbacks of structural matching Not all translations have similar

structures

Not all texts use HTML markup

Page 10: 1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.

10

Content-based matching

Seed: bilingual lexicon Link: pair x is in L1 and y in L2 Probability that x a translation of y given by

bilingual lexicon Want most probable link sequence that could

account for a pair of texts Product of the probability of links Best set of links using Maximum Weighted

Bipartite Matching

Page 11: 1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.

11

Cross-language similarity score: tsim

Computed on first 500 words of a document for efficiency

Page 12: 1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.

12

Experiment Dictionary

English/French dictionary: 34,808 entries Dictionary of English/French cognates: 35,513

pairs Additional web pairs: 11,264 from Bible Final lexicon: 132,155 pairs

Trained threshold for t-sim on 32 pairs from Strand test set

Strand (manual): Fmeasure of .81 Tsim: F-measure of .88 Combined model: F-measure .977


Recommended