+ All Categories
Home > Documents > Eliciting a corpus of word-aligned phrases for MT

Eliciting a corpus of word-aligned phrases for MT

Date post: 25-Jan-2016
Category:
Upload: leigh
View: 22 times
Download: 2 times
Share this document with a friend
Description:
Eliciting a corpus of word-aligned phrases for MT. Lori Levin, Alon Lavie, Erik Peterson Language Technologies Institute Carnegie Mellon University. Introduction. Problem: Building Machine Translation systems for languages with scarce resources: - PowerPoint PPT Presentation
Popular Tags:
27
Eliciting a corpus of word-aligned phrases for MT Lori Levin, Alon Lavie, Erik Peterson Language Technologies Institute Carnegie Mellon University
Transcript
Page 1: Eliciting a corpus of word-aligned phrases for MT

Eliciting a corpus of word-aligned phrases for MT

Lori Levin, Alon Lavie, Erik Peterson

Language Technologies Institute

Carnegie Mellon University

Page 2: Eliciting a corpus of word-aligned phrases for MT

Introduction

• Problem: Building Machine Translation systems for languages with scarce resources:– Not enough data for Statistical MT and Example-

Based MT– Not enough human linguistic expertise for writing

rules

• Approach: – Elicit high quality, word-aligned data from bilingual

speakers– Learn transfer rules from the elicited data

Page 3: Eliciting a corpus of word-aligned phrases for MT

Modules of the AVENUE/MilliRADD rule learning system and MT system

Learning Module

Transfer Rules

{PP,4894};;Score:0.0470PP::PP [NP POSTP] -> [PREP NP]((X2::Y1)(X1::Y2))

Translation Lexicon

Run Time Transfer System

Lattice Decoder

English Language Model

Word-to-Word Translation Probabilities

Word-aligned elicited data

Page 4: Eliciting a corpus of word-aligned phrases for MT

Outline

• Demo of elicitation interface

• Description of elicitation corpus

• Overview of automated rule learning

Page 5: Eliciting a corpus of word-aligned phrases for MT

Demo of Elicitation Tool

• Speaker needs to be bilingual and literate: no other knowledge necessary

• Mappings between words and phrases: Many-to-many, one-to-none, many-to-none, etc.

• Create phrasal mappings• Fonts and character sets:

– Including Hindi, Chinese, and Arabic

• Add morpheme boundaries to target language• Add alternate translations• Notes and context

Page 6: Eliciting a corpus of word-aligned phrases for MT

English-Chinese Example

Page 7: Eliciting a corpus of word-aligned phrases for MT

English-Hindi Example

Page 8: Eliciting a corpus of word-aligned phrases for MT

Spanish-Mapudungun Example

Page 9: Eliciting a corpus of word-aligned phrases for MT

English-Arabic Example

Page 10: Eliciting a corpus of word-aligned phrases for MT

Testing of Elicitation Tool

• DARPA Hindi Surprise Language Exercise

• Around 10 Hindi speakers

• Around 17,000 phrases translated and aligned– Elicitation corpus– NPs and PPs from Treebanked Brown Corpus

Page 11: Eliciting a corpus of word-aligned phrases for MT

Elicitation Corpus: Basic Principles

• Minimal pairs

• Syntactic compositionality

• Special semantic/pragmatic constructions

• Navigation based on language typology and universals

• Challenges

Page 12: Eliciting a corpus of word-aligned phrases for MT

Elicitation Corpus: Minimal Pairs

• Eng: I fell.

Sp: Caí

M: Tranün• Eng: You (John) fell.

Sp: Tu (Juan) caiste

M: Eymi tranimi (Kuan)• Eng: You (Mary) fell. ;;

Sp: Tu (María) caiste

M: Eymi tranimi (Maria)

• Eng: I am falling.

Sp: Estoy cayendo

M: Tranmeken• Eng: You (John) are falling.

Sp: Tu (Juan) estás cayendo

M: Eimi(Kuan) tranmekeymi

Mapudungun: Spoken by around one million people in Chile and Argentina.

Page 13: Eliciting a corpus of word-aligned phrases for MT

Using feature vectors to detect minimal pairs

• np1:(subj-of cl1).pro-pers.hum.2.sg. masc.no-clusn.no-def.no-alien

• cl1:(subj np1).intr-ag.past.complete– Eng: You (John) fell. Sp: Tu (Juan) caiste M: Eymi tranimi (Kuan)

• np1:(subj-of cl1).pro-pers.hum.2.sg. fem.no-clusn.no-def.no-alien

• cl1:(subj np1).intr-ag.past.complete– Eng: You (Mary) fell. ;; Sp: Tu (María) caiste M: Eymi tranimi (Maria)

Feature vectors can be extracted from the output of a parser for English or Spanish. (Except for features that English and Spanish do not have…)

Page 14: Eliciting a corpus of word-aligned phrases for MT

Syntactic Compositionality

– The tree – The tree fell.– I think that the tree fell.

• We learn rules for smaller phrases – E.g., NP

• Their root nodes become non-terminals in the rules for larger phrases.– E.g., S containing an NP

• Meaning of a phrase is predictable from the meanings of the parts.

Page 15: Eliciting a corpus of word-aligned phrases for MT

Special Semantic and Pragmatic Constructions

• Meaning may not be compositional– Not predictable from the meanings of the parts

• May not follow normal rules of grammar.– Suggestion: Why not go?

• Word-for-word translation may not work. • Tend to be sources of MT mismatches

– Comparative: • English: Hotel A is [closer than Hotel B]• Japanese: Hoteru A wa [Hoteru B yori] [tikai desu] Hotel A TOP Hotel B than close is• “Closer than Hotel B” is a constituent in English, but “Hoteru B

yori tikai” is not a constituent in Japanese.

Page 16: Eliciting a corpus of word-aligned phrases for MT

Examples of Semantic/Pragmatic Categories

• Speech Acts: requests, suggestions, etc.

• Comparatives and Equatives

• Modality: possibility, probability, ability, obligation, uncertainty, evidentiality

• Correllatives: (the more the merrier)

• Causatives

• Etc.

Page 17: Eliciting a corpus of word-aligned phrases for MT

A Challenge: Combinatorics– Person (1, 2, 3, 4)– Number (sg, pl, du, paucal)– Gender/Noun Class (?)– Animacy (animate/inanimate)– Definiteness (definite/indefinite)– Proximity (near, far, very far, etc.)– Inclusion/exclusion

• Multiply with: tenses and aspects (complete, incomplete, real, unreal, iterative, habitual, present, past, recent past, future, recent future, non-past, non-future, etc.)

• Multiply with verb class: agentive intransitive, non-agentive intransitive, transitive, ditransitive, etc.

• (Case marking and agreement may vary with verb tense, verb class, animacy, definiteness, and whether or not object outranks subject in person or animacy.)

Page 18: Eliciting a corpus of word-aligned phrases for MT

Solutions to Combinatorics

• Generate paradigms of feature vectors, and then automatically generate sentences to match each feature vector.

• Use known universals to eliminate features: e.g., Languages without plurals don’t have duals.

Page 19: Eliciting a corpus of word-aligned phrases for MT

Other Challenges of Computer Based Elicitation

• Inconsistency of human translation and alignment

• Bias toward word order of the elicitation language– Need to provide discourse context for given and new

information

• How to elicit things that aren’t grammaticalized in the elicitation language:– Evidential: I see that it is raining/Apparently it is

raining/It must be raining. • Context: You are inside the house. Your friend comes in

wet.

Page 20: Eliciting a corpus of word-aligned phrases for MT

Transfer Rule Formalism

Type information

Part-of-speech/constituent information

Alignments

x-side constraints

y-side constraints

xy-constraints,

e.g. ((Y1 AGR) = (X1 AGR))

;SL: the man, TL: der Mann

NP::NP [DET N] -> [DET N]((X1::Y1)(X2::Y2)

((X1 AGR) = *3-SING)((X1 DEF = *DEF)((X2 AGR) = *3-SING)((X2 COUNT) = +)

((Y1 AGR) = *3-SING)((Y1 DEF) = *DEF)((Y2 AGR) = *3-SING)((Y2 GENDER) = (Y1 GENDER)))

Page 21: Eliciting a corpus of word-aligned phrases for MT

Rule Learning - Overview

• Goal: Acquire Syntactic Transfer Rules• Use available knowledge from the source

side (grammatical structure)• Three steps:

1. Flat Seed Generation: first guesses at transfer rules; flat syntactic structure

2. Compositionality: use previously learned rules to add hierarchical structure

3. Seeded Version Space Learning: refine rules by learning appropriate feature constraints

Page 22: Eliciting a corpus of word-aligned phrases for MT

Flat Seed Rule Generation

Learning Example: NP

Eng: the big apple

Heb: ha-tapuax ha-gadol

Generated Seed Rule:

NP::NP [ART ADJ N] [ART N ART ADJ]

((X1::Y1)

(X1::Y3)

(X2::Y4)

(X3::Y2))

Page 23: Eliciting a corpus of word-aligned phrases for MT

Compositionality

Initial Flat Rules: S::S [ART ADJ N V ART N] [ART N ART ADJ V P ART N]

((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2) (X4::Y5) (X5::Y7) (X6::Y8))

NP::NP [ART ADJ N] [ART N ART ADJ]

((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2))

NP::NP [ART N] [ART N]

((X1::Y1) (X2::Y2))

Generated Compositional Rule:

S::S [NP V NP] [NP V P NP]

((X1::Y1) (X2::Y2) (X3::Y4))

Page 24: Eliciting a corpus of word-aligned phrases for MT

Version Space LearningInput: Rules and their Example Sets

S::S [NP V NP] [NP V P NP] {ex1,ex12,ex17,ex26}

((X1::Y1) (X2::Y2) (X3::Y4))

NP::NP [ART ADJ N] [ART N ART ADJ] {ex2,ex3,ex13}

((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2))

NP::NP [ART N] [ART N] {ex4,ex5,ex6,ex8,ex10,ex11}

((X1::Y1) (X2::Y2))

Output: Rules with Feature Constraints:

S::S [NP V NP] [NP V P NP]

((X1::Y1) (X2::Y2) (X3::Y4)

(X1 NUM = X2 NUM)

(Y1 NUM = Y2 NUM)

(X1 NUM = Y1 NUM))

Page 25: Eliciting a corpus of word-aligned phrases for MT

Examples of Learned Rules{NP,14244}

;;Score:0.0429

NP::NP [N] -> [DET N]

(

(X1::Y2)

)

{NP,14434}

;;Score:0.0040

NP::NP [ADJ CONJ ADJ N] ->

[ADJ CONJ ADJ N]

(

(X1::Y1) (X2::Y2)

(X3::Y3) (X4::Y4)

)

{PP,4894};;Score:0.0470PP::PP [NP POSTP] -> [PREP NP]((X2::Y1)(X1::Y2))

Page 26: Eliciting a corpus of word-aligned phrases for MT

Manual Transfer Rules: Example

;; PASSIVE OF SIMPLE PAST (NO AUX) WITH LIGHT VERB;; passive of 43 (7b){VP,28}VP::VP : [V V V] -> [Aux V]( (X1::Y2) ((x1 form) = root) ((x2 type) =c light) ((x2 form) = part) ((x2 aspect) = perf) ((x3 lexwx) = 'jAnA') ((x3 form) = part) ((x3 aspect) = perf) (x0 = x1) ((y1 lex) = be) ((y1 tense) = past) ((y1 agr num) = (x3 agr num)) ((y1 agr pers) = (x3 agr pers)) ((y2 form) = part))

Page 27: Eliciting a corpus of word-aligned phrases for MT

Manual Transfer Rules: Example

; NP1 ke NP2 -> NP2 of NP1; Ex: jIvana ke eka aXyAya; life of (one) chapter ; ==> a chapter of life;{NP,12}NP::NP : [PP NP1] -> [NP1 PP]( (X1::Y2) (X2::Y1); ((x2 lexwx) = 'kA'))

{NP,13}NP::NP : [NP1] -> [NP1]( (X1::Y1))

{PP,12}PP::PP : [NP Postp] -> [Prep NP]( (X1::Y2) (X2::Y1))

NP

PP NP1

NP P Adj N

N1 ke eka aXyAya

N

jIvana

NP

NP1 PP

Adj N P NP

one chapter of N1

N

life


Recommended