Language Divergences and Solutions Advanced Machine Translation Seminar Alison Alvarez.

Post on 02-Apr-2015

216 views 0 download

transcript

Language Divergences and Solutions

Advanced Machine Translation Seminar

Alison Alvarez

Overview

Introduction Morphology Primer Translation Mismatches

Types Solutions

Translation Divergences Types Solutions

Different MT Systems Generation Heavy Machine Translation DUSTer

Source ≠ Target

Languages don’t encode the same information in the same wayMakes MT complicatedKeeps all of us employed

Morphology in a Nutshell

Morphemes are word partsWork +er Iki +ta +ku +na +ku +na +ri +ma +shi +ta

Types of MorphemesDerivational: makes new word Inflectional: adds information to an existing

word

Morphology in a Nutshell Analytic/Isolating

little or no inflectional morphology, separate words Vietnamese, Chinese I was made to go

Synthetic Lots of inflectional morphology Fusional vs. Agglutinating Romance Languages, Finnish, Japanese, Mapudungun Ika (to go) +se (to make/let) +rare (passive) +ta (past

tense) He need +s (3rd person singular) it.

Translation Differences

TypesTranslation Mismatches

Different information from source to target

Translation Divergences Same information from source to target, but the

meaning is distributed differently in each language

Translation Mismatches

“…the information that is conveyed is different in the source and target languages”

Types: Lexical levelTypological level

Lexical Mismatches

A lexical item in one language may have more distinctions than in another

Brother

otouto

Younger Brother

兄さん

Ani-san

Older Brother

Typological Mismatches

Mismatch between languages with different levels of grammaticalization

One language may be more structurally complex

Source marking, Obligatory Subject

Typological Mismatches

Source: Quechua vs. English (they say) s/he was singing --> takisharansi taki (sing) +sha (progressive) +ra (past) + n (3rd sg)

+si (reportative)

Obligatory Arguments: English vs. Japanese Kusuri wo Nonda --> (I, you, etc.) took medicine. Makasemasu! -->(I’ll) leave (it) to (you)

Translation Mismatch Solutions

More information --> Less information (easy) Less information --> More information (hard)

Context clues Language Models Generalization Formal representations

Translation Divergences

“…the same information is conveyed in source and target texts”

Divergences are quite common Occurs in about 1 out of every three

sentences in the TREC El Norte Newspaper corpus (Spanish-English)

Sentences can have multiple kinds of divergences

Translation Divergence Types

Categorial Divergence Conflational Divergence Structural Divergence Head Swapping Divergence Thematic Divergence

Categorial Divergence

Translation that uses different parts of speech

Tener hambre (have hunger) --> be hungry

Noun --> adjective

Conflational Divergence

The translation of two words using a single word that combines their meaning

Can also be called a lexical gap X stab Z --> X dar puñaladas a Z (X give stabs

to Z) glastuinbouw --> cultivation under glass

Structural Divergence

A difference in the realization of incorporated arguments

PP to Object X entrar en Y (X enter in Y) --> X enter Y  X ask for a referendum --> X pedir un

referendum (ask-for a referendum)

Head Swapping Divergence

Involves the demotion of a head verb and the promotion of a modifier verb to head position

S

NP VP

N V PP VP

Yo entro en el cuarto corriendo

S

NP VP

N V PP

I ran into the room.

Thematic Divergence

This divergence occurs when sentence arguments switch argument roles from one language to another

X gustar a Y (X please to Y) --> Y like X

Divergence Solutions and Statistical/EBMT Systems Not really addressed explicitly in SMT Covered in EBMT only if it is covered

extensively in the data

Divergence Solutions and Transfer Systems Hand-written transfer rules Automatic extraction of transfer rules from

bi-texts Problematic with multiple divergences

Divergence Solutions and Interlingua Systems Mel’čuk’s Deep Syntactic Structure Jackendoff’s Lexical Semantic Structure Both require “explicit symmetric knowledge” from

both source and target language Expensive

Divergence Solutions and Interlingua Systems

John swam across a river

Juan cruza el río nadando

[event CAUSE JOHN

[event GO JOHN [path ACROSS JOHN [position AT JOHN RIVER]]]

[manner SWIM+INGLY]]

Generation-Heavy MT

Built to address language divergences Designed for source-poor/target-rich

translation Non-Interlingual Non-Transfer Uses symbolic overgeneration to account

for different translation divergences

Generation-Heavy MT

Source languagesyntactic parser translation lexicon

Target language lexical semantics, categorial variations &

subcategorization frames for overgenerationStatistical language model

GHMT System

Analysis Stage

Independent of Target Language Creates a deep syntactic dependency Only argument structure, top-level

conceptual nodes & thematic-role information

Should normalize over syntactic & morphological phenomena

Translation Stage

Converts SL lexemes to TL lexemes Maintains dependency structure

Analysis/Translation Stage

GIVE (v)

[cause go]

I

agent

STAB (n)

theme

JOHN

goal

Generation Stage

Lexical & Structural Selection Conversion to a thematic dependency

Uses syntactic-thematic linking map “loose” linking

Structural expansion Addresses conflation & head-swapped divergences

Turn thematic dependency to TL syntactic dependency

Addresses categorial divergence

Generation Stage: Structural Expansion

Generation Stage

Linearization Step Creates a word lattice to encode different

possible realizations Implemented using oxyGen engine

Sentences ranked & extracted Nitrogen’s statistical extractor

Generation Stage

GHMT Results

4 of 5 Spanish-English divergences “can be generated using structural expansion & categorial variations”

The remaining 1 out of 5 needed more world knowledge or idiom handling

SL syntactic parser can still be hard to come by

Divergences and DUSTer

Helps to overcome divergences for word alignment & improve coder agreement

Changes an English sentence structure to resemble another language

More accurate alignment and projection of dependency trees without training on dependency tree data

DUSTer

Motivation for the development of automatic correction of divergences

1. “Every Language Pair has translation divergences that are easy to recognize”

2. “Knowing what they are and how to accommodate them provides the basis for refined word level alignment”

3. “Refined word-level” alignment results in improved projection of structural information from English to another language

DUSTer

DUSTer

Bi-text parsed on English side only “Linguistically Motivated” & common search

terms Conducted on Spanish & Arabic (and later

Chinese & Hindi) Uses all of the divergences mentioned before,

plus a “light verb” divergence Try put to trying poner a prueba

DUSTer Rule Development Methods Identify canonical transformations for each

divergence type Categorize English sentences into

divergence type or “none” Apply appropriate transformations Humans align E E’ foreign language

DUSTer Rules

# "kill" => "LightVB kill(N)" (LightVB = light verb)# Presumably, this will work for "kill" => "give death to”# "borrow" => "take lent (thing) to”# "hurt" => "make harm to”# "fear" => "have fear of”# "desire" => "have interest in”# "rest" => "have repose on”# "envy" => "have envy of”type1.B.X [English{2 1 3} Spanish{2 1 3 4 5} ][ Verb<1,i,CatVar:V_N> [ Noun<2,j,Subj> ] [ Noun<3,k,Obj> ] ] <--> [ LightVB<1,Verb>[ Noun<2,j,Subj> ] [ Noun<3,i,Obj> ]

[ Oblique<4,Pred,Prep> [ Noun<5,k,PObj> ] ] ]

DU

ST

er R

esul

ts

Conclusion

Divergences are common They are not handled well by most MT

systems GHMT can account for divergences, but

still needs development DUSTer can handle divergences through

structure transformations, but requires a great deal of linguistic knowledge

The End

Questions?

ReferencesDorr, Bonnie J., "Machine Translation Divergences: A Formal Description and Proposed Solution,"

Computational Linguistics, 20:4, pp. 597--633, 1994.Dorr, Bonnie J. and Nizar Habash, "Interlingua Approximation: A Generation-Heavy Approach", In

Proceedings of Workshop on Interlingua Reliability, Fifth Conference of the Association for Machine Translation in the Americas, AMTA-2002,Tiburon, CA, pp. 1--6, 2002

Dorr, Bonnie J., Clare R. Voss, Eric Peterson, and Michael Kiker, "Concept Based Lexical Selection," Proceedings of the AAAI-94 fall symposium on Knowledge Representation for Natural Language Processing in Implemented Systems, New Orleans, LA, pp. 21--30, 1994.

Dorr, Bonnie J., Lisa Pearl, Rebecca Hwa, and Nizar Habash, "DUSTer: A Method for Unraveling Cross-Language Divergences for Statistical Word-Level Alignment," Proceedings of the Fifth Conference of the Association for Machine Translation in the Americas, AMTA-2002,Tiburon, CA, pp. 31--43, 2002.

Habash, Nizar and Bonnie J. Dorr, "Handling Translation Divergences: Combining Statistical and Symbolic Techniques in Generation-Heavy Machine Translation", In Proceedings of the Fifth Conference of the Association for Machine Translation in the Americas, AMTA-2002,Tiburon, CA, pp. 84--93, 2002.

Haspelmath, Martin. Understanding Morphology. Oxford Univeristy Press, 2002. Kameyama, Megumi and Ryo Ochitani, Stanley Peters “Resolving Translation Mismatches With

Information Flow” Annual Meeting of the Assocation of Computational Linguistics, 1991