CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 15–Language Divergence) Pushpak Bhattacharyya CSE Dept., IIT Bombay 8 th Feb, 2011
Transcript
Slide 1
CS460/626 : Natural Language Processing/Speech, NLP and the Web
(Lecture 15Language Divergence) Pushpak Bhattacharyya CSE Dept.,
IIT Bombay 8 th Feb, 2011
Slide 2
Key difference between Statistical/ML- based NLP and Knowledge-
based/linguistics-based NLP Stat NLP: speed and robustness are the
main concerns KB NLP: Phenomena based Example: Boys, Toys, Toes To
get the root remove s How about foxes, boxes, ladies Understand
phenomena: go deeper Slower processing
Slide 3
Perspective on Statistical MT What is a good translation?
Faithful to source Fluent in target fluency faithfulness
Slide 4
Word-alignment example (1) (2) (3) (4) Ram has an apple (1)
(2)(3) (4) (5) (6) Ram of near an apple is
Slide 5
Kinds of MT Systems (point of entry from source to the target
text) fwdfwd
Slide 6
Why is MT difficult? Classical NLP problems Ambiguity Lexical:
Went to the bank to withdraw money Structural: Saw the boy with a
telescope Ellipsis: I wanted a book and John a pen Co-reference
Anaphora: John said he likes music Hypernymic: Johns house is a
robust structure
Slide 7
Why is MT Difficult Language Divergence Lexico-Semantic
Divergence Structural Divergence
Slide 8
Language Divergence (English Hindi: Noun to Adjective) The
demands on sportsmen today can lead to burnout at an early age.
(noun the state of being extremely tired or ill, either physically
or mentally, because you have worked too hard) , Sportsmen-from,
which today demands exist, that (correlative) them early age in
inactive do can (aspectual) V-AUX.
Slide 9
Language Divergence (English Hindi: Noun to Verb) Every concert
they gave us was a sell-out. (an event for which all the tickets
have been sold) - Their every concert-of all ticket sell-
past-passive-plural (were sold out).
Slide 10
Language Divergence (English Hindi: Adjective to Adverb) The
children were watching in wide- eyed amazement. (with eyes fully
open because of fear, great surprise, etc) Children amazement-with
eyes opening widely seeing were.
Slide 11
Language Divergence (English Hindi: Adjective to Verb) He was
in a bad mood at breakfast and wasn't very communicative. (able and
willing to talk and give information to other people) -
Breakfast-of time he bad mood-in was and much conversation not
do-past- progressive-sing (was doing).
Slide 12
Language Divergence (English Hindi: Preposition to Adverb) It
gets cooler toward evening. (near a point in time) - Evening
happening-happening (reduplication; typical Indian language
phenomenon) cold increase-goes (verb compound; polar vector).
Slide 13
Language Divergence (English Hindi: idiomatic usage) Given her
interest in children, teaching seems the right job for her. (when
you consider sth) ( ) , Children-towards her interest having seen,
teaching for her appropriate seems.
Slide 14
Language Divergence is ubiquitous (Marathi-Hindi-English: case
marking and postpositions transfer: works!) Not only for languages
from distant families, but also within close cousins (simple
present) . He goes. (universal truth) . The earth revolves round
the sun.
Slide 15
Language Divergence (Marathi-Hindi-English: case marking and
postpositions: works again!) (historical truth) ... ... Krushna
says to Arjuna (quoting) ,... ,... Damle says,...
Slide 16
Language Divergence (Marathi-Hindi-English: case marking and
postpositions: does not work!) (immediate past) ? ! ? When did you
come? Just now (I came). (certainty in future) ! ! He is in for a
thrashing. (assurance) . I will see you tomorrow.
Slide 17
Language Divergence Theory: Lexico- Semantic Divergences (ref:
Dave, Parikh, Bhattacharyya, Journal of MT, 2002) Conflational
divergence E: stab; H: churaa se maaranaa (knife-with hit) S:
Utrymningsplan; E: escape plan Structural divergence E: SVO; H: SOV
Categorial divergence Change is in POS category (many examples
discussed) Head swapping divergence E: Prime Minister of India; H:
bhaarat ke pradhaan mantrii (India-of Prime Minister) Lexical
divergence E: advise; H: paraamarsh denaa (advice give): Noun
Incorporation- very common Indian Language Phenomenon
Slide 18
Language Divergence Theory: Syntactic Divergences Constituent
Order divergence E: Singh, the PM of India, will address the nation
today; H: bhaarat ke pradhaan mantrii, singh, (India-of PM, Singh)
Adjunction Divergence E: She will visit here in the summer; H: vah
yahaa garmii meM aayegii (she here summer-in will come)
Preposition-Stranding divergence E: Who do you want to go with?; H:
kisake saath aap jaanaa chaahate ho? (who with) Null Subject
Divergence E: I will go; H: jaauMgaa (subject dropped) Pleonastic
Divergence E: It is raining; H: baarish ho rahii haai (rain
happening is: no translation of it)
Slide 19
Entropy considerations Work of Chirag and Venkatesh,
ongoing
Slide 20
Language Typology
Slide 21
Slide 22
Parallel Corpora EnglishHindiMarathi Jaipur, popularly known as
the Pink City, is the capital of Rajasthan state, India. , , , , .
Until the war of 1982, the rainy, windswept Falkland Islands were a
forgotten remnant of the old British Empire. 1982 , . Spanish rule
was administered from a distance, leaving the various regions to
develop separately from the capital, Caracas, which was founded by
Diego de Losada in 1567. , , 1567 , , .
Entropy Evaluation The phrase table gives a probability
distribution over the possible translations for each source phrase.
We use the probability of the source phrase itself to get a
distribution for the entire phrase table. Entropy is evaluated as
per the standard formula Hindi-Marathi Phrase Table Entropy : 9.671
Hindi English Phrase Table Entropy : 9.770
Slide 25
Handling Divergence through Indicative Translation (Microsoft
Techvista Award, Ananthakrishnan 2007)
Slide 26
Indicative Translation what and why? Native speaker acceptable
translation not possible especially considering English-Hindi
(Indian languages) divergence Compromises human-aided translation
(post-editing) narrow domain (weather reports) rough translation
Indicative MT Goal: understandable rather than perfect output
Purpose: assimilation rather than dissemination (translation on the
web)
Slide 27
27 Divergence between English and Hindi Divergence: differences
in lexical and syntactic choices that languages make in expressing
ideas MaTra: Structural transfer SVO to SOV post-modifiers to
pre-modifiers Lexical transfer: WSD + lexicon lookup inflections
case-markers.
Slide 28
28 Divergence between Natural and Indicative Hindi: some
examples E: We eat the rotten canteen food every night. H: I: E:
The batsman who had been scoring heavily against them has to be
removed early. H: I: , ,
Slide 29
29 Categorial divergence E: I am feeling hungry H: I: n-gram
matches: unigrams: 0/6; bigrams: 0/5; trigrams: 0/4; 4-grams:
0/3
Slide 30
30 Relation between words in noun- noun compounds E: The ten
best Aamir Khan performances H: I: n-gram matches: unigrams: 5/5;
bigrams: 2/4; trigrams: 0/3; 4-grams: 0/2
Slide 31
31 Lexical divergence E: Food, clothing and shelter are a man's
basic needs. H: , I: , , n-gram matches: unigrams: 8/10; bigrams:
6/9; trigrams: 4/8; 4-grams: 3/7
Slide 32
32 Pleonastic Divergence E: It is raining H: I: n-gram matches:
unigrams: 4/5; bigrams: 3/4; trigrams:2/3; 4-grams: 1/2 E: There
was a great king H: I:
Slide 33
33 Stylistic differences E: The Lok Sabha has 545 members. H:
I: n-gram matches: unigrams: 5/7; bigrams:3/6; trigrams: 1/5;
4-grams: 0/4 Other differences: word order, sentence length
Slide 34
34 Transliteration and WSD errors E: I purchased a bat. H: I:
n-gram matches: unigrams: 3/4; bigrams: 1/3; trigrams:0/2; 4-grams:
0/1
Slide 35
35 Divergence/ problem Average BLEU precision Translation
acceptable? Categorial0Yes Noun-noun compounds 0.38Yes
Lexical0.6Yes Transliteration0.27Yes Pleonastic0.68No
Stylistic0.35No WSD error0.27No
Slide 36
Advantages of a hybrid Rule- based + SMT system What SMT brings
to the table If data available, then no need for linguistic
resources Quick adaptation to new domains (tourism, health) new
language pairs (English-Gujarati/Marathi) See improvements by
adding data What rule-based systems bring to the table Capture
small set of systematic difference well SVO SOV (do we need to
learn this?) Better handle on correcting specific cases
Slide 37
Preprocessing rules + SMT for English-Indian language MT Lack
of linguistic resources for Indian languages Lots of resources
available for English Morphology is rich for Indian languages Wider
systematic syntactic differences between English and Indian
languages
Slide 38
Placed within the Vauquois Triangle
Slide 39
Previous work on factored MT
Slide 40
Previous work {ney:04} show that the use of morpho-syntactic
information drastically reduces the need for bilingual training
data {ney:06} report the use of morphological and syntactic
restructuring information for Spanish-English and Serbian- English
translation
Slide 41
Previous work (contd) Koehn and Hoang {koehn:07} propose
factored translation models that combine feature functions to
handle syntactic, morphological, and other linguistic information
in a log-linear model Experiments in translating from English to
German, Spanish, and Czech, including the use of morphological
factors
Slide 42
Previous work (contd) Koehn and Hoang {koehn:07} propose
factored translation models that combine feature functions to
handle syntactic, morphological, and other linguistic information
in a log-linear model Experiments in translating from English to
German, Spanish, and Czech, including the use of morphological
factors
Slide 43
Previous work (contd) Avramidis and Koehn {koehn:08} report
work on translating from poor to rich morphology, namely, English
to Greek and Czech translation Factored models with case and verb
conjugation related factors determined by heuristics on parse trees
Used only on the source side, and not on the target side
Slide 44
Previous work (contd) Melamed {melamed:04} proposes methods
based on tree-to-tree mappings Imamura et al. {imamura:05} present
a similar method that achieves significant improvements over a
phrase-based baseline model for Japanese-English translation
Slide 45
Previous work (contd) Target language does not have
parsing/clause-detecting tools Niessen and Ney {ney:04}: Reorder
the source language data prior to the SMT training and decoding
cycles German-English SMT Popovic and Ney {ney:06} :simple local
transformation rules for Spanish-English and Serbian-English
translation Collins et al. {collins:05}: German clause
restructuring to improve German-English SMT Wang et al. {wang:07}:
similar work for Chinese- English SMT Ananthakrishnan and
Bhattacharyya {anand:08}: syntactic reordering and morphological
suffix separation for English-Hindi SMT