+ All Categories
Home > Documents > CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 15–Language Divergence)...

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 15–Language Divergence)...

Date post: 23-Dec-2015
Category:
Upload: alan-melton
View: 223 times
Download: 1 times
Share this document with a friend
45
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 15–Language Divergence) Pushpak Bhattacharyya CSE Dept., IIT Bombay 8 th Feb, 2011
Transcript
  • Slide 1
  • CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 15Language Divergence) Pushpak Bhattacharyya CSE Dept., IIT Bombay 8 th Feb, 2011
  • Slide 2
  • Key difference between Statistical/ML- based NLP and Knowledge- based/linguistics-based NLP Stat NLP: speed and robustness are the main concerns KB NLP: Phenomena based Example: Boys, Toys, Toes To get the root remove s How about foxes, boxes, ladies Understand phenomena: go deeper Slower processing
  • Slide 3
  • Perspective on Statistical MT What is a good translation? Faithful to source Fluent in target fluency faithfulness
  • Slide 4
  • Word-alignment example (1) (2) (3) (4) Ram has an apple (1) (2)(3) (4) (5) (6) Ram of near an apple is
  • Slide 5
  • Kinds of MT Systems (point of entry from source to the target text) fwdfwd
  • Slide 6
  • Why is MT difficult? Classical NLP problems Ambiguity Lexical: Went to the bank to withdraw money Structural: Saw the boy with a telescope Ellipsis: I wanted a book and John a pen Co-reference Anaphora: John said he likes music Hypernymic: Johns house is a robust structure
  • Slide 7
  • Why is MT Difficult Language Divergence Lexico-Semantic Divergence Structural Divergence
  • Slide 8
  • Language Divergence (English Hindi: Noun to Adjective) The demands on sportsmen today can lead to burnout at an early age. (noun the state of being extremely tired or ill, either physically or mentally, because you have worked too hard) , Sportsmen-from, which today demands exist, that (correlative) them early age in inactive do can (aspectual) V-AUX.
  • Slide 9
  • Language Divergence (English Hindi: Noun to Verb) Every concert they gave us was a sell-out. (an event for which all the tickets have been sold) - Their every concert-of all ticket sell- past-passive-plural (were sold out).
  • Slide 10
  • Language Divergence (English Hindi: Adjective to Adverb) The children were watching in wide- eyed amazement. (with eyes fully open because of fear, great surprise, etc) Children amazement-with eyes opening widely seeing were.
  • Slide 11
  • Language Divergence (English Hindi: Adjective to Verb) He was in a bad mood at breakfast and wasn't very communicative. (able and willing to talk and give information to other people) - Breakfast-of time he bad mood-in was and much conversation not do-past- progressive-sing (was doing).
  • Slide 12
  • Language Divergence (English Hindi: Preposition to Adverb) It gets cooler toward evening. (near a point in time) - Evening happening-happening (reduplication; typical Indian language phenomenon) cold increase-goes (verb compound; polar vector).
  • Slide 13
  • Language Divergence (English Hindi: idiomatic usage) Given her interest in children, teaching seems the right job for her. (when you consider sth) ( ) , Children-towards her interest having seen, teaching for her appropriate seems.
  • Slide 14
  • Language Divergence is ubiquitous (Marathi-Hindi-English: case marking and postpositions transfer: works!) Not only for languages from distant families, but also within close cousins (simple present) . He goes. (universal truth) . The earth revolves round the sun.
  • Slide 15
  • Language Divergence (Marathi-Hindi-English: case marking and postpositions: works again!) (historical truth) ... ... Krushna says to Arjuna (quoting) ,... ,... Damle says,...
  • Slide 16
  • Language Divergence (Marathi-Hindi-English: case marking and postpositions: does not work!) (immediate past) ? ! ? When did you come? Just now (I came). (certainty in future) ! ! He is in for a thrashing. (assurance) . I will see you tomorrow.
  • Slide 17
  • Language Divergence Theory: Lexico- Semantic Divergences (ref: Dave, Parikh, Bhattacharyya, Journal of MT, 2002) Conflational divergence E: stab; H: churaa se maaranaa (knife-with hit) S: Utrymningsplan; E: escape plan Structural divergence E: SVO; H: SOV Categorial divergence Change is in POS category (many examples discussed) Head swapping divergence E: Prime Minister of India; H: bhaarat ke pradhaan mantrii (India-of Prime Minister) Lexical divergence E: advise; H: paraamarsh denaa (advice give): Noun Incorporation- very common Indian Language Phenomenon
  • Slide 18
  • Language Divergence Theory: Syntactic Divergences Constituent Order divergence E: Singh, the PM of India, will address the nation today; H: bhaarat ke pradhaan mantrii, singh, (India-of PM, Singh) Adjunction Divergence E: She will visit here in the summer; H: vah yahaa garmii meM aayegii (she here summer-in will come) Preposition-Stranding divergence E: Who do you want to go with?; H: kisake saath aap jaanaa chaahate ho? (who with) Null Subject Divergence E: I will go; H: jaauMgaa (subject dropped) Pleonastic Divergence E: It is raining; H: baarish ho rahii haai (rain happening is: no translation of it)
  • Slide 19
  • Entropy considerations Work of Chirag and Venkatesh, ongoing
  • Slide 20
  • Language Typology
  • Slide 21
  • Slide 22
  • Parallel Corpora EnglishHindiMarathi Jaipur, popularly known as the Pink City, is the capital of Rajasthan state, India. , , , , . Until the war of 1982, the rainy, windswept Falkland Islands were a forgotten remnant of the old British Empire. 1982 , . Spanish rule was administered from a distance, leaving the various regions to develop separately from the capital, Caracas, which was founded by Diego de Losada in 1567. , , 1567 , , .
  • Slide 23
  • Phrase Table Entries Hindi-English Phrase Table Entries ||| a ||| 0.1 ||| afford ||| 0.1 ||| offer ||| 0.5 ||| offers ||| 0.3 Contribution to entropy = 0.507 Hindi-Marathi Phrase Table Entries ||| ||| 0.05 ||| ||| 0.2 ||| ||| 0.05 ||| ||| 0.6 ||| ||| 0.1 Contribution to entropy = 0.503
  • Slide 24
  • Entropy Evaluation The phrase table gives a probability distribution over the possible translations for each source phrase. We use the probability of the source phrase itself to get a distribution for the entire phrase table. Entropy is evaluated as per the standard formula Hindi-Marathi Phrase Table Entropy : 9.671 Hindi English Phrase Table Entropy : 9.770
  • Slide 25
  • Handling Divergence through Indicative Translation (Microsoft Techvista Award, Ananthakrishnan 2007)
  • Slide 26
  • Indicative Translation what and why? Native speaker acceptable translation not possible especially considering English-Hindi (Indian languages) divergence Compromises human-aided translation (post-editing) narrow domain (weather reports) rough translation Indicative MT Goal: understandable rather than perfect output Purpose: assimilation rather than dissemination (translation on the web)
  • Slide 27
  • 27 Divergence between English and Hindi Divergence: differences in lexical and syntactic choices that languages make in expressing ideas MaTra: Structural transfer SVO to SOV post-modifiers to pre-modifiers Lexical transfer: WSD + lexicon lookup inflections case-markers.
  • Slide 28
  • 28 Divergence between Natural and Indicative Hindi: some examples E: We eat the rotten canteen food every night. H: I: E: The batsman who had been scoring heavily against them has to be removed early. H: I: , ,
  • Slide 29
  • 29 Categorial divergence E: I am feeling hungry H: I: n-gram matches: unigrams: 0/6; bigrams: 0/5; trigrams: 0/4; 4-grams: 0/3
  • Slide 30
  • 30 Relation between words in noun- noun compounds E: The ten best Aamir Khan performances H: I: n-gram matches: unigrams: 5/5; bigrams: 2/4; trigrams: 0/3; 4-grams: 0/2
  • Slide 31
  • 31 Lexical divergence E: Food, clothing and shelter are a man's basic needs. H: , I: , , n-gram matches: unigrams: 8/10; bigrams: 6/9; trigrams: 4/8; 4-grams: 3/7
  • Slide 32
  • 32 Pleonastic Divergence E: It is raining H: I: n-gram matches: unigrams: 4/5; bigrams: 3/4; trigrams:2/3; 4-grams: 1/2 E: There was a great king H: I:
  • Slide 33
  • 33 Stylistic differences E: The Lok Sabha has 545 members. H: I: n-gram matches: unigrams: 5/7; bigrams:3/6; trigrams: 1/5; 4-grams: 0/4 Other differences: word order, sentence length
  • Slide 34
  • 34 Transliteration and WSD errors E: I purchased a bat. H: I: n-gram matches: unigrams: 3/4; bigrams: 1/3; trigrams:0/2; 4-grams: 0/1
  • Slide 35
  • 35 Divergence/ problem Average BLEU precision Translation acceptable? Categorial0Yes Noun-noun compounds 0.38Yes Lexical0.6Yes Transliteration0.27Yes Pleonastic0.68No Stylistic0.35No WSD error0.27No
  • Slide 36
  • Advantages of a hybrid Rule- based + SMT system What SMT brings to the table If data available, then no need for linguistic resources Quick adaptation to new domains (tourism, health) new language pairs (English-Gujarati/Marathi) See improvements by adding data What rule-based systems bring to the table Capture small set of systematic difference well SVO SOV (do we need to learn this?) Better handle on correcting specific cases
  • Slide 37
  • Preprocessing rules + SMT for English-Indian language MT Lack of linguistic resources for Indian languages Lots of resources available for English Morphology is rich for Indian languages Wider systematic syntactic differences between English and Indian languages
  • Slide 38
  • Placed within the Vauquois Triangle
  • Slide 39
  • Previous work on factored MT
  • Slide 40
  • Previous work {ney:04} show that the use of morpho-syntactic information drastically reduces the need for bilingual training data {ney:06} report the use of morphological and syntactic restructuring information for Spanish-English and Serbian- English translation
  • Slide 41
  • Previous work (contd) Koehn and Hoang {koehn:07} propose factored translation models that combine feature functions to handle syntactic, morphological, and other linguistic information in a log-linear model Experiments in translating from English to German, Spanish, and Czech, including the use of morphological factors
  • Slide 42
  • Previous work (contd) Koehn and Hoang {koehn:07} propose factored translation models that combine feature functions to handle syntactic, morphological, and other linguistic information in a log-linear model Experiments in translating from English to German, Spanish, and Czech, including the use of morphological factors
  • Slide 43
  • Previous work (contd) Avramidis and Koehn {koehn:08} report work on translating from poor to rich morphology, namely, English to Greek and Czech translation Factored models with case and verb conjugation related factors determined by heuristics on parse trees Used only on the source side, and not on the target side
  • Slide 44
  • Previous work (contd) Melamed {melamed:04} proposes methods based on tree-to-tree mappings Imamura et al. {imamura:05} present a similar method that achieves significant improvements over a phrase-based baseline model for Japanese-English translation
  • Slide 45
  • Previous work (contd) Target language does not have parsing/clause-detecting tools Niessen and Ney {ney:04}: Reorder the source language data prior to the SMT training and decoding cycles German-English SMT Popovic and Ney {ney:06} :simple local transformation rules for Spanish-English and Serbian-English translation Collins et al. {collins:05}: German clause restructuring to improve German-English SMT Wang et al. {wang:07}: similar work for Chinese- English SMT Ananthakrishnan and Bhattacharyya {anand:08}: syntactic reordering and morphological suffix separation for English-Hindi SMT

Recommended