Date post: | 17-Dec-2015 |
Category: |
Documents |
Upload: | colleen-cooper |
View: | 217 times |
Download: | 0 times |
Information Dynamics in Language : English-Hindi
Anusaaraka
Akshar Bharati
LTRC, IIIT, Hyderabad
(20-04-09)
u3Ld
Outline
Anusaaraka – What it is ?
Anusaaraka – How does it work ?
Information dynamics in language
English – From Paninian perspective
Machine Translation
Anusaaraka – An alternative approach
Anusaaraka Goals
Anusaaraka Philosophy
Summary – What Anusaaraka is
What is Anusaaraka?
Software that translates English text to Hindi.
Fusion of traditional Indian shastras and modern technology.
Collaborative endeavour of CIF, IIIT – Hyderabad and University of Hyderabad (Department of Sanskrit Studies).
How does Anusaaraka work?
User types in the text that needs to be translated.
Machine gives output (i.e. translation).
Option to view step-by-step translation.
Information Dynamics in Language (1/४4)
Languages encode information
cuuhe maarate haiM kutte
rats kill dogs
Hindi sentence is ambiguous
Possible interpretations :
Dogs kill rats
Rats kill dogs
However,
Information Dynamics in Language (2/4)
Ambiguity in Hindi is resolved if,
cuuhe maarate haiM kuttoM ko
rats kill dogs acc
English has information in positions Hindi in morphemes
Languages encode information differently
Information Dynamics in Language (3/4)
English pronouns he, she, it
Hindi vaha
He is going to Delhi vaha dilli jA rahaa hai
She is going to Delhi vaha dillii jA rahii hai
It broke vaha tuta ??
Information does not always map fully from one language into another. Conceptual worlds may be different.
Information Dynamics in Language (4/4)
This chair has been sat on
This chair has been used for sitting
X sat on this chair, and it is known
Language encodes information partially
English from Paninian View Point
ा�An Example:
Panini's 'sutra'
सु� सु�प्� तिङन्म्� प्दम्�
states
प्रा�तिप्दिदक+सु�प्�= सु�बन् प्दNom base+nom inflection=nominal word form
धा��+तिङ� =तिङन्verb root+verbal inflections=finite verb form
Therefore, take the following Hindi example:
रा� रा�म् फल खा�� है� Ram eats fruits
रा�म्+ ० फल+ ० खा�+�_है�है�
रा�म् ने� फल खा�या� Ram ate a fruit
रा�म्+ ने� फल+ ० खा�+या�
English from Paninian View Poinट
Interrogatives in English
To whom did you give the book ? who+to_m do+past you+0 give the book+0
Alternatively
Who did you give the book to ?Who do+past you+0 give the book+o to
Notionसुs of NP and PP are essential to Explain English structures where NP=प्प्रा�तिप्दिदक, PP= सु�बन् (प्द)
to+who = सु�बन्
Translation
Translation involves Transfer of information from one language
to anotherThis generates tension between
Faithfulness to the source Readability (naturalness) in the target
Translators normally sacrifice faithfulness in favour of readability
Machine Translation
Challenges and Problems
Language codes information only partially Tension between BREVITY and PRECISION Brevity wins leading to inherent ambiguity at
different levels
Structural Ambiguity
Time flies like an arrow
Possible parses
1. Time flies like an arrow (time goes fast)
2. Time flies like an arrow (time-flies have a liking for an arrow)
3. Time flies like an arrow (time the flies just like you time the arrows) -flies are like an arrow
Lexical Ambiguity (1/3)
Can be
Complete bank banks / banked banks banking (river) bank banks / banked banks banking (money)
Partial lie (not speak truth) lie lied lying lie (rest horizontally) lie lay lying
Lexical Ambiguity (2/3)
Shelve
1. Shelve the books
Put the books on the shelf
2. The Institute has shelved the idea at least until next
year
Postponed the idea till the next year
Function words (1/2)
He bought a shirt with tiny collars.
usane chote kOlaroM vaalii kamiiza khariidii
He washed a shirt with soap.
usane saabuna se kamiiza dhoii
PP attachment is governing the choice of postposition in
Hindi
Function words (2/2)
Ram is sitting in the garden
raama bagiice meM baiThaa haiRam is running in the garden
raama bagice meM dODza rahaa hai
Verb root is governing the choice of TAM
Information Flow and Ambiguity
1. He scratched a figure on the rock (engrave)
2. She scratched the figure on the rock (scrape)
Human beings use
World knowledge
Context
Cultural knowledge and
Language conventions
To resolve ambiguities. Can we provide all this knowledge to the machine ?
Machine Translation: Current Trends
Techniques being used: Statistical
Statistical methods: Inherent limitation
Can never give a 100% reliable system
End user can never be sure about the Correctness.
Current MT systems CAN NOT give a system for
users who want to ACCESS a text in other languages
Anusaaraka
An Incremental Machine Translation
Layered output
First layer a Language ACCESOR
Successive layers more and more close
to MT
What is an Accessor ?
Gist Terminal is a concrete example of SCRIPT ACCESSOR
(Developed by IIT Kanpur, and marketed by C-DAC)
One can access any text in any Indian script
through
-- enhanced Devanagari script.
Salient Features Faithful representation
Reversibility
Anusaaraka tries to generalise and apply this
philosophy to the problem of language
conversion which is several order more
complex
Languages Differ
Script (For written language)
Vocabulary
Grammar
These differences can be considered
as a measure of language distance
Language Distance
Script -------------- Vocabulary----------Grammar Urdu-> Hindi
Telugu -> Hindi Telugu->Hindi
English -> Hindi English-> Hindi English->Hindi
Anusaaaraka follows the approach of gradually reducing the distance
Anusaaraka Solutions
Transliteration
Padasutra for Vocabulary substitution
WSD for word level ambiguity
Transfer Grammar
Transfer Grammar
Eng : This chair has been sat on
Transli : दिदसु चे�यारा है�ज़ ब ने सु�ट आने
Lexical substitution : याहै याहै क� सु" ब�ठा� जा� चे�क� है� प्रा
Transfer grammar :ि्िि्िा� इसु क� कक� सु" प्रा ब�ठा� जा� चे�क� है�
Padasutra (1/2)
Get the core meaning of a polysemous word State it in a formulaic formठातिहैसुThis appears in the first Write notes to show the relatedness of various senses The user can refer to it if required
Padasutra (2/2)English verb 'have' 1. She has tea in the morning
vaha(nom) subaha caaya piitii hai (drink)
2. She has bread in the morning
vaha(nom) subaha breda khaatii hai (eat)
3. She has fever
usako bukhaara hai (be)
4. She has my book
usake paasa merii pustaka hai (posses+be)
5. She has three children
usake tiina bacce haiM
Word Sense Disambiguation (WSD)
WSD is
Automatically selecting the appropriate sense in a given context
Requires linguistic Resources and Tools
Linguistic resources : dictionaries, thesauri, hand crafted rules etc Linguistic tools : POS tagger, Parser, MWE Identifier etc
WSD : Possible Solutions
Two major approachesManually crafted rules
Costly Fragile
Machine Learning/Statistical
Anusaaraka Solution to WSD (1/4)
Major bottleneck
Requires large number of disambiguation rules
Anusaaraka combines statistically generated rules with manually created rules
WSD rules can be revised/added over a period of time
Simplify the method for the above
Involve large number of people to prepare rules
Anusaarak uses 'clips' an Expert System Shell for developing rules
Anusaaraka Solution (2/4)
Divide the problem into small bite size with considerable time
Bite size – 4 pagesTime – Two years
Relevant in Indian conditions as we have large manpower
Anusaaraka Solution (3/4)
Use manually craftd rules for WSD
Which means developing WSD rules for approximately 10,000 words
Handling MWE in lakhs
Anusaarka Solution (4/4)
Use Cambridge Advanced Learners dictionary for distributing words to the rule developers
The dictionary has Approx 1600 pagesAllot 4 pages to one person
1600/4=400
Anusaaraka Goals
Provide an open source usable system to the users
The system should facilitate accessing another language
Show the usability of Indian traditional grammar system in the modern context Facilitate users to become developers
Anusaaraka Philosophy
No Loss of Information
No efforts should go wasted
Users contribute towards the
development
Anusaaraka is An application of concepts from Panini's
Ashtadhyayi to contemporary problems
• pravitti nimitta
• sannidhi (proximity)
• yogyataa (qualification)
• aakaaMkshaa (expectation)
• kaarakas (role-relations)
• etc
Anusaaraka is A tool for overcoming language barriers An application of concepts from Panini's ashtadhyayi to contemporary problems.
An exploration of the information dynamics in language
A better approach for building Machine Translation systems
A Workbench for NLP students An opportunity for the masses to be IT contributors rather than mere IT consumers
Paninian Grammar Inspired Information Dynamics
Basics Core IdeaSyntax
Vocabulary
EnglishGrammarfrom Paninianview point
WSD(Incommensurability)
Concrete ExamplesAnusaaraka cum Machine Translation System
a. Scientific Aspect
b. Engineering AspectBasics
a) Evolutionary Approachb) Graceful Degradationc) Providing Practical Alternatives
Anusaaraka cumMachine Transaltion
Core IdeaLayered Output
Smart UserInterface
c. Social AspectBasics
Gitaa) yajna
b) tyena tyaktena bhunjithaa
Temple of Learning
Core Idea
InspiredInspired
inspired
Open Source
Users are not mere consumersbut can also participate in the development
Bringing out hidden talents