Resources for South Asian Languages
Localization
• What is it: ‘the process of enabling computing experience in local culture and language’ (S. Hussain, Durrani & Gul 2005: 3)
• What do you need for Localization:• Linguistic description and documentation
• Unicode and font development
• Keyboarding layouts
• Voice-to-text applications
• Local research capacity and resources
Unicode
• An international encoding standard for use with different languages and scripts, by which each letter, digit, or symbol is assigned a unique numeric value that applies across different platforms and programs.
• Unicode for Meitei Mayek: see next
MM Unicode Letters and code
http://tech.firstpost.com/news-analysis/google-translate-gets-voice-recognition-for-hindi-and-seven-other-indian-languages-225225.html
Speech to Text
Pan Asian Networking (PAN) Localization Project
Sinhala
1. linguistic resources
a. part-of-speech (POS) tag set
b. a 500,000 word tagged Sinhala corpus,
c. a 10 million word contemporary Sinhala corpus,
d. a trilingual Sinhala Tamil-English dictionary
e. developed were a 100,000-word English-Sinhala parallel corpus
f. 1,000 word Sinhala WordNet
1. Language tools
a. text-to-speech
b. screen reader
c. an optical character recognizer (OCR)
d. utilities for encoding conversion
e. spell checker
f. localized versions of Windows operating systems
Kolhapur Corpus of Indian English (KCIE)
Word Net
• What is it: https://wordnet.princeton.edu/
• Indo-WordNet
• Eighteen languages of India, including (alphabetically) Assamese, Bengali, Bodo, Gujarati, Hindi, Kannada, Kashmiri, Konkani, Malayalam, Manipuri, Marathi, Nepali, Odia, Panjabi, Sanskrit, Tamil, Telugu, and Urdu.
Synset
UNIVERSAL synset: indigenous lexemes for a concept in all
languages, e.g. sun, moon, star)
PANINDIAN synset: indigenous lexemes in all Indian languages but no equivalent in English, e.g. pāpaḍ[pa:pɔɽ] ‘thin cake of dried ground pulses variously
spiced’)
IN-FAMILY (synset found in a particular language family, e.g. kal/ḷ- ‘toddy, fermented juice from the flower of the
palmyra tree’ in Dravidian,5 found in Tamil kaḷ, Telugu
kallu, Kannada and Malayalam kaḷ, Tulu kali).
LANGUAGE SPECIFIC (synset unique to a particular language,
e.g. bihu ‘a kind of group dance of Assam’ in
Assamese),
RARE (specific technical terms, e.g. modem), and
SYNTHESIZED (synset that is created in a language due to
influence of another language, e.g. pizza).
Hyderabad, India
NLP-MT Lab –what they working on
Sub-areas of NLP
• syntax and parsing
• semantics and
• word sense disambiguation
• discourse and tree banking
• machine translation
• creation of linguistics resources
NLP-MT Lab –what they working on
Sub-areas of NLP
• syntax and parsing
• semantics and
• word sense disambiguation
• discourse and tree banking
• machine translation
• creation of linguistics resources
Sampark
• Sampark is a multipart machine translation system developed with the combined efforts of 11 under the umbrella of consortium project “ Indian language to India Language Machine translation” (ILMT) funded by TDIL program of Dept of IT, Govt. of India. ILMT project has developed language technology for 9 Indian languages resulting in MT for 18 language pairs. These are: 14 bi-directional between Hindi and Urdu / Punjabi / Telugu / Bengali / Tamil / Marathi / Kannada and 4 bidirectional between Tamil and Malayalam / Telugu.
Creating Treebanks
What to look for in a useful treebank
Tagging and tagsets
What you are doing with annotation
• Creating a careful selection of analyses for common constructions
• Could be used to model an grammar for the language
• Then extract the grammar from a corpus• Extracted grammars have better coverage and include statistical information
• Extracted grammars are more noisy and lack rich features.
What resources are