Resources for South Asian Languages - UNT Digital Library/67531/metadc...a. text-to-speech b. screen...

Resources for South Asian Languages

Localization

• What is it: ‘the process of enabling computing experience in local culture and language’ (S. Hussain, Durrani & Gul 2005: 3)

• What do you need for Localization:• Linguistic description and documentation

• Unicode and font development

• Keyboarding layouts

• Voice-to-text applications

• Local research capacity and resources

Unicode

• An international encoding standard for use with different languages and scripts, by which each letter, digit, or symbol is assigned a unique numeric value that applies across different platforms and programs.

• Unicode for Meitei Mayek: see next

MM Unicode Letters and code

http://tech.firstpost.com/news-analysis/google-translate-gets-voice-recognition-for-hindi-and-seven-other-indian-languages-225225.html

Speech to Text

Pan Asian Networking (PAN) Localization Project

Sinhala

1. linguistic resources

a. part-of-speech (POS) tag set

b. a 500,000 word tagged Sinhala corpus,

c. a 10 million word contemporary Sinhala corpus,

d. a trilingual Sinhala Tamil-English dictionary

e. developed were a 100,000-word English-Sinhala parallel corpus

f. 1,000 word Sinhala WordNet

1. Language tools

a. text-to-speech

b. screen reader

c. an optical character recognizer (OCR)

d. utilities for encoding conversion

e. spell checker

f. localized versions of Windows operating systems

Kolhapur Corpus of Indian English (KCIE)

Word Net

• What is it: https://wordnet.princeton.edu/

• Indo-WordNet

• Eighteen languages of India, including (alphabetically) Assamese, Bengali, Bodo, Gujarati, Hindi, Kannada, Kashmiri, Konkani, Malayalam, Manipuri, Marathi, Nepali, Odia, Panjabi, Sanskrit, Tamil, Telugu, and Urdu.

https://wordnet.princeton.edu/

Synset

UNIVERSAL synset: indigenous lexemes for a concept in all

languages, e.g. sun, moon, star)

PANINDIAN synset: indigenous lexemes in all Indian languages but no equivalent in English, e.g. pāpaḍ[pa:pɔɽ] ‘thin cake of dried ground pulses variously

spiced’)

IN-FAMILY (synset found in a particular language family, e.g. kal/ḷ- ‘toddy, fermented juice from the flower of the

palmyra tree’ in Dravidian,5 found in Tamil kaḷ, Telugu

kallu, Kannada and Malayalam kaḷ, Tulu kali).

LANGUAGE SPECIFIC (synset unique to a particular language,

e.g. bihu ‘a kind of group dance of Assam’ in

Assamese),

RARE (specific technical terms, e.g. modem), and

SYNTHESIZED (synset that is created in a language due to

influence of another language, e.g. pizza).

Hyderabad, India

NLP-MT Lab –what they working on

Sub-areas of NLP

• syntax and parsing

• semantics and

• word sense disambiguation

• discourse and tree banking

• machine translation

• creation of linguistics resources

NLP-MT Lab –what they working on

Sub-areas of NLP

• syntax and parsing

• semantics and

• word sense disambiguation

• discourse and tree banking

• machine translation

• creation of linguistics resources

Sampark

• Sampark is a multipart machine translation system developed with the combined efforts of 11 under the umbrella of consortium project “ Indian language to India Language Machine translation” (ILMT) funded by TDIL program of Dept of IT, Govt. of India. ILMT project has developed language technology for 9 Indian languages resulting in MT for 18 language pairs. These are: 14 bi-directional between Hindi and Urdu / Punjabi / Telugu / Bengali / Tamil / Marathi / Kannada and 4 bidirectional between Tamil and Malayalam / Telugu.

Creating Treebanks

What to look for in a useful treebank

Tagging and tagsets

What you are doing with annotation

• Creating a careful selection of analyses for common constructions

• Could be used to model an grammar for the language

• Then extract the grammar from a corpus• Extracted grammars have better coverage and include statistical information

• Extracted grammars are more noisy and lack rich features.

What resources are

Date post:	12-Oct-2020
Category:	Documents
Upload:	others
View:	6 times
Download:	0 times

Resources for South Asian Languages - UNT Digital Library/67531/metadc...a. text-to-speech b. screen...

Documents