+ All Categories
Home > Documents > Resources for South Asian Languages - UNT Digital Library/67531/metadc...a. text-to-speech b. screen...

Resources for South Asian Languages - UNT Digital Library/67531/metadc...a. text-to-speech b. screen...

Date post: 12-Oct-2020
Category:
Upload: others
View: 6 times
Download: 0 times
Share this document with a friend
21
Resources for South Asian Languages
Transcript
Page 1: Resources for South Asian Languages - UNT Digital Library/67531/metadc...a. text-to-speech b. screen reader c. an optical character recognizer (OCR) d. utilities for encoding conversion

Resources for South Asian Languages

Page 2: Resources for South Asian Languages - UNT Digital Library/67531/metadc...a. text-to-speech b. screen reader c. an optical character recognizer (OCR) d. utilities for encoding conversion

Localization

• What is it: ‘the process of enabling computing experience in local culture and language’ (S. Hussain, Durrani & Gul 2005: 3)

• What do you need for Localization:• Linguistic description and documentation

• Unicode and font development

• Keyboarding layouts

• Voice-to-text applications

• Local research capacity and resources

Page 3: Resources for South Asian Languages - UNT Digital Library/67531/metadc...a. text-to-speech b. screen reader c. an optical character recognizer (OCR) d. utilities for encoding conversion

Unicode

• An international encoding standard for use with different languages and scripts, by which each letter, digit, or symbol is assigned a unique numeric value that applies across different platforms and programs.

• Unicode for Meitei Mayek: see next

Page 4: Resources for South Asian Languages - UNT Digital Library/67531/metadc...a. text-to-speech b. screen reader c. an optical character recognizer (OCR) d. utilities for encoding conversion

MM Unicode Letters and code

Page 5: Resources for South Asian Languages - UNT Digital Library/67531/metadc...a. text-to-speech b. screen reader c. an optical character recognizer (OCR) d. utilities for encoding conversion
Page 6: Resources for South Asian Languages - UNT Digital Library/67531/metadc...a. text-to-speech b. screen reader c. an optical character recognizer (OCR) d. utilities for encoding conversion

http://tech.firstpost.com/news-analysis/google-translate-gets-voice-recognition-for-hindi-and-seven-other-indian-languages-225225.html

Speech to Text

Page 7: Resources for South Asian Languages - UNT Digital Library/67531/metadc...a. text-to-speech b. screen reader c. an optical character recognizer (OCR) d. utilities for encoding conversion

Pan Asian Networking (PAN) Localization Project

Page 8: Resources for South Asian Languages - UNT Digital Library/67531/metadc...a. text-to-speech b. screen reader c. an optical character recognizer (OCR) d. utilities for encoding conversion

Sinhala

1. linguistic resources

a. part-of-speech (POS) tag set

b. a 500,000 word tagged Sinhala corpus,

c. a 10 million word contemporary Sinhala corpus,

d. a trilingual Sinhala Tamil-English dictionary

e. developed were a 100,000-word English-Sinhala parallel corpus

f. 1,000 word Sinhala WordNet

1. Language tools

a. text-to-speech

b. screen reader

c. an optical character recognizer (OCR)

d. utilities for encoding conversion

e. spell checker

f. localized versions of Windows operating systems

Page 9: Resources for South Asian Languages - UNT Digital Library/67531/metadc...a. text-to-speech b. screen reader c. an optical character recognizer (OCR) d. utilities for encoding conversion

Kolhapur Corpus of Indian English (KCIE)

Page 10: Resources for South Asian Languages - UNT Digital Library/67531/metadc...a. text-to-speech b. screen reader c. an optical character recognizer (OCR) d. utilities for encoding conversion

Word Net

• What is it: https://wordnet.princeton.edu/

• Indo-WordNet

• Eighteen languages of India, including (alphabetically) Assamese, Bengali, Bodo, Gujarati, Hindi, Kannada, Kashmiri, Konkani, Malayalam, Manipuri, Marathi, Nepali, Odia, Panjabi, Sanskrit, Tamil, Telugu, and Urdu.

Page 11: Resources for South Asian Languages - UNT Digital Library/67531/metadc...a. text-to-speech b. screen reader c. an optical character recognizer (OCR) d. utilities for encoding conversion

Synset

UNIVERSAL synset: indigenous lexemes for a concept in all

languages, e.g. sun, moon, star)

PANINDIAN synset: indigenous lexemes in all Indian languages but no equivalent in English, e.g. pāpaḍ[pa:pɔɽ] ‘thin cake of dried ground pulses variously

spiced’)

IN-FAMILY (synset found in a particular language family, e.g. kal/ḷ- ‘toddy, fermented juice from the flower of the

palmyra tree’ in Dravidian,5 found in Tamil kaḷ, Telugu

kallu, Kannada and Malayalam kaḷ, Tulu kali).

LANGUAGE SPECIFIC (synset unique to a particular language,

e.g. bihu ‘a kind of group dance of Assam’ in

Assamese),

RARE (specific technical terms, e.g. modem), and

SYNTHESIZED (synset that is created in a language due to

influence of another language, e.g. pizza).

Page 12: Resources for South Asian Languages - UNT Digital Library/67531/metadc...a. text-to-speech b. screen reader c. an optical character recognizer (OCR) d. utilities for encoding conversion

Hyderabad, India

Page 13: Resources for South Asian Languages - UNT Digital Library/67531/metadc...a. text-to-speech b. screen reader c. an optical character recognizer (OCR) d. utilities for encoding conversion
Page 14: Resources for South Asian Languages - UNT Digital Library/67531/metadc...a. text-to-speech b. screen reader c. an optical character recognizer (OCR) d. utilities for encoding conversion

NLP-MT Lab –what they working on

Sub-areas of NLP

• syntax and parsing

• semantics and

• word sense disambiguation

• discourse and tree banking

• machine translation

• creation of linguistics resources

Page 15: Resources for South Asian Languages - UNT Digital Library/67531/metadc...a. text-to-speech b. screen reader c. an optical character recognizer (OCR) d. utilities for encoding conversion

NLP-MT Lab –what they working on

Sub-areas of NLP

• syntax and parsing

• semantics and

• word sense disambiguation

• discourse and tree banking

• machine translation

• creation of linguistics resources

Page 16: Resources for South Asian Languages - UNT Digital Library/67531/metadc...a. text-to-speech b. screen reader c. an optical character recognizer (OCR) d. utilities for encoding conversion

Sampark

• Sampark is a multipart machine translation system developed with the combined efforts of 11 under the umbrella of consortium project “ Indian language to India Language Machine translation” (ILMT) funded by TDIL program of Dept of IT, Govt. of India. ILMT project has developed language technology for 9 Indian languages resulting in MT for 18 language pairs. These are: 14 bi-directional between Hindi and Urdu / Punjabi / Telugu / Bengali / Tamil / Marathi / Kannada and 4 bidirectional between Tamil and Malayalam / Telugu.

Page 17: Resources for South Asian Languages - UNT Digital Library/67531/metadc...a. text-to-speech b. screen reader c. an optical character recognizer (OCR) d. utilities for encoding conversion

Creating Treebanks

Page 18: Resources for South Asian Languages - UNT Digital Library/67531/metadc...a. text-to-speech b. screen reader c. an optical character recognizer (OCR) d. utilities for encoding conversion

What to look for in a useful treebank

Page 19: Resources for South Asian Languages - UNT Digital Library/67531/metadc...a. text-to-speech b. screen reader c. an optical character recognizer (OCR) d. utilities for encoding conversion

Tagging and tagsets

Page 20: Resources for South Asian Languages - UNT Digital Library/67531/metadc...a. text-to-speech b. screen reader c. an optical character recognizer (OCR) d. utilities for encoding conversion

What you are doing with annotation

• Creating a careful selection of analyses for common constructions

• Could be used to model an grammar for the language

• Then extract the grammar from a corpus• Extracted grammars have better coverage and include statistical information

• Extracted grammars are more noisy and lack rich features.

Page 21: Resources for South Asian Languages - UNT Digital Library/67531/metadc...a. text-to-speech b. screen reader c. an optical character recognizer (OCR) d. utilities for encoding conversion

What resources are


Recommended