Date post: | 01-Apr-2015 |
Category: |
Documents |
Upload: | joe-shelby |
View: | 217 times |
Download: | 1 times |
1
EACL 2003, Budapest : April 12 – 17, 2003
Computational Linguistics for South Asian Languages
Expanding Synergies with Europe
CORPORA IN MINOR LANGUAGES OF INDIASOME ISSUES
Dr.B.Mallikarjun
Central Institute of Indian LanguagesMysore 570 006, INDIA
[email protected]/faculty/mallikarjun.html
www.ciilcorpora.net
1. Current status of corpora – major Indian languages
2. Current status of corpora - minor Indian languages
3. Importance of minor languages corpora
4. Objectives
5. Categorization of minor languages for corpora building
6. Minor languages: A sample
7. Issues in corpora building
8. Corpus processing tools – a. Basic b. Advanced
9. Conclusion and a mission EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
3
India has 1652 mother tongues of 4 families.The Constitution of India in 8th Schedule
has recognized 18 languages spoken by 96.29% of the population.
EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
Assamese : 2,622,836 Bengali : 3,535,863 Gujarati : Hindi : 3,003,004 Kannada : 2,239,537 Kashmiri : 2,266,588 Konkani : Malayalam: 2,349,526 Manipuri :
Marathi : 2,213,241 Nepali : Oriya : 2,727,670 Punjabi : 1,966,260 Sanskrit: Sindhi : Tamil : 3,381,525 Telugu : 3,967,926 Urdu : 1,64,125
4
* Different quantum.
* Comparable quality.
* Quantum and coverage is inadequate for wider NLP activities.
* Needs to be augmented with wider coverage.
* Enhancing attempts have some problems needing immediate solution.
EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
5
* 1634 are minor languages spoken by 3.71% of the population.
* Indo-Aryan and Dravidian language families have both major and minor languages.
* Almost all the languages of the other two families, Munda and Tibeto-Burman are “minor” languages.
* Text corpora building has not taken place in these languages.
EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
6
Minor languages hardly attract the attention of the policy makers anywhere in the world.
These are endangered in Indian social, educational and linguistic contexts.
Linguists evince great interest to study the richness of languages and try to save the endangered languages from extinction.
EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
7
They hardly attract and become source for technological research.
Technology has made it possible to empower all languages whether they are major or minor ones.
Creating corpora in minor languages, especially those that have small or no written literature have certain critical advantages for linguistic computing.
Experimentation with corpora designs and standards is more easily done in these languages because of manageable quantum of data.
EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
8
Archival and cross-linguistic comparison within a language family and across language families. Utilize language technology for their preservation and continued use.
Fine-tune language analysis where grammatical analysis is available. Use machine readable form of the texts to produce possibly precise analysis of the language where ever such analysis is not available. Also use some of the minor languages corpora for machine translation purposes.
Speech corpora too has more significance in minor languages, since most of them exist in spoken form and many are yet to be rendered into written form.
Indigenous knowledge systems: Most of the minor languages are resources of cultural heritage and a treasure house of indigenous knowledge systems. Once the same is available in the machine readable form by using UNL can be made available to the universal knowledge base.
EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
9
Minor languages can be classified into 3 groups on the basis of the issues to be tackled while building corpora.
First category : Languages other than the 18 major languages having good amount of literary and other texts and also used in wider domains like : Bodo, Kurukh, Maithili, Santhali, Tripuri etc.
Second category : Languages are the once with limited quantity of written texts but not widely used in different domains such as education, administration etc. like : Kodava, Tulu, etc.
Third category : Languages available only in spoken form and yet to be rendered into written form like Toda, Kota, Yerava, etc.
EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
10
13,689No script
Indigenous KnowledgeSystem
DravidianYerava
97,011KannadaVery lessDravidianKodava or Coorgi
77,66,597DevanagariYesIndo AryanMaithili
No. of speakersScriptTextLg.familyName
These languages are representative of the ground linguistic reality in India.
EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
11
In-compatibility of adopted software not accommodative of all the features of Maithili, Kodava and Yerava
Standard software based on the grammar of the concerned script and UNICODE for Kannada: - 1, 2, 3, 4.
Technical:
key-board, input and storage
All available text / All transcribed speechMaithili, Kodava and Yerava
Sampling - domainsPeriod
Text
Minor languageMajor languageIssue
EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
12EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
Frequency count of words and syllables :
The facilities created for languages like Hindi and Kannada are there and where ever necessary language specific modifications are made and used.
13
Comparison of Maithili, Kodava and Yerava Corpora
3.105.703.52Average Word length%
rurakaMost frequent Syllable
3030605051902Word types
38819432328146Corpus size
YeravaKodavaMaithiliStatistical distribution
EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
14
4.364.714.96 4.963.52Average Word length%
kakakakakaMost frequent Syllable
24745476407195318986051902Word types
671171156677931407292327129328146Corpus size
Hindi(Premchand)
Hindi(India Today)
Hindi(Naiduniya)
Hindi (CIIL)
MaithiliStatistical distribution
EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
15
r a r ur ur aMost frequent Syllable
6.938.424.364.64Average sentence length %
10.258.683.105.70Average Word length%
52680234685030306050Word types
2119935197798738819432Corpus size
MalayalamKannadaYeravaKodaguStatistical distribution
EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
16
1. Key Word in Context2. Search by required word3. Sorting and indexing
The facilities created for languages like Hindi and Kannada are there and where ever necessary language specific modifications can be made and used.
EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
17
1. Part-of-speech tagging2. Morphological analyzer
EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
18
1. Non availability of standard basic tag set is one of the major drawbacks.
2. Each Institution/group of scholars use their own notations: CLAWS, Research institution in IT, CIIL(Maj lg.), CIIL(Min lg.)
3. The tagging tools being developed even for major languages are at different stages of development.
4. The POS tagging tool developed for Hindi can be tried out at the first instance on Maithili to see its viability. Hindi too is not having fully working POS tagging tool.
5. Due to limited data in Kodava and Yerava manual tagging is preferred.
EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
19
The Morphological Analyzers designed for the minor languages of India should be sensitive enough to take care of their specific features.
1. Tagged lexicon2. Rules to cover the processes of:
Inflection - Suffixing is normally based on word endingDerivation – Both prefixing and suffixing are possible – depends on lexical item
EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
20
Yerava word ‘-ati’ has three meanings such as ‘to sweep’, ‘wind blow’ and ‘bottom’ for which meaning has to be taken depending upon the context. In such of these cases the morphological analyzer demands a semantic tool.
Kodava word bappe has the meaning ‘I am coming’ but when it is used in the context of leave taking, it means, ‘I am leaving.’ Cultural nuances in the context of leave taking do not allow one to use the word poope ‘going or leaving’ because it would only mean that the person is saying the ultimate good-bye to this world. It is possible to judge the meaning of such words only with the knowledge of the culture represented by a language.
EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
21
Ambiguities are seen in three senses - Word sense, Pronoun sense and Structural sense. Word sense ambiguities are words having multiple meanings that will be found in all the languages. With regard to the second one, pronominal and adjectival anaphora are also ambiguities. In English, disambiguation tools have been developed. After the inception of a few lexical databases such as Word Net, Euro Net, etc., researchers seem to have overcome the ambiguity problem to certain extent.
In the case of Indian languages, however, in the absence of such a sensitive tool, one has to work manually in order to cross over disambiguate even in the case of major languages.
Minor languages need better linguistic analysis to arrive at tangible and usable disambiguation procedures.
EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
22
India abounds in many endangered languages. Technology can actually help maintain a language.
Technology should immediately take into account the concerns of minority languages. Especially, major language technologies of the region should accommodate the needs of the minor languages too.
Corpora building in minor languages poses new challenges to innovate novel ways to accommodate and adequately describe the distinctive features of these languages.
Comparison of corpora studies - within a family of languages, across the families of languages and at the international level will be helpful in bringing out a standard module of developing corpora.
EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
23EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
Thank You
24EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
8.1 Kannada Code Chart
25EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
26EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
27EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
28
DemographyAstrologyCriminologyPhysical Education / SportsHealth and Family WelfareForestrySexologyCulture & AnthropologyCommerceBankingAccountancyIndustry & handicraftsFinanceTextile TechnologyOfficial And Media LanguagesMass MediaLegislativeAdministrativeTranslated MaterialLiteratureScientificLegalAdministrationTranslated PsychologyEACL 2003, CLSAL: Budapest – April 12 – 17, 2003
AestheticsLiterature Novel Short Story Essays Criticism Humour Children 's Literature Biographies &
Autobiographies TraveloguesLetters/Diaries/
Speeches Plays Science Fiction Folk Tales Text Books(School) Social SciencesFine Arts Music Dance/Impersonations Drawing Sculpture Musical Instruments Hobbies
Natural, Physical And Professional SciencesBotanyZoologyGeologyGeographyBio ChemistryMicro BiologyPhysicsChemistryMathematicsStatisticsComputer SciencesAstronomyText book(Science)MedicineAyurvedaHomeopathyYogaNaturopathyEngineeringArchitectureOceanologyAgricultureVeternary
Film TechnologyPhotographyMarine BiologyFisheriesTextile TechnologySocial SciencesSociologyLinguisticsPsychologyAnthropologyHistory, Archeology, EpigraphyPolitical ScienceHome ScienceLibrary ScienceReligion, PhilosophyEconomicsLogicJournalismFolklore/MythologyPublic AdministrationLawBusiness ManagementEducationText Books-Social Science
29EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
30EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
31EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
32EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
33EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
34EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
35EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
Thank You