A PROJECT REPORT ON PART-OF-SPEECH TAGGING FOR BENGALI › xadm › data_entry_module › project...

Bengali Part-Of-Speech Tagging

1

A

PROJECT REPORT

ON

PART-OF-SPEECH TAGGING

FOR BENGALI

IN PARTIAL FULFILLMENT OF THE REQUIRMENT FOR THE DEGREE OF

MASTER OF COMPUTER SCIENCE

DEPARTMENT OF COMPUTER SCIENCE

ASSAM UNIVERSITY, SILCHAR

2016

Submitted by:

DEEPANKAR DAS

Roll: 101614 No.: 22220380

Under the Guidance of

PROF. BIPUL SYAM PURKYASTHA

HEAD OF DEPARTMENT, PROFESSOR

DEPARTMENT OF COMPUTER SCIENCE

ASSAM UNIVERSITY, SILCHAR-788011


2

CERTIFICATE

This is to certify that Deepankar Das bearing Roll: 101614 No: 22220380 has

carried out her work for the project entitled “PART-OF-SPEECH TAGGING FOR

BENGALI” under my supervision in partial fulfillment for the requirement of the

award of degree of Master of Science in Computer Science of Assam University,

Silchar. He has done sincerely his work for preparing this project. He has fulfilled

all the requirements laid down in the regulations of the MSc (2 years) 4th

Semester Examination (Paper MS-405) of the Department of Computer Science,

Assam University, Silchar, for the session 2015-2016.

Date: Signature of the Guide

Place: (PROF. BIPUL SYAM PURKAYASTHA)

Supervisior, Professor

Department of Computer Science

Assam University, Silchar


3

CERTIFICATE

This is to certify that Deepankar Das bearing Roll: 101614 No: 22220380 has

carried out her work for the project entitled “PART-OF-SPEECH TAGGING FOR

BENGALI” under my supervision in partial fulfillment for the requirement of the

award of degree of Master of Science in Computer Science of Assam University,

Silchar. He has done sincerely his work for preparing this project. He has fulfilled

all the requirements laid down in the regulations of the MSc (2 years) 4th

Semester Examination (Paper MS-405) of the Department of Computer Science,

Assam University, Silchar, for the session 2015-2016.

Date: Signature of the HOD

Place: (PROF. BIPUL SYAM PURKAYASTHA)

HOD, Professor




4

DECLARATION

I, Deepankar Das, student of 4th semester (MSc 2 years), Department of Computer

Science do hereby solemnly declare that I have duly worked on my project

entitled “PART-OF-SPEECH TAGGING FOR BENGALI” under the supervision of Prof. Bipul

Syam Purkayastha, Professor, Department of Computer Science, Assam

University, Silchar.

Date: Signature

Place: ( Deepankar Das )

Msc 4th Semester

Roll: 101614 No.: 22220380

Regn. No.: 02-110018703 of 2011-12




5

ACKNOWLWDGEMENT

At the very outset, I take the privilege to convey my gratitude to those

persons whose co-operation, suggestions and heartfelt support helped

us to accomplish the term paper successfully.

I take immense pleasure to express my sincere thanks and profound

gratitude to my respected guide Prof. Bipul Shyam Purkayastha, Head

of the Department of Computer Science, Assam University, Silchar, for

his excellence and able guidance, valuable suggestions and

encouragement he rendered for completing the term paper and also for

his valuable suggestions.

I also indebted to my family members, friends and well-wishers who

encouraged me to do this work with vigor and seriousness.

Last but not the least I would like to acknowledge the cooperation I

received from the entire staff of our department and thanks to all those

who directly or indirectly extended their helpful hands and moral

support while making this project.

( Deepankar Das )


6

Table of Contents

Chapters Title Page No Chapter 1 Introduction 1

1.1 NLP 2

1.2 Applications of NLP 2

1.3 POS Tagging 6

1.4 The POS Tagging Problem 7

1.5 Applications of POS Tagging 9

1.6 Motivations 10

1.7 Goals of Our Work 10

1.8 Organization of the report 11

Chapter 2 Prior Work 12

2.1 Prior Work in POS Tagging 13

2.2 Linguistics Taggers 13

2.3 POS Tagging Approaches 14

2.4 Indian Language POS Taggers 18

Chapter 3 Foundational Consideration 20

3.1 Corpora Collection 21

3.2 The Tagset 21


7

Chapter 4 Tagging with Rule Based Approach 24

4.1 Rule Based Approach 25

4.2 Our Approach 25

Chapter 5 Experimental Result & Discussion 28

5.1 Tools Used 29

5.2 Graphical User Interface 30

5.3 Experimental Result 31

5`4 Result Discussion 32

Chapter 6 Conclusion & Future Direction 33

6.1 Conclusion 34

6.2 Future Work 34

References 35


8

Abstract

Part-of-Speech (POS) tagging is the process of assigning the appropriate part of

speech or lexical category to each word in a natural language sentence. Part-of-speech

tagging is an important part of Natural Language Processing (NLP) and is useful for most

NLP applications. It is often the first stage of natural language processing following which

further processing like chunking, parsing, etc are done.

POS tagging is considered as the one of the basic necessary tool. Its simplified form is

commonly taught to school age children, in the identification of words as nouns, pronouns,

verbs, adjectives, adverbs, prepositions, conjunctions,, interjections etc. Development of any

Indian language POS tagger will influence several pipelined modules of natural language

understanding system including Information Extraction(IE); Information Retrieval(IR);

Machine Translation (MT); Partial Parsing (PP) and Word Sense Disambiguation(WSD).

Our objective in this work is to develop an effective POS tagger for Bengali Language. Once

performed by manual, POS tagging is now done with the context of computational

linguistics, using algorithms which associate discrete terms, as well as hidden parts of speech,

in accordance with a set of descriptive tags. POS tagging algorithms fall into two distinctive

groups: rule based and stochastic. E. Brill's tagger, one of the first and most widely used

English POS taggers, employs rule based algorithms.

Bengali is the main language spoken in Bangladesh, the second most commonly

spoken language in India, and the seventh most commonly spoken language in the world with

nearly 230 million total speakers(189 million native speakers). Natural language processing

of Bengali is in its infancy. POS tagging of Bengali is a necessary component for most NLP

applications of Bengali.

The developed system is tested with a set of experimental data and result analysis has

been made. The system gives accuracy over 74.50%. The performance can be increased by

increasing the size of the lexicon.


9

CHAPTER 1

Introduction


10

1.1 NLP

The goal of natural language processing (NLP) is to build computational models of natural

language for its analysis and generation. First, there is technological motivation of building

intelligent computer systems such as machine translation systems, natural language interfaces

to databases, man-machine interfaces to computers in general, speech understanding systems,

text analysis and understanding systems, computer aided instruction systems, systems that

read and understand printed or handwritten text. Second, there is a cognitive and linguistic

motivation to gain a better in- sight into how humans communicate using natural language

(NL).

Natural language processing (NLP) is a field of computer science and linguistics

concerned with the interactions between computers and human (natural) languages; it began

as a branch of artificial intelligence .In theory, natural language processing is a very attractive

method of human computer interaction. Natural language understanding is sometimes

referred to as an AI-computer problem because it seems to require extensive knowledge

about the outside world and the ability to manipulate it. Natural language processing (NLP) is

a collection of techniques used to extract grammatical structure and meaning from input in

order to perform a useful task as a result, natural language generation builds output based on

the rules of the target language and the task at hand. NLP is useful in the tutoring systems,

duplicate detection, computer supported instruction and database interface fields as it

provides a pathway for increased interactivity and productivity.

The tools of work in NLP are grammar formalisms, algorithms and data structures,

formalism for representing world knowledge, reasoning mechanisms, etc. Many of these have

been taken from and inherit results from Computer Science, Artificial Intelligence,

Linguistics, Logic, and Philosophy.

1.2 Applications of NLP

Automatic summarization : Produce a readable summary of a chunk of text. Often used to

provide summaries of text of a known type such as articles in the financial section of a

newspaper.

Machine translation: Automatically translate text from one human language to another. This

is one of the most difficult problems, and is a member of a class of problems colloquially

http://en.wikipedia.org/wiki/Automatic_summarizationhttp://en.wikipedia.org/wiki/Machine_translation


11

termed "AI-complete", i.e. requiring all of the different types of knowledge that humans

possess (grammar, semantics, facts about the real world, etc.) in order to solve properly.

Morphological segmentation: Separate words into individual morphemes and identify the

class of the morphemes. The difficulty of this task depends greatly on the complexity of the

morphology (i.e. the structure of words) of the language being considered. English has fairly

simple morphology, especially inflectional morphology, and thus it is often possible to ignore

this task entirely and simply model all possible forms of a word (e.g. "open, opens, opened,

opening") as separate words. In languages such as Turkish, however, such an approach is not

possible, as each dictionary entry has thousands of possible word forms. Not only for Turkish

but also the Manipuri which is a highly agglutinated Indian language.

Named entity recognition (NER): Given a stream of text,determine which items in the text

map to proper names, such as people or places, and what the type of each such name is (e.g.

person, location, organization). Note that, although capitalization can aid in recognizing

named entities in languages such as English, this information cannot aid in determining the

type of named entity, and in any case is often inaccurate or insufficient. For example, the first

word of a sentence is also capitalized, and named entities often span several words, only

some of which are capitalized. Furthermore, many other languages in non-Western scripts

(e.g. Chinese or Arabic) do not have any capitalization at all, and even languages with

capitalization may not consistently use it to distinguish names. For example, German

capitalizes all nouns, regardless of whether they refer to names, and French and Spanish do

not capitalize names that serve as adjectives.

Natural language generation: Convert information from computer databases into readable

human language.

Natural language understanding: Convert chunks of text into more formal representations

such as first-order logic structures that are easier for computer programs to manipulate.

Natural language understanding involves the identification of the intended semantic from the

multiple possible semantics which can be derived from a natural language expression which

usually takes the form of organized notations of natural languages concepts. Introduction and

creation of language metamodel and ontology are efficient however empirical solutions. An

explicit formalization of natural languages semantics without confusions with implicit

assumptions such as closed world assumption (CWA) vs. open world assumption, or

http://en.wikipedia.org/wiki/AI-completehttp://en.wikipedia.org/wiki/Morphology_%28linguistics%29http://en.wikipedia.org/wiki/Morphemehttp://en.wikipedia.org/wiki/Morphology_%28linguistics%29http://en.wikipedia.org/wiki/English_languagehttp://en.wikipedia.org/wiki/Inflectional_morphologyhttp://en.wikipedia.org/wiki/Turkish_languagehttp://en.wikipedia.org/wiki/Manipuri_languagehttp://en.wikipedia.org/wiki/Named_entity_recognitionhttp://en.wikipedia.org/wiki/Capitalizationhttp://en.wikipedia.org/wiki/Chinese_languagehttp://en.wikipedia.org/wiki/Arabic_languagehttp://en.wikipedia.org/wiki/German_languagehttp://en.wikipedia.org/wiki/Nounhttp://en.wikipedia.org/wiki/French_languagehttp://en.wikipedia.org/wiki/Spanish_languagehttp://en.wikipedia.org/wiki/Adjectivehttp://en.wikipedia.org/wiki/Natural_language_generationhttp://en.wikipedia.org/wiki/Natural_language_understandinghttp://en.wikipedia.org/wiki/First-order_logichttp://en.wikipedia.org/wiki/Computer


12

subjective Yes/No vs. objective True/False is expected for the construction of a basis of

semantics formalization.

Optical character recognition (OCR): Given an image representing printed text, determine

the corresponding text.

Part-of-speech tagging(POST) : Given a sentence, determine the part of speech for each

word. Many words, especially common ones, can serve as multiple parts of speech. For

example, "book" can be a noun ("the book on the table") or verb ("to book a flight"); "set"

can be a noun, verb or adjective; and "out" can be any of at least five different parts of

speech. Some languages have more such ambiguity than others. Languages with little

inflectional morphology, such as English are particularly prone to such ambiguity. Chinese is

prone to such ambiguity because it is a tonal language during verbalization. Such inflection is

not readily conveyed via the entities employed within the orthography to convey intended

meaning.

Parsing: Determine the parse tree (grammatical analysis) of a given sentence. The grammar

for natural languages is ambiguous and typical sentences have multiple possible analyses. In

fact, perhaps surprisingly, for a typical sentence there may be thousands of potential parses

(most of which will seem completely nonsensical to a human).

Question answering: Given a human-language question, determine its answer. Typical

questions have a specific right answer (such as "What is the capital of Canada?"), but

sometimes open-ended questions are also considered (such as "What is the meaning of

life?"). Recent works have looked at even more complex questions.

Relationship extraction: Given a chunk of text, identify the relationships among named

entities (e.g. who is the wife of whom).

Sentence breaking (also known as sentence boundary disambiguation): Given a chunk of

text, find the sentence boundaries. Sentence boundaries are often marked by periods or other

punctuation marks, but these same characters can serve other purposes (e.g. marking

abbreviations).

Sentiment analysis: Extract subjective information usually from a set of documents, often

using online reviews to determine "polarity" about specific objects. It is especially useful for

identifying trends of public opinion in the social media, for the purpose of marketing.

http://en.wikipedia.org/wiki/Optical_character_recognitionhttp://en.wikipedia.org/wiki/Part-of-speech_tagginghttp://en.wikipedia.org/wiki/Part_of_speechhttp://en.wikipedia.org/wiki/Parts_of_speechhttp://en.wikipedia.org/wiki/Nounhttp://en.wikipedia.org/wiki/Verbhttp://en.wikipedia.org/wiki/Nounhttp://en.wikipedia.org/wiki/Verbhttp://en.wikipedia.org/wiki/Adjectivehttp://en.wikipedia.org/wiki/Inflectional_morphologyhttp://en.wikipedia.org/wiki/English_languagehttp://en.wikipedia.org/wiki/Chinese_languagehttp://en.wikipedia.org/wiki/Tonal_languagehttp://en.wikipedia.org/wiki/Parsinghttp://en.wikipedia.org/wiki/Parse_treehttp://en.wikipedia.org/wiki/Grammarhttp://en.wikipedia.org/wiki/Natural_languagehttp://en.wikipedia.org/wiki/Ambiguoushttp://en.wikipedia.org/wiki/Question_answeringhttp://en.wikipedia.org/wiki/Relationship_extractionhttp://en.wikipedia.org/wiki/Sentence_breakinghttp://en.wikipedia.org/wiki/Sentence_boundary_disambiguationhttp://en.wikipedia.org/wiki/Full_stophttp://en.wikipedia.org/wiki/Punctuation_markhttp://en.wikipedia.org/wiki/Abbreviationhttp://en.wikipedia.org/wiki/Sentiment_analysis


13

Speech recognition: Given a sound clip of a person or people speaking, determine the textual

representation of the speech. This is the opposite of text to speech and is one of the extremely

difficult problems colloquially termed "AI-complete" (see above). In natural speech there are

hardly any pauses between successive words, and thus speech segmentation is a necessary

subtask of speech recognition (see below). Note also that in most spoken languages, the

sounds representing successive letters blend into each other in a process termed co

articulation, so the conversion of the analog signal to discrete characters can be a very

difficult process.

Speech segmentation: Given a sound clip of a person or people speaking, separate it into

words. A subtask of speech recognition and typically grouped with it.

Topic segmentation and recognition: Given a chunk of text, separate it into segments each of

which is devoted to a topic, and identify the topic of the segment.

Word segmentation: Separate a chunk of continuous text into separate words. For a language

like English, this is fairly trivial, since words are usually separated by spaces. However, some

written languages like Chinese, Japanese and Thai do not mark word boundaries in such a

fashion, and in those languages text segmentation is a significant task requiring knowledge of

the vocabulary and morphology of words in the language.

Word sense disambiguation: Many words have more than one meaning; we have to select the

meaning which makes the most sense in context. For this problem, we are typically given a

list of words and associated word senses, e.g. from a dictionary or from an online resource

such as WordNet. In some cases, sets of related tasks are grouped into subfields of NLP that

are often considered separately from NLP as a whole. Examples include:

Information retrieval (IR): This is concerned with storing, searching and retrieving

information. It is a separate field within computer science (closer to databases), but IR relies

on some NLP methods (for example, stemming). Some current research and applications seek

to bridge the gap between IR and NLP.

Information extraction (IE): This is concerned in general with the extraction of semantic

information from text. This covers tasks such as named entity recognition, Co reference

resolution, relationship extraction, etc.

http://en.wikipedia.org/wiki/Speech_recognitionhttp://en.wikipedia.org/wiki/Text_to_speechhttp://en.wikipedia.org/wiki/AI-completehttp://en.wikipedia.org/wiki/Natural_speechhttp://en.wikipedia.org/wiki/Speech_segmentationhttp://en.wikipedia.org/wiki/Coarticulationhttp://en.wikipedia.org/wiki/Coarticulationhttp://en.wikipedia.org/wiki/Speech_segmentationhttp://en.wikipedia.org/wiki/Speech_recognitionhttp://en.wikipedia.org/wiki/Topic_segmentationhttp://en.wikipedia.org/wiki/Word_segmentationhttp://en.wikipedia.org/wiki/English_languagehttp://en.wikipedia.org/wiki/Chinese_languagehttp://en.wikipedia.org/wiki/Japanese_languagehttp://en.wikipedia.org/wiki/Thai_languagehttp://en.wikipedia.org/wiki/Vocabularyhttp://en.wikipedia.org/wiki/Morphology_%28linguistics%29http://en.wikipedia.org/wiki/Word_sense_disambiguationhttp://en.wikipedia.org/wiki/Meaning_%28linguistics%29http://en.wikipedia.org/wiki/WordNethttp://en.wikipedia.org/wiki/Information_retrievalhttp://en.wikipedia.org/wiki/Information_extractionhttp://en.wikipedia.org/wiki/Named_entity_recognitionhttp://en.wikipedia.org/wiki/Coreferencehttp://en.wikipedia.org/wiki/Coreferencehttp://en.wikipedia.org/wiki/Relationship_extraction


14

1.3 POS Tagging

Part-of-Speech (POS) tagging is the process of automatic annotation of lexical categories.

Part-of–Speech tagging assigns an appropriate part of speech tag for each word in a sentence

of a natural language. The development of an automatic POS tagger requires either a

comprehensive set of linguistically motivated rules or a large annotated corpus. But such

rules and corpora have been developed for a few languages like English and some other

languages. POS taggers for Indian languages are not readily available due to lack of such

rules and large annotated corpora.

A part-of-speech is a grammatical category commonly including nouns, pronouns,

verbs, adjectives, adverbs, prepositions, conjunctions, interjections. Parts of speech can be

divided into two broad categories: closed classes and open classes. Closed classes are those

that have relatively fixed membership. For example, pronouns are categorized in closed class

because there is a fixed set of them in English; new pronouns are rarely added. But nouns are

in open class because new nouns are continually added in every language.

The linguistic approach is the classical approach to POS tagging was initially

explored in middle sixties and seventies (Harris, 1962; Klein and Simmons, 1963; Greene

and Rubin, 1971). People manually engineered rules for tagging. The most representative of

such pioneer tagger was TAGGIT (Greene and Rubin, 1971), which was used for initial

tagging of the Brown Corpus. The development of ENGTWOL (an English tagger based on

constraint grammar architecture) can be considered most important in this direction (Karlsson

et al., 1995). These taggers typically use rule-based models manually written by linguists.

The advantage of this model is that the rules are written from a linguistic point of view and

can be made to capture complex kinds of information. This allows the construction of an

extremely accurate system. But handling all rules is not easy and requires expertise. The

context frame rules have to be developed by language experts and it is costly and difficult to

develop a rule based POS tagger. Further, if one uses of rule based POS tagging, transferring

the tagger to another language means starting from scratch again.

On the other hand, recent machine learning techniques makes use of annotated

corpora to acquire high-level language knowledge for different tasks including PSO tagging.

This knowledge is estimated from the corpora which are usually tagged with the correct part

of speech labels for the words. Machine learning based tagging techniques facilitate the

development of taggers in shorter time and these techniques can be transferred for use with

corpora of other languages. Several machine learning algorithms have been developed for the


15

POS disambiguation task. These algorithms range from instance based learning to several

graphical models. The knowledge acquired may be in the form of rules, decision trees,

probability distribution, etc. The encoded knowledge in stochastic methods may or may not

have direct linguistic interpretation. But typically such taggers need to be trained with a

handsome amount of annotated data to achieve high accuracy. Though significant amounts of

annotated corpus are often not available for most languages, it is easier to obtain large

volumes of un-annotated corpus for most of the languages. The implication is that one may

explore the power of semi-supervised and unsupervised learning mechanism to get a POS

tagger.

Our interest is in developing taggers for Bengali Languages. Annotated corpora are

not readily available for this language, but the language is morphologically rich. The use of

morphological features of a word, as well as word suffixes can enable us to develop a POS

tagger with limited resources. In the present work, these morphological features (affixes)

have been incorporated in different machine learning models (Maximum Entropy,

Conditional Random Field, etc.) to perform the POS tagging task. This approach can be

generalized for use with any morphologically rich language in poor-resource scenario.

The development of a tagger requires either developing an exhaustive set of linguistic

rules or a large amount of annotated text. However no tagged corpus was available to us for

use in this task. We had to start with creating tagged resources for Bengali. Manual part of

speech tagging is quite a time consuming and difficult process. So we tried to work with

methods so that small amount of tagged resources can be used to effectively carry out the part

of speech tagging task.

1.4 The Part-of-Speech Tagging Problem

Natural languages are ambiguous in nature. Ambiguity appears at different levels of the

natural language processing (NLP) task. Many words take multiple part of speech tags. The

correct tag depends on the context.

Consider, for instance, the following English and Bengali sentence

1. Keep the book on the top shelf.

2. সকাবো তারা ক্ষেবত াঙ দিবে কাজ কবর

The sentences have lot of POS ambiguity which should be resolved before the

sentence can be understood. For instance in example sentence 1, the word “ keep ” and


16

“book” can be a noun or a verb; “on” can be a preposition, an adverb, an adjective; finally,

“top” can be either an adjective or a noun. Similarly, in Bengali example sentence 2, the

word “তারা ” can be either a noun or a pronoun; “দিবে” can be either a verb or a postposition

”করে” can be a noun, a verb, or a postposition. In most cases POS ambiguity can be

resolved by examining the context of the surrounding words. Figure1 shows a detailed

analysis of the POS ambiguity of an English sentence considering only the basic 8 tags. The

box with single line indicates the correct tag for a particular word where no ambiguity exists

i.e. only one tag is possible for the word. On the contrary, the boxes with double line indicate

the correct POS tag of a word form a set of possible tags.

Figure 1: POS ambiguity of an English sentence with eight basic tags.

Figure 2: POS ambiguity of a Bengali sentence with tagset of experiment.

Figure 2 illustrate the detail of the ambiguity class for the Bengali sentence as per the

tagset used for our experiment. As we are using a fine grained tagset compare to the basic 8

tags, the number of possible tags for a word increases POS tagging is the task of assigning

appropriate grammatical tags to each word of an input text in its context of appearance.

Essentially, the POS tagging task resolves ambiguity by selecting the correct tag from the set

of possible tags for a word in a sentence.

সকাবো তারা ক্ষেবত াঙ দিবে কাজ কবর

N PR N N V N

PSP

V

PSP


17

1.5 Applications of POS Tagging

POS disambiguation task is useful in several natural language processing tasks. It is often the

first stage of natural language understanding following which further processing e.g.,

chunking, parsing, etc are done. Part-of –speech tagging is of interest for a number of

applications, including – speech synthesis and recognition , machine translation, lexicography

etc.

Most of the natural language understanding systems are formed by a set of pipelined

modules; each of them is specific to a particular level of analysis of the natural language text.

Development of a POS tagger influences several pipelined modules of the natural language

understanding task. As POS tagging is the first step towards natural language understating, it

is important to achieve a high level of accuracy which otherwise may hamper further stages

of the natural language understanding. In the following, we briefly discuss some of the above

applications of POS tagging.

Speech synthesis and recognition, Part-of-speech gives significant amount of information

about the word and its neighbours which can be useful in a language model for speech

recognition (Heeman et al., 1997). Part of Speech of a word tells us something about how

the word is pronounced depending on the grammatical category (the noun is pronounced

Object and the verb object).

Information retrieval and extraction, by augmenting a query given to a retrieval

system with POS information, more refined information extraction is possible. For

example, if a person wants to search for document containing “ book” as a noun, adding

the POS information will eliminate irrelevant documents with only “ book” as a verb.

Also, patterns used for information extraction from text often use POS references.

Machine translation, the probability of translating a word in the source

language into a word in the target language is effectively dependent on the

POS category of the source Word.

As mentioned earlier, POS tagging has been used in several other application such as a processor

to high level syntactic processing (noun phrase chunker), lexicography, stylometry, and word

sense disambiguation. These applications are discussed in some detail in (Church, 1988;

Ramshaw and Marcus, 1995; Wilks and Stevenson, 1998).


18

1.6 Motivation

A lot of work has been done in part of speech tagging of several languages, such as English.

While some work has been done on the part of speech tagging of different Indian languages

(Ray et al., 2003; Shrivastav et al., 2006; Arulmozhi et al., 2006; Singh et al., 2006; Dalal et

al., 2007), the effort is still in its infancy. Very little work has been done previously with part

of speech tagging of Bengali. Bengali is the main language spoken in Bangladesh, the second

most commonly spoken language in India, and the seventh most commonly spoken language

in the world.

Apart from being required for further language analysis, Bengali POS tagging is of

interest due to a number of applications like speech synthesis and recognition. Part-of-speech

gives significant amount of information about the word and its neighbours which can be

useful in a language model for different speech and natural language processing applications.

Development of a Bengali POS tagger will also influence several pipelined modules of

natural language understanding system including: information extraction and retrieval;

machine translation; partial parsing and word sense disambiguation. The existing POS

tagging technique shows that the development of a reasonably good accuracy POS tagger

requires either developing an exhaustive set of linguistic rules or a large amount of annotated

text. We have the following observations.

i. POS tagging has wide range of applications.

ii. Reputed companies like Google, Microsoft are concentrated on NLP

applications so POS tagging has got more importance.

iii. Part of speech tagging using rule based approach is a challenging task. Part of

Speech resolves ambiguities

Therefore, there is a pressing necessity to develop a automatic Part-of-Speech tagger for

Bengali. With this motivation, major goals of this report have been made.

1.7 Goals of Our Work

The primary goal of the thesis is to develop a reasonably good accuracy part-of-speech

tagger for Bengali. To address this broad objective, we identify the following goals:

We wish to investigate different machine learning algorithm to develop a part-of-

speech tagger for Bengali.


19

Bengali is a morphologically-rich language. We wish to use the morphological

features of a word, as well as word suffix to enable us to develop a POS tagger with

limited resource.

As stemming is one of the pre-processing steps to develop an effective POS tagger, so

we wish to stem a few Bengali text documents

1.8 Organization of the Report

Rest of this report is organized into chapters as follows:

Chapter 2 provides a review of the previous work on POS tagging. Comparative review

of the work is not shown in this chapter because such an attempt is extremely difficult due

to the large number of publications in this area and the works based on several theories

and techniques used by researchers over the years. Instead, a brief review i.e. the work

based on different techniques used for POS tagging has been presented. This chapter also

presents a discussion on English language POS taggers and Indian languages POS

taggers.

Chapter 3 supply some information about several important issues related to POS

tagging, which can greatly influence the performance of the taggers i.e. corpora and the

Bengali tagset.

Chapter 4 provides information about the developed system and the way the system is

developed. Also in this chapter the system architecture has been shown.

Chapter 5 provides the experimental result and a discussion was made on the

experimental result.

Chapter 6 presents the general conclusion, summary of the work and contributions are

outlined along with a discussion on scope for future research work.


20

CHAPTER 2

Prior Work


21

2.1 Prior Work in POS Tagging

The area of automated Part-of-speech tagging has been enriched over the last few decades by

contribution from several researchers. Since its inception in the middle sixties and seventies

(Harris, 1962; Klein and Simmons, 1963; Greene and Rubin, 1971), many new concepts have

been introduced to improve the efficiency of the tagger and to construct the POS taggers for

several languages. Initially, people manually engineered rules for tagging. Linguistic taggers

incorporate the knowledge as a set of rules or constraints written by linguists. More recently

several statistical or probabilistic models have been used for the POS tagging task for

providing transportable adaptive taggers. Several sophisticated machine learning algorithms

have been developed that acquire more robust information. In general all the statistical

models rely on manually POS labelled corpora to learn the underling language model, which

is difficult to acquire for a new language. Finally, combinations of several sources of

information (linguistic, statistical and automatically learned) have been used in current

research direction.

This chapter provides a brief review of the prior work in POS tagging. For the sake of

consciousness, we do not aim to give a comprehensive review of the related work. Instead,

we provide a brief review on the different techniques used in POS tagging. Further, we focus

onto the detail review of the Indian language POS taggers.

2.2 Linguistic Taggers

Automated part of speech tagging was initially explored in middle sixties and seventies

People manually engineered rules for tagging. The most representative of such pioneer tagger

was TAGGIT (Greene and Rubin, 1971), which was used for initial tagging of the Brown

Corpus. Since that time to nowadays, a lot of effort has been devoted to improving the quality

of the tagging process in terms of accuracy and efficiency.

Recent linguistic taggers incorporate the knowledge as a set of rules or constraints,

written by linguists. The current models are expressive and accurate and they are used in very

efficient disambiguation algorithms. The linguistic rules range from a few hundred to several

thousands, and they usually require years of labour. The development of ENGTWOL (an

English tagger based on constraint grammar architecture) can be considered most important

in this direction .The constraint grammar formalism has also been applied for other languages

like Turkish.


22

The accuracy reported by the first rule-based linguistic English tagger was slightly

below 80%. A Constraint Grammar for English tagging (Samuelsson and Voutilainen, 1997)

is presented which achieves a recall of 99.5% with a very high precision around 97%. Their

advantages are that the models are written from a linguistic point of view and explicitly

describe linguistic phenomena, and the models may contain many and complex kinds of

information. Both things allow the construction of extremely accurate system. However, the

linguistic models are developed by introspection (sometimes with the aid of reference

corpora). This makes it particularly costly to obtain a good language model. Transporting the

model to other languages would require starting over again.

2.3 POS Tagging Approaches

POS taggers are broadly classified into three categories called rule based, Empirical based

and Hybrid based .In case of rule based approach hand-written rules are used to distinguish

the tag ambiguity. The empirical POS taggers are further classified into Example based and

Stochastic based taggers. Stochastic taggers are either HMM based, choosing the tag

sequence which maximizes the product of word likelihood and tag sequence probability, or

cue-based, using decision trees or maximum entropy models to combine probabilistic

features. The stochastic taggers are further classified in to supervised and unsupervised

taggers. Each of these supervised and unsupervised taggers are categorized into different

groups based on the particular algorithm used. The Fig.2.3 shows the classification of parts of

speech approaches.

2.3.1 Rule Based POS tagging

The rule based POS tagging models apply a set of hand written rules and use

contextual information to assign POS tags to words. These rules are often known as context

frame rules. For example, a context frame rule might say something like: “If an

ambiguous/unknown word X is preceded by a Determiner and followed by a Noun, tag it as

an Adjective”. One of the first and widely used English POS-taggers employs rule based

algorithms is “Brill‟s tagger”. The earliest algorithms for automatically assigning part-of-

speech were based on two-stage architecture. The first stage used a dictionary to assign each

word a list of potential parts of speech. The second stage used large lists of hand-written

disambiguation rules to bring down this list to a single part-of-speech for each word. The


23

ENGTWOL tagger is based on the same two-stage architecture, although both the lexicon

and the disambiguation rules are much more sophisticated than the early algorithms.

Fig.2.3 : Classification of POS tagging

2.3.2 Empirical Based POS tagging

The relative failure of rule-based approaches, the increasing availability of machine

readable text and the increase in capability of hardware (CPU, memory, disk space) with

decrease in cost are some of the reasons, researchers to prefer corpus based pos tagging. The

empirical approach of parts speech tagging is further divided in to two categories: Example-

based approach and Stochastic based approach. Literature shows that majority of the

developed POS taggers belongs to empirical based approach.


24

2.3.2(a) Example Based POS tagging

Example based approach are depend on trained or tagged corpus which have to

be trained with the machine with learning technique. In example based

morphoynthetic tagging this problem must be formulated as a classification task. The

features usually include POS of neighbouring tokens, their auto graphics forms ,

sometimes also fixed width affixes of the word forms.

2.3.2(b) Stochastic based POS tagging

The stochastic approach finds out the most frequently used tag for a specific word in

the annotated training data and uses this information to tag that word in the unannotated text.

A stochastic approach required a sufficient large sized corpus and calculates frequency,

probability or statistics of each and every word in the corpus. The problem with this approach

is that it can come up with sequences of tags for sentences that are not acceptable according

to the grammar rules of a language. The use of probabilities in tags is quite old; probabilities

in tagging were first used in 1965, a complete probabilistic tagger with Viterbi decoding was

sketched by Bahl and Mercer (1976), and various stochastic taggers were built in the 1980's

(Marshall, 1983; Garside, 1987; Church, 1988; DeRose, 1988). Supervised and unsupervised

are two broad categories of stochastic based approach.

Supervised POS tagging: The supervised POS tagging models require pre-tagged

corpora which are used for training to learn information about the tagset, word-tag

frequencies, rule sets etc. The performance of the models generally increases with the

increase in size of this corpus. The following are the two familiar examples for supervised

POS taggers Hidden Markov Model and Support Vector Machines .

Hidden Markov Model (HMM) based POS tagging: An alternative to the

word frequency approach is known as the n-gram approach that calculates the

probability of a given sequence of tags. It determines the best tag for a word

by calculating the probability that it occurs with the n previous tags, where the

value of n is set to 1, 2 or 3 for practical purposes. These are known as the

Unigram, Bigram and Trigram models. The most common algorithm for

implementing an n-gram approach for tagging new text is known as the

HMM‟s Viterbi Algorithm. The Viterbi algorithm is a search algorithm that

avoids the polynomial expansion of a breadth first search by trimming the


25

search tree at each level using the best „m‟ Maximum Likelihood Estimates

(MLE) where „m‟ represents the number of tags of the following word. For a

given sentence or word sequence, HMM taggers choose the tag sequence that

maximizes as in formula 1

P(word | tag ) X P(tag | previous n tags) (1)

A bigram-HMM tagger of this kind chooses the tag ti for word wi that is most

probable given the previous tag ti-1 and the current word wi :

ti = arg max P( ti | ti-1 , wi) (2)

j Support Vector Machines (SVM ): SVM is a machine learning algorithm for

binary classification, which has been successfully applied to a number of

practical problems, including NLP. Let {(x1, y1). . . (xN, yN)} be the set of N

training examples, where each instance xi is a vector in RN and yi ∈ {−1,+1}

is the class label. In their basic form, a SVM learns a linear hyperplane, that

separates the set of positive examples from the set of negative examples with

maximal margin (the margin is defined as the distance of the hyperplane to the

nearest of the positive and negative examples). This learning bias has proved

to have good in terms of generalization bounds for the induced classifiers.

The SVM Tool is intended to comply with all the requirements of modern

NLP technology, by combining simplicity, flexibility, robustness, portability

and efficiency with state–of–the–art accuracy. This is achieved by working in

the Support Vector Machines (SVM) learning framework, and by offering

NLP researchers a highly customizable sequential tagger generator.

Unsupervised POS Tagging: Unlike the supervised models, the unsupervised POS

tagging models do not require a pre-tagged corpus. Instead, they use advanced computational

methods like the Baum-Welch algorithm to automatically induce tagsets, transformation rules

etc. Based on the information, they either calculate the probabilistic information needed by

the stochastic taggers or induce the contextual rules needed by rule-based systems or

transformation based systems.

Transformation-based POS tagging :In general, the supervised tagging

approach usually requires large sized pre-annotated corpora for training, which

is difficult for most of the cases. But recently, good amount of work has been

done to automatically induce the transformation rules. One approach to


26

automatic rule induction is to run an untagged text through a tagging model

and get the initial output. A human then goes through the output of this first

phase and corrects any erroneously tagged words by hand. This tagged text is

then submitted to the tagger, which learns correction rules by comparing the

two sets of data. Several iterations of this process are sometimes necessary

before the tagging model can achieve considerable performance. The

transformation based approach is similar to the rule based approach in the

sense that it depends on a set of rules for tagging.

2.3.3 Hybrid Based Tagger

A hybrid approach combines the features of both Rule based & Stochastic Based

approaches. Like rule based systems, they use rules to specify tags. Like stochastic systems

they use machine-learning to induce rules from a tagged training corpus automatically. The

transformation-based learning (TBL) tagger or Brill tagger shares features of the hybrid

approach. This approach follows the advantages and disadvantages of both rule based and

stochastic based approach.

2.4 Indian Language POS Taggers

There has been a lot of interest in Indian language POS tagging in recent years. POS tagging

is one of the basic steps in many language processing tasks, so it is important to build good

POS taggers for these languages. However it was found that very little work has been done

on Bengali POS tagging and there are very limited amount of resources that are available.

The oldest work on Indian language POS tagging we found is by Bharati et al. (Bhartai et al.,

1995). They presented a framework for Indian languages where POS tagging is implicit and

is merged with the parsing problem in their work on computational Paninian parser.

For Bengali, ( Dandapat et al. 2007) studied the possibility of developing a tagger

using HMM and Maximum Entropy (ME) models. They too used a morphological analyzer

for compensating the shortage of annotated corpus. With these two modes they implemented

a supervised tagger and a semi-supervised tagger and reported an accuracy of around 88% for

the two approaches. ( Ekbal et al 2007) annotated news corpus and developed an SVM based

tagger. They reported an accuracy of 86.84% for their tagger


27

An attempt on Hindi POS disambiguation was done by Ray (Ray et al. 2003). The

part-of-speech tagging problem was solved as an essential requirement for local word

grouping. Lexical sequence constraints were used to assign the correct POS labels for Hindi.

A morphological analyzer was used to find out the possible POS of every word in a sentence.

A rule based POS tagger for Tamil (Arulmozhi et al., 2004) has been developed in

combination of both lexical rules and context sensitive rules. They used a very coarse grained

tagset of only 12 tags. They reported an accuracy of 83.6% using only lexical rules and

88.6% after applying the context sensitive rules. The accuracy reported in the work, are tested

on a very small reference set of 1000 words.

Shrivastav et al. (Shrivastav et al. 2006) presented a CRF based statistical tagger for

Hindi. They used 24 different features (lexical features and spelling features) to generate the

model parameters. They experimented on a corpus of around 12,000 tokens and annotated

with a tagset of size 23. The reported accuracy was 88.95% with a 4-fold cross validation.

Smriti et al. (Smriti et al. 2006) in their work, describes a technique for morphology-

based POS tagging in a limited resource scenario. The system uses a decision tree based

learning algorithm (CN2). They used stemmer, morphological analyzer and a verb group

analyzer to assign the morphotactic tags to all the words, which identify the Ambiguity

Scheme and Unknown Words. Further, a manually annotated corpus was used to generate If-

Then rules to assign the correct POS tags for each ambiguity scheme and unknown words. A

tagset of 23 tags were used for the experiment. An accuracy of 93.5% was reported with a 4-

fold cross validation on modestly-sized corpora (around 16,000 words).

In 2006, two machine learning contests were organized on part-of-speech tagging and

chunking for Indian Languages for providing a platform for researchers to work on a

common problem. Both the contests were conducted for three different Indian languages:

Hindi, Bengali and Telugu. All the languages used a common tagset of 27 tags. The results of

the contests give an overall picture of the Indian language POS tagging. The first contest was

conducted by NLP Association of India (NLPAI) and IIIT-Hyderabad in the summer of 2006.


28

CHAPTER 3

Foundational Considerations


29

In this chapter we discuss several important issues related to the POS tagging problem, which

can greatly influence the performance of a tagger. Another important issue of POS tagging is

collecting and annotating corpora. Most of the statistical techniques rely on some amount of

annotated data to learn the underlying language model. The sizes of the corpus and amount of

corpus ambiguity have a direct influence on the performance of a tagger. Finally, there are

several other issues e.g. how to handle unknown words, smoothing techniques which

contribute to the performance of a tagger.

In the following sections, we discus three important issues related to POS tagging.

The first section discuss the process of corpora collection. In second section we present the

tagset which is used for our experiment.

3.1. Corpora Collection

The compilation of raw text corpora is no longer a big problem, since nowadays most of the

documents are written in a machine readable format and are available on the web. Collecting

raw corpora is a little more difficult problem in Bengali (might be true for other Indian

languages also) compared to English and other European languages. This is due to the fact

that many different encoding standards are being used. Also, the number of Bengali

documents are available in the web is comparatively quite limited.

Raw corpora do not have much linguistic information. Corpora acquire higher

linguistic value when they are annotated, that is, some amount of linguistic information (part-

of-speech tags, semantic labels, syntactic analysis, named entity etc.) is embedded into it.

Although, many corpora (both raw and annotated) are available for English and other

European languages but, we had no tagged data for Bengali to start the POS tagging task. The

raw corpus developed at TDIL was available to us. We used a portion of the TDIL corpus to

develop the annotated data for the experiments.

3.2. The Tagset

With respect to the tagset, the main feature that concerns us is its granularity, which is

directly related to the size of the tagset. If the tagset is too coarse, the tagging accuracy will

be much higher, since only the important distinctions are considered, and the classification

may be easier both by human manual annotators as well as the machine. But, some important

information may be missed out due to the coarse grained tagset. On the other hand, a too fine-

grained tagset may enrich the supplied information but the performance of the automatic POS


30

tagger may decrease. A much richer model is required to be designed to capture the encoded

information when using a fine grained tagset and hence, it is more difficult to

So, when we are about to design a tagset for the POS disambiguation task, some

issues needs to be considered. Such issues include – the type of applications (some

application may require more complex information whereas only category information may

sufficient for some tasks), tagging techniques to be used (rule based which can adopt large

tagsets very well, supervised/unsupervised learning). Further, a large amount of annotated

corpus is usually required for rule based POS taggers. A too fine grained tagset might be

difficult to use by human annotators during the development of a large annotated corpus.

Hence, the availability of resources needs to be considered during the design of a tagset.

learn.

The Bureau of Indian Standards (BIS) Tagset has recommended the use of a common

tagset for the part of speech annotation of Indian languages. The tagset, incorporating the

advice of the experts and the stakeholders in the area of natural language processing and

language technology of Indian languages, has to be followed in the annotation tasks taking

place in Indian languages after August, 2010.

The BIS tagset has a total of 38 annotation level tags which are common to all the

Indian languages covered under this tagset. We are using the basic eight (8) part-of-speech

tagset i.e. Noun, Pronoun, Verb, Adjective , Adverb, Preposition, Conjunction, Interjection,

along with Residuals and Quantifier from the BIS tagset.

The below table describes the individual tags with examples used in our experiments:


31

Category Annotation

TAG

Examples

Noun N িীপঙ্কর , রাম, লযাম , দিল্লী etc

Pronoun PR ক্ষস, দতদি,তা, দযদি, আদম, তুদম , আমরা, তারা etc

Verb V কদর, করাম, খাওো, ে, ক্ষদখ etc

Adjective JJ খারাপ, ভাবা, েড়, ক্ষছটা etc

Adverb RB অদিকতর, অিবূর, এতটা, etc

Preposition /

Postposition

PSP ক্ষেবক, হইবত, উপবর, দভতর etc

Conjunction CC এেং, দকন্তু , অেচ, অেো

Interjection INJ প্লীজ,িন্নোি,সােিাি, হাাঁ, etc

Residuals RD । , , , ?, “” , ‘ ‘ ,

Quantifiers QT প্রেম , ,১,২.etc

.

Table 3.2 : The tagset for Bengali with 10-tags


32

CHAPTER 4

Tagging with Rule Based

Approach


33

In the first section we describe Rule Based Approach for POS tagging. Since only a small

labeled training set is available to us for Bengali POS tagging. Second section devoted to our

particular approach to Bengali POS tagging using Rule Based Approach.

4.1. Rule Based Approach

The rule based POS tagging models apply a set of hand written rules and use contextual

information to assign POS tags to each word in a sentence. These rules are often known as

context frame rules. Most of the rule based taggers have two- stage architecture. The first

stage is simply a dictionary look-up procedure, which returns a set of potential tags and

appropriate syntactic features for each word. The second stage uses a set of hand written rules

to discard contextually illegitimate tags to get a single best POS for each word. A context

frame rule might say something like: “If current word is post position then there is high

probability that previous word will be noun.” e.g. in the sentence “ক্ষস লদিির উপর পাের ছুবর

মার।” the noun-adjective {N, JJ} ambiguity is present in the word “লদিির”. So the

mentioned rule simply resolve this ambiguity problem.

In addition to contextual information, many taggers use morphological information to

help in the disambiguation process. An example of a rule that makes use of morphological

information is: IF word ends with –“ইরেছি / ছিলাম ” and preceding word is a verb THEN

label it a verb (V).

Speed is an advantage of the rule based tagger, and unlike stochastic taggers, they are

deterministic. Maximum effort is required in writing the disambiguation rules. Also rule

based tagger is usable for only one language i.e. it is language dependent. Using it for another

one requires a rewrite of most of the program.

4.2. Our Approach

4.2.1 System Flow Diagram

This section is concern with all the processing tasks are designed. Here we concerned about

the following:

What are the modules need to be designed?

How they are interconnected?


34

No Yes

Start

Show the GUI

Accept Bengali

Language

Divide the sentence into tokens

Tokens with

suffix / affix ?

Split tokens into its stem by

Stemming

Assign the TAGS to tokens in Tagger

Find ambiguous Word

Assign the TAGS to ambiguous word using

POS tagging rules

View the result

Stop

Fig 4.2.1: Flow diagram


35

The fig 4.2.1 shows the diagrammatic representation of flow of data throughout the

system. It consist of the following components/modules:

GUI(Graphical User Interface), the interface by which user will communicate with

the back-end files. The interface should be simple in view and easy to maintain

.

Tokenizer : This module generates the tokens of the given input sentence. It also

calls the other modules when required. The tokens of the sentence are basically stored

in a String array for further processing.

Stemming : The Stemming module split a word into its stem, i.e. root. It is one of the

important applications and common requirement of any Natural Language Processing

task. Word stemming is useful for indexing and search systems also indexing and

searching are the key concepts of Text Mining applications and IR systems. It also has

been used to improve the performance of spelling checkers where morphological

analysis would be computationally expensive. A stemmer can also reduce the size of a

dictionary which is the main feature to use a stemmer in spelling checker applications

in mobile and other handheld device.

Tagging : The tagging module assigns tags to tokens and also search for ambiguous

words and according to their type assign some special symbols to them. If we

encounter words which are not present in the Lexicon they are treated as unknown.

The ambiguous words are those words which act as a noun and adjective or adjective

and adverb according to different context.

Resolving Ambiguity : The ambiguity which is identified in the tagging module is resolved

using the Bengali grammar rules.

Displaying results : This module will be displaying the final result. The tokens i.e.

words in the sentences are shown with their corresponding parts of speech


36

CHAPTER 5

Experimental Result &

Discussion


37

5.1 Tools Used

Software: Few open source software tools were used in the development of the project work

which are mentioned below:

- jdk 1.7.0_05

NetBenas IDE 7.1.1, NetBeans IDE lets you quickly and easily develop

java desktop ,mobile and web

application. It can be directly

downloaded at

https://netbeans.org/downloads/

Fig 5.1.1 NetBenas IDE

Notepad, Notepad is a simple text editor for Microsoft Windows and a basic

text editing program that you can use to create documents. It has been include

in all versions of Microsoft Windows since Windows 1.0 in 1985. So, no need

to download it. It is a common text only (plain text) editor. The resulting file

typically saved wit the .txt extension. It looks simple application but it has a

great impact in software

development. It can

write the programming

languages like

C.C++,Java, HTML and

many more but saved

with different

extensions.

Fig 5.1.2 Notepad

https://netbeans.org/downloads/


38

Hardware: We design and developed the whole system on a ACCR Notebook with the

following specification:

Processor: Intel(R) Pentium(R) CPU 2030M @ 2.50GHz

RAM : 4.00 GB

HDD : 500 GB

Although the current system is ok for development but terrible for huge dada handling

i.e. higher the size of data slower the speed of system reply and this is just because of

Processor, if anyone use i3 or more then the speed will be better.

5.2 Graphical User Interface

Snapshot1: This is the welcome screen of our project. Click the Proceed button to go

to the Tagging section.

Fig 5.2.1: Welcome Screen


39

Snapshot 2: Here we first enter the Bengali sentence for tagging purpose in the

specified blank text filled then press the TAG button for tagging. The RESET button

will remove all the texts from the text field.

Fig 5.2.2 : The Tagging Menu


40

5.3 Experimental Results

The system has been tested with a set of data. The input text is taken from the corpus

which was discussed in the chapter 3. Here only four results are shown in the following

snapshot.

Result I:


41

Result II:


42

Result III:


43

Result IV:


44

5.4 Result Discussion

Accuracy of the tagger is computed as the ratio of the number of words correctly tagged by

the system to the total number of tested words.

x 100%

The following are the observations that have been made during testing the system.

Test No of tested words Accuracy

Test 1 150 67 %

Test 2 400 71 %

Test 3 800 78%

Test 4 1200 82 %

The overall accuracy of the system was computed by taking the mean of four tested

results. The overall accuracy of the system was achieved 74.50%.

.


45

CHAPTER 6

Conclusion & Future

Works


46

6.1 Conclusion

Part-of-speech tagging is playing an important role in various speech and language

processing applications in NLP. Since many of the reputed companies like Google and

Microsoft are concentrating on Natural language processing applications, it has got more

importance. Currently, many tools are available to do the task of part of speech tagging. In

this report, our effort was computational linguistics analysis for Bengali language by

developing a tagging system and we achieved accuracy over 74.50%. It had shown that the

performance of the tagger depends upon the size of the lexicon and corpus. The performance

can be increased by increasing the size of the lexicon.

6.2 Future Work

Future work is still to be done in several directions. Though we attained accuracy over

74.50% for known words, it is still an open area to enhance the performance of the tagger.

This can be achieved by increasing the tagset and enlarge the size of the lexicon so that the

tagger can do less ambiguous classification of the text. One can also compare our results with

the result achieved by other Indian language tagging system.


47

References

Church K. W. 1988. A stochastic parts program and noun phrase parser for unrestricted text.

Proceedings of the second conference on Applied Natural Language Processing.

Austin, Texas, 136-143.

Ramshaw L. A. and Marcus M. P. 1995. Text chunking using transformation-based learning.

In Proc. Third Workshop on Very Large Corpora. ACL, 1995

Wilks Y., and Stevenson M. 1997. Combining Independent Knowledge Sources for Word

Sense Disambiguation. In Proceedings of the Third Conference on Recent Advances

in Natural Language Processing Conference (RANLP-97), Bulgeria. 1-7.

Heeman, P. A. and J. F. Allen. 1997. Incorporating POS tagging into language modelling. In

Proceedings of the 5th European Conference on Speech Communication and

Technology (Eurospeech), Rhodes, Greece.

Ray P. R., Harish V., Basu A. and Sarkar S., 2003. Part of Speech Tagging and Local Word

Grouping Techniques for Natural Language Processing. In Proceedings 1st

International Conference on Natural Language Processing

Shrivastav M., Melz R., Singh S., Gupta K. and Bhattacharyya P., 2006. Conditional

Random Field Based POS Tagger for Hindi. In Proceedings of the MSPIL, Bombay,.

63-68.

Dandapt, S., Sarkar, S., Basu, A.(2007) “Automatic Part-of-Speech Tagging for Bengali :An

Approach for Morphological Rich Languages in a Poor Resource Scenario”. In:

Association for Computational Linguistic,pp 221-224.

Bharati, A., Chaitanya V., Sangal R., (1995) “Natural Language Processing- A PAninian

Perspective”. Prentice-Hall India, New Delhi(1995)

Arulmozhi P., Rao R. K. and Sobha L., 2006. A Hybrid POS Tagger for a Relatively Free

Word Order Language. In Proceedings of the Modeling and Shallow Parsing of

Indian Language (MSPIL), Bombay. 79-85.


48

Singh S., Gupta K., Shrivastav M. and Bhattacharyya V. 2006. Morphological Richness

Offset Resource Demand – Experience in constructing a POS Tagger for Hindi. In

Proceedings of COLLING/ACL 06. 779-786.

Dalal, K. Nagaraj, U. Swant, S. Shelke and P. Bhattacharyya. 2007. Building Feature Rich

POS Tagger for Morphologically Rich Languages: Experience in Hindi. In

Proceedings of ICON, India.

Greene B. B. and Rubin G. M., 1971. Automatic grammatical tagging of English. Technical

Report, Department of Linguistics, Brown University.

Samuelsson C., Voutilainen A. 1997. Comparing a linguistic and a stochastic tagger. In

Proceedings of the eighth conference on European chapter of the Association for

Computational Linguistics (EACL), Madrid, Spain. 246-253

Ekbal, A., Bandyopadhyay, S., (2007) ”Lexicon Development and POS tagging using A

Tagged for Marathi Text” 2014 in proceeding of: International Journal of Computer

Science and Information Technologies, Vol.5 (2),2014,1322-1326.


49

APPENDIX

CD

Date post:	25-Jun-2020
Category:	Documents
Upload:	others
View:	11 times
Download:	0 times

A PROJECT REPORT ON PART-OF-SPEECH TAGGING FOR BENGALI › xadm › data_entry_module › project...

Documents