Bengali Part-Of-Speech Tagging
1
A
PROJECT REPORT
ON
PART-OF-SPEECH TAGGING
FOR BENGALI
IN PARTIAL FULFILLMENT OF THE REQUIRMENT FOR THE DEGREE OF
MASTER OF COMPUTER SCIENCE
DEPARTMENT OF COMPUTER SCIENCE
ASSAM UNIVERSITY, SILCHAR
2016
Submitted by:
DEEPANKAR DAS
Roll: 101614 No.: 22220380
Under the Guidance of
PROF. BIPUL SYAM PURKYASTHA
HEAD OF DEPARTMENT, PROFESSOR
DEPARTMENT OF COMPUTER SCIENCE
ASSAM UNIVERSITY, SILCHAR-788011
Bengali Part-Of-Speech Tagging
2
CERTIFICATE
This is to certify that Deepankar Das bearing Roll: 101614 No: 22220380 has
carried out her work for the project entitled “PART-OF-SPEECH TAGGING FOR
BENGALI” under my supervision in partial fulfillment for the requirement of the
award of degree of Master of Science in Computer Science of Assam University,
Silchar. He has done sincerely his work for preparing this project. He has fulfilled
all the requirements laid down in the regulations of the MSc (2 years) 4th
Semester Examination (Paper MS-405) of the Department of Computer Science,
Assam University, Silchar, for the session 2015-2016.
Date: Signature of the Guide
Place: (PROF. BIPUL SYAM PURKAYASTHA)
Supervisior, Professor
Department of Computer Science
Assam University, Silchar
Bengali Part-Of-Speech Tagging
3
CERTIFICATE
This is to certify that Deepankar Das bearing Roll: 101614 No: 22220380 has
carried out her work for the project entitled “PART-OF-SPEECH TAGGING FOR
BENGALI” under my supervision in partial fulfillment for the requirement of the
award of degree of Master of Science in Computer Science of Assam University,
Silchar. He has done sincerely his work for preparing this project. He has fulfilled
all the requirements laid down in the regulations of the MSc (2 years) 4th
Semester Examination (Paper MS-405) of the Department of Computer Science,
Assam University, Silchar, for the session 2015-2016.
Date: Signature of the HOD
Place: (PROF. BIPUL SYAM PURKAYASTHA)
HOD, Professor
Department of Computer Science
Assam University, Silchar
Bengali Part-Of-Speech Tagging
4
DECLARATION
I, Deepankar Das, student of 4th semester (MSc 2 years), Department of Computer
Science do hereby solemnly declare that I have duly worked on my project
entitled “PART-OF-SPEECH TAGGING FOR BENGALI” under the supervision of Prof. Bipul
Syam Purkayastha, Professor, Department of Computer Science, Assam
University, Silchar.
Date: Signature
Place: ( Deepankar Das )
Msc 4th Semester
Roll: 101614 No.: 22220380
Regn. No.: 02-110018703 of 2011-12
Department of Computer Science
Assam University, Silchar
Bengali Part-Of-Speech Tagging
5
ACKNOWLWDGEMENT
At the very outset, I take the privilege to convey my gratitude to those
persons whose co-operation, suggestions and heartfelt support helped
us to accomplish the term paper successfully.
I take immense pleasure to express my sincere thanks and profound
gratitude to my respected guide Prof. Bipul Shyam Purkayastha, Head
of the Department of Computer Science, Assam University, Silchar, for
his excellence and able guidance, valuable suggestions and
encouragement he rendered for completing the term paper and also for
his valuable suggestions.
I also indebted to my family members, friends and well-wishers who
encouraged me to do this work with vigor and seriousness.
Last but not the least I would like to acknowledge the cooperation I
received from the entire staff of our department and thanks to all those
who directly or indirectly extended their helpful hands and moral
support while making this project.
( Deepankar Das )
Bengali Part-Of-Speech Tagging
6
Table of Contents
Chapters Title Page No Chapter 1 Introduction 1
1.1 NLP 2
1.2 Applications of NLP 2
1.3 POS Tagging 6
1.4 The POS Tagging Problem 7
1.5 Applications of POS Tagging 9
1.6 Motivations 10
1.7 Goals of Our Work 10
1.8 Organization of the report 11
Chapter 2 Prior Work 12
2.1 Prior Work in POS Tagging 13
2.2 Linguistics Taggers 13
2.3 POS Tagging Approaches 14
2.4 Indian Language POS Taggers 18
Chapter 3 Foundational Consideration 20
3.1 Corpora Collection 21
3.2 The Tagset 21
Bengali Part-Of-Speech Tagging
7
Chapter 4 Tagging with Rule Based Approach 24
4.1 Rule Based Approach 25
4.2 Our Approach 25
Chapter 5 Experimental Result & Discussion 28
5.1 Tools Used 29
5.2 Graphical User Interface 30
5.3 Experimental Result 31
5`4 Result Discussion 32
Chapter 6 Conclusion & Future Direction 33
6.1 Conclusion 34
6.2 Future Work 34
References 35
Bengali Part-Of-Speech Tagging
8
Abstract
Part-of-Speech (POS) tagging is the process of assigning the appropriate part of
speech or lexical category to each word in a natural language sentence. Part-of-speech
tagging is an important part of Natural Language Processing (NLP) and is useful for most
NLP applications. It is often the first stage of natural language processing following which
further processing like chunking, parsing, etc are done.
POS tagging is considered as the one of the basic necessary tool. Its simplified form is
commonly taught to school age children, in the identification of words as nouns, pronouns,
verbs, adjectives, adverbs, prepositions, conjunctions,, interjections etc. Development of any
Indian language POS tagger will influence several pipelined modules of natural language
understanding system including Information Extraction(IE); Information Retrieval(IR);
Machine Translation (MT); Partial Parsing (PP) and Word Sense Disambiguation(WSD).
Our objective in this work is to develop an effective POS tagger for Bengali Language. Once
performed by manual, POS tagging is now done with the context of computational
linguistics, using algorithms which associate discrete terms, as well as hidden parts of speech,
in accordance with a set of descriptive tags. POS tagging algorithms fall into two distinctive
groups: rule based and stochastic. E. Brill's tagger, one of the first and most widely used
English POS taggers, employs rule based algorithms.
Bengali is the main language spoken in Bangladesh, the second most commonly
spoken language in India, and the seventh most commonly spoken language in the world with
nearly 230 million total speakers(189 million native speakers). Natural language processing
of Bengali is in its infancy. POS tagging of Bengali is a necessary component for most NLP
applications of Bengali.
The developed system is tested with a set of experimental data and result analysis has
been made. The system gives accuracy over 74.50%. The performance can be increased by
increasing the size of the lexicon.
Bengali Part-Of-Speech Tagging
9
CHAPTER 1
Introduction
Bengali Part-Of-Speech Tagging
10
1.1 NLP
The goal of natural language processing (NLP) is to build computational models of natural
language for its analysis and generation. First, there is technological motivation of building
intelligent computer systems such as machine translation systems, natural language interfaces
to databases, man-machine interfaces to computers in general, speech understanding systems,
text analysis and understanding systems, computer aided instruction systems, systems that
read and understand printed or handwritten text. Second, there is a cognitive and linguistic
motivation to gain a better in- sight into how humans communicate using natural language
(NL).
Natural language processing (NLP) is a field of computer science and linguistics
concerned with the interactions between computers and human (natural) languages; it began
as a branch of artificial intelligence .In theory, natural language processing is a very attractive
method of human computer interaction. Natural language understanding is sometimes
referred to as an AI-computer problem because it seems to require extensive knowledge
about the outside world and the ability to manipulate it. Natural language processing (NLP) is
a collection of techniques used to extract grammatical structure and meaning from input in
order to perform a useful task as a result, natural language generation builds output based on
the rules of the target language and the task at hand. NLP is useful in the tutoring systems,
duplicate detection, computer supported instruction and database interface fields as it
provides a pathway for increased interactivity and productivity.
The tools of work in NLP are grammar formalisms, algorithms and data structures,
formalism for representing world knowledge, reasoning mechanisms, etc. Many of these have
been taken from and inherit results from Computer Science, Artificial Intelligence,
Linguistics, Logic, and Philosophy.
1.2 Applications of NLP
Automatic summarization : Produce a readable summary of a chunk of text. Often used to
provide summaries of text of a known type such as articles in the financial section of a
newspaper.
Machine translation: Automatically translate text from one human language to another. This
is one of the most difficult problems, and is a member of a class of problems colloquially
http://en.wikipedia.org/wiki/Automatic_summarizationhttp://en.wikipedia.org/wiki/Machine_translation
Bengali Part-Of-Speech Tagging
11
termed "AI-complete", i.e. requiring all of the different types of knowledge that humans
possess (grammar, semantics, facts about the real world, etc.) in order to solve properly.
Morphological segmentation: Separate words into individual morphemes and identify the
class of the morphemes. The difficulty of this task depends greatly on the complexity of the
morphology (i.e. the structure of words) of the language being considered. English has fairly
simple morphology, especially inflectional morphology, and thus it is often possible to ignore
this task entirely and simply model all possible forms of a word (e.g. "open, opens, opened,
opening") as separate words. In languages such as Turkish, however, such an approach is not
possible, as each dictionary entry has thousands of possible word forms. Not only for Turkish
but also the Manipuri which is a highly agglutinated Indian language.
Named entity recognition (NER): Given a stream of text,determine which items in the text
map to proper names, such as people or places, and what the type of each such name is (e.g.
person, location, organization). Note that, although capitalization can aid in recognizing
named entities in languages such as English, this information cannot aid in determining the
type of named entity, and in any case is often inaccurate or insufficient. For example, the first
word of a sentence is also capitalized, and named entities often span several words, only
some of which are capitalized. Furthermore, many other languages in non-Western scripts
(e.g. Chinese or Arabic) do not have any capitalization at all, and even languages with
capitalization may not consistently use it to distinguish names. For example, German
capitalizes all nouns, regardless of whether they refer to names, and French and Spanish do
not capitalize names that serve as adjectives.
Natural language generation: Convert information from computer databases into readable
human language.
Natural language understanding: Convert chunks of text into more formal representations
such as first-order logic structures that are easier for computer programs to manipulate.
Natural language understanding involves the identification of the intended semantic from the
multiple possible semantics which can be derived from a natural language expression which
usually takes the form of organized notations of natural languages concepts. Introduction and
creation of language metamodel and ontology are efficient however empirical solutions. An
explicit formalization of natural languages semantics without confusions with implicit
assumptions such as closed world assumption (CWA) vs. open world assumption, or
http://en.wikipedia.org/wiki/AI-completehttp://en.wikipedia.org/wiki/Morphology_%28linguistics%29http://en.wikipedia.org/wiki/Morphemehttp://en.wikipedia.org/wiki/Morphology_%28linguistics%29http://en.wikipedia.org/wiki/English_languagehttp://en.wikipedia.org/wiki/Inflectional_morphologyhttp://en.wikipedia.org/wiki/Turkish_languagehttp://en.wikipedia.org/wiki/Manipuri_languagehttp://en.wikipedia.org/wiki/Named_entity_recognitionhttp://en.wikipedia.org/wiki/Capitalizationhttp://en.wikipedia.org/wiki/Chinese_languagehttp://en.wikipedia.org/wiki/Arabic_languagehttp://en.wikipedia.org/wiki/German_languagehttp://en.wikipedia.org/wiki/Nounhttp://en.wikipedia.org/wiki/French_languagehttp://en.wikipedia.org/wiki/Spanish_languagehttp://en.wikipedia.org/wiki/Adjectivehttp://en.wikipedia.org/wiki/Natural_language_generationhttp://en.wikipedia.org/wiki/Natural_language_understandinghttp://en.wikipedia.org/wiki/First-order_logichttp://en.wikipedia.org/wiki/Computer
Bengali Part-Of-Speech Tagging
12
subjective Yes/No vs. objective True/False is expected for the construction of a basis of
semantics formalization.
Optical character recognition (OCR): Given an image representing printed text, determine
the corresponding text.
Part-of-speech tagging(POST) : Given a sentence, determine the part of speech for each
word. Many words, especially common ones, can serve as multiple parts of speech. For
example, "book" can be a noun ("the book on the table") or verb ("to book a flight"); "set"
can be a noun, verb or adjective; and "out" can be any of at least five different parts of
speech. Some languages have more such ambiguity than others. Languages with little
inflectional morphology, such as English are particularly prone to such ambiguity. Chinese is
prone to such ambiguity because it is a tonal language during verbalization. Such inflection is
not readily conveyed via the entities employed within the orthography to convey intended
meaning.
Parsing: Determine the parse tree (grammatical analysis) of a given sentence. The grammar
for natural languages is ambiguous and typical sentences have multiple possible analyses. In
fact, perhaps surprisingly, for a typical sentence there may be thousands of potential parses
(most of which will seem completely nonsensical to a human).
Question answering: Given a human-language question, determine its answer. Typical
questions have a specific right answer (such as "What is the capital of Canada?"), but
sometimes open-ended questions are also considered (such as "What is the meaning of
life?"). Recent works have looked at even more complex questions.
Relationship extraction: Given a chunk of text, identify the relationships among named
entities (e.g. who is the wife of whom).
Sentence breaking (also known as sentence boundary disambiguation): Given a chunk of
text, find the sentence boundaries. Sentence boundaries are often marked by periods or other
punctuation marks, but these same characters can serve other purposes (e.g. marking
abbreviations).
Sentiment analysis: Extract subjective information usually from a set of documents, often
using online reviews to determine "polarity" about specific objects. It is especially useful for
identifying trends of public opinion in the social media, for the purpose of marketing.
http://en.wikipedia.org/wiki/Optical_character_recognitionhttp://en.wikipedia.org/wiki/Part-of-speech_tagginghttp://en.wikipedia.org/wiki/Part_of_speechhttp://en.wikipedia.org/wiki/Parts_of_speechhttp://en.wikipedia.org/wiki/Nounhttp://en.wikipedia.org/wiki/Verbhttp://en.wikipedia.org/wiki/Nounhttp://en.wikipedia.org/wiki/Verbhttp://en.wikipedia.org/wiki/Adjectivehttp://en.wikipedia.org/wiki/Inflectional_morphologyhttp://en.wikipedia.org/wiki/English_languagehttp://en.wikipedia.org/wiki/Chinese_languagehttp://en.wikipedia.org/wiki/Tonal_languagehttp://en.wikipedia.org/wiki/Parsinghttp://en.wikipedia.org/wiki/Parse_treehttp://en.wikipedia.org/wiki/Grammarhttp://en.wikipedia.org/wiki/Natural_languagehttp://en.wikipedia.org/wiki/Ambiguoushttp://en.wikipedia.org/wiki/Question_answeringhttp://en.wikipedia.org/wiki/Relationship_extractionhttp://en.wikipedia.org/wiki/Sentence_breakinghttp://en.wikipedia.org/wiki/Sentence_boundary_disambiguationhttp://en.wikipedia.org/wiki/Full_stophttp://en.wikipedia.org/wiki/Punctuation_markhttp://en.wikipedia.org/wiki/Abbreviationhttp://en.wikipedia.org/wiki/Sentiment_analysis
Bengali Part-Of-Speech Tagging
13
Speech recognition: Given a sound clip of a person or people speaking, determine the textual
representation of the speech. This is the opposite of text to speech and is one of the extremely
difficult problems colloquially termed "AI-complete" (see above). In natural speech there are
hardly any pauses between successive words, and thus speech segmentation is a necessary
subtask of speech recognition (see below). Note also that in most spoken languages, the
sounds representing successive letters blend into each other in a process termed co
articulation, so the conversion of the analog signal to discrete characters can be a very
difficult process.
Speech segmentation: Given a sound clip of a person or people speaking, separate it into
words. A subtask of speech recognition and typically grouped with it.
Topic segmentation and recognition: Given a chunk of text, separate it into segments each of
which is devoted to a topic, and identify the topic of the segment.
Word segmentation: Separate a chunk of continuous text into separate words. For a language
like English, this is fairly trivial, since words are usually separated by spaces. However, some
written languages like Chinese, Japanese and Thai do not mark word boundaries in such a
fashion, and in those languages text segmentation is a significant task requiring knowledge of
the vocabulary and morphology of words in the language.
Word sense disambiguation: Many words have more than one meaning; we have to select the
meaning which makes the most sense in context. For this problem, we are typically given a
list of words and associated word senses, e.g. from a dictionary or from an online resource
such as WordNet. In some cases, sets of related tasks are grouped into subfields of NLP that
are often considered separately from NLP as a whole. Examples include:
Information retrieval (IR): This is concerned with storing, searching and retrieving
information. It is a separate field within computer science (closer to databases), but IR relies
on some NLP methods (for example, stemming). Some current research and applications seek
to bridge the gap between IR and NLP.
Information extraction (IE): This is concerned in general with the extraction of semantic
information from text. This covers tasks such as named entity recognition, Co reference
resolution, relationship extraction, etc.
http://en.wikipedia.org/wiki/Speech_recognitionhttp://en.wikipedia.org/wiki/Text_to_speechhttp://en.wikipedia.org/wiki/AI-completehttp://en.wikipedia.org/wiki/Natural_speechhttp://en.wikipedia.org/wiki/Speech_segmentationhttp://en.wikipedia.org/wiki/Coarticulationhttp://en.wikipedia.org/wiki/Coarticulationhttp://en.wikipedia.org/wiki/Speech_segmentationhttp://en.wikipedia.org/wiki/Speech_recognitionhttp://en.wikipedia.org/wiki/Topic_segmentationhttp://en.wikipedia.org/wiki/Word_segmentationhttp://en.wikipedia.org/wiki/English_languagehttp://en.wikipedia.org/wiki/Chinese_languagehttp://en.wikipedia.org/wiki/Japanese_languagehttp://en.wikipedia.org/wiki/Thai_languagehttp://en.wikipedia.org/wiki/Vocabularyhttp://en.wikipedia.org/wiki/Morphology_%28linguistics%29http://en.wikipedia.org/wiki/Word_sense_disambiguationhttp://en.wikipedia.org/wiki/Meaning_%28linguistics%29http://en.wikipedia.org/wiki/WordNethttp://en.wikipedia.org/wiki/Information_retrievalhttp://en.wikipedia.org/wiki/Information_extractionhttp://en.wikipedia.org/wiki/Named_entity_recognitionhttp://en.wikipedia.org/wiki/Coreferencehttp://en.wikipedia.org/wiki/Coreferencehttp://en.wikipedia.org/wiki/Relationship_extraction
Bengali Part-Of-Speech Tagging
14
1.3 POS Tagging
Part-of-Speech (POS) tagging is the process of automatic annotation of lexical categories.
Part-of–Speech tagging assigns an appropriate part of speech tag for each word in a sentence
of a natural language. The development of an automatic POS tagger requires either a
comprehensive set of linguistically motivated rules or a large annotated corpus. But such
rules and corpora have been developed for a few languages like English and some other
languages. POS taggers for Indian languages are not readily available due to lack of such
rules and large annotated corpora.
A part-of-speech is a grammatical category commonly including nouns, pronouns,
verbs, adjectives, adverbs, prepositions, conjunctions, interjections. Parts of speech can be
divided into two broad categories: closed classes and open classes. Closed classes are those
that have relatively fixed membership. For example, pronouns are categorized in closed class
because there is a fixed set of them in English; new pronouns are rarely added. But nouns are
in open class because new nouns are continually added in every language.
The linguistic approach is the classical approach to POS tagging was initially
explored in middle sixties and seventies (Harris, 1962; Klein and Simmons, 1963; Greene
and Rubin, 1971). People manually engineered rules for tagging. The most representative of
such pioneer tagger was TAGGIT (Greene and Rubin, 1971), which was used for initial
tagging of the Brown Corpus. The development of ENGTWOL (an English tagger based on
constraint grammar architecture) can be considered most important in this direction (Karlsson
et al., 1995). These taggers typically use rule-based models manually written by linguists.
The advantage of this model is that the rules are written from a linguistic point of view and
can be made to capture complex kinds of information. This allows the construction of an
extremely accurate system. But handling all rules is not easy and requires expertise. The
context frame rules have to be developed by language experts and it is costly and difficult to
develop a rule based POS tagger. Further, if one uses of rule based POS tagging, transferring
the tagger to another language means starting from scratch again.
On the other hand, recent machine learning techniques makes use of annotated
corpora to acquire high-level language knowledge for different tasks including PSO tagging.
This knowledge is estimated from the corpora which are usually tagged with the correct part
of speech labels for the words. Machine learning based tagging techniques facilitate the
development of taggers in shorter time and these techniques can be transferred for use with
corpora of other languages. Several machine learning algorithms have been developed for the
Bengali Part-Of-Speech Tagging
15
POS disambiguation task. These algorithms range from instance based learning to several
graphical models. The knowledge acquired may be in the form of rules, decision trees,
probability distribution, etc. The encoded knowledge in stochastic methods may or may not
have direct linguistic interpretation. But typically such taggers need to be trained with a
handsome amount of annotated data to achieve high accuracy. Though significant amounts of
annotated corpus are often not available for most languages, it is easier to obtain large
volumes of un-annotated corpus for most of the languages. The implication is that one may
explore the power of semi-supervised and unsupervised learning mechanism to get a POS
tagger.
Our interest is in developing taggers for Bengali Languages. Annotated corpora are
not readily available for this language, but the language is morphologically rich. The use of
morphological features of a word, as well as word suffixes can enable us to develop a POS
tagger with limited resources. In the present work, these morphological features (affixes)
have been incorporated in different machine learning models (Maximum Entropy,
Conditional Random Field, etc.) to perform the POS tagging task. This approach can be
generalized for use with any morphologically rich language in poor-resource scenario.
The development of a tagger requires either developing an exhaustive set of linguistic
rules or a large amount of annotated text. However no tagged corpus was available to us for
use in this task. We had to start with creating tagged resources for Bengali. Manual part of
speech tagging is quite a time consuming and difficult process. So we tried to work with
methods so that small amount of tagged resources can be used to effectively carry out the part
of speech tagging task.
1.4 The Part-of-Speech Tagging Problem
Natural languages are ambiguous in nature. Ambiguity appears at different levels of the
natural language processing (NLP) task. Many words take multiple part of speech tags. The
correct tag depends on the context.
Consider, for instance, the following English and Bengali sentence
1. Keep the book on the top shelf.
2. সকাবো তারা ক্ষেবত াঙ দিবে কাজ কবর
The sentences have lot of POS ambiguity which should be resolved before the
sentence can be understood. For instance in example sentence 1, the word “ keep ” and
Bengali Part-Of-Speech Tagging
16
“book” can be a noun or a verb; “on” can be a preposition, an adverb, an adjective; finally,
“top” can be either an adjective or a noun. Similarly, in Bengali example sentence 2, the
word “তারা ” can be either a noun or a pronoun; “দিবে” can be either a verb or a postposition
”করে” can be a noun, a verb, or a postposition. In most cases POS ambiguity can be
resolved by examining the context of the surrounding words. Figure1 shows a detailed
analysis of the POS ambiguity of an English sentence considering only the basic 8 tags. The
box with single line indicates the correct tag for a particular word where no ambiguity exists
i.e. only one tag is possible for the word. On the contrary, the boxes with double line indicate
the correct POS tag of a word form a set of possible tags.
Figure 1: POS ambiguity of an English sentence with eight basic tags.
Figure 2: POS ambiguity of a Bengali sentence with tagset of experiment.
Figure 2 illustrate the detail of the ambiguity class for the Bengali sentence as per the
tagset used for our experiment. As we are using a fine grained tagset compare to the basic 8
tags, the number of possible tags for a word increases POS tagging is the task of assigning
appropriate grammatical tags to each word of an input text in its context of appearance.
Essentially, the POS tagging task resolves ambiguity by selecting the correct tag from the set
of possible tags for a word in a sentence.
সকাবো তারা ক্ষেবত াঙ দিবে কাজ কবর
N PR N N V N
PSP
V
PSP
Bengali Part-Of-Speech Tagging
17
1.5 Applications of POS Tagging
POS disambiguation task is useful in several natural language processing tasks. It is often the
first stage of natural language understanding following which further processing e.g.,
chunking, parsing, etc are done. Part-of –speech tagging is of interest for a number of
applications, including – speech synthesis and recognition , machine translation, lexicography
etc.
Most of the natural language understanding systems are formed by a set of pipelined
modules; each of them is specific to a particular level of analysis of the natural language text.
Development of a POS tagger influences several pipelined modules of the natural language
understanding task. As POS tagging is the first step towards natural language understating, it
is important to achieve a high level of accuracy which otherwise may hamper further stages
of the natural language understanding. In the following, we briefly discuss some of the above
applications of POS tagging.
Speech synthesis and recognition, Part-of-speech gives significant amount of information
about the word and its neighbours which can be useful in a language model for speech
recognition (Heeman et al., 1997). Part of Speech of a word tells us something about how
the word is pronounced depending on the grammatical category (the noun is pronounced
Object and the verb object).
Information retrieval and extraction, by augmenting a query given to a retrieval
system with POS information, more refined information extraction is possible. For
example, if a person wants to search for document containing “ book” as a noun, adding
the POS information will eliminate irrelevant documents with only “ book” as a verb.
Also, patterns used for information extraction from text often use POS references.
Machine translation, the probability of translating a word in the source
language into a word in the target language is effectively dependent on the
POS category of the source Word.
As mentioned earlier, POS tagging has been used in several other application such as a processor
to high level syntactic processing (noun phrase chunker), lexicography, stylometry, and word
sense disambiguation. These applications are discussed in some detail in (Church, 1988;
Ramshaw and Marcus, 1995; Wilks and Stevenson, 1998).
Bengali Part-Of-Speech Tagging
18
1.6 Motivation
A lot of work has been done in part of speech tagging of several languages, such as English.
While some work has been done on the part of speech tagging of different Indian languages
(Ray et al., 2003; Shrivastav et al., 2006; Arulmozhi et al., 2006; Singh et al., 2006; Dalal et
al., 2007), the effort is still in its infancy. Very little work has been done previously with part
of speech tagging of Bengali. Bengali is the main language spoken in Bangladesh, the second
most commonly spoken language in India, and the seventh most commonly spoken language
in the world.
Apart from being required for further language analysis, Bengali POS tagging is of
interest due to a number of applications like speech synthesis and recognition. Part-of-speech
gives significant amount of information about the word and its neighbours which can be
useful in a language model for different speech and natural language processing applications.
Development of a Bengali POS tagger will also influence several pipelined modules of
natural language understanding system including: information extraction and retrieval;
machine translation; partial parsing and word sense disambiguation. The existing POS
tagging technique shows that the development of a reasonably good accuracy POS tagger
requires either developing an exhaustive set of linguistic rules or a large amount of annotated
text. We have the following observations.
i. POS tagging has wide range of applications.
ii. Reputed companies like Google, Microsoft are concentrated on NLP
applications so POS tagging has got more importance.
iii. Part of speech tagging using rule based approach is a challenging task. Part of
Speech resolves ambiguities
Therefore, there is a pressing necessity to develop a automatic Part-of-Speech tagger for
Bengali. With this motivation, major goals of this report have been made.
1.7 Goals of Our Work
The primary goal of the thesis is to develop a reasonably good accuracy part-of-speech
tagger for Bengali. To address this broad objective, we identify the following goals:
We wish to investigate different machine learning algorithm to develop a part-of-
speech tagger for Bengali.
Bengali Part-Of-Speech Tagging
19
Bengali is a morphologically-rich language. We wish to use the morphological
features of a word, as well as word suffix to enable us to develop a POS tagger with
limited resource.
As stemming is one of the pre-processing steps to develop an effective POS tagger, so
we wish to stem a few Bengali text documents
1.8 Organization of the Report
Rest of this report is organized into chapters as follows:
Chapter 2 provides a review of the previous work on POS tagging. Comparative review
of the work is not shown in this chapter because such an attempt is extremely difficult due
to the large number of publications in this area and the works based on several theories
and techniques used by researchers over the years. Instead, a brief review i.e. the work
based on different techniques used for POS tagging has been presented. This chapter also
presents a discussion on English language POS taggers and Indian languages POS
taggers.
Chapter 3 supply some information about several important issues related to POS
tagging, which can greatly influence the performance of the taggers i.e. corpora and the
Bengali tagset.
Chapter 4 provides information about the developed system and the way the system is
developed. Also in this chapter the system architecture has been shown.
Chapter 5 provides the experimental result and a discussion was made on the
experimental result.
Chapter 6 presents the general conclusion, summary of the work and contributions are
outlined along with a discussion on scope for future research work.
Bengali Part-Of-Speech Tagging
20
CHAPTER 2
Prior Work
Bengali Part-Of-Speech Tagging
21
2.1 Prior Work in POS Tagging
The area of automated Part-of-speech tagging has been enriched over the last few decades by
contribution from several researchers. Since its inception in the middle sixties and seventies
(Harris, 1962; Klein and Simmons, 1963; Greene and Rubin, 1971), many new concepts have
been introduced to improve the efficiency of the tagger and to construct the POS taggers for
several languages. Initially, people manually engineered rules for tagging. Linguistic taggers
incorporate the knowledge as a set of rules or constraints written by linguists. More recently
several statistical or probabilistic models have been used for the POS tagging task for
providing transportable adaptive taggers. Several sophisticated machine learning algorithms
have been developed that acquire more robust information. In general all the statistical
models rely on manually POS labelled corpora to learn the underling language model, which
is difficult to acquire for a new language. Finally, combinations of several sources of
information (linguistic, statistical and automatically learned) have been used in current
research direction.
This chapter provides a brief review of the prior work in POS tagging. For the sake of
consciousness, we do not aim to give a comprehensive review of the related work. Instead,
we provide a brief review on the different techniques used in POS tagging. Further, we focus
onto the detail review of the Indian language POS taggers.
2.2 Linguistic Taggers
Automated part of speech tagging was initially explored in middle sixties and seventies
People manually engineered rules for tagging. The most representative of such pioneer tagger
was TAGGIT (Greene and Rubin, 1971), which was used for initial tagging of the Brown
Corpus. Since that time to nowadays, a lot of effort has been devoted to improving the quality
of the tagging process in terms of accuracy and efficiency.
Recent linguistic taggers incorporate the knowledge as a set of rules or constraints,
written by linguists. The current models are expressive and accurate and they are used in very
efficient disambiguation algorithms. The linguistic rules range from a few hundred to several
thousands, and they usually require years of labour. The development of ENGTWOL (an
English tagger based on constraint grammar architecture) can be considered most important
in this direction .The constraint grammar formalism has also been applied for other languages
like Turkish.
Bengali Part-Of-Speech Tagging
22
The accuracy reported by the first rule-based linguistic English tagger was slightly
below 80%. A Constraint Grammar for English tagging (Samuelsson and Voutilainen, 1997)
is presented which achieves a recall of 99.5% with a very high precision around 97%. Their
advantages are that the models are written from a linguistic point of view and explicitly
describe linguistic phenomena, and the models may contain many and complex kinds of
information. Both things allow the construction of extremely accurate system. However, the
linguistic models are developed by introspection (sometimes with the aid of reference
corpora). This makes it particularly costly to obtain a good language model. Transporting the
model to other languages would require starting over again.
2.3 POS Tagging Approaches
POS taggers are broadly classified into three categories called rule based, Empirical based
and Hybrid based .In case of rule based approach hand-written rules are used to distinguish
the tag ambiguity. The empirical POS taggers are further classified into Example based and
Stochastic based taggers. Stochastic taggers are either HMM based, choosing the tag
sequence which maximizes the product of word likelihood and tag sequence probability, or
cue-based, using decision trees or maximum entropy models to combine probabilistic
features. The stochastic taggers are further classified in to supervised and unsupervised
taggers. Each of these supervised and unsupervised taggers are categorized into different
groups based on the particular algorithm used. The Fig.2.3 shows the classification of parts of
speech approaches.
2.3.1 Rule Based POS tagging
The rule based POS tagging models apply a set of hand written rules and use
contextual information to assign POS tags to words. These rules are often known as context
frame rules. For example, a context frame rule might say something like: “If an
ambiguous/unknown word X is preceded by a Determiner and followed by a Noun, tag it as
an Adjective”. One of the first and widely used English POS-taggers employs rule based
algorithms is “Brill‟s tagger”. The earliest algorithms for automatically assigning part-of-
speech were based on two-stage architecture. The first stage used a dictionary to assign each
word a list of potential parts of speech. The second stage used large lists of hand-written
disambiguation rules to bring down this list to a single part-of-speech for each word. The
Bengali Part-Of-Speech Tagging
23
ENGTWOL tagger is based on the same two-stage architecture, although both the lexicon
and the disambiguation rules are much more sophisticated than the early algorithms.
Fig.2.3 : Classification of POS tagging
2.3.2 Empirical Based POS tagging
The relative failure of rule-based approaches, the increasing availability of machine
readable text and the increase in capability of hardware (CPU, memory, disk space) with
decrease in cost are some of the reasons, researchers to prefer corpus based pos tagging. The
empirical approach of parts speech tagging is further divided in to two categories: Example-
based approach and Stochastic based approach. Literature shows that majority of the
developed POS taggers belongs to empirical based approach.
Bengali Part-Of-Speech Tagging
24
2.3.2(a) Example Based POS tagging
Example based approach are depend on trained or tagged corpus which have to
be trained with the machine with learning technique. In example based
morphoynthetic tagging this problem must be formulated as a classification task. The
features usually include POS of neighbouring tokens, their auto graphics forms ,
sometimes also fixed width affixes of the word forms.
2.3.2(b) Stochastic based POS tagging
The stochastic approach finds out the most frequently used tag for a specific word in
the annotated training data and uses this information to tag that word in the unannotated text.
A stochastic approach required a sufficient large sized corpus and calculates frequency,
probability or statistics of each and every word in the corpus. The problem with this approach
is that it can come up with sequences of tags for sentences that are not acceptable according
to the grammar rules of a language. The use of probabilities in tags is quite old; probabilities
in tagging were first used in 1965, a complete probabilistic tagger with Viterbi decoding was
sketched by Bahl and Mercer (1976), and various stochastic taggers were built in the 1980's
(Marshall, 1983; Garside, 1987; Church, 1988; DeRose, 1988). Supervised and unsupervised
are two broad categories of stochastic based approach.
Supervised POS tagging: The supervised POS tagging models require pre-tagged
corpora which are used for training to learn information about the tagset, word-tag
frequencies, rule sets etc. The performance of the models generally increases with the
increase in size of this corpus. The following are the two familiar examples for supervised
POS taggers Hidden Markov Model and Support Vector Machines .
Hidden Markov Model (HMM) based POS tagging: An alternative to the
word frequency approach is known as the n-gram approach that calculates the
probability of a given sequence of tags. It determines the best tag for a word
by calculating the probability that it occurs with the n previous tags, where the
value of n is set to 1, 2 or 3 for practical purposes. These are known as the
Unigram, Bigram and Trigram models. The most common algorithm for
implementing an n-gram approach for tagging new text is known as the
HMM‟s Viterbi Algorithm. The Viterbi algorithm is a search algorithm that
avoids the polynomial expansion of a breadth first search by trimming the
Bengali Part-Of-Speech Tagging
25
search tree at each level using the best „m‟ Maximum Likelihood Estimates
(MLE) where „m‟ represents the number of tags of the following word. For a
given sentence or word sequence, HMM taggers choose the tag sequence that
maximizes as in formula 1
P(word | tag ) X P(tag | previous n tags) (1)
A bigram-HMM tagger of this kind chooses the tag ti for word wi that is most
probable given the previous tag ti-1 and the current word wi :
ti = arg max P( ti | ti-1 , wi) (2)
j Support Vector Machines (SVM ): SVM is a machine learning algorithm for
binary classification, which has been successfully applied to a number of
practical problems, including NLP. Let {(x1, y1). . . (xN, yN)} be the set of N
training examples, where each instance xi is a vector in RN and yi ∈ {−1,+1}
is the class label. In their basic form, a SVM learns a linear hyperplane, that
separates the set of positive examples from the set of negative examples with
maximal margin (the margin is defined as the distance of the hyperplane to the
nearest of the positive and negative examples). This learning bias has proved
to have good in terms of generalization bounds for the induced classifiers.
The SVM Tool is intended to comply with all the requirements of modern
NLP technology, by combining simplicity, flexibility, robustness, portability
and efficiency with state–of–the–art accuracy. This is achieved by working in
the Support Vector Machines (SVM) learning framework, and by offering
NLP researchers a highly customizable sequential tagger generator.
Unsupervised POS Tagging: Unlike the supervised models, the unsupervised POS
tagging models do not require a pre-tagged corpus. Instead, they use advanced computational
methods like the Baum-Welch algorithm to automatically induce tagsets, transformation rules
etc. Based on the information, they either calculate the probabilistic information needed by
the stochastic taggers or induce the contextual rules needed by rule-based systems or
transformation based systems.
Transformation-based POS tagging :In general, the supervised tagging
approach usually requires large sized pre-annotated corpora for training, which
is difficult for most of the cases. But recently, good amount of work has been
done to automatically induce the transformation rules. One approach to
Bengali Part-Of-Speech Tagging
26
automatic rule induction is to run an untagged text through a tagging model
and get the initial output. A human then goes through the output of this first
phase and corrects any erroneously tagged words by hand. This tagged text is
then submitted to the tagger, which learns correction rules by comparing the
two sets of data. Several iterations of this process are sometimes necessary
before the tagging model can achieve considerable performance. The
transformation based approach is similar to the rule based approach in the
sense that it depends on a set of rules for tagging.
2.3.3 Hybrid Based Tagger
A hybrid approach combines the features of both Rule based & Stochastic Based
approaches. Like rule based systems, they use rules to specify tags. Like stochastic systems
they use machine-learning to induce rules from a tagged training corpus automatically. The
transformation-based learning (TBL) tagger or Brill tagger shares features of the hybrid
approach. This approach follows the advantages and disadvantages of both rule based and
stochastic based approach.
2.4 Indian Language POS Taggers
There has been a lot of interest in Indian language POS tagging in recent years. POS tagging
is one of the basic steps in many language processing tasks, so it is important to build good
POS taggers for these languages. However it was found that very little work has been done
on Bengali POS tagging and there are very limited amount of resources that are available.
The oldest work on Indian language POS tagging we found is by Bharati et al. (Bhartai et al.,
1995). They presented a framework for Indian languages where POS tagging is implicit and
is merged with the parsing problem in their work on computational Paninian parser.
For Bengali, ( Dandapat et al. 2007) studied the possibility of developing a tagger
using HMM and Maximum Entropy (ME) models. They too used a morphological analyzer
for compensating the shortage of annotated corpus. With these two modes they implemented
a supervised tagger and a semi-supervised tagger and reported an accuracy of around 88% for
the two approaches. ( Ekbal et al 2007) annotated news corpus and developed an SVM based
tagger. They reported an accuracy of 86.84% for their tagger
Bengali Part-Of-Speech Tagging
27
An attempt on Hindi POS disambiguation was done by Ray (Ray et al. 2003). The
part-of-speech tagging problem was solved as an essential requirement for local word
grouping. Lexical sequence constraints were used to assign the correct POS labels for Hindi.
A morphological analyzer was used to find out the possible POS of every word in a sentence.
A rule based POS tagger for Tamil (Arulmozhi et al., 2004) has been developed in
combination of both lexical rules and context sensitive rules. They used a very coarse grained
tagset of only 12 tags. They reported an accuracy of 83.6% using only lexical rules and
88.6% after applying the context sensitive rules. The accuracy reported in the work, are tested
on a very small reference set of 1000 words.
Shrivastav et al. (Shrivastav et al. 2006) presented a CRF based statistical tagger for
Hindi. They used 24 different features (lexical features and spelling features) to generate the
model parameters. They experimented on a corpus of around 12,000 tokens and annotated
with a tagset of size 23. The reported accuracy was 88.95% with a 4-fold cross validation.
Smriti et al. (Smriti et al. 2006) in their work, describes a technique for morphology-
based POS tagging in a limited resource scenario. The system uses a decision tree based
learning algorithm (CN2). They used stemmer, morphological analyzer and a verb group
analyzer to assign the morphotactic tags to all the words, which identify the Ambiguity
Scheme and Unknown Words. Further, a manually annotated corpus was used to generate If-
Then rules to assign the correct POS tags for each ambiguity scheme and unknown words. A
tagset of 23 tags were used for the experiment. An accuracy of 93.5% was reported with a 4-
fold cross validation on modestly-sized corpora (around 16,000 words).
In 2006, two machine learning contests were organized on part-of-speech tagging and
chunking for Indian Languages for providing a platform for researchers to work on a
common problem. Both the contests were conducted for three different Indian languages:
Hindi, Bengali and Telugu. All the languages used a common tagset of 27 tags. The results of
the contests give an overall picture of the Indian language POS tagging. The first contest was
conducted by NLP Association of India (NLPAI) and IIIT-Hyderabad in the summer of 2006.
Bengali Part-Of-Speech Tagging
28
CHAPTER 3
Foundational Considerations
Bengali Part-Of-Speech Tagging
29
In this chapter we discuss several important issues related to the POS tagging problem, which
can greatly influence the performance of a tagger. Another important issue of POS tagging is
collecting and annotating corpora. Most of the statistical techniques rely on some amount of
annotated data to learn the underlying language model. The sizes of the corpus and amount of
corpus ambiguity have a direct influence on the performance of a tagger. Finally, there are
several other issues e.g. how to handle unknown words, smoothing techniques which
contribute to the performance of a tagger.
In the following sections, we discus three important issues related to POS tagging.
The first section discuss the process of corpora collection. In second section we present the
tagset which is used for our experiment.
3.1. Corpora Collection
The compilation of raw text corpora is no longer a big problem, since nowadays most of the
documents are written in a machine readable format and are available on the web. Collecting
raw corpora is a little more difficult problem in Bengali (might be true for other Indian
languages also) compared to English and other European languages. This is due to the fact
that many different encoding standards are being used. Also, the number of Bengali
documents are available in the web is comparatively quite limited.
Raw corpora do not have much linguistic information. Corpora acquire higher
linguistic value when they are annotated, that is, some amount of linguistic information (part-
of-speech tags, semantic labels, syntactic analysis, named entity etc.) is embedded into it.
Although, many corpora (both raw and annotated) are available for English and other
European languages but, we had no tagged data for Bengali to start the POS tagging task. The
raw corpus developed at TDIL was available to us. We used a portion of the TDIL corpus to
develop the annotated data for the experiments.
3.2. The Tagset
With respect to the tagset, the main feature that concerns us is its granularity, which is
directly related to the size of the tagset. If the tagset is too coarse, the tagging accuracy will
be much higher, since only the important distinctions are considered, and the classification
may be easier both by human manual annotators as well as the machine. But, some important
information may be missed out due to the coarse grained tagset. On the other hand, a too fine-
grained tagset may enrich the supplied information but the performance of the automatic POS
Bengali Part-Of-Speech Tagging
30
tagger may decrease. A much richer model is required to be designed to capture the encoded
information when using a fine grained tagset and hence, it is more difficult to
So, when we are about to design a tagset for the POS disambiguation task, some
issues needs to be considered. Such issues include – the type of applications (some
application may require more complex information whereas only category information may
sufficient for some tasks), tagging techniques to be used (rule based which can adopt large
tagsets very well, supervised/unsupervised learning). Further, a large amount of annotated
corpus is usually required for rule based POS taggers. A too fine grained tagset might be
difficult to use by human annotators during the development of a large annotated corpus.
Hence, the availability of resources needs to be considered during the design of a tagset.
learn.
The Bureau of Indian Standards (BIS) Tagset has recommended the use of a common
tagset for the part of speech annotation of Indian languages. The tagset, incorporating the
advice of the experts and the stakeholders in the area of natural language processing and
language technology of Indian languages, has to be followed in the annotation tasks taking
place in Indian languages after August, 2010.
The BIS tagset has a total of 38 annotation level tags which are common to all the
Indian languages covered under this tagset. We are using the basic eight (8) part-of-speech
tagset i.e. Noun, Pronoun, Verb, Adjective , Adverb, Preposition, Conjunction, Interjection,
along with Residuals and Quantifier from the BIS tagset.
The below table describes the individual tags with examples used in our experiments:
Bengali Part-Of-Speech Tagging
31
Category Annotation
TAG
Examples
Noun N িীপঙ্কর , রাম, লযাম , দিল্লী etc
Pronoun PR ক্ষস, দতদি,তা, দযদি, আদম, তুদম , আমরা, তারা etc
Verb V কদর, করাম, খাওো, ে, ক্ষদখ etc
Adjective JJ খারাপ, ভাবা, েড়, ক্ষছটা etc
Adverb RB অদিকতর, অিবূর, এতটা, etc
Preposition /
Postposition
PSP ক্ষেবক, হইবত, উপবর, দভতর etc
Conjunction CC এেং, দকন্তু , অেচ, অেো
Interjection INJ প্লীজ,িন্নোি,সােিাি, হাাঁ, etc
Residuals RD । , , , ?, “” , ‘ ‘ ,
Quantifiers QT প্রেম , ,১,২.etc
.
Table 3.2 : The tagset for Bengali with 10-tags
Bengali Part-Of-Speech Tagging
32
CHAPTER 4
Tagging with Rule Based
Approach
Bengali Part-Of-Speech Tagging
33
In the first section we describe Rule Based Approach for POS tagging. Since only a small
labeled training set is available to us for Bengali POS tagging. Second section devoted to our
particular approach to Bengali POS tagging using Rule Based Approach.
4.1. Rule Based Approach
The rule based POS tagging models apply a set of hand written rules and use contextual
information to assign POS tags to each word in a sentence. These rules are often known as
context frame rules. Most of the rule based taggers have two- stage architecture. The first
stage is simply a dictionary look-up procedure, which returns a set of potential tags and
appropriate syntactic features for each word. The second stage uses a set of hand written rules
to discard contextually illegitimate tags to get a single best POS for each word. A context
frame rule might say something like: “If current word is post position then there is high
probability that previous word will be noun.” e.g. in the sentence “ক্ষস লদিির উপর পাের ছুবর
মার।” the noun-adjective {N, JJ} ambiguity is present in the word “লদিির”. So the
mentioned rule simply resolve this ambiguity problem.
In addition to contextual information, many taggers use morphological information to
help in the disambiguation process. An example of a rule that makes use of morphological
information is: IF word ends with –“ইরেছি / ছিলাম ” and preceding word is a verb THEN
label it a verb (V).
Speed is an advantage of the rule based tagger, and unlike stochastic taggers, they are
deterministic. Maximum effort is required in writing the disambiguation rules. Also rule
based tagger is usable for only one language i.e. it is language dependent. Using it for another
one requires a rewrite of most of the program.
4.2. Our Approach
4.2.1 System Flow Diagram
This section is concern with all the processing tasks are designed. Here we concerned about
the following:
What are the modules need to be designed?
How they are interconnected?
Bengali Part-Of-Speech Tagging
34
No Yes
Start
Show the GUI
Accept Bengali
Language
Divide the sentence into tokens
Tokens with
suffix / affix ?
Split tokens into its stem by
Stemming
Assign the TAGS to tokens in Tagger
Find ambiguous Word
Assign the TAGS to ambiguous word using
POS tagging rules
View the result
Stop
Fig 4.2.1: Flow diagram
Bengali Part-Of-Speech Tagging
35
The fig 4.2.1 shows the diagrammatic representation of flow of data throughout the
system. It consist of the following components/modules:
GUI(Graphical User Interface), the interface by which user will communicate with
the back-end files. The interface should be simple in view and easy to maintain
.
Tokenizer : This module generates the tokens of the given input sentence. It also
calls the other modules when required. The tokens of the sentence are basically stored
in a String array for further processing.
Stemming : The Stemming module split a word into its stem, i.e. root. It is one of the
important applications and common requirement of any Natural Language Processing
task. Word stemming is useful for indexing and search systems also indexing and
searching are the key concepts of Text Mining applications and IR systems. It also has
been used to improve the performance of spelling checkers where morphological
analysis would be computationally expensive. A stemmer can also reduce the size of a
dictionary which is the main feature to use a stemmer in spelling checker applications
in mobile and other handheld device.
Tagging : The tagging module assigns tags to tokens and also search for ambiguous
words and according to their type assign some special symbols to them. If we
encounter words which are not present in the Lexicon they are treated as unknown.
The ambiguous words are those words which act as a noun and adjective or adjective
and adverb according to different context.
Resolving Ambiguity : The ambiguity which is identified in the tagging module is resolved
using the Bengali grammar rules.
Displaying results : This module will be displaying the final result. The tokens i.e.
words in the sentences are shown with their corresponding parts of speech
Bengali Part-Of-Speech Tagging
36
CHAPTER 5
Experimental Result &
Discussion
Bengali Part-Of-Speech Tagging
37
5.1 Tools Used
Software: Few open source software tools were used in the development of the project work
which are mentioned below:
- jdk 1.7.0_05
NetBenas IDE 7.1.1, NetBeans IDE lets you quickly and easily develop
java desktop ,mobile and web
application. It can be directly
downloaded at
https://netbeans.org/downloads/
Fig 5.1.1 NetBenas IDE
Notepad, Notepad is a simple text editor for Microsoft Windows and a basic
text editing program that you can use to create documents. It has been include
in all versions of Microsoft Windows since Windows 1.0 in 1985. So, no need
to download it. It is a common text only (plain text) editor. The resulting file
typically saved wit the .txt extension. It looks simple application but it has a
great impact in software
development. It can
write the programming
languages like
C.C++,Java, HTML and
many more but saved
with different
extensions.
Fig 5.1.2 Notepad
https://netbeans.org/downloads/
Bengali Part-Of-Speech Tagging
38
Hardware: We design and developed the whole system on a ACCR Notebook with the
following specification:
Processor: Intel(R) Pentium(R) CPU 2030M @ 2.50GHz
RAM : 4.00 GB
HDD : 500 GB
Although the current system is ok for development but terrible for huge dada handling
i.e. higher the size of data slower the speed of system reply and this is just because of
Processor, if anyone use i3 or more then the speed will be better.
5.2 Graphical User Interface
Snapshot1: This is the welcome screen of our project. Click the Proceed button to go
to the Tagging section.
Fig 5.2.1: Welcome Screen
Bengali Part-Of-Speech Tagging
39
Snapshot 2: Here we first enter the Bengali sentence for tagging purpose in the
specified blank text filled then press the TAG button for tagging. The RESET button
will remove all the texts from the text field.
Fig 5.2.2 : The Tagging Menu
Bengali Part-Of-Speech Tagging
40
5.3 Experimental Results
The system has been tested with a set of data. The input text is taken from the corpus
which was discussed in the chapter 3. Here only four results are shown in the following
snapshot.
Result I:
Bengali Part-Of-Speech Tagging
41
Result II:
Bengali Part-Of-Speech Tagging
42
Result III:
Bengali Part-Of-Speech Tagging
43
Result IV:
Bengali Part-Of-Speech Tagging
44
5.4 Result Discussion
Accuracy of the tagger is computed as the ratio of the number of words correctly tagged by
the system to the total number of tested words.
x 100%
The following are the observations that have been made during testing the system.
Test No of tested words Accuracy
Test 1 150 67 %
Test 2 400 71 %
Test 3 800 78%
Test 4 1200 82 %
The overall accuracy of the system was computed by taking the mean of four tested
results. The overall accuracy of the system was achieved 74.50%.
.
Bengali Part-Of-Speech Tagging
45
CHAPTER 6
Conclusion & Future
Works
Bengali Part-Of-Speech Tagging
46
6.1 Conclusion
Part-of-speech tagging is playing an important role in various speech and language
processing applications in NLP. Since many of the reputed companies like Google and
Microsoft are concentrating on Natural language processing applications, it has got more
importance. Currently, many tools are available to do the task of part of speech tagging. In
this report, our effort was computational linguistics analysis for Bengali language by
developing a tagging system and we achieved accuracy over 74.50%. It had shown that the
performance of the tagger depends upon the size of the lexicon and corpus. The performance
can be increased by increasing the size of the lexicon.
6.2 Future Work
Future work is still to be done in several directions. Though we attained accuracy over
74.50% for known words, it is still an open area to enhance the performance of the tagger.
This can be achieved by increasing the tagset and enlarge the size of the lexicon so that the
tagger can do less ambiguous classification of the text. One can also compare our results with
the result achieved by other Indian language tagging system.
Bengali Part-Of-Speech Tagging
47
References
Church K. W. 1988. A stochastic parts program and noun phrase parser for unrestricted text.
Proceedings of the second conference on Applied Natural Language Processing.
Austin, Texas, 136-143.
Ramshaw L. A. and Marcus M. P. 1995. Text chunking using transformation-based learning.
In Proc. Third Workshop on Very Large Corpora. ACL, 1995
Wilks Y., and Stevenson M. 1997. Combining Independent Knowledge Sources for Word
Sense Disambiguation. In Proceedings of the Third Conference on Recent Advances
in Natural Language Processing Conference (RANLP-97), Bulgeria. 1-7.
Heeman, P. A. and J. F. Allen. 1997. Incorporating POS tagging into language modelling. In
Proceedings of the 5th European Conference on Speech Communication and
Technology (Eurospeech), Rhodes, Greece.
Ray P. R., Harish V., Basu A. and Sarkar S., 2003. Part of Speech Tagging and Local Word
Grouping Techniques for Natural Language Processing. In Proceedings 1st
International Conference on Natural Language Processing
Shrivastav M., Melz R., Singh S., Gupta K. and Bhattacharyya P., 2006. Conditional
Random Field Based POS Tagger for Hindi. In Proceedings of the MSPIL, Bombay,.
63-68.
Dandapt, S., Sarkar, S., Basu, A.(2007) “Automatic Part-of-Speech Tagging for Bengali :An
Approach for Morphological Rich Languages in a Poor Resource Scenario”. In:
Association for Computational Linguistic,pp 221-224.
Bharati, A., Chaitanya V., Sangal R., (1995) “Natural Language Processing- A PAninian
Perspective”. Prentice-Hall India, New Delhi(1995)
Arulmozhi P., Rao R. K. and Sobha L., 2006. A Hybrid POS Tagger for a Relatively Free
Word Order Language. In Proceedings of the Modeling and Shallow Parsing of
Indian Language (MSPIL), Bombay. 79-85.
Bengali Part-Of-Speech Tagging
48
Singh S., Gupta K., Shrivastav M. and Bhattacharyya V. 2006. Morphological Richness
Offset Resource Demand – Experience in constructing a POS Tagger for Hindi. In
Proceedings of COLLING/ACL 06. 779-786.
Dalal, K. Nagaraj, U. Swant, S. Shelke and P. Bhattacharyya. 2007. Building Feature Rich
POS Tagger for Morphologically Rich Languages: Experience in Hindi. In
Proceedings of ICON, India.
Greene B. B. and Rubin G. M., 1971. Automatic grammatical tagging of English. Technical
Report, Department of Linguistics, Brown University.
Samuelsson C., Voutilainen A. 1997. Comparing a linguistic and a stochastic tagger. In
Proceedings of the eighth conference on European chapter of the Association for
Computational Linguistics (EACL), Madrid, Spain. 246-253
Ekbal, A., Bandyopadhyay, S., (2007) ”Lexicon Development and POS tagging using A
Tagged for Marathi Text” 2014 in proceeding of: International Journal of Computer
Science and Information Technologies, Vol.5 (2),2014,1322-1326.
Bengali Part-Of-Speech Tagging
49
APPENDIX
CD