Date post: | 08-Apr-2018 |
Category: |
Documents |
Upload: | manishay339 |
View: | 218 times |
Download: | 0 times |
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 1/53
JAVA API FOR NLP TOOLS
Bachelor of Technology
Computer Science and Engineering
Submitted By:-
RAJSHREE GUPTA (0709110075)
RICHA PANDEY (0709110079)
SURABHI VERMA (0709110104)
SWETA SINGH (0709110105)
Department of Computer Science andEngineering
JSS Academy Of Technical Education,
Noida
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 2/53
DECLARATION
We hereby declare that this submission is our own work and that, to the best of our
knowledge and belief, it contains no material previously published or written by
another person nor material which to a substantial extent has been accepted for
the award of any other degree or diploma of the university or other institute of
higher learning, except where due acknowledgment has been made in the text.
Signature: Signature:
Name :Rajshree Gupta Name :Richa Pandey
Roll No.:0709110075 Roll No.:0709110079
Date : Date :
Signature: Signature:
Name :Surabhi Verma Name :Sweta Singh
Roll No.: 0709110104 Roll No.: 0709110105
Date : Date :
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 3/53
ii
CERTIFICATE
This is to certify that Project Report entitled “Java API for NLP tools” which is submitted
by Rajshree Gupta, Richa Pandey, Surabhi Verma and Sweta Singh in partial fulfillment
of the requirement for the award of degree B. Tech. in Department of Computer Science
& Engineering of U. P. Technical University, is a record of the candidates’ own work
carried out by them under my supervision. The matter embodied in this thesis is original
and has not been submitted for the award of any other degree.
Supervisor
Mrs. Seema Shukla
Asst. Professor
Department of Computer Science & Engineering.
Date
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 4/53
iii
ACKNOWLEDGEMENT
It gives us a great sense of pleasure to present the report of the B. Tech Project undertaken
during B. Tech. Final Year. We owe special debt of gratitude to our Professor(Mrs.) Seema Shukla, Departmentof Computer Science & Engineering,
JSS Academy of Technical Education, Noida for her constant support and guidance
throughout the course of our work. Her sincerity, thoroughness and perseverance
have been a constant source of inspiration for us. It is only her cognizant efforts
that our endeavors have seen light of the day.
We also do not like to miss the opportunity to acknowledge the contribution of all faculty
members of the department for their kind assistance and cooperation during the
development of our project. Last but not the least, we acknowledge our friends for their
contribution in the completion of the project.
Signature: Signature:
Name :Rajshree Gupta Name :Richa Pandey
Roll No.:0709110075 Roll No.:0709110079
Date : Date :
Signature: Signature:
Name :Surabhi Verma Name :Sweta Singh
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 5/53
Roll No.: 0709110104 Roll No.: 0709110105
Date : Date :
iv
ABSTRACT
Many application areas require text preprocessing. Java API for NLP tools is an API that
analyzes, understands, and help in processing languages that humans use naturally. The
proposed API provides a platform to programmers for preprocessing natural text.
The proposed API supports a variety of NLP tools like the Tokenizer to split the stream of
text in tokens. Stop Word Remover eliminates the stopwords from the text. Word
Frequency Counter counts the frequency of words from the input file. Stemmer reduces
inflectional forms and derivationally related forms of a word to a common base form.
NGram Identifier identifies a subsequence of n words from a given sequence. Multiword
Extractor identifies multiword sets from a corpora. Word Sense Disambiguator identifiesand corrects the meaning of an ambiguous word.
Java API for NLP tools tries to eliminate the major deficiencies in the available tools
although with some constraints as mentioned in the subsequent chapters.
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 7/53
TABLE OF CONTENTS
Page
DECLARATION ii
CERTIFICATE iii
ACKNOWLEDGEMENTS iv
ABSTRACT v
LIST OF FIGURES ix
LIST OF TABLES x
CHAPTER 1 INTRODUCTION
1.1 Problem Introduction 11
1.1.1 Motivation 11
1.1.2 Applications of JAVA API for NLP Tools 12
1.1.3 Objective 12
1.1.4 Scope of the Project 12
1.2 Related Previous Work 16
1.2.1 NLP Tools 16
1.3 Organization of the report 17
CHAPTER 2 LITERATURE SURVEY 18
2.1 Natural Language Processing 18
2.2 Tokenizer 19
2.2.1 Need of Tokenizer 19
2.2.2 Existing approaches with recurring problems 20
2.3 Stop Words 21
2.3.1 Significance of Stopword list 22
2.3.2 Problems encountered in Search Engines 22
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 8/53
2.4 Word Frequency Counter 23
CHAPTER 3 TOKENIZER TOOL 24
3.1 Problem Identification and Elimination 24
3.2 Algorithm and Flowchart 25
3.3 Class Descripion 26
3.3.1 Attributes 27
3.3.2 Constructors 27
3.3.3 Methods 29
3.4 Assumptions and Dependencies 32
3.5 Constraints 32
3.6 Result 32
CHAPTER 4 STOP WORD REMOVER 37
4.1 Algorithm and Flowchart 37
4.2 Class Descripion 37
4.2.1 Attributes 38
4.2.2 Constructors 39
4.2.3 Methods 40
4.3 Assumptions and Dependencies 40
4.4 Constraints 41
4.5 Result 41
CHAPTER 5 WORD FREQUENCY COUNTER 42
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 9/53
5.1 Algorithm and Flowchart 42
5.2 Class Descripion 44
5.2.1 Attributes 44
5.2.2 Constructors 44
5.2.3 Methods 45
CHAPTER 6 CONCLUSION 46
6.1 Agenda for next semester 46
APPENDIX A:LIST OF STOPWORDS 47
REFERENCES 49
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 10/53
LIST OF FIGURES
Fig. 3.1 Flowchart for Tokenizer Page: 26
Fig. 3.2 Class Diagram for class Tokenizer Page: 31
Fig. 3.3 Snap shot of User Interface Page: 33
Fig. 3.4 Snapshot of Tokenizer tool Interface Page: 33
Fig. 3.5 Test case 1 Snapshot Page: 34
Fig. 3.6 Test Case 2 snapshot- browsing an input file Page: 35
Fig. 3.7 Test Case 2 snapshot- selecting an output file Page: 36
Fig. 4.1 Flowchart for Stop Word Removal Page: 38
Fig. 4.2 Class Diagram for class StopWordRemover Page: 40
Fig. 5.1 Flowchart for Word Frequency Counter Page: 43
Fig. 5.2 Class Diagram for class WordFrequencyCounter Page: 44
ix
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 11/53
LIST OF TABLES
Table 3.1. Table representing key application areas Page: 16
x
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 12/53
CHAPTER 1
INTRODUCTION
NLP Tools are useful in many areas of computational linguistics and information-retrieval
work, in automated morphological analysis, for stylistic or mathematical analysis of a body
of language, auto text summarization, auto text categorization, etc., which require texts to
be pre-processed. But certain linguistic problems exist in almost every tool no matter what
its ultimate use.
Besides this, existing API’s are too complex to use and do not contain all the proposed
tools.
1.1 Problem Introduction
For the purpose of preprocessing the language, the task of identifying classes, their
attributes, methods and other object oriented features will be an umbrella activity carried
out for all the NLP tools supported by the proposed Java API.
1.1.1 Motivation
Natural languages are very complex. Natural-language processing is a very attractive
method of human-computer interaction. The goal is to design and build APIs that will
analyze, understand, and help in processing languages that humans use naturally, so that
eventually one will be able to address his computer as though one is addressing another
person[1,3].
There are many applications such as auto text summarization, auto text categorization, etc.,
which require texts to be pre-processed. They also require tasks such as stop word removal
and stemming. Although many APIs do exist for NLP tools, most of they are complex to
use and they do not contain all of the proposed tools. The proposed work is focused on
developing efficient Java API for NLP tools that processes texts and to make their
information accessible to computer applications.
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 13/53
1.1.2 Applications of JAVA API for NLP tools
Different NLP tools are used for different purposes. Some of the applications are:
• Automatic summarization
• Machine Translation
• Morphological Segmentation
• Natural Language Generation
• Information Retrieval
• Information Extraction
• Question Answering System
1.1.3 Objective
The primary objective of the project is to provide programmers of NLP based applications
with an easy to use API for NLP tools such as the following:
• Tokenizer
• Stop word remover
• Frequency counter of words
• N-gram identification
• Stemmer
• Multiword Extractor
• Disambiguator
1.1.4 Scope of the Project
The following tasks will be carried out to achieve the objectives stated above:
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 14/53
Designing of classes
Identifying classes, their attributes, methods and other object oriented features will be anumbrella activity carried out for all the NLP tools supported by the proposed Java API.
Design and development of the following tools
• Tokenizer
Tokenization is the process of demarcating and possibly classifying sections of a string of
input characters[1]. Tokenizer splits text into simple tokens based on delimeters that may
be specified either at the time of creation or on a per token basis. An effort will be made to
remove maximum problems such as tokenizing abbreviation like A.K.Sharma as a single
token.
• Stop Word Remover
Stop words is the name given to words which are filtered out prior to, or after, processingof natural language data (text)[2]. Some examples of stop words are to, for, of, etc. To
remove stop word,a database will be maintained which can be manipulated by user.
• Word frequency counter
The aim is to develop an efficient algorithm to count the occurrence of unique words in a
corpus and store in a database or a text file.
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 15/53
• N-Gram identification[1]
An n gram is a subsequence of n words from a given sequence. An n-gram of size 1 is
referred to as a "unigram"; size 2 is a "bigram" (or, less commonly, a "digram"); size 3 is a
"trigram"; and size 4 or more is simply called an "n-gram". The effort will be to design an
appropriate algorithm to identify n grams with user defined value of n. Even if that is not
achieved at least trigram will be identified.
• Stemmer
The goal of stemmer and is to reduce inflectional forms and derivationally related forms of
a word to a common base form. Stemming usually refers to a crude heuristic process that
chops off the ends of words and often includes the removal of derivational affixes[2].
The subtasks to be carried out for this tool are
Study of existing stemming algorithm and their drawbacks.
Implementation of stemming algorithm which is a crude heuristic process.
Development of a statistical stemmer.
The following tools will also be added to the API if time permits
• Multiword Extractor
Multiword extractor will identify multiword sets on the basis of an existing corpus, and
store them in a database or text file that can be used for future reference to extract multi
word from another text/corpus.
The subtasks to be carried out for this tool are
Corpora collection.
Developing algorithm for statistical extraction of multiwords from corpora.
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 16/53
• Word Sense Disambiguator[4]
This tool is used to identify the correct meaning of an ambiguous word.
The subtasks to be carried out for this tool are
Study of WORDNET- a lexical dictionary of English words.
Explore methods for word sense disambiguator without explicit creation of lexicon
sets of database.
Study and explore the tools to be used for POS tagging
• Design and Development of a GUI
A GUI will be developed which consists of a work area, from where a user can browse the
entire system and select any file or input any random text that he wishes to
classify/process. The selected file/text may then be processed after the user clicks on any
desired button provided for all the NLP tools supported by the respectively developed Java
API.
• Testing & Evaluation of results achieved
The project will be tested against a variety of test data of varying length. After the
evaluation of the output thus produced, the accuracy of the project will be stated.
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 17/53
1.2 Related Previous Work
The following Table1.1 shows some of the commonly used tools in main key areas[3] :
Table 1.1 Table representing key application areas
Table: Key application areas
KEY APPLICATION AREAS NLP TOOLS COMMONLY USED
Machine Learning and Data Mining Weka(implements algos like Naive Bayes,
Support Vector Machines),
Apache Lucene Mahout
Information Extraction Mallet, Minor Third, GATE
Text Classification NTLK, LingPipe
Named entity analysis &Co-reference
analysis
OpenNLP(uses maxent Machine Learning
Package)
Approximate String Matching Second String, Simmetrics, LingPipe
Wordnet interfaces Java Wordnet Library(JWNL),
MIT Java Wordnet Interface (JWI)
Question Answering OpenEphyra
Speech recognition & OCR error correction OpenFST
Parallel Language Parsing Dan Bike’s Multilingual Parser (English,
Arabic, Chinese & soon Korean)
1.2.1 NLP tools
LingPipe - It is a suite of java tools for linguistic processing of text including entity
extraction, speech tagging (pos) , clustering, classification, etc. It is known for it's speed,
stability, and scalability. One of its best features is the extensive collection of well-written
tutorials to help you get started. LingPipe[1][6] is released under a royalty-free commercial
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 18/53
license that includes the source code, but it's not technically 'open-source'.
Stanford Parser and Part-of-Speech (POS) Tagger - Java packages for sentence parsingand part of speech tagging from the Stanford NLP group. It has implementations of
probabilistic natural language parsers, both highly optimized PCFG and lexicalized
dependency parsers, and a lexicalized PCFG parser. It's has a full GNU GPL license.
OpenFST - A package for manipulating weighted finite state automata.These are often
used to represent probablistic model. They are used to model text for speech recognition,
OCR error correction, machine translation and a variety of other tasks.
NTLK - The natural language toolkit is a tool for teaching and researching classification,
clustering, speech tagging and parsing, and more. It contains a set of tutorials and data sets
for experimentation. It is written by Steven Bird, from the University of Melbourne.
Dan Bikel's Multilingual Statistical Parser - A parallel statistical parsing engine for
English, Arabic, Chinese, and soon Korean.
1.3 Organization of the Report
This report consists of six chapters dealing with the process of designing Java API for NLP
tools. The first two chapters deal with the introduction of the project and the background
research carried out in order to provide a better understanding of the topic. Then in the
subsequent chapters, further details about the methods and a description of each tool along
with the test cases are defined. The approaches used to design each class of the API are
also specified. The various algorithms used are stated along with the flow diagrams and
class descriptions. The last chapter concludes the report by summarizing the work that has
been done till now.
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 19/53
CHAPTER 2
LITERATURE SURVEY
‘Naturally occurring texts’ can be of any language, mode, genre, etc. The texts can be oral
or written. The only requirement is that they be in a language used by humans to
communicate to one another. Also, the text being analyzed should not be specifically
constructed for the purpose of the analysis, but rather that the text be gathered from actual
usage.
‘Human-like language processing’[1] reveals that NLP is considered a discipline within
Artificial Intelligence (AI). And while the full lineage of NLP does depend on a number of
other disciplines, since NLP strives for human-like performance, it is appropriate to
consider it an AI discipline.
2.1 Natural Language Processing
Natural Language Processing (NLP) is the computerized approach to analyzing text that is
based on both a set of theories and a set of technologies. The definition we offer is: Natural
Language Processing is a theoretically motivated range of computational techniques for
analyzing and representing naturally occurring texts at one or more levels of linguistic
analysis for the purpose of achieving human-like language processing for a range of tasks
or applications.
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 20/53
2.2 Tokenizer
Tokenization, that is the identification of each “atomic” unit, represents the very first
operation to be performed in document processing: nevertheless it is often overlooked
because of its supposed basic nature. Despite the apparent simplicity of the issue at stake,
no readily available solution or standard exists for character stream tokenization.
In NLP, tokenization can be defined as the task of splitting a stream of characters into
words. However, very often it is associated with lower or upper level processes[1].Even if
there exists a tendency to gather both tasks under the vague label: “pre-processing”,
tokenization differs nevertheless from preliminary “cleaning procedures”.Conversely additional preprocessing is often associated to the segmenting into word units:
acronym and abbreviation recognition, hyphenation checking, number standardization, etc.
Some tokenizers include even the delimitation of textual units such as sentences,
paragraphs, notes, and so on.
2.2.1 Need of tokenization
The major question of the tokenization phase is what are the correct tokens to use? In this
example, it looks fairly trivial: you chop on whitespace and throw away punctuation
characters. This is a starting point, but even for English there are a number of tricky cases.
For example, here various uses of possession and contraction of apostrophes that can be
visualized: Mr. O'Neill thinks that the boys' stories about Chile's capital aren't amusing.
For O'Neill , which of the following is the desired tokenization?
neill
oneill
o’neill
o’ neill
neill ?
And for aren't , is it:
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 21/53
aren’t
arent
are n’t
aren t ?
A simple strategy is to just split on all non-alphanumeric characters, but while o and neill
looks okay, aren and t looks intuitively bad. These issues of tokenization are language-
specific. It thus requires the language of the document to be known. Language
identification[2] based on classifiers that use short character subsequences as features is
highly effective; most languages have distinctive signature patterns .
For most languages and particular domains within them there are unusual specific tokens
that we wish to recognize as terms, such as the programming languages C++ and C#,
aircraft names like B-52, or a T.V. show name such as M*A*S*H. Computer technology
has introduced new types of character sequences that a tokenizer should probably tokenize
as a single token, including email addresses ([email protected]), web URLs
(http://stuff.big.com/new/specials.html), numeric IP addresses (142.32.48.231), package
tracking numbers (1Z9999W99845399981), and more. In English, hyphenation is used for
various purposes ranging from splitting up vowels in words (co-education) to joining
nouns as names ( Hewlett-Packard ) to a copyediting device to show word grouping (the
hold-him-back-and-drag-him-away maneuver ). It is easy to feel that the first example
should be regarded as one token, the last should be separated into words, and that the
middle case is unclear.
2.2.2 Existing approaches with recurring problems[7]
Handling hyphens automatically can thus be complex: it can either be done as a
classification problem, or more commonly by some heuristic rules, such as allowing short
hyphenated prefixes on words, but not longer hyphenated forms.
Conceptually, splitting on white space can also split what should be regarded as a single
token. This occurs most commonly with names (San Francisco, Los Angeles) but also with
borrowed foreign phrases (au fait ) and compounds that are sometimes written as a singleword and sometimes space separated (such as white space vs. whitespace).
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 22/53
Other cases with internal spaces that we might wish to regard as a single token include
phone numbers ((800) 234-2333) and dates (Mar 11, 1983). Splitting tokens on spaces can
cause bad retrieval results, for example, if a search for York University mainly returns
documents containing New York University.
The problems of hyphens and non-separating whitespace can even interact.
Advertisements for air fares frequently contain items like San Francisco-Los Angeles,
where simply doing whitespace splitting would give unfortunate results. One effective
strategy in practice, which is used by some Boolean retrieval systems such as Westlaw and
Lexis-Nexis (westlaw), is to encourage users to enter hyphens wherever they may be
possible, and whenever there is a hyphenated form, the system will generalize the query to
cover all three of the one word, hyphenated, and two word forms, so that a query for over-
eager will search for over-eager OR ``over eager'' OR overeager.
However, this strategy depends on user training, since if you query using any of the
available techniques, you get no generalization. Since there are multiple possible
segmentations of character sequences , all such methods make mistakes sometimes, and so
you are never guaranteed a consistent unique tokenization. The other approach is to
abandon word-based indexing and to do all indexing via just short subsequences of
characters (character –n-grams) regardless of whether particular sequences cross word
boundaries or not.
2.3 Stopwords
Stop words is the name given to words which are filtered out prior to, or after, processing
of natural language data (text)[5]. Hans Peter Luhn, one of the pioneers in information
retrieval, is credited with coining the phrase and using the concept in his design. It is
controlled by human input and not automated. Sometimes, some extremely common words
which would appear to be of little value in helping select documents matching a user need
are excluded from the vocabulary entirely. These words are called stop words .
2.3.1 Significance of Stop Word list
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 23/53
The general strategy for determining a stop list is to sort the terms by collection frequency
(the total number of times each term appears in the document collection), and then to take
the most frequent terms, often hand-filtered for their semantic content relative to the
domain of the documents being indexed, as a stop list , the members of which are then
discarded during indexing. Using a stop list significantly reduces the number of postings
that a system has to store.
The general trend in IR systems [2] over time has been from standard use of quite large stop
lists (200-300 terms) to very small stop lists (7-12 terms) to no stop list whatsoever. Web
search engines generally do not use stop lists. Some of the design of modern IR systems
has focused precisely on how we can exploit the statistics of language so as to be able to
cope with common words in better ways. For most modern IR systems, the additional cost
of including stop words is not that big - neither in terms of index size nor in terms of query
processing time.
2.3.2 Problems encountered in Search Engines
Stop words can cause problems when using a search engine to search for phrases that
include them, particularly in names such as 'The Who', 'The The', or 'Take That'. There is
no definite list of stop words which all Natural language processing (NLP) tools
incorporate. Not all NLP tools use a stoplist. Some tools specifically avoid using them to
support phrase search. The phrase query ``President of the United States'', which contains
two stop words, is more precise than President AND ``United States''. The meaning of
“flights to London” is likely to be lost if the word ‘to’ is stopped out. A search for
Vannevar Bush's article, “As we may think” will be difficult if the first three words are
stopped out, and the system searches simply for documents containing the word think.
Some special query types are disproportionately affected. Some song titles and well known
pieces of verse consist entirely of words that are commonly on stop lists (To be or not to
be, Let It Be, I don't want to be, ...).
In more precise terms, some constraints on stop word removal may be as follows:
• All of the words in a query are stop words. If all the query terms are removed
during stop word processing, then the result set is empty. To ensure that search
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 24/53
results are returned, stop word removal is disabled when all of the query terms are
stop words. For example, if the word car is a stop word and you search for car, then
the search results contain documents that match the word car. If you search for car
xylo, the search results contain only documents that match the word xylo.
• The word in a query is preceded by the plus sign (+).
• The word is part of an exact match.
• The word is inside a phrase, for example, "I love my car".
2.4 Word Frequency Counter
WordFrequencyCounter counts the frequency of words from a single file, multiple files or
the clipboard. The many options make it a very useful word counting tool for language
analysis and learning[1]. Search engines and directories now use artificial intelligence to
analyze the web as they produce sorted search results. Knowing the frequency of words
used in the web design will give the user more of an idea how these tools work for
processing in natural languages.
Word Frequency Counter enables the user to:
Define words. A word is made up of characters from an alphabet, but there are some
characters that you might or might not want to include in a word definition such as & or -.
Define word separators. Word separators are used to divide language into individual
words (text segmentation). The space character and punctuation (in the English language)
are the most important word separators, but you also need to decide whether you want to
use characters such as & or - as separators.
Count words from the clipboard, directories and sub-directories. It can count word
frequencies from either a single file, the clipboard or all files in a directory and its
subdirectories.
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 25/53
CHAPTER 3
TOKENIZER TOOL
This chapter is devoted to present the various phases namely design, methodology,
implementation and testing of the tokenizer tool. It aims at splitting text into simple tokens
based upon a set of delimiters provided by the user or else a default list is used. Although
the issues concerned are language specific, yet this tool tries to minimize the distinct
signature patterns that mostly languages have by providing the user freedom on textual
delimitation units[6].
3.1 Problem Identification and Elimination
This tool provides solution to the following problems which were not identified by the
existing tokenizer classes:
• Identification of abbreviation as a single token
Certain applications require the abbreviation to be identified as a single atomic unit.
This tool returns abbreviation as a single token even while considering period as a
delimiter. For eg., it will return I.U.P.A.C as a single token, IUPAC.
• Distinguishing token and non token delimiters
There are certain characters and punctuation marks which though acting as a
delimiter for tokenizing process need to be returned as a token. For eg. @ and . in
[email protected]. This tool distinguishes between token and non-token
delimiters by enabling user to explicitly define the two different kind of delimiters.
• Ability to return empty tokens
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 26/53
In case when two delimiters are encountered consecutively, an empty token needs
to be returned. For example in a database entry statement,the presence of a null
token specifies that the corresponding null entry has to be made in the database.
This tool facilitates returning of empty tokens by returning a space character
whenever required.
3.2 Algorithm and Flowchart
The following steps are used for tokenization:
Step 1. Input text, non-token delimiters, token delimiters and empty returned parameters.
Step 2. Set them to their respective class variables.
If there is no specification for delimiters then take a default list for non-token
delimiters and token delimiters are set to null.
Step 3. Set initial position=0.
Step 4. Repeat steps 4-7 for entire text.
Step 5. Set working position= position.
Step 6. Use advance position function to find position of next delimiter.
Step 7. If position is changed i.e. position!= working position, then return substring from
working position to position as a token.
If token is null and empty returned is true, then return null token.
Advance Position function
Step 1. For i = position+1 to entire text length, repeat step 2 &3.
Step 2. If character at index i belongs to token delimiters or non-token delimiters the
return i as new position.
Step 3. If character at index i is a character then skip the position.
The steps are explained with the help of following flowchart in Fig. 3.2
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 27/53
Fig.3.1 Flowchart for Tokenizer
3.3 Class Description
The specification of the attributes, constructors and methods of class tokenizer
represtented by the Fig. 3.2 is as follows:
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 28/53
3.3.1 Attributes
Text(datatype String)
For storing the string to be tokenized.
filein(datatype BufferedReader)
For reading input from file.
fileout(datatype BufferedWriter)
For storing the result into the file
strlength(datatype Integer)
For storing the length of the string.
nonTokenDelims(datatype String)
For storing the set of non token delimiters.
tokenDelims(datatype String)
For storing the set of token delimiters.
position(datatype Integer)
For representing the position at we should start looking for the next token.Value= the
position of the character immediately following the end of the last token or Value =-1 if
the entire string is examined.
emptyReturned(datatype Boolean)
Value =true if an empty token should be returned or if the last token returned was an
empty token.
returnEmptyTokens( datatype Boolean)
For determining whether empty tokens should be returned.
delimsChangedPosition(datatype Integer)
For indicating at which position the delimeter was last changed.
tokenCount(datatype Integer)
value=-1 if tokens has not been counted else >=0.
3.3.2 Constructors
Parameters used in constructor are as follows:
• text - a string to be parsed.
• nonTokenDelims - the non-token delimiters, i.e. the delimiters that only separate
tokens and are not returned as separate tokens.
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 29/53
• tokenDelims - the token delimiters, i.e. delimiters that both separate tokens, and
are themselves returned as tokens.
• returnEmptyTokens - true if empty tokens may be returned; false otherwise.
• filein – is the file from which text to be tokenized is read.
• fileout – is the file into which tokenized result is written.
Following are the constructors defined in the class:
public StringTokenizer(String text, String nontokenDelims, String
tokenDelims, boolean returnEmptyTokens)
It is the primary constructor which constructs a string tokenizer for the specified string.
Both token and non-token delimiters are specified and whether or not empty tokens are
returned is specified. Empty tokens are the tokens that are between consequtive
delimiters.The current position is set at the beginning of the string.
public StringTokenizer(String text, String nontokenDelims, String tokenDelims)It is equivalent to public StringTokenizer(String text, String nontokenDelims, String
tokendelims, False).
public StringTokenizer(String text, String nontokenDelims)
It is equivalent to public StringTokenizer(String text, String nontokenDelims, null,
False).
public StringTokenizer(String text)
It is equivalent to StringTokenizer(text, " \t\n\r\f", null).
public StringTokenizer(BufferedReader filein, BufferedWriter fileout, String
nontokenDelims, String tokenDelims, boolean returnEmptyTokens)
It is the primary constructor which constructs a string tokenizer for the specified input
and output file.
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 30/53
public StringTokenizer(BufferedReader filein, BufferedWriter fileout, String
nontokenDelims, String tokenDelims)
It is equivalent to public StringTokenizer(BufferedReader filein, Writer fileout, String
nontokenDelims, String tokendelims, False).
public StringTokenizer(BufferedReader filein, BufferedWriter fileout, String
nontokenDelims)
It is equivalent to public StringTokenizer(BufferedReader filein, BufferedWriter
fileout, String nontokenDelims, null, False).
public StringTokenizer(BufferedReader filein, BufferedWriter fileout)
It is equivalent to StringTokenizer(BufferedReader filein, BufferedWriter fileout, "
\t\n\r\f", null).
3.3.4 Methods
public void setText(String text)
It sets the text to be tokenized in this StringTokenizer.
• Parameter
Text- string to be tokenized.
private void setDelims(String nontokenDelims, String tokenDelims)
It sets the delimiters for the StringTokenizer.
• Parameters
nontokenDelims: list of delimeters that are not returned as tokens.
tokenDelims: list of delimiters that should be returned as tokens.
public boolean hasMoreTokens()
It returns true if there is atleast one token in the string after the current position,false
otherwise. If this method returns true, then a subsequent call to nextToken with noargument will successfully return a token.
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 31/53
public String nextToken()
It returns the next token from the string tokenizer. The current position is set after the
token returned. It throws “ NoSuchElementException” if there are no more tokens in
this tokenizer's string.
public boolean skipDelimiters()
It returns true if there are more tokens, false otherwise.
It advances the current position so it is before the next token.This method skips non
tokendelimiters but does not skip token delimiters.
public int countTokens()
It returns the number of tokens remaining in the string using the current delimiter set.
private boolean advancePosition()
It returns true if a token has to be returned, false otherwise.
It advances the state of the tokenizer to the next token or delimiter. This method only
modifies the class variables Position, and emptyReturned. If there are no more tokens,
the state of these variables does not change at all.
private int indexOfNextDelimiter(int start)
It returns the number of tokens remaining in the string using the current delimiter set.
• Parameter
Start: index in text at which to begin the search.
public boolean hasMoreElements()
It returns the same value as the hasMoreTokens() method(true if more tokens are
available). It exists so that this class can implement the Enumeration interface.
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 32/53
public String nextElement()
It returns the same value as the nextToken() method(the next token in the string),
except that its declared return value is Object rather than String. It exists so that this
class can implement the Enumeration interface.
public boolean hasNext()
It returns the same value as the hasMoreTokens() method(true if there are more tokens;
false otherwise.). It exists so that this class can implement the Iterator interface.
public String next()
It returns the same value as the nextToken() method(the next token in the string),
except that its declared return value is Object rather than String. It exists so that this
class can implement the Iterator interface.
public void setReturnEmptyTokens(boolean returnEmptyTokens)
It sets whether empty tokens should be returned in the tokenizing process this point
onwards.
• Parameter
returnEmptyTokens - true if empty tokens should be returned.
public String[] toArray()
It returns a string array of remaining tokens.
public void tokenize()
It writes an output file containing tokens of input file.
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 33/53
Fig.3.2 Class diagram for the class tokenizer
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 34/53
3.4 Assumptions and Dependencies
The assumptions and dependencies are listed as follows:
1. It has been assumed that the input text is not null otherwise exception is thrown.
2. A default set of non-token delimiters is taken if both token and non token delimiters are
specified by the user.
3.5 Constraints
This tokenizer is only applicable for English language text. Also its not designed to follow
any grammatical rules and does not consider semantics of a language.
3.6 Result
This involves giving an idea about the user interfaces and the various test cases used to
demonstrate the working of the project.
Fig.3.3 shows the main page of user interface that will be used for each module. It contains
five buttons each representing a NLP tool designed. On the click of button, a new user
interface will appear corresponding to the tool.
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 35/53
Fig.3.3 Snapshot of user interface
When user clicks on tokenizer tool button, user interface for tokenizer opens up.
It has four text areas, a submit button and a refresh button.
Three text areas are for input- String to be tokenized, token delimiters and non-token
delimiters, one text area is for output tokens. Fig. 3.4 represents the tokenizer tool
interface.
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 36/53
Fig 3.4 Snapshot of Tokenizer Tool Interface
Test Case 1
On click of submit button, the text entered by the user is tokenized on the basis of token
and non-token delimiters specified by the user as represented in Fig 3.5.
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 37/53
Fig 3.5 Test Case 1 Snapshot
Test case 2
On click of browse button a window pop ups to browse the location of the input file
tokenized as represented in Fig 3.6
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 38/53
Fig 3.6 Test case 2 snapshot -browsing an input file
On click of submit button the input file is tokenized into the selected output file as
represented in the Fig 3.7
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 39/53
Fig 3.7 Test case 2 snapshot –selecting an output file
This tool tokenizes character streams of input data into a set of single atomic units called
tokens. Existing problems in the tokenization process have been identified and eliminated.
Provisions for providing separate lists of token and non token delimiters have been made to
provide distinguishibility between two types of delimiters. Features like returning empty
tokens and identification of abbreviation as a single token are included. It provides user
with the freedom of choosing the delimitation units without textual dependence.
CHAPTER 4
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 40/53
STOP WORD REMOVER
Stop words are the words which are used as fillers in order to construct a proper
grammatical sentence from the text. Stop word remover aims at removing all the stop
words by maintaining a list of stop words in a database which can be manipulated by the
user.
4.1 Algorithm and Flowchart
Following are the steps for stop word removal:
Step 1. Input text/file and a list of stop words.
Step 2. Apply tokenization on the input.
Step 3. Check whether user has loaded a list of stop words. If not, load a default list.
Step 4. Each word in the given text is matched against the list of stop words.
Step 5. If the word matches, then it is eliminated from the text.
Step 6. Output is now a text without any stop words.
Fig.4.1 shows the flowchart of stopword removal as follows.
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 41/53
Fig.4.1 Flowchart for Stop word removal
4.2 Class Description
The specification of the attributes, constructors and methods of class StopWordRemover
represented by the Fig. 4.2 is as follows:
4.2.1 Attributes
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 42/53
filein(datatype BufferedReader)
For storing reference of user input text file.
filestop(datatype BufferedReader)
For storing reference of user stop word list text file.
fileout(datatype BufferedWriter)
For storing reference of user output text file.
defaultStopWordsList(datatype String)
For storing the list of default stop words.
4.2.2 Constructors
Parameters used in constructor are as follows:
• filein – is the file from which text to be tokenized is read.
• filestop-is the file from which stop words to be rmoved are specified.
• fileout – is the file into which tokenized result is written.
Following are the constructors used in StopWordRemover class-
public StopWordRemover(BufferedReader filein, BufferedReader filestop,
BufferedWriter fileout)
It defines a StopWordRemover which stores the reference of an input text file in the
class attribute filein, the list of stop words in the class attribute filestop and reference of
output file in the class attribute fileout.
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 43/53
public StopWordRemover(String BufferedReader filein, BufferedWriter
fileout)
It defines a StopWordRemover which stores the reference of an input text file in the
class attribute filein, reference of output file in the class attribute fileout and the list of
stop words used is default stop word list.
4.2.3 Methods
public void removeStopWord()
It removes stop words from the given input text file and writes the final result in the
output file specified by the user.
Fig.4.2 Class diagram for class StopWordRemover
4.4 Assumptions and Dependencies
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 44/53
The assumptions and dependencies are listed as follows:
1. It has been assumed that the input text is not null otherwise exception is thrown.
2. Only space is used as non-token delimiter to tokenize the input text file.
3. The input stop words list text file should only have one stopword per line. If user
doesnot specifies a stop words list, a default stop words list is taken.
4.5 Constraints
This StopWordRemover is only applicable for English language text. Also, its not designed
to follow any grammatical rules and does not consider semantics of a language. This tool is
dependent on the case sensitivity of stop word list and input text.
4.6 Result
The class for stop word remover has been designed with the specification of all basic
constructors, methods and related attributes. Testing of the module will be done in the next
semester along with any subsequent improvisation in the class specification if any. The
default list of the stop words being used is specified in the Appendix A.
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 45/53
CHAPTER 5
WORD FREQUENCY COUNTER
Its aim is to count the frequency of words in the given text. The input text is first tokenized
by a tokenizer and then the corresponding frequencies of the tokens are generated using the
frequency counter.
5.1 Algorithm and Flowchart
The following steps are used for counting the frequency of words:
Step 1. Input the text.
Step 2. Tokenize the input text and save the tokens in a arraylist- InArrayList.
Step 3. Create an empty arraylist –OutArrayList to store the output.
Step 4. Repeat steps 5-8 for all the elements i of InArrayList.
Step 5. Repeat steps 6-7 for all the elements n of OutArrayList.
Step 6. Initialize a variable fc=0.
Step 7. If e= n, then increment frequency of element n by1and set fc=1.
Step 8. If fc =1, then add e to OutArrayList with frequency=1.Step 9. Final output is OutArrayList having tokens with their corresponding frequency.
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 46/53
The steps are explained with the help of the following flowchart in Fig. 3.5
Fig.5.1 Flowchart for Word Frequency Counter
5.2 Class Description
The specification of the attributes, constructors and methods of class FrequencyCounter
represtented by the Fig. 6.2 is as follows:
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 47/53
Fig.5.2 Class diagram for WordFrequencyCounter
5.2.1 Parameters
filein(datatype BufferedReader)
For storing reference of user input text file.
fileout(datatype BufferedWriter)
For storing reference of user output text file.
5.2.2 Constructors
Parameters used in constructor are as follows:
• filein – is the file from which text to be tokenized is read.
• fileout – is the file into which tokenized result is written.
The constructors used are as described below:
public FrequencyCounter(BufferedReader filein, BufferedWriter fileout)
It defines a FrequencyCounter which stores the reference of an input text file in the
class attribute filein and reference of output file in the class attribute fileout.
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 48/53
5.2.3 Methods
public void countWordFrequency()
It counts the frequency of words in the given input text file and writes the final result in
the output file specified by the user.
The class for frequency counter of words has been designed with the specification of all
basic constructors, methods and related attributes. Testing of the module will be done in
the next semester along with any subsequent improvisation in the class specification if any.
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 49/53
CHAPTER 6
CONCLUSION
So far, Tokenizer tool has been implemented and tested. Existing problems have been
identified and eliminated. The basic approaches for stopword removal and word frequency
counter have been designed and are being worked upon. Simultaneously, existing
algorithms for next modules are being studied for problem identification and elimination
phase.
6.1 Agenda for the next semester
In the next semester,first target would be testing of stopword remover and frequency
counter of words .After completion of these modules focus will be to design and
implementation of stemmer and n-gram identifier and if time permits then wsd and
multigram will be developed.
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 50/53
APPENDIX A
LIST OF STOP WORDSa
about
all
am
an
and
any
are
as
at
be
because
between
both
but
by
can
could
do
for
from
had
has
havehe
her
here
him
his
how
if
in
is
it
me
my
no
not
of
on
or
she
so
than
that
the
this
to
untilup
was
we
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 51/53
what
when
where
which
who
whom
why
with
you
your
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 52/53
References
1. Christopher D. Manning, Hinrich Schütze ,” Foundations of Statistical Natural
Language Processing”, p. xxxi , MIT Press (1999), ISBN 978-0-262-13360-9
2. Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütze , “Introduction to
Information Retrieval” Stemming and Lemmatization, Cambridge University Press
(2008)
http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
3. Dalton .J, “Java Open Source NLP and Text Mining tools” 16- march-2008
http://www.searchenginecaffe.com/2007/03/java-open-source-text-mining-and.html
4. Jaiswal , Singh, Gupta ,Srivastava “Word Sense Disambiguation” B.Tech Report , U.P
Technical University , 2009.
5. “Natural Language Processing” MediaWiki version 1.16wmf4 (r71783) 27-August-
2010 http://en.wikipedia.org/wiki/Natural_language_processing
6. Padhi .B “ Improve tokenization information rich text” 15- june- 2001
http://www.javaworld.com/javaworld/javatips/jw-javatip112.html
7. “Tokenizer MediaWiki version 1.16wmf4 (r71783) 18-August-2010
http://en.wikipedia.org/wiki/Lexical_analysis
8/7/2019 Final 20 Dec Reprt..(2)
http://slidepdf.com/reader/full/final-20-dec-reprt2 53/53
Page no’s dalne hain n chap 5 ,6 ki editin karni hai and abstract ki
headin b and chap 3 ki snapshot daal die hai bt spacing samajh nai aarai
Sorry rajjo hum useless chaps se kuch nai ho paaya kaise karti ho tum ye
sab…ajj