Date post: | 30-Oct-2015 |
Category: |
Documents |
Upload: | gen-aloys-ochola-badde |
View: | 106 times |
Download: | 0 times |
of 22
Automated Classification of Short Message Service (SMS)
ALOYSIUS OCHOLA
MAKERERE UNIVERSITY
ARTIFICIAL INTELLIGENCE GROUP
USING NAVE BAYES ALGORITHM
Artificial Intelligence Seminar . May 30 . 2013
2Automated Classification of SMS using Nave Bayes Algorithm
Classification
A supervised learning technique that involves assigning a label to a set of
unlabeled input objects.
Based on the number of classes present, there are two types of
classification:
AI Seminar (MUK) . May 30, 2013 Aloysius Ochola
Binary classification; classifies the members of a given set of objects into one of
the two classes
Multi-class classification; classifying instances into more than two classes.
Unlike a better understood binary classification, the multiclass
classification is more complex and less researched.
3Automated Classification of SMS using Nave Bayes Algorithm
Text Classification/Categorization
Text documents is one of the several areas where classification can
be applied.
TC (text classification/categorization) is the application of
classification algorithms on documents of text in order to
automatically group them to predefined categories.
AI Seminar (MUK) . May 30, 2013 Aloysius Ochola
automatically group them to predefined categories.
How to represent text documents
Preprocessing and feature selection
How to build the classifier; compute a classification function.
Training classifier and classifying
4Automated Classification of SMS using Nave Bayes Algorithm
Short Text Documents
Normal documents like email, journals, etc are typically
large and are rich with content (natural languages).
Easy to apply traditional classification approaches which rely on
word frequencies.
AI Seminar (MUK) . May 30, 2013 Aloysius Ochola
Unlike short text documents like SMS & Twitter messages,
Forum posts , etc where word occurrence is too small.
Dealing with short text therefore shall require just a little more
than traditional techniques.
Especially during preprocessing and feature selection
5Automated Classification of SMS using Nave Bayes Algorithm
Applications of TC
Spam filtering, a process which tries to discern E-mail spam messages from
legitimate emails
Email routing, sending an email sent to a general address to a specific address or
mailbox depending on topic.
AI Seminar (MUK) . May 30, 2013 Aloysius Ochola
Language identification, automatically determining the language of a text
Genre classification, automatically determining the genre of a text.
Movie reviewing, automatically classify them as good, bad and neutral.
Etc . . .
6Automated Classification of SMS using Nave Bayes Algorithm
Data Preprocessing
The data captured in real world is so noisy, inconsistent and has no quality.
Some cleaning and transformation required.
Quality results from short text will see most of the major steps of text
preprocessing skipped and some selected ones modified.
Tokenization and lowercasing: splitting text streams to tokens and forced
AI Seminar (MUK) . May 30, 2013 Aloysius Ochola
Tokenization and lowercasing: splitting text streams to tokens and forced
lowercasing.
Word boundary detection, using whitespace and punctuation
Note: Prepared corpus was lowercased.
Minor spell-correction: although theres a growing culture of using short-
hands (not formal) in SMS texts, some spell corrections can still done.
7Automated Classification of SMS using Nave Bayes Algorithm
Data Preprocessing (cont)
Regular expression replacer: replacing words used with apostrophes with their
matching regular expressions.
list pairs of RE apostrophes-word and correction Ex.Willnt : will not, didnt : did not, . . .
Repeat replacer: people are not often strictly grammatical. May write "I
looooove it" to emphasize the word "love.
AI Seminar (MUK) . May 30, 2013 Aloysius Ochola
looooove it" to emphasize the word "love.
Before replacing any characters from the supplied word
Module replaces any word with more than two repeating characters to just two as no such words can
exist in the English vocabulary, for example goooooooose to goose.
RE: [(\w*)(\w)\*(\w*)]
And then look-up if WordNet (a lexical database for English natural language) recognizes the supplied
word.
8Automated Classification of SMS using Nave Bayes Algorithm
Data Preprocessing (cont)
Then, if otherwise use regular expression (RE) [(\w*)(\w)\2(\w*)]
to remove extra repeated characters from the word.
Matches 0 or more starting characters (\w*)
A single character (\w), followed by another instance of that character \2
Then 0 or more ending characters (\w*)
AI Seminar (MUK) . May 30, 2013 Aloysius Ochola
Then 0 or more ending characters (\w*)
Stop-words filtering: process of removing most
frequent words that exist in a document.
Looking-up into a file containing stop words and return
only words not in the file/dictionary.
9Automated Classification of SMS using Nave Bayes Algorithm
A Classifier
A classifier is built on a function f which will determine a category of an
input feature vector x, given a fixed set of classes C={c1, c2,,cn} and a
description of features xX
where X is the feature space to the output class labels.
In simple terms; f(x) C.
AI Seminar (MUK) . May 30, 2013 Aloysius Ochola
In simple terms; f(x) C.
where f(x) is the classification function whose domain is X and whose range is
C. The class labels C can be ordered or unordered (categorical)
A classifier is expected to learn from learn from a set of N input-output
pairs or simply training data set and predict a class of unseen input. That is
to say, mapping X to C .CXf :
10Automated Classification of SMS using Nave Bayes Algorithm
Building the Text Classifier
For the particular case, we will deal with a probabilistic text classifier ft based on
Nave Bayes classification (NBC) Theorem.
Building the classifier will therefore involve a recursive processes of creating a
functional classifier by training it with example data set (NB learning) and running
the trained classifier on unknown content to determine class membership for the
AI Seminar (MUK) . May 30, 2013 Aloysius Ochola
the trained classifier on unknown content to determine class membership for the
unknown content classification (Bayesian Classification).
Probabilistic classifier, to predict the class membership of a certain new document
X, calculates the probability of a class C given that document, that is:
-> XCP |
11Automated Classification of SMS using Nave Bayes Algorithm
Nave Bayes Algorithm
It is a simple probabilistic learning and classification methods built upon
the Bayes probabilistic theory.
It assumes that the presence (or absence) of a particular feature of a class
is not related to the presence (or absence) of any other feature (nave
AI Seminar (MUK) . May 30, 2013 Aloysius Ochola
assumption).
Uses prior probability of each category given no information about
an item.
Categorization produces a posterior probability distribution over
the possible categories given a description of an item.
CP
XCP |
12Automated Classification of SMS using Nave Bayes Algorithm
Nave Bayes (NB) Probability Theorem
Derived from the definition of conditional probability
probability that an event will occur, when another event is known to occur or to have occurred.
From the product rule, given events C and X.
0)(,)|( )()( XPXCP XP
XCP
AI Seminar (MUK) . May 30, 2013 Aloysius Ochola
given as:
Bayes Rule:
->
0)(,)|( )( XPXCP XP
)().|()().|()( CPCXPXPXCPXCP
0)(,)|( )()().|( XPXCP XP
CPCXP
)( CXPXCP
P(C): Prior probability, the initial probability that C holds before seeing any evidenceP(X): Probability that X is observedP(X|C): Likelihood, probability of observing X given that C holdsP(C|X): Posterior probability, the probability that C holds given X is observed
Equation (1)
13Automated Classification of SMS using Nave Bayes Algorithm
Deriving NB Classification Algorithm
Given a set of feature vectors for each possible class C, the task of the
NBC (NB classification) algorithm is to approximate the probability of new
input features X to be present in C , that is, the class posterior or simply
the greatest .)|( XCcP
AI Seminar (MUK) . May 30, 2013 Aloysius Ochola
Assume C boolean random variables and a vector space X containing n
boolean attributes:
If ci is the ith possible value of C and xk denotes the k
th attribute of X
Applying NB probability theorem (Equation (1)):
j
iikki
cjCPcjCxkXP
cCPcCxXPxXcCP
)().|(
)().|()|(
Equation (2)
14Automated Classification of SMS using Nave Bayes Algorithm
NB conditional Independence Assumption: Features (term presence) are
independent of each other given the class. A new document of n features
can therefore be classified into one of C classes using equation (2) as:
The aim of the classifier is to return the maximum posterior probability of
Deriving NBC Algorithm
n
kk CxPXCP
1
)|()|(
AI Seminar (MUK) . May 30, 2013 Aloysius Ochola
The aim of the classifier is to return the maximum posterior probability of
c, thus:
Further, because the sample space (denominator) is always constant for all
the classes and does not depend on any class ci of C, the NBC theorem is
given as:
j k jkj
k ii
ci
cCxPcCP
cCPcCPcC
i )|()(
)()(maxarg
k iic
i cCPcCPcCi
)()(maxarg Equation (3)
15Automated Classification of SMS using Nave Bayes Algorithm
Training Nave Bayes Text Classifier
During the training process, the classification
function ft, extracts, selects the most useful
features from the example corpus and labels
them with their appropriate class.
Construct and store a mapping of feature-set:label
AI Seminar (MUK) . May 30, 2013 Aloysius Ochola
Construct and store a mapping of feature-set:label
pair sets (training dataset); which ft will learn from.
feature-set is a list of preprocessed and unique term
occurrences from the document samples
label is the known class of that feature-set.
16Automated Classification of SMS using Nave Bayes Algorithm
Feature Representation
Features describes and represents texts in format suitable for further machine
processing.
Final performance depends on how descriptive features are used for text
description.
AI Seminar (MUK) . May 30, 2013 Aloysius Ochola
Supervised learning classifiers can use any sort of feature
URL, email address, punctuation, capitalization, dictionaries, network features
Word based feature (Bag of Words): feature extraction process to transform the
plain documents, which are merely strings of text, into a feature set containing
the (frequency of) occurrence of each word that is usable by a classifier.
17Automated Classification of SMS using Nave Bayes Algorithm
Feature Selection
Text collections have a large number of features yet some classifiers cant deal with
a very larger number of features. Therefore performing feature Selection would
ensure reduced training time and improve performance as it eliminates noise from
features and avoids over fitting.
AI Seminar (MUK) . May 30, 2013 Aloysius Ochola
Term Weighting: Each term in a document vector must be associated with a value
(weight) which measures the importance of this term and denotes how much this
term contributes to the categorization task of the document.
Depend on information theory; frequency count of every word
chi-squared statistical distribution; score measure of bigram of each word per-label
18Automated Classification of SMS using Nave Bayes Algorithm
Text Classification
One step classifier testing process of taking the built text classifier ft and running it
on unknown content to determine class membership for that content.
New input (test) SMS stream is passed to the classifier.
Preprocesses the stream and compares it with the set of pre-classified examples (training set).
AI Seminar (MUK) . May 30, 2013 Aloysius Ochola
Numerical underflow
In equation (3), many conditional probabilities are multiplied one for each position of X
Multiplying lots of probabilities, which are between 0 and 1 by definition, can result in floating-
point underflow.
since log(xc)=log (x)+log (c), it is better to perform all computations by summing natural logs
of the probabilities rather than multiplying them. Therefore, during text classification, a
normalized NBC equation (given bellow) is used.
nkik
c
cCxPciCPCi 1
)|(log)(logmaxarg
19Automated Classification of SMS using Nave Bayes Algorithm
Implementation Pseudo Algorithm
for a given unknown input document:
break the input stream into word tokens
preprocess the tokens
for a given training set:
count the number of documents in each class
for every training document: for each class:
AI Seminar (MUK) . May 30, 2013 Aloysius Ochola
for each class:
if a preprocessed token appears in the document:
increment the count for tokens
for each class:
for each preprocessed token
divide the token count by the total token count to get conditional probabilities
return log conditional probabilities for each class
for all the individual class log conditional probabilities:
compute a comparison of the probability values
return the class with the greatest probability (maximum likelihood hypothesis).
20Automated Classification of SMS using Nave Bayes Algorithm
Evaluation and Implementation Approach
Evaluation: test SMS text documents to assess classifier
success on the prediction of the class .
Implementation: complete text classification application
with user interactive interface.
testsofnumberTotal
edictionsCorrect
___
Pr_
AI Seminar (MUK) . May 30, 2013 Aloysius Ochola
Natural Language Processing approach
Natural Language ToolKit (NLTK) used with Python programming
language.
NLTK is entirely self-contained and provides convenient functions and
wrappers that can be used as building blocks for common NLP tasks.
21Automated Classification of SMS using Nave Bayes Algorithm
BIBLIOGRAPHY
Automated Classification of Short Messaging Services (SMS)
Messages for Optimized Handling
AI Seminar (MUK) . May 30, 2013 Aloysius Ochola
Aloysius OcholaMsc. Computer Science Project
Makerere University Kampala (2013)
22Automated Classification of SMS using Nave Bayes Algorithm
DEMO . . .
Training samples collected from manually categorized SMS
message compiled by Ureport, an SMS based opinion forum
Problem:They receives up-to 10,000 SMS messages in a day
and are supposed to reply to all the messages, if it is relevant
AI Seminar (MUK) . May 30, 2013 Aloysius Ochola
and worthy.
smsTextClassificationApplication