Scalable sentiment classification for big data analysis using naive bayes classifier

transcript

2013 IEEE International Conference on Big Data

Scalable Sentiment Classification for Big DataAnalysis Using Naive Bayes Classifier

Bingwei Liu, Erik Blasch, Yu Chen, Dan Shen and Genshe Chen

outline

✤ introduction

✤ Naive Bayes Classification

✤ implementation of Naive Bayes in hadoop

✤ experimental study

introduction

A typical method to obtain valuable information is to extract the sentiment or opinion from a message

In this paper, it aim to evaluate the scalability ofNaive Bayes classifier (NBC) in large datasets

introduction

NBC is able to scale up to analyze the sentiment of millions movie reviews with increasing throughput

the accuracy of NBC is improved and approaches 82%

Naive Bayes Classification

naive Bayes classifiers is simple probabilistic classifiers based on applying Bayes' theorem with

strong (naive) independence assumptions between the features

a popular method for text categorization,( the problem of judging documents as belonging to one

category)

prior probability ：

posterior probability：

P(A|B)

P(POS|excellent,terrible) = P(POS) x P(excellent,terrible|POS)

P(excellent,terrible)

P(POS|d1) = P(POS) x P(d1|POS)

Bayes' theorem

P(POS|excellent,terrible) = P(POS) x P(excellent,terrible|POS)

P(excellent,terrible|POS) P(excellent|POS) x P(terrible|POS)

independent

P(POS|excellent,terrible) = P(POS) x P(excellent|POS) x P(terrible|POS)

classes excellent terrible

d1 POS 5 1

d2 NEG 2 6

P(POS|excellent,terrible) = P(POS) x P(excellent|POS) x P(terrible|POS)

P(POS|excellent,terrible) =

P(NEG|excellent,terrible) =

d3 : (excellent,8),(terrible,2)

( ) 268

( )x x

( ) 216

( )x x

P(POS|excellent,terrible) =

P(NEG|excellent,terrible) =

d3 : (excellent,8),(terrible,2)12

( ) 216

( )x x12

( ) 268

( )x x

0.00323011165

0.00000429153

d3 is POS

( ) 216

( )x x

N is the total number of documents,Nc is the number of documents in class c

Nwi is the frequency of a word wi in class c.

implementation of Naive Bayes in hadoop

pre-processing raw dataset

1000 positive and 1000 negative review

(word,posSum,negSum)

the words frequency in all positive,negative document

(excellent,1000,10)

(excellent,1000,10) (excellent,20,5)

(word,posSum,negSum) (word,count,docID)

(docID,count,word,posSum,negSum)

(5,20,excellent,1000,10)

(5,10,excellent,20,5)

(5,2,terrible,5,20)

(5,pos,true)

(docID,predict,correct)

(6,neg,false)

(docID,count,word,posSum,negSum)

10xlog(20)+2xlog(5)

10xlog(5)+2xlog(20)

experimental study

one name node and six data nodes. they allocate each VM two virtual CPU and 4GB of memory

7 nodes

a Dell server with 12 Intel Xeon E5-2630 2.3GHz cores and 32G memory

use Xen CloudPlatform (XCP) 1.6 as the hypervisor

experimental study

training data

experimental study

Scalable sentiment classification for big data analysis using naive bayes classifier

Software