Scalable sentiment classification for big data analysis using naive bayes classifier

Post on 15-Jul-2015

181 views 1 download

Tags:

transcript

2013 IEEE International Conference on Big Data

Scalable Sentiment Classification for Big DataAnalysis Using Naive Bayes Classifier

Bingwei Liu, Erik Blasch, Yu Chen, Dan Shen and Genshe Chen

outline

✤ introduction

✤ Naive Bayes Classification

✤ implementation of Naive Bayes in hadoop

✤ experimental study

introduction

A typical method to obtain valuable information is to extract the sentiment or opinion from a message

In this paper, it aim to evaluate the scalability ofNaive Bayes classifier (NBC) in large datasets

introduction

NBC is able to scale up to analyze the sentiment of millions movie reviews with increasing throughput

the accuracy of NBC is improved and approaches 82%

Naive Bayes Classification

naive Bayes classifiers is simple probabilistic classifiers based on applying Bayes' theorem with

strong (naive) independence assumptions between the features

a popular method for text categorization,( the problem of judging documents as belonging to one

category)

Naive Bayes Classification

prior probability :

posterior probability:

P(A)

P(A|B)

Naive Bayes Classification

P(POS|excellent,terrible) = P(POS) x P(excellent,terrible|POS)

P(excellent,terrible)

P(POS|d1) = P(POS) x P(d1|POS)

P(d1)

Bayes' theorem

Naive Bayes Classification

P(POS|excellent,terrible) = P(POS) x P(excellent,terrible|POS)

P(excellent,terrible)

P(excellent,terrible|POS) P(excellent|POS) x P(terrible|POS)

independent

P(POS|excellent,terrible) = P(POS) x P(excellent|POS) x P(terrible|POS)

P(excellent,terrible)

Naive Bayes Classification

classes excellent terrible

d1 POS 5 1

d2 NEG 2 6

P(POS|excellent,terrible) = P(POS) x P(excellent|POS) x P(terrible|POS)

P(excellent,terrible)

P(POS|excellent,terrible) =

P(NEG|excellent,terrible) =

d3 : (excellent,8),(terrible,2)

56

( )16

( )

12

828

( ) 268

( )x x

12

856

( ) 216

( )x x

Naive Bayes Classification

P(POS|excellent,terrible) =

P(NEG|excellent,terrible) =

d3 : (excellent,8),(terrible,2)12

856

( ) 216

( )x x12

828

( ) 268

( )x x

0.00323011165

0.00000429153

d3 is POS

Naive Bayes Classification

12

856

( ) 216

( )x x

Naive Bayes Classification

N is the total number of documents,Nc is the number of documents in class c

Nwi is the frequency of a word wi in class c.

implementation of Naive Bayes in hadoop

pre-processing raw dataset

implementation of Naive Bayes in hadoop

1000 positive and 1000 negative review

implementation of Naive Bayes in hadoop

(word,posSum,negSum)

the words frequency in all positive,negative document

(excellent,1000,10)

implementation of Naive Bayes in hadoop

(excellent,1000,10) (excellent,20,5)

(word,posSum,negSum) (word,count,docID)

(docID,count,word,posSum,negSum)

(5,20,excellent,1000,10)

implementation of Naive Bayes in hadoop

(5,10,excellent,20,5)

(5,2,terrible,5,20)

(5,pos,true)

(docID,predict,correct)

(6,neg,false)

(docID,count,word,posSum,negSum)

10xlog(20)+2xlog(5)

10xlog(5)+2xlog(20)

experimental study

one name node and six data nodes. they allocate each VM two virtual CPU and 4GB of memory

7 nodes

a Dell server with 12 Intel Xeon E5-2630 2.3GHz cores and 32G memory

use Xen CloudPlatform (XCP) 1.6 as the hypervisor

experimental study

training data

experimental study