Sentiment Analysis Using Deep Learning Technique
CNN with KMeans
B. Swathi Lakshmi, P. Sini Raj and R. Raj Vikram
Department of Computer Science and Engineering, Amrita School of
Engineering, Amrita Vishwa Vidyapeetham, Ettimadai, Coimbatore.
Abstract Sentiment analysis has already started playing a vital role in most of social
media. Whether it is social networking sites or video or audio based systems,
people are interested in knowing the sentiments which will help them to
identify whether the word is positive, negative or neutral. Sentiment analysis
in turn helps to detect the emotions. In the proposed work sentiment analysis
is used to find the review for a particular movie by using a novel combination
of deep learning technique CNN and unsupervised learning method K means
upon movie reviews, which gives a better estimation of the sentiments than
the existing methods which are currently available. This minimal
improvement in the accuracy is expected to get improved when applied to a
larger corpus of big data where it will show its significance.
Key Words: CNN, deep learning, K means, sentiment analysis.
International Journal of Pure and Applied MathematicsVolume 114 No. 11 2017, 47-57ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version)url: http://www.ijpam.euSpecial Issue ijpam.eu
47
1. Introduction
Sentiment analysis also known as Opinion Mining is an interesting way to find
the opinions of a user and to effectively categorize then to be positive, negative
or neutral. Now-a-days sentiment analysis has shown its significance in almost
all the fields of media. Natural language processing is deeply tied with
Sentiment analysis. When a user expresses his views, it is important for the
organization to correctly identify the requirements of the user to make him stay
longer as their customer. For that a deep understanding of their customer’s
opinion[1][3] is important. By the analysis of product reviews by the customer,
it is easier for the company to decide about the future of that product. In the
same way, it is very important to analyze the comments given in social media.[1]
Twitter Analytics has become a separate field by itself, where even studies
show the impact of tweets over the sensitive fields like [9] market prediction.
Sentiment Analysis has its diverse applications ranging from the field of
centrifugal pump to social media[5].
The centrifugal pump is widely distributed over many applications. In the case
of centrifugal pump, if there occurs an error, or a mono block, monitoring is
very essential. Another set of networking algorithms like ANN (Artificial
Neural Network) are used [6]. But the accuracy produced by these algorithms is
not satisfactory. It provided a better outcome in case of monoblock. In the
algorithmic process, features are trained, extracted and their fault classification
is compared. The major advantage is that the operator may be informed about
the status of the pump well in advance. If the status is negative, necessary
precautions may be taken. On contrary, Support Vector Machine and Proximal
Support vector Machine (PSVM) provides a better outcome under good and
faulty conditions of a monoblock centrifugal pump. In this machine learning
process, decision tree is used for feature extraction. The extracted features are
fed as inputs to SVM and PSVM and the inputs are trained, tested and their fault
classification is compared.[11]
Raymond Hsu et.al suggested an attempted experiment model which was
implemented in Stanford University.[8]
The process involves Raw Data, Parser,
Spell Checker, Synset Features and LIWC features.This helped in knowing the
sentiments of the data[10].
In stock market prediction, it was very difficult to collect these enormous data
in a short span of time. So the role of sentiment analysis plays a vital part. The
prediction was done by two algorithms; one is the genetic algorithm (GA), and
support vector machine (SVM).Hybrid systems were proposed in order to avoid
regression problems and to manage the existing problem with satisfactory
accuracy. Upon the previous day records, the algorithm may be applied and
successful targets may be achieved. For the decision tree parameters were
optimized by the GA and SVM for the accuracy. Once the trade has begun for
that day, trade must be carried in order to obtain a highest possible
International Journal of Pure and Applied Mathematics Special Issue
48
profit.[11]
Basically, by keeping track of the previous record, next day’s
prediction is done. This helps in achieving the fixed target. This factor is
implemented as a part of hybrid systems.
By the analysis of tweets [2] for the apt classification to be positive, negative or
neutral is noteworthy. Research of Sentiment analysis in a blog form has grown
rapidly. As the population exceeds, the users of blogs and microblogs have also
increased in a short span of time. This leads to a lot of unformatted, bulky and
unprecised text formats. In order to overcome the factor, sentiment analysis is
widely used and considered as the most efficient part of the deep learning
process. Various methods are involved in sentiment analysis, in which feature
extraction is the most efficient part. On the contrary to opinion mining, it has a
lot of drawbacks as compared to opinion mining [4]. Opinion mining was only
concentrated on one-dimensional feature, unlike the sentiment analysis. In order
to avoid these problems, Jeong et. al. proposed a theory on FEROM (Feature
Extraction and Refinement Method) that extracts the appropriate grammars and
features by scanning the whole blog content. This method checks with each
grammar and features are to be extracted by merging with the correct exact-
matching words. This arose a challenge for keyword extraction which was
proposed by Fan and Chang from the concept of contextual advertising in
related to the advertising ads of the blog page. Only the traditional keyword
extraction can be referred for searching or featuring formal documents in
traditional blogs, newspapers or scientific related papers. In addition to
traditional keyword extraction, frequency-based extraction was introduced for
extracting features from micro blogs [4].
In addition to the frequency; graphical
model extraction was also introduced[12]
.
The words in a Sentiment analysis is classified on the basis of semantic
orientation (SO), that is the word is basically classified using its weight,
polarity, and its strength. Semantic Orientation is extremely helpful in
determining marketing reviews, compiling reviews etc. In general semantic
orientation always refers to the strength of the words, phrases or texts in
addition to the sentiment analysis which is the main goal of our process[16]
.
Semantic Orientation involves adjectives, phrases, words, texts, adverbs, verbs
and noun.
At first we start with each tweet , then for each word in sentiment dictionary , if
an emoticon [12]is found; then calculate it as positive, negative or neutral ; else
if a contextual word is found Contextual Valence Shifter [9]
then calculate its
valence ; otherwise if a sentiment word is found then calculate positive ,
negative valences. Finally sum all positive values, negative values and neutral
values for each sentence.
Sentiment analysis is also used in an interesting application when the user is
talking, it analyzes whether the situation or action has been actually occurred or
not. Those terms are called as “Irrealis”, which are applied in non-factual
contexts. These are some set of grammatical moods which predicts the
International Journal of Pure and Applied Mathematics Special Issue
49
occurrence of an event or not. Here the imperative mood plays the major role in
irrealis blocking.[13]
An instance taken here is the validation of dictionary where
granularity of the dictionary is used by the data set which provides evidence for
the dictionary rankings. Also predicting the intuition of English speaking people
(here) which are valuable, in comparing to the automatically generated ones.
Granularity of the scales is expected in datasets, so as to increase the
efficiency.[13]
2. Machine Learning Algorithms used in Sentiment Analysis
Machine learning algorithms play an important role in sentiment analysis.
Specifically speaking, lots of works in sentimental analysis uses classification
algorithms like Support Vector Machine (SVM), Kernel trick, KNN (K-Nearest
Neighbor) to detect positive, negative or neutral sentiments.
A. Support Vector Machine
Support Vector Machine (SVM) which are also called as supervised learning
networks,that analyze data for classification and regression analysis.In SVM,
the points are present in the space so that the examples present forms the new
category in the space.Two seperate categories are formed so that it forms a clear
gap in space.SVM also has a special advantage that it can perform non-linear
classification called the Kenrel trick[7],by mapping the inputs to high-
dimensional features.SVM is always applicable to supervised learning data set.
B. Proximal Support Vector Machine
Instead of a software machine that classifies points by assigning to one of the
disjoint planes,PSVM classifies by assigning them to the closest of the planes.
C. Kernel Trick
It is a set of algorithms designed for pattern analysis.This method is used to find
general types of analysis such as clustering,ranking,components,co-relations
and classifications which are implemented in datasets[8].Kernel functions
works on the basis of the data sets which is present maps to all images and
algorithms.
This is more efficient than computation of the coordinates.Kernel method
algoritms are capable of operating with Support Vector Machines.The
functions are used in graphs,vectors,text,images and vectors.Basically Kernel
algorithms are based on convex optimization or eigen value problems.
International Journal of Pure and Applied Mathematics Special Issue
50
D. K Nearest Nieighbor (KNN)
KNN is one the simplest and most commonly used classification algorithm. It is
extremely simple and usually works better providing good accurate and
competitive results. Here the whole data set needs to be classified into positive,
negative or neutral. This is done considering the k nearest neighbors and their
closeness. The closeness is measured by any of the distance measures mainly
Euclidian distance measure is used. This classification correctly classifies for a
better smaller datasets.
E. Hybrid model K Nearest Nieighbor (KNN) and SVM
There are various works which uses individual methods for the purpose of
classification. The works which uses the hybrid model where KNN-SVM has
been used for a better classification [15]
.This also shows an improvement in the
sentiments identification by using this method.
3. Deep Learning Technique–Convolution Neural Networks (CNN)
Convolution Neural Network is a type of feed forward network, which consists
of two or more layers deep within and then connected with a fully connected
layers like a multilayer neural network. In the perspective of sentiment analysis,
CNN works on the process in which each word is given a weight in the hidden
layer. Further each word is being checked for the exact match and the process
continues in a repeated manner.CNN also works based on the logic of sliding
window. For an instance, if an image is given each, filters are decided and
passed through the image as a sliding window. This gives the corresponding
value of the image and is stored as a matrix. Thus for the entire image, a matrix
will be calculated. In the case of text classification, every word will be given as
an input and finally represented in a matrix format as shown in Figure 1.
Feature detection is done by the convolution layers.
Figure 1: Convolution Works
International Journal of Pure and Applied Mathematics Special Issue
51
4. Sentiment Analysis using Movie Reviews
A. Existing Method
One such method is discussed in the paper by Kim Yong et.al used the
combination of CNN and KNN to identify the sentiments in the movie reviews
.The data file has to be loaded. Pre-processing has to be done so that the
maximum noise is removed. In the existing method they use Deep learning
technique Convolution neural network to train and learn the positive and
negative sentiments from the movie review data sets. A sentence in the movie
review is inputted and is separated it into words. It is then passed through the
convolution layers. Multiple layers are set using the filters. The Features are
extracted after the convolution layers[12]
. These features are fed to a KNN
classifier to identify whether the reviews get categorized to positive or negative
sentiments as shown in Figure 2. In this paper, they also suggested to convert
word into integer values using word2vec library or any other method such as
word embedding techniques.
Figure 2: Sentiment Analysis using CNN – KNN
B. Proposed Method
There are various unsupervised learning algorithms such as k- means,
hierarchical, agglomerative clustering. As a deviation from the existing work,
experiment carried out from the combination of deep learning technique CNN
with unsupervised learning K Means clustering method.
All the unlabelled data sets comes under unsupervised learning. In the case of
K-means clustering, no labels are known. In K means clustering, the no of
clusters has to be decided in advance according to the application. Once the k
clusters has been decided, then the as and when the new data comes, the data
needs to be put in clusters according to the centroid value calculated. This
shows the distance of the data from the centroid value. According to the
distance calculated the data may be put into various clusters. Unsupervised
learning is very useful for the datasets where the labels are not proper, so that it
shows better results in the case of novel and unknown data.
International Journal of Pure and Applied Mathematics Special Issue
52
Unsupervised learning methods has a advantage to predict the hidden patterns
and grouping methods. In our proposed model, a movie review dataset is used,
which contains all the mixed data containing positive and negative reviews. The
deep leaning technique CNN is used to train and learn the system. The input to
the proposed system is also sentences which need to be converted to a matrix by
using multilevel convolutions. The features are extracted from the CNN which
are in turn fed to a K-Means set up where the reviews are groups into positive or
negative clusters. Thus the complete data set will be grouped accordingly.
Whenever a novel unknown movie review comes, they are passed through the
trained and learned CNN and after the feature extraction, the K Means
clustering algorithm used will help to group the movie reviews accordingly into
positive or negative clusters as in Figure3. But the proposed method works
better and gives a minimal improvement in the accuracy when in the movie
review dataset. But this dataset is not a big dataset when in comparison with
others as in these consists of only 10,662 instances[15]
.
Figure 3: Sentiment Analysis using CNN –K Means
5. Experiments and Results
In this paper, a comparative study of supervised learning, the combination of
CNN and KNN and unsupervised learning, the combination of CNN and K-
means is done. This is implemented in tensor flow framework. Tensor flow is
one of the trending framework for working with Convolution neural networks
and more of techniques in the field of deep learning. In the case of existing
system, using CNN and KNN, it provides better results for smaller datasets.
This is evaluated using the metrics accuracy and precision. As this is supervised
learning, accuracy is highly superior for smaller datasets. As all the positive and
negative sentiments are trained, learned and labeled by CNN, and then by the
use of KNN ,it correctly classifies the reviews as positive and negative
sentiments with less error rate.[15]
In the proposed work, uses unsupervised learning which when used in
combination with CNN, the accuracy and precision is seen improved. Tensor
flow usually runs faster when in a GPU (Graphical Processing Unit) set up.If
International Journal of Pure and Applied Mathematics Special Issue
53
the system needs to be worked for a larger corpora then the normal CPU may
not be suffice. Then it is suggested to have GPUs space and time consumption
can be made lesser. Thus our system shows that CNN-KNN works better for a
smaller dataset and for a larger dataset, CNN-K Means is suggested. The
comparison of both the algorithms is depicted and is given below. This graph
shows the sentiment analysis done for various real time movies. This is done by
taking the review comments of these movies and analyzed the positive and
negative comments. This is plotted and is given below in the Figure 4.
Figure 4: Sentiment analysis for different movies
This graph shows the loss and accuracy of various movies and it is also
observed that when we change the filters in convolution neural networks, for
few movies the accuracy is more and the loss is less which is the required. This
is achieved in the case of CNN-KNN for smaller datasets. The same is achieved
when we use CNN-K Means for larger datasets. The below Figure.5 shows that
the accuracy is attained with mere loss or lesser error rate when we use our
proposed method.
Figure 5: Loss and accuracy trade off
0
50
100
150
200
250
300
FoodFight The God FatherHouse of the Dead BlackHat
Pos
Neg
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5
Loss
Accuracy
International Journal of Pure and Applied Mathematics Special Issue
54
6. Conclusion
Sentiment Analysis is very essential in our daily routine. It has its diverse
specification in the areas of social media such as analysis of twitter data, other
mechanical specifications like centrifugal pump through the help of Support
Vector Machine. Through Sentimental Analysis marketing strategy, campaign
success, improving product messaging and other areas. In this paper we have
proposed a theory through the impact of K-means algorithm which is effective
for larger sets of data also. Sentiment Analysis has been effective in all its cases
in which it has been implemented. Filters like CNN, using deep learning
techniques, is also used as a part of Sentimental Analysis. All these factors
make an impact in the difference of learning, in order to increase the existing
work. Algorithms like hierarchical and Agglomerative clustering are also useful
for the data prediction. The factors which can also be applicable for larger
datasets, which improves the efficiency and accuracy.
References
[1] Varghese R., Jayasree M., A survey on sentiment analysis and opinion mining, International Journal of Research in Engineering and Technology 2(11) (2013), 312-317.
[2] Agarwal A., Xie B., Vovsha I., Rambow O., Passonneau R, Sentiment analysis of twitter data, Proceedings of the workshop on languages in social media, Association for Computational Linguistics (2011), 30-38.
[3] Vinita Sharma, Literature Survey (2014).
[4] Sahayak V., Shete V., Pathan A, Sentiment Analysis on Twitter Data, International Journal of Innovative Research in Advanced Engineering (IJIRAE) 2(1) (2015), 178-183.
[5] Singh R., Kaur, R, Sentiment Analysis on Social Media and Online Review, International Journal of Computer Applications 121(20) (2015).
[6] Medhat W., Hassan A., Korashy H., Sentiment analysis algorithms and applications: A survey, Ain Shams Engineering Journal 5(4) (2014), 1093-1113.
[7] Sources from Wikipedia, Kernel Methods.
[8] Sindhwani V., Melville P., Document-word co-regularization for semi-supervised sentiment analysis, Eighth IEEE International Conference on Data Mining (2008), 1025-1030.
[9] Nair B.B., Mohandas V.P., Sakthivel N.R., A genetic algorithm optimized decision tree-SVM based stock market trend prediction system, International Journal on Computer Science and Engineering 2(9) (2010), 2981-2988.
International Journal of Pure and Applied Mathematics Special Issue
55
[10] Nanli Z., Ping Z., Weiguo L., Meng C., Sentiment analysis: A literature review, International Symposium on Management of Technology (ISMOT) (2012), 572-576.
[11] Taboada M., Brooke J., Tofiloski M., Voll K., Stede, M, Lexicon-based methods for sentiment analysis, Computational linguistics 37(2) (2011), 267-307.
[12] Vaitheeswaran G., Arockiam, L, A Novel Lexicon Based Approach to Enhance the Accuracy of Sentiment Analysis on Big Data, International Journal of Emerging Research in Management and Technology (IJERMT) 5(2) (2016).
[13] Sivakumar P.B., Mohandas V.P., Sobh T, Evaluating the predictability of financial time series, A case study on SENSEX data, Innovations and Advanced Techniques in Computer and Information Sciences and Engineering (2007), 99–104.
[14] Padmavathi S., Rajalaxmi C., Soman K.P, Texel identification using K-Means clustering method, Advances in Computer Science, Engineering & Applications (2012), 285-294.
[15] Abarna K., Rajamani M., Vasudevan S.K, Big data analytics: A detailed gaze and a technical review, International Journal of Applied Engineering Research 9(9) (2014).
[16] Geethan P., Jithin P., Naveen T., Padminy K.V., Shruthi Krithika J., Vasudevan S.K, Augmented reality X-ray vision with gesture interaction, Indian Journal of Science and Technology 8 (2015), 43-47.
[17] Sankar A., Suresh A., Varun Babu P., Baskar A., Vasudevan S.K, An in-depth analysis of applications of object recognition, Research Journal of Applied Sciences, Engineering and Technology 10(1) (2015), 1-14.
[18] Rajendran A., Kiran M.V.K., Vasudevan S.K., Baskar A, An exhaustive survey on human computer interaction’s past, present and future, International Journal of Applied Engineering Research 10(2) (2015), 5091-5105.
[19] Gaurangi Patil, Varsha Galande, Vedant Kekan, Kalpana Dange, Sentiment Analysis Using Support Vector Machine, International Journal of Innovative Research in Computer and Communication Engineering 2(1), (2014).
[20] Yong Yang, Chun Xu, Ge Ren, Sentiment Analysis of Text Using SVM, Electrical, Information Engineering and Mechatronics of the series Lecture Notes in Electrical Engineering 138 (2012), 1133-1139.
International Journal of Pure and Applied Mathematics Special Issue
56
57
58