Post on 29-Mar-2018
transcript
A Summer Internship Project Report On
“Sparse Aspect Rating Model for Sentiment Summarization”
Carried out at the Institute for Development and Research in Banking Technology,
Hyderabad
Established by ‘Reserve Bank of India’
Submitted by Agni Besh Chauhan Roll No. – 1301CS04
B.Tech (Computer Science & Engineering) Indian Institute of Technology, Patna
Under the Guidance of Dr. S. Nagesh Bhattu
Asst. Professor Centre of Excellence in Analytics
IDRBT, Hyderabad
DECLARATION
I hereby declare that this dissertation entitled “Sparse Aspect Rating
Model for Sentiment Summarization” under the guidance and supervision
of Dr. S. Nagesh Bhattu submitted to Institute for Development &
Research in Banking Technology, Hyderabad is a bonafide record of
work which is also free from plagarism . I also declare that it has not
been submitted previously in part or full to this or any other university
or institution for award of any degree or diploma.
Agni Besh Chauhan
Dept. of Computer science & Engineering Indian Institute of Technology, Patna
CERTIFICATE
This is to certify that the summer internship project report entitled
“Sparse Aspect Rating Model for Sentiment Summarization” submitted
to Institute for Development & Research in Banking Technology
[IDRBT], Hyderabad is a bonafide record of work done by Agni Besh
Chauhan, Roll no. 1301CS04, B.Tech (Computer Science &
Engineering), 201318, Indian Institute of Technology, Patna” from 12th
May, 2016 to 20th July, 2016 under my supervision.
(Project Guide) Dr. S. Nagesh Bhattu
Asst. Professor Centre of Excellence in Analytics,
IDRBT, Hyderabad
CONTENTS
Pg. No.
1. Abstract … 2
2. Introduction … 2
3. Related Work … 5
4. Problem Definition … 7
5. Model Description … 8
6. Bibliography … 14
Abstract
We investigate aspect mining problem that aims to deliver opinion
based summarization of text reviews in financial domain. Our goal is
to derive the hidden aspect which has been discussed in the text data.
Further we retrieve the aspect intensity with which the user emphasis
on a given aspect in the review and give an score to each
aspect.Existing works defining the aspect rating prediction deals with
supervised model and does not consider sparsity of aspects. Our
model handles the sparsity of the aspect in an unsupervised manner
which can very efficiently solve the problem and produce detailed
records of aspect rating and sentiment summarization.
Introduction
With increase in number of internet services and users, we have
developed a vast repository of various kinds of knowledge. People
contribute their opinion and reviews to such repository on various
user centric platforms. And with growing number of such user
reviews, it is being utter shambles to wallow and find the vital
information. A lot of research effort has been made to tackle this
problem via information extraction, user opinion summarization and
sentiment analysis on reviews but it is still unable to produce reliable
prediction on user's opinion at very close and detailed analysis at
aspect level.
We will take an instance from typical bank review as banks are very
eminent entity in financial domain. A user tamsat17 writes "A good
bank these days, providing many branches and atms all over the
world. This bank service is very very good, always customer service is
present to help you. There are several assistant managers present in
all branches to help you at your doorstep. This bank providing
investments and savings according to your need. I have an account
with this bank. My account is priority ac. This account can be
continued at zero balance after 2 month period of opening, can
withdraw money from any banks atm whenever you want(no charge is
there), get several options like free demat account, shopping coupons,
5 dd without any bank charge in every month, conference rooms with
previous booking for your business needs etc. Likes- this bank service
is great, whenever I need to deposit they send assistance in my home
for taking it, this is like banking from home, always at your help for
what reason you may go. They suggest good investments schemes,
well experienced financial managers are recruited. Opening account is
much easier in this bank. Same day account opening. Dislikes- few
managers suggest wrong things for their promotions, no place to
complain against higher officials of axis bank, there should be zero
balance account opening with small investments, closing account in
this bank is like waiting for months. Overall I suggest axis for its
service and support. Moreover they offer several types of savings ac,
demat, other investment plans. For opening account don't have to visit
branch call helpline they will send assistance in your home. A good
bank with modern technology and no harassment." and gives an
overall rating of 4 star. Here overall rating given by user can not give a
detailed analysis of all the aspect in banking services. In the review he
talks about some positive aspects e.g., "service", "deposit",
"investments schemes" etc. while he also talks about some of the
negative aspects e.g., "promotions", "zero balance account", "closing
account" etc. without an explicit aspect rating on each aspect. The
user's opinion and sentiment about these aspects can not be
identified with simple overall rating. Considering two users giving
same rating to a bank may lead to different direction of their aspects,
the user may have liked the one aspects but not other and there may
be vice versa case for the other user and they give same overall rating
to the bank. To address this problem we conducted sparsity based
aspect rating modeling which as a result produced user's interest
toward a particular aspect in a given domain along with their opinion
and sentiment regarding that aspect.
For evaluation purpose of our model which takes as input a dataset
containing collection of review text along with their overall rating and
reviewer identity. These reviews belong to particular domain, in our
case it belongs to financial domain. Our aim is to achieve detailed
understanding of review by discovering the aspect set and rating
prediction for each aspect.Previous models such as Rating Analysis
Model (LARAM) which can address such aspect rating problem. They
relied on topic model Latent Dirichlet Allocation (LDA) for modeling
the word generation in reviews from internet and determine the
aspect rating based on a rating regression component. Some
limitations persisting in probabilistic topic models such as LDA is not
efficient when dealing with aspect sparsity in reviews. Sparse Aspect
refers to vitreous observation in the review data that a user talks
about only a few of many aspects in a domain. For example we again
consider the review from the banking domain discussed earlier where
user talks about various aspects such as "service", "deposit",
"investments schemes", "promotions", "zero balance account",
"closing account". But there may be various aspect he is not talking
about such as "ATM services'', "Phone banking", "Internet banking" etc.
And this is quite common scenario that in real life situation review
data suffers with sparsity issue. In order to gain proper insight of an
entity we need to consider all the aspects.
We Evaluate our Sparse Aspect Rating Model on a bank review
dataset crawled from MouthShut (http://www.mouthshut.com).
Experiments shows that our model can control sparsity of aspect
proportions and produces aspect ratings by considering item and user
information. Another considerable result in addition to aspect rating
prediction is that our model detects the key term for each aspect and
the learned dictionary contains term that are associated with each
aspect together with association strength.
Related Work
Aspect rating prediction has received vigorous interest in recent times.
The wide coverage of topics and abundance of opinions makes it an
important area of research for discovering public opinions on all sort
of topics. Significant effort has been paid on sentiment analysis for
customer reviews. Latent aspect rating Analysis model (LARAM) jointly
identifies latent aspects, aspect rating, and aspect weights in a review.
However, LARAM does not consider reviewer identity and user's trend
for writing reviews, and learn parameters per review basis in contrast
to our model which consider reviewer identity and user's trend to
learn hidden parameters by iterating over each review, item and user.
Dependency parser used to learn product aspect and aspect specific
opinions by jointly considering the aspect frequency and reviewer's
opinion about each aspect is very trivial in LARAM. However, the above
model is based on probabilistic topic model and fails to handle aspect
sparsity issue.
Several follow up work has been tries to address the limitation of
LARAM such as Hidden Topic Sentiment Model (HTSM) and FACTS
(FACeT and Sentiment extraction) Model. HTSM deals to explicitly
capture topic coherence and sentiment consistency in an opinionated
text review to extract hidden aspects and corresponding sentiment
polarities. In HTSM, topic coherence is achieved by enforcing words in
the same sentence to share the same topic assignment and modeling
topic transition between successive sentences. Sentence consistency
is imposed by constraining topic transition via tracking sentiment
changes and both topic transition and and sentiment transition are
guided by a parameterized logistic function based on the linguistic
signals directly observable in the document. However, it is based on
first order Markov Dependency and is a semi-supervised technique. It
does not captures the sparsity of data. Facet level sentiment analysis
has also been in interest from last few years. This involves extracting
the facets and the associated sentiments. Formulation by Hu and Liu
for this problem and applied association mining to extract product
features and used a seed set of adjective expanded using wordnet
synsets to identify the polarity of the sentiment words, but they do not
make any attempts to cluster the product features obtained into
appropriate facets.\par
Some of the works on extraction of aspect term such as MG-LDA
model which extract the ratable aspects automatically. Another work
by Mukherjee et al. applied the seed words provided by users for a
few aspect categories to jointly extract and cluster aspect term by
semi-supervised model. Topic Joint Model called JST by Lin et al. also
extract the aspect and its corresponding sentiment polarity. Although,
it does not give enough idea for identification of sentiment orientation
or rating prediction of each topical aspect for a specific item. various
sparsity based models has also been in widespread in different
application. Maximum A Posterior estimation for inducing sparsity
based on Probabilistic Latent Semantic Analysis by Shashanka et al.
Incorporation of sparse coding to improve traditional probabilistic
model and discover sparse hidden notation for each document by Zhu
et al.
Problem Definition
Our definition for sparse aspect rating problem for sentiment
summarization can be described as: We take input of a review
collection in financial domain for dealing with our sparse aspect rating
problem. Each review having an overall rating, reviewer’s identity and
item identity. Our aim here is to retrieve the previously latent aspects
for our domain and give a rating prediction for each aspect for each
review provided we need to define the count of the aspects. Further
we will retrieve the keys for each aspect. Reviews are associated with a
reviewer’s identity, item identity and overall rating. For a domain
specific dataset the input corpus is represented as R = { , ,...., }.r1 r2 r|R|
We use A = {1, 2,...., A} for representing collection of reviewers and B =
{1, 2,..., B} for collection of items. Let us say the review r ∈ R is written
by reviewer ∈ A for the item ∈ B. The overall rating, ∈ , isur br Y r R+
given by the reviewers to denote the emphasis on the item. The
numeric score has a range same as the ground truth ratings, typically
it ranges from 1 to 5. The representation of attributes belonging to the
domain specific subjects is termed as aspect. For example ”ATM
services”, ”Phone banking”, ”Internet banking” etc. in banking service
domain. Let K be the total number of aspects in the given domain. to
denote this set of aspect we use F = {1, 2, ..., K }. Each of it’s element is
denoted by t ∈ F.
Model Description
Our model incorporates two latent variables namely user intrinsic
aspect interest and item intrinsic aspect quality when modeling the
observed review text and overall rating . Reviewer’s interest for the tr
reviewer r ∈ R represents this reviewer’s interest for each aspect. Item
aspect quality denotes the intrinsic quality of item b ∈ B for each qb
aspect, which is user independent. More description for these two
notions can be found in Section 4. The generative process is as
follows: One would first choose the subset of all aspects for giving
comments and decide the text proportion for describing each aspect
based on the user intrinsic aspect interest and item intrinsic aspecttr
quality . Then, some terms including opinionated words would be qb
selected to form the review content. The details of the generation
process of a word will be described below. Next, the sentimental
orientation for each aspect characterized by the aspect rating is
determined. Finally, the observed overall rating given by this user will
be based on the weighted sum of aspect ratings. The graphical model
of SACM is depicted in Figure 3. The outer rectangle plate represents
the replication for a review. The inner rectangle plate captures each
word in each review. There are two components in this model. The
first component shown on the lower left is related to the review text
content component including , and . The second componentθr srn wrn
shown on the upper right is related to the rating mining component.
We first describe the review text content component which uses a
variant of STC mentioned in Section 4.2 to generate the observed
words. For a particular review d ∈ D written by the user ∈ A for thear
item ∈ B, the document code is modeled as the Hadamardbr θd
product between the user intrinsic aspect interest and the itemtar
intrinsic aspect quality instead of Laplace prior. Precisely, the kthqbr
element of the document code represents the association strengthθrk
on the aspect k. Also, the more the word occurrence over the kth
aspect, the higher the value of is. Specifically, the dominated aspectθrk
proportions in a review mainly depend on the corresponding andtar
. For instance, in the hotel domain, a user who likes delicious foodqbr
will have high where the aspect k is the Food aspect. This usertark
likely provides opinions on food in detail in his/her reviews leading to
a high value of . Additionally, a hotel possessing distinctiveθrk
environment, i.e. high where k is the Environment aspect, is likelyqbk
to draw attention from users by its environment. Thus, it tends to
attract some comments on this aspect. As a result, the corresponding
also has a h igh value. The above examples show us that both θrk tar
and contributes to . Based on the above motivation, we use Eq.qbr θr
(3) below to generate the aspect proportion, which is modeled by the
document code . for review r,θr
= ◦θr tar qbr
where the operator ◦ is the Hadamard product , which is defined as
the entry-wise product between the vector and the vector . It istar qbr
reasonable that the user intrinsic aspect interest ,a ∈ A is drawnta
from the Laplace prior, i.e. p( ) ∝ exp (−λ|| | ). Specifically, a userta ta |1
usually will not be interested in all possible aspects of a particular
item. Then, we use the STC model to generate the observed review
text. After obtaining the document code , we sample the word codeθr
fromsrn
p( | ) for each observed word n, where n is the word index insrn θr
vocabulary, and sample the observed word count from awrn
distribution with as the mean, where represents the n-thsTrn β.n β.n
column of β. Unlike the multinomial distribution adopted in
traditional probabilistic topic models for the sparsity of word code, srn
is drawn from the super gaussian as shown below. The -norm withinl1
them tends to find sparse codes.
p( | ) ∝ exp (−γ|| - | - ρ|| | )srn θr srn θr |22 srn |1
Then, the word count in each document is sampled from the Poisson
distribution p( | , β) = Poiss( ; ). In the rating miningwrn srn wrn sTrn β.n
component, we define the aspect weight represents the user’s relative
weight placed on each aspect when the user decides the overall rating
for a particular review. For the review d, we assume that aspect weight
∈ is generated by the document code , which denotes theηr RK++ θr
aspect strength in each aspect. After normalization, we have each
element of η d as follows:
=ηrkexp(θ )rk∑ exp(θ )rj
For a review d written by the user for the item , we assume that ar br
the k-th element of the aspect rating is drawn from a GaussianY Frk
distribution. The mean and variance are assumed to be and qbrk α 2 t2ark
respectively where α is a positive scalar.
~ N( , )Y Frk qbrk α
2 t2ark
Consequently, the ratings on the kth aspect from all reviews for a
particular item should attain the average value determined by thebr
intrinsic aspect quality of the item . For a particular user a, theqbr br
variance for his/her aspect ratings should be related to this user’s
intrinsic aspect interest . For example, in the hotel domain, a foodieta
person is likely to write more about the Food aspect in the reviews,
and this user would be more sensitive about the variation of the Food
aspect in different hotels. Thus, he would give ratings on the Food
aspect with higher variance. Another example is that a thrifty person
would be more sensitive to the Price aspect and tends to provide a
wider range of ratings for the Price aspect for different hotels. But for
other aspects, this user does not care much and the ratings on them
would exhibit much less variance. Finally, as the generative process
mentioned above, we assume that the overall rating of the review rY r
is drawn from a Gaussian distribution. The weighted sum of aspect
ratings is the mean and is a fixed variance, i.e. ∼ N ( ,ηrT Y rF c2 Y
r ηrT Y rF
). Since the user intrinsic aspect interest is modeled by a Laplacec2
prior, we employ the Maximum A Posterior (MAP) to estimate all the
latent variables in this model. Let T and Q be the collection of user
intrinsic aspect interest and item intrinsic aspect quality respectively,
i.e. T = { } a∈A , Q = { } b∈B , and we represent the collection ofta qb
word codes and aspect ratings as S = { } r∈R,n∈I d and Y = { }srn Y rF
r∈R , respectively. Our goal is to infer the latent variable set Ω where
Ω = {Y, S, T, Q, β, α}. The objective function is the negative logarithm
of the posterior p(Ω|{ , } r∈R,n∈ ). Combining (3) to (6), andwrn Y r Ir
the review text content component, the optimization problem based
on MAP estimation is given as follows:
Min Ω λ|| | + (γ|| - | - ρ|| | )∑
a ta |1 ∑
r∑
n∈Ir srn θr |22 srn |1
- log( ))- ]+ ∑
r[(∑
n∈Ir wrn sTrn β.n sTrn β.n ∑
r(Y1
2c2 r −∑
k ηrk Y Frk )2
[log( )+ ( - ]∑
r∑
k α t ark 1
2α t 2 2ark
Y Frk ) qbrk2
S.t. 0, 0, 0, = ta ≥ qb ≥ srn ≥ ηrk exp(θ )rk∑ exp(θ )rj
= ◦ , ∈ , α>0, , θr tar qbr βk SN−1 r, n ∀ ∈ Ir k∀
Where represents the (N-1)-simplex.SN−1
Bibliography
[1] H. Wang and Y. Lu C. Zhai. Latent aspect rating analysis on review
text data: a rating regression approach. In KDD, pages 783–792, 2010
[2] Y. Xu et al. Latent Aspect Mining via Exploring Sparsity and Intrinsic
Information. In ACM 2014
[3] Md Mustafizur Rahman and H. wang. Hidden Topic Sentiment
Model. WWW Conference 2016