Sparse Aspect Rating Model for Sentiment - IDRBT Interns 2016...“ Sparse Aspect Rating Model for...

transcript

A Summer Internship Project Report On

“Sparse Aspect Rating Model for Sentiment Summarization”

Carried out at the Institute for Development and Research in Banking Technology,

Hyderabad

Established by ‘Reserve Bank of India’

Submitted by Agni Besh Chauhan Roll No. – 1301CS04

B.Tech (Computer Science & Engineering) Indian Institute of Technology, Patna

Under the Guidance of Dr. S. Nagesh Bhattu

Asst. Professor Centre of Excellence in Analytics

IDRBT, Hyderabad

DECLARATION

I hereby declare that this dissertation entitled “Sparse Aspect Rating

Model for Sentiment Summarization” under the guidance and supervision

of Dr. S. Nagesh Bhattu submitted to Institute for Development &

Research in Banking Technology, Hyderabad is a bonafide record of

work which is also free from plagarism . I also declare that it has not

been submitted previously in part or full to this or any other university

or institution for award of any degree or diploma.

Agni Besh Chauhan

Dept. of Computer science & Engineering Indian Institute of Technology, Patna

CERTIFICATE

This is to certify that the summer internship project report entitled

“Sparse Aspect Rating Model for Sentiment Summarization” submitted

to Institute for Development & Research in Banking Technology

[IDRBT], Hyderabad is a bonafide record of work done by Agni Besh

Chauhan, Roll no. 1301CS04, B.Tech (Computer Science &

Engineering), 201318, Indian Institute of Technology, Patna” from 12th

May, 2016 to 20th July, 2016 under my supervision.

(Project Guide) Dr. S. Nagesh Bhattu

Asst. Professor Centre of Excellence in Analytics,

IDRBT, Hyderabad

CONTENTS

Pg. No.

1. Abstract … 2

2. Introduction … 2

3. Related Work … 5

4. Problem Definition … 7

5. Model Description … 8

6. Bibliography … 14

Abstract

We investigate aspect mining problem that aims to deliver opinion

based summarization of text reviews in financial domain. Our goal is

to derive the hidden aspect which has been discussed in the text data.

Further we retrieve the aspect intensity with which the user emphasis

on a given aspect in the review and give an score to each

aspect.Existing works defining the aspect rating prediction deals with

supervised model and does not consider sparsity of aspects. Our

model handles the sparsity of the aspect in an unsupervised manner

which can very efficiently solve the problem and produce detailed

records of aspect rating and sentiment summarization.

Introduction

With increase in number of internet services and users, we have

developed a vast repository of various kinds of knowledge. People

contribute their opinion and reviews to such repository on various

user centric platforms. And with growing number of such user

reviews, it is being utter shambles to wallow and find the vital

information. A lot of research effort has been made to tackle this

problem via information extraction, user opinion summarization and

sentiment analysis on reviews but it is still unable to produce reliable

prediction on user's opinion at very close and detailed analysis at

aspect level.

We will take an instance from typical bank review as banks are very

eminent entity in financial domain. A user tamsat17 writes "A good

bank these days, providing many branches and atms all over the

world. This bank service is very very good, always customer service is

present to help you. There are several assistant managers present in

all branches to help you at your doorstep. This bank providing

investments and savings according to your need. I have an account

with this bank. My account is priority ac. This account can be

continued at zero balance after 2 month period of opening, can

withdraw money from any banks atm whenever you want(no charge is

there), get several options like free demat account, shopping coupons,

5 dd without any bank charge in every month, conference rooms with

previous booking for your business needs etc. Likes- this bank service

is great, whenever I need to deposit they send assistance in my home

for taking it, this is like banking from home, always at your help for

what reason you may go. They suggest good investments schemes,

well experienced financial managers are recruited. Opening account is

much easier in this bank. Same day account opening. Dislikes- few

managers suggest wrong things for their promotions, no place to

complain against higher officials of axis bank, there should be zero

balance account opening with small investments, closing account in

this bank is like waiting for months. Overall I suggest axis for its

service and support. Moreover they offer several types of savings ac,

demat, other investment plans. For opening account don't have to visit

branch call helpline they will send assistance in your home. A good

bank with modern technology and no harassment." and gives an

overall rating of 4 star. Here overall rating given by user can not give a

detailed analysis of all the aspect in banking services. In the review he

talks about some positive aspects e.g., "service", "deposit",

"investments schemes" etc. while he also talks about some of the

negative aspects e.g., "promotions", "zero balance account", "closing

account" etc. without an explicit aspect rating on each aspect. The

user's opinion and sentiment about these aspects can not be

identified with simple overall rating. Considering two users giving

same rating to a bank may lead to different direction of their aspects,

the user may have liked the one aspects but not other and there may

be vice versa case for the other user and they give same overall rating

to the bank. To address this problem we conducted sparsity based

aspect rating modeling which as a result produced user's interest

toward a particular aspect in a given domain along with their opinion

and sentiment regarding that aspect.

For evaluation purpose of our model which takes as input a dataset

containing collection of review text along with their overall rating and

reviewer identity. These reviews belong to particular domain, in our

case it belongs to financial domain. Our aim is to achieve detailed

understanding of review by discovering the aspect set and rating

prediction for each aspect.Previous models such as Rating Analysis

Model (LARAM) which can address such aspect rating problem. They

relied on topic model Latent Dirichlet Allocation (LDA) for modeling

the word generation in reviews from internet and determine the

aspect rating based on a rating regression component. Some

limitations persisting in probabilistic topic models such as LDA is not

efficient when dealing with aspect sparsity in reviews. Sparse Aspect

refers to vitreous observation in the review data that a user talks

about only a few of many aspects in a domain. For example we again

consider the review from the banking domain discussed earlier where

user talks about various aspects such as "service", "deposit",

"investments schemes", "promotions", "zero balance account",

"closing account". But there may be various aspect he is not talking

about such as "ATM services'', "Phone banking", "Internet banking" etc.

And this is quite common scenario that in real life situation review

data suffers with sparsity issue. In order to gain proper insight of an

entity we need to consider all the aspects.

We Evaluate our Sparse Aspect Rating Model on a bank review

dataset crawled from MouthShut (http://www.mouthshut.com).

Experiments shows that our model can control sparsity of aspect

proportions and produces aspect ratings by considering item and user

information. Another considerable result in addition to aspect rating

prediction is that our model detects the key term for each aspect and

the learned dictionary contains term that are associated with each

aspect together with association strength.

Related Work

Aspect rating prediction has received vigorous interest in recent times.

The wide coverage of topics and abundance of opinions makes it an

important area of research for discovering public opinions on all sort

of topics. Significant effort has been paid on sentiment analysis for

customer reviews. Latent aspect rating Analysis model (LARAM) jointly

identifies latent aspects, aspect rating, and aspect weights in a review.

However, LARAM does not consider reviewer identity and user's trend

for writing reviews, and learn parameters per review basis in contrast

to our model which consider reviewer identity and user's trend to

learn hidden parameters by iterating over each review, item and user.

Dependency parser used to learn product aspect and aspect specific

opinions by jointly considering the aspect frequency and reviewer's

opinion about each aspect is very trivial in LARAM. However, the above

model is based on probabilistic topic model and fails to handle aspect

sparsity issue.

Several follow up work has been tries to address the limitation of

LARAM such as Hidden Topic Sentiment Model (HTSM) and FACTS

(FACeT and Sentiment extraction) Model. HTSM deals to explicitly

capture topic coherence and sentiment consistency in an opinionated

text review to extract hidden aspects and corresponding sentiment

polarities. In HTSM, topic coherence is achieved by enforcing words in

the same sentence to share the same topic assignment and modeling

topic transition between successive sentences. Sentence consistency

is imposed by constraining topic transition via tracking sentiment

changes and both topic transition and and sentiment transition are

guided by a parameterized logistic function based on the linguistic

signals directly observable in the document. However, it is based on

first order Markov Dependency and is a semi-supervised technique. It

does not captures the sparsity of data. Facet level sentiment analysis

has also been in interest from last few years. This involves extracting

the facets and the associated sentiments. Formulation by Hu and Liu

for this problem and applied association mining to extract product

features and used a seed set of adjective expanded using wordnet

synsets to identify the polarity of the sentiment words, but they do not

make any attempts to cluster the product features obtained into

appropriate facets.\par

Some of the works on extraction of aspect term such as MG-LDA

model which extract the ratable aspects automatically. Another work

by Mukherjee et al. applied the seed words provided by users for a

few aspect categories to jointly extract and cluster aspect term by

semi-supervised model. Topic Joint Model called JST by Lin et al. also

extract the aspect and its corresponding sentiment polarity. Although,

it does not give enough idea for identification of sentiment orientation

or rating prediction of each topical aspect for a specific item. various

sparsity based models has also been in widespread in different

application. Maximum A Posterior estimation for inducing sparsity

based on Probabilistic Latent Semantic Analysis by Shashanka et al.

Incorporation of sparse coding to improve traditional probabilistic

model and discover sparse hidden notation for each document by Zhu

et al.

Problem Definition

Our definition for sparse aspect rating problem for sentiment

summarization can be described as: We take input of a review

collection in financial domain for dealing with our sparse aspect rating

problem. Each review having an overall rating, reviewer’s identity and

item identity. Our aim here is to retrieve the previously latent aspects

for our domain and give a rating prediction for each aspect for each

review provided we need to define the count of the aspects. Further

we will retrieve the keys for each aspect. Reviews are associated with a

reviewer’s identity, item identity and overall rating. For a domain

specific dataset the input corpus is represented as R = { , ,...., }.r1 r2 r|R|

We use A = {1, 2,...., A} for representing collection of reviewers and B =

{1, 2,..., B} for collection of items. Let us say the review r ∈ R is written

by reviewer ∈ A for the item ∈ B. The overall rating, ∈ , isur br Y r R+

given by the reviewers to denote the emphasis on the item. The

numeric score has a range same as the ground truth ratings, typically

it ranges from 1 to 5. The representation of attributes belonging to the

domain specific subjects is termed as aspect. For example ”ATM

services”, ”Phone banking”, ”Internet banking” etc. in banking service

domain. Let K be the total number of aspects in the given domain. to

denote this set of aspect we use F = {1, 2, ..., K }. Each of it’s element is

denoted by t ∈ F.

Model Description

Our model incorporates two latent variables namely user intrinsic

aspect interest and item intrinsic aspect quality when modeling the

observed review text and overall rating . Reviewer’s interest for the tr

reviewer r ∈ R represents this reviewer’s interest for each aspect. Item

aspect quality denotes the intrinsic quality of item b ∈ B for each qb

aspect, which is user independent. More description for these two

notions can be found in Section 4. The generative process is as

follows: One would first choose the subset of all aspects for giving

comments and decide the text proportion for describing each aspect

based on the user intrinsic aspect interest and item intrinsic aspecttr

quality . Then, some terms including opinionated words would be qb

selected to form the review content. The details of the generation

process of a word will be described below. Next, the sentimental

orientation for each aspect characterized by the aspect rating is

determined. Finally, the observed overall rating given by this user will

be based on the weighted sum of aspect ratings. The graphical model

of SACM is depicted in Figure 3. The outer rectangle plate represents

the replication for a review. The inner rectangle plate captures each

word in each review. There are two components in this model. The

first component shown on the lower left is related to the review text

content component including , and . The second componentθr srn wrn

shown on the upper right is related to the rating mining component.

We first describe the review text content component which uses a

variant of STC mentioned in Section 4.2 to generate the observed

words. For a particular review d ∈ D written by the user ∈ A for thear

item ∈ B, the document code is modeled as the Hadamardbr θd

product between the user intrinsic aspect interest and the itemtar

intrinsic aspect quality instead of Laplace prior. Precisely, the kthqbr

element of the document code represents the association strengthθrk

on the aspect k. Also, the more the word occurrence over the kth

aspect, the higher the value of is. Specifically, the dominated aspectθrk

proportions in a review mainly depend on the corresponding andtar

. For instance, in the hotel domain, a user who likes delicious foodqbr

will have high where the aspect k is the Food aspect. This usertark

likely provides opinions on food in detail in his/her reviews leading to

a high value of . Additionally, a hotel possessing distinctiveθrk

environment, i.e. high where k is the Environment aspect, is likelyqbk

to draw attention from users by its environment. Thus, it tends to

attract some comments on this aspect. As a result, the corresponding

also has a h igh value. The above examples show us that both θrk tar

and contributes to . Based on the above motivation, we use Eq.qbr θr

(3) below to generate the aspect proportion, which is modeled by the

document code . for review r,θr

= ◦θr tar qbr

where the operator ◦ is the Hadamard product , which is defined as

the entry-wise product between the vector and the vector . It istar qbr

reasonable that the user intrinsic aspect interest ,a ∈ A is drawnta

from the Laplace prior, i.e. p( ) ∝ exp (−λ|| | ). Specifically, a userta ta |1

usually will not be interested in all possible aspects of a particular

item. Then, we use the STC model to generate the observed review

text. After obtaining the document code , we sample the word codeθr

fromsrn

p( | ) for each observed word n, where n is the word index insrn θr

vocabulary, and sample the observed word count from awrn

distribution with as the mean, where represents the n-thsTrn β.n β.n

column of β. Unlike the multinomial distribution adopted in

traditional probabilistic topic models for the sparsity of word code, srn

is drawn from the super gaussian as shown below. The -norm withinl1

them tends to find sparse codes.

p( | ) ∝ exp (−γ|| - | - ρ|| | )srn θr srn θr |22 srn |1

Then, the word count in each document is sampled from the Poisson

distribution p( | , β) = Poiss( ; ). In the rating miningwrn srn wrn sTrn β.n

component, we define the aspect weight represents the user’s relative

weight placed on each aspect when the user decides the overall rating

for a particular review. For the review d, we assume that aspect weight

∈ is generated by the document code , which denotes theηr RK++ θr

aspect strength in each aspect. After normalization, we have each

element of η d as follows:

=ηrkexp(θ )rk∑ exp(θ )rj

For a review d written by the user for the item , we assume that ar br

the k-th element of the aspect rating is drawn from a GaussianY Frk

distribution. The mean and variance are assumed to be and qbrk α 2 t2ark

respectively where α is a positive scalar.

~ N( , )Y Frk qbrk α

2 t2ark

Consequently, the ratings on the kth aspect from all reviews for a

particular item should attain the average value determined by thebr

intrinsic aspect quality of the item . For a particular user a, theqbr br

variance for his/her aspect ratings should be related to this user’s

intrinsic aspect interest . For example, in the hotel domain, a foodieta

person is likely to write more about the Food aspect in the reviews,

and this user would be more sensitive about the variation of the Food

aspect in different hotels. Thus, he would give ratings on the Food

aspect with higher variance. Another example is that a thrifty person

would be more sensitive to the Price aspect and tends to provide a

wider range of ratings for the Price aspect for different hotels. But for

other aspects, this user does not care much and the ratings on them

would exhibit much less variance. Finally, as the generative process

mentioned above, we assume that the overall rating of the review rY r

is drawn from a Gaussian distribution. The weighted sum of aspect

ratings is the mean and is a fixed variance, i.e. ∼ N ( ,ηrT Y rF c2 Y

r ηrT Y rF

). Since the user intrinsic aspect interest is modeled by a Laplacec2

prior, we employ the Maximum A Posterior (MAP) to estimate all the

latent variables in this model. Let T and Q be the collection of user

intrinsic aspect interest and item intrinsic aspect quality respectively,

i.e. T = { } a∈A , Q = { } b∈B , and we represent the collection ofta qb

word codes and aspect ratings as S = { } r∈R,n∈I d and Y = { }srn Y rF

r∈R , respectively. Our goal is to infer the latent variable set Ω where

Ω = {Y, S, T, Q, β, α}. The objective function is the negative logarithm

of the posterior p(Ω|{ , } r∈R,n∈ ). Combining (3) to (6), andwrn Y r Ir

the review text content component, the optimization problem based

on MAP estimation is given as follows:

Min Ω λ|| | + (γ|| - | - ρ|| | )∑

a ta |1 ∑

n∈Ir srn θr |22 srn |1

- log( ))- ]+ ∑

r[(∑

n∈Ir wrn sTrn β.n sTrn β.n ∑

2c2 r −∑

k ηrk Y Frk )2

[log( )+ ( - ]∑

k α t ark 1

2α t 2 2ark

Y Frk ) qbrk2

S.t. 0, 0, 0, = ta ≥ qb ≥ srn ≥ ηrk exp(θ )rk∑ exp(θ )rj

= ◦ , ∈ , α>0, , θr tar qbr βk SN−1 r, n ∀ ∈ Ir k∀

Where represents the (N-1)-simplex.SN−1

Bibliography

[1] H. Wang and Y. Lu C. Zhai. Latent aspect rating analysis on review

text data: a rating regression approach. In KDD, pages 783–792, 2010

[2] Y. Xu et al. Latent Aspect Mining via Exploring Sparsity and Intrinsic

Information. In ACM 2014

[3] Md Mustafizur Rahman and H. wang. Hidden Topic Sentiment

Model. WWW Conference 2016

Sparse Aspect Rating Model for Sentiment - IDRBT Interns 2016...“ Sparse Aspect Rating Model for...

Documents