Download - A Generation Model to Unify Topic Relevance and Lexicon-based Sentiment for Opinion Retrieval

A Generation Model to Unify Topic Relevance and A Generation Model to Unify Topic Relevance and Lexicon-based Sentiment for Opinion RetrievalLexicon-based Sentiment for Opinion Retrieval

Min Zhang, Xinyao Ye

Tsinghua University

SIGIR 2008

2009. 01. 14.

Summarized & presented by Jung-Yeon Yang

IDS Lab.

Copyright 2008 by CEBT

IntroductionIntroduction

A growing interest in finding out people’s opinions from web data

Product survey

Advertisement analysis

Political opinion polls

TREC started a special track on blog data in 2006 – blog opinion retrieval

– It has been the track that has the most participants in 2007

This paper is focused on the problem of searching opinions over general topics

2


Related WorkRelated Work

The popular opinion identification approaches

Text classification

Lexicon-based sentiment analysis

Opinion retrieval

Opinion retrieval

To find the sentimental relevant documents according to a user’s query

One of the key problems

How to combine opinion score with relevance score of each document for ranking

3


Related Work (cont.)Related Work (cont.)

Topic-relevance search is carried out by using relevance ranking(e.g. TF*IDF ranking)

Ad hoc solutions of combining relevance ranking and opinion detection result

2 steps: rank with relevance, then re-rank with sentiment score

Most existing approaches use a linear combinationα *Scorerel + β *Scoreopn

4


BackgroundsBackgrounds

Statistical Language Model (LM)

A probability distribution over word sequences

– p(“Today is Wednesday”) 0.001

– p(“Today Wednesday is”) 0.0000000000001

– Unigram : P(w1,w2,w3) = P(w1)*P(w2)*P(w3)

– Bigram : P(w1,w2,w3) = P(w1)*P(w2|w1)*P(w3|w2)

Can also be regarded as a probabilistic mechanism for “generating” text, thus also called a “generative” model

LM allows us to answer questions like:

Given that we see “John” and “feels”, how likely will we see “happy” as opposed to “habit” as the next word? (speech recognition)

Given that we observe “baseball” three times and “game” once in a news article, how likely is it about “sports”? (text categorization, information retrieval)

5


Backgrounds (cont.)Backgrounds (cont.)

The notion of relevance

6

Relevance

(Rep(q), Rep(d)) Similarity

P(r=1|q,d) r {0,1} Probability of Relevance

P(d q) or P(q d) Probabilistic inference

Different rep & similarity

Vector spacemodel

Prob. distr.model

…

GenerativeModel

RegressionModel

Classicalprob. Model

Docgeneration

Querygeneration

LMapproach

Prob. conceptspace model

Differentinference system

Inference network model


Backgrounds (cont.)Backgrounds (cont.)

Retrieval as Language Model Estimation

Document ranking based on query likelihood

Retrieval problem Estimation of p(wi|d)

Smoothing is an important issue

– Problem : If tf = 0 , then p(wi|d) = 0

– Smoothing methods try to

Discount the probability of words seen in a document

Re-allocate the extra probability so that unseen words will have a non-zero probability

– Most use a reference model(collection language model) to discriminate unseen words

7

n

ii

wwwqwhere

dwpdqp

...,

)|(log)|(log

21

Document language model

otherwiseCwp

dinseeniswifdwpdwp

d

seen

)|(

)|()|(

Discounted LM estimate

Collection language model


Generation Model for Opinion Generation Model for Opinion RetrievalRetrieval

The Generation Model

To find both sentimental and relevant documents with ranks

Topic Relevance Ranking

Opinion Generation Model and Ranking

Ranking function of generation model for opinion retrieval

8


The Proposed Generation ModelThe Proposed Generation Model Document generation model

how well the document d “fits” the particular query q p(d |q)∝p(q |d)p(d)

In opinion retrieval, p(d |q, s )

Users’ information need is restricted to only an opinionate subset of the relevant documents

This subset is characterized by sentiment expressions s towards topic q

9


The Proposed Generation Model The Proposed Generation Model (cont.)(cont.)

In this work, discuss lexicon-based sentiment analysis

Assume

– The latent variable s is estimated with a pre-constructed bag-of-word sentiment thesaurus

– All sentiment words si are uniformly distributed

The final generation model

10

quadratic relationship


Topic Relevance Ranking Topic Relevance Ranking IIrel rel ((dd,,qq))

The Binary Independent Retrieval (BIR) model is one of the most famous ones in this branch

Heuristic ranking function BM25

– TREC tests have shown this to be the best of the known probabilistic weighting schemes

11

IDF(w)


Opinion Generation Model Opinion Generation Model IIop op ((dd,,q,sq,s))

Iop(d,q,s) focus on the problem that given query q, how

probably a document d generates a sentiment expression s.

Sparseness problem smoothing

Jelinek-mercer smoothing is applied.

12


Opinion Generation Model Opinion Generation Model IIop op ((dd,,q,sq,s))

Use the co-occurrence of si and q inside d within a window

W as the ranking measure of Pml (si |d,q)

13


Ranking function of generation model for opinion Ranking function of generation model for opinion retrievalretrieval

The final ranking function

To reduce the impact of unbalance between #(sentiment words) and #(query terms) logarithm normalization

14

Only use the topic relevance


Experimental SetupExperimental Setup

Data set

TREC blog 06 & 07

100,649 blogs during 2.5 month

Strategy : find top 1000 relevant documents, then re-rank the list with proposed model

Models

General linear combination

Proposed generation model with smoothing

Proposed generation model with smoothing and normalization

15


Experimental Setup (cont.) Experimental Setup (cont.)

Sentimental Lexicons

16

Thesaurus name Size Desc.

1 HowNet 4621 English translation of pos/neg Chinese words from HowNet

2 WordNet 7426 Selected words from WordNet with seeds

3 Intersection 1413 1 ∩ 2

4 Union 10634 1 ∪ 2

5 General Inquirer 3642 All words in the positive and negative categories

6 SentiWordNet 3133 Words with a positive or negative score above 0.6


Expetiment ResultsExpetiment Results

17


ConclusionConclusion

Proposed a formal generation opinion retrieval model

Topic relevance & sentimental scores are integrated with quadratic comb.

Opinion generation ranking functions are derived

Discussed the roles of the sentiment lexicon and the matching window

It is a general model for opinion retrieval

Use the domain-independent lexicons

No assumption has been made on the nature of blog-structure text

18


My opinionsMy opinions

Good points

proposed the opinion retrieval model and ranking function

Do various experiments

Lacking points

Just find documents that has some opinions

Don’t know what kinds of opinions in the documents

Don’t use the sentimental polarity of words

19