A Generation Model to Unify Topic Relevance and A Generation Model to Unify Topic Relevance and Lexicon-based Sentiment for Opinion RetrievalLexicon-based Sentiment for Opinion Retrieval
Min Zhang, Xinyao Ye
Tsinghua University
SIGIR 2008
2009. 01. 14.
Summarized & presented by Jung-Yeon Yang
IDS Lab.
Copyright 2008 by CEBT
IntroductionIntroduction
A growing interest in finding out people’s opinions from web data
Product survey
Advertisement analysis
Political opinion polls
TREC started a special track on blog data in 2006 – blog opinion retrieval
– It has been the track that has the most participants in 2007
This paper is focused on the problem of searching opinions over general topics
2
Copyright 2008 by CEBT
Related WorkRelated Work
The popular opinion identification approaches
Text classification
Lexicon-based sentiment analysis
Opinion retrieval
Opinion retrieval
To find the sentimental relevant documents according to a user’s query
One of the key problems
How to combine opinion score with relevance score of each document for ranking
3
Copyright 2008 by CEBT
Related Work (cont.)Related Work (cont.)
Topic-relevance search is carried out by using relevance ranking(e.g. TF*IDF ranking)
Ad hoc solutions of combining relevance ranking and opinion detection result
2 steps: rank with relevance, then re-rank with sentiment score
Most existing approaches use a linear combinationα *Scorerel + β *Scoreopn
4
Copyright 2008 by CEBT
BackgroundsBackgrounds
Statistical Language Model (LM)
A probability distribution over word sequences
– p(“Today is Wednesday”) 0.001
– p(“Today Wednesday is”) 0.0000000000001
– Unigram : P(w1,w2,w3) = P(w1)*P(w2)*P(w3)
– Bigram : P(w1,w2,w3) = P(w1)*P(w2|w1)*P(w3|w2)
Can also be regarded as a probabilistic mechanism for “generating” text, thus also called a “generative” model
LM allows us to answer questions like:
Given that we see “John” and “feels”, how likely will we see “happy” as opposed to “habit” as the next word? (speech recognition)
Given that we observe “baseball” three times and “game” once in a news article, how likely is it about “sports”? (text categorization, information retrieval)
5
Copyright 2008 by CEBT
Backgrounds (cont.)Backgrounds (cont.)
The notion of relevance
6
Relevance
(Rep(q), Rep(d)) Similarity
P(r=1|q,d) r {0,1} Probability of Relevance
P(d q) or P(q d) Probabilistic inference
Different rep & similarity
Vector spacemodel
Prob. distr.model
…
GenerativeModel
RegressionModel
Classicalprob. Model
Docgeneration
Querygeneration
LMapproach
Prob. conceptspace model
Differentinference system
Inference network model
Copyright 2008 by CEBT
Backgrounds (cont.)Backgrounds (cont.)
Retrieval as Language Model Estimation
Document ranking based on query likelihood
Retrieval problem Estimation of p(wi|d)
Smoothing is an important issue
– Problem : If tf = 0 , then p(wi|d) = 0
– Smoothing methods try to
Discount the probability of words seen in a document
Re-allocate the extra probability so that unseen words will have a non-zero probability
– Most use a reference model(collection language model) to discriminate unseen words
7
n
ii
wwwqwhere
dwpdqp
...,
)|(log)|(log
21
Document language model
otherwiseCwp
dinseeniswifdwpdwp
d
seen
)|(
)|()|(
Discounted LM estimate
Collection language model
Copyright 2008 by CEBT
Generation Model for Opinion Generation Model for Opinion RetrievalRetrieval
The Generation Model
To find both sentimental and relevant documents with ranks
Topic Relevance Ranking
Opinion Generation Model and Ranking
Ranking function of generation model for opinion retrieval
8
Copyright 2008 by CEBT
The Proposed Generation ModelThe Proposed Generation Model Document generation model
how well the document d “fits” the particular query q p(d |q)∝p(q |d)p(d)
In opinion retrieval, p(d |q, s )
Users’ information need is restricted to only an opinionate subset of the relevant documents
This subset is characterized by sentiment expressions s towards topic q
9
Copyright 2008 by CEBT
The Proposed Generation Model The Proposed Generation Model (cont.)(cont.)
In this work, discuss lexicon-based sentiment analysis
Assume
– The latent variable s is estimated with a pre-constructed bag-of-word sentiment thesaurus
– All sentiment words si are uniformly distributed
The final generation model
10
quadratic relationship
Copyright 2008 by CEBT
Topic Relevance Ranking Topic Relevance Ranking IIrel rel ((dd,,qq))
The Binary Independent Retrieval (BIR) model is one of the most famous ones in this branch
Heuristic ranking function BM25
– TREC tests have shown this to be the best of the known probabilistic weighting schemes
11
IDF(w)
Copyright 2008 by CEBT
Opinion Generation Model Opinion Generation Model IIop op ((dd,,q,sq,s))
Iop(d,q,s) focus on the problem that given query q, how
probably a document d generates a sentiment expression s.
Sparseness problem smoothing
Jelinek-mercer smoothing is applied.
12
Copyright 2008 by CEBT
Opinion Generation Model Opinion Generation Model IIop op ((dd,,q,sq,s))
Use the co-occurrence of si and q inside d within a window
W as the ranking measure of Pml (si |d,q)
13
Copyright 2008 by CEBT
Ranking function of generation model for opinion Ranking function of generation model for opinion retrievalretrieval
The final ranking function
To reduce the impact of unbalance between #(sentiment words) and #(query terms) logarithm normalization
14
Only use the topic relevance
Copyright 2008 by CEBT
Experimental SetupExperimental Setup
Data set
TREC blog 06 & 07
100,649 blogs during 2.5 month
Strategy : find top 1000 relevant documents, then re-rank the list with proposed model
Models
General linear combination
Proposed generation model with smoothing
Proposed generation model with smoothing and normalization
15
Copyright 2008 by CEBT
Experimental Setup (cont.) Experimental Setup (cont.)
Sentimental Lexicons
16
Thesaurus name Size Desc.
1 HowNet 4621 English translation of pos/neg Chinese words from HowNet
2 WordNet 7426 Selected words from WordNet with seeds
3 Intersection 1413 1 ∩ 2
4 Union 10634 1 ∪ 2
5 General Inquirer 3642 All words in the positive and negative categories
6 SentiWordNet 3133 Words with a positive or negative score above 0.6
Copyright 2008 by CEBT
Expetiment ResultsExpetiment Results
17
Copyright 2008 by CEBT
ConclusionConclusion
Proposed a formal generation opinion retrieval model
Topic relevance & sentimental scores are integrated with quadratic comb.
Opinion generation ranking functions are derived
Discussed the roles of the sentiment lexicon and the matching window
It is a general model for opinion retrieval
Use the domain-independent lexicons
No assumption has been made on the nature of blog-structure text
18
Copyright 2008 by CEBT
My opinionsMy opinions
Good points
proposed the opinion retrieval model and ranking function
Do various experiments
Lacking points
Just find documents that has some opinions
Don’t know what kinds of opinions in the documents
Don’t use the sentimental polarity of words
19