Review Topic Discovery with Phrases
using the Polya Urn Model
Geli Fei, Zhiyuan Chen, Bing Liu University of Illinois at Chicago
Presenter: Alan Akbik
IBM Research Almaden / Berlin Institute of Technology
Product Aspects
Large collection of product reviews
◦ Example domain: Smartphones
Task: Discover aspects that are being discussed in
the reviews
◦ Battery - Battery life, AAA batteries „The battery life of this smartphone is great.“
„It uses AAA batteries.“
◦ Screen - Screen size, touch screen
◦ Camera - Resolution, image quality
Topic Models
Widely used in review topic / aspect discovery
Most models regard each topic as a distribution over individual terms (unigrams)
Terms in each document are assigned to topics
◦ Documents assigned to topics via terms
The generation of topics is mostly governed by “higher order co-occurrence” (Heinrich 2009)
◦ i.e., how often words co-occur in different contexts
Topic Models
Major issue: individual words may not convey the
same information as natural phrases
◦ e.g. “battery life” vs. “life”
Leading to three problems:
◦ Interpretability - Topics are hard for users to interpret
unless they are domain experts
◦ Ambiguity - Hard to directly make use of the topical
words
◦ False evidence - Causes extra or wrong co-occurrences
in topic generation, leading to poorer topics
Possible Solutions (1)
Treat each whole phrase as one term
“The battery life of this smartphone is great”
<the> <battery_life> <of> <this> <smartphone> <is> <great>
Problems: ◦ Many phrases very rare
◦ Remove important words “battery life” may not be in the same topic as “battery”, because we
don’t observe co-occurence
Possible Solutions (2)
Keep individual words, add extra terms for phrases
“The battery life of this tablet is great”
<the> <battery> <life> <battery_life> <of> <this> <smartphone>
<is> <great>
Problems: ◦ False evidence still exists
◦ Many phrases rare “battery life” is much less frequent than “life” to be ranked
on the top in a topic
Challenge
How to retain connections between phrases
and words while removing wrong co-
occurrences?
Related Work
Using n-grams in topic modeling (Mukherjee and Liu 2013; Mukherjee et al. 2013).
Identifying key phrases in the post-processing step based on the discovered topical unigrams (Blei and Lafferty 2009; Liu et al. 2010; Zhao et al. 2011).
Directly modeling word order in topic model (Wallach 2006; Wang et al. 2007).
◦ Breaking the “bag-of-word” assumption
◦ Although ”bag-of-word” assumption does not always hold, it offers a great computational advantage
◦ Our method still follows the ”bag-of-word” assumption
Gibbs Sampling for LDA
One of the most commonly used inference
techniques for topic models.
Considers each term in the documents in turn
Samples a topic to the current term,
conditioned on the topic assignments of other
terms.
Simple Po lya Urn Model (SPU)
Designed in the context of colored balls and urns
In the context of topic models:
◦ A ball with a certain color: a term
◦ The urn: contains a mixture of balls with various colors (terms)
Topic-word (topic-term) distribution is reflected by the proportion of balls with a certain color in the urn
Simple Po lya Urn Model (SPU)
Left: initial state
Middle: draw a ball of a certain color
Right: put two balls of the same color back
Self-reinforcing property known as “the rich get richer”
Generalized Po lya Urn Model (GPU)
GPU vs. SPU: apart from two balls with the same color being put back, a certain number of balls with some other colors are also put in the urn.
We call this the promotion of these colored balls
Using the idea in the sampling process:
◦ SPU: seeing “staff” under a topic only increases the chance of seeing it again under the same topic
◦ GPU: also increases the chance of seeing “hotel staff” under the topic
Generalized Po lya Urn Model (GPU)
In our application: ◦ We use each whole phrase as a term to remove
wrong co-occurrences
◦ And use GPU to regain the connection between phrases and words
Two directions of promotion: ◦ Word to phrase: when a topic is assigned to an
individual word, phrases containing the word are promoted
◦ Phrase to word: when a topic is assigned to a phrase, each component word is promoted
Datasets and Preprocessing
Data sets:
◦ 30 categories of electronics reviews from Amazon (1,000
reviews in each category)
◦ Hotel reviews from TripAdvisor (101,234 reviews)
◦ Restaurant reviews from Yelp (25,459 reviews)
Preprocessing:
◦ Review sentences as documents
Standard topic models cannot discover product aspects well
when directly applied to reviews (Titov and McDonald, 2008)
◦ Rule-based method for noun phrase detection
Use rule-based method for efficiency
Experiments
Four sets of experiments on 32 domains
◦ Baseline #1, LDA(w): without considering phrases
◦ Baseline #2, LDA(p): considers phrases, uses each whole phrase as a term
◦ Baseline #3, LDA(w_p): considers phrases, keeps individual component words, and adds phrases as extra terms
◦ LDA(p_GPU): Our proposed method
Parameter Setting
Use the same set of parameters for all experiments
◦ Set Dirichlet priors as in (Griffiths and Steyvers, 2004)
Set document-topic prior 𝛼=50/𝐾, where 𝐾 is the number of
topics.
Set topic-term prior 𝛽=0.1
◦ Set number of topics 𝐾=15
◦ posterior inference was drawn after 2000 Gibbs sampling
iterations with 400 iterations of burn-in
Parameters for GPU Model
Not all words in a phrase are equally important ◦ e.g. “staff” is more important than “hotel” in “hotel staff”
Determine head nouns
◦ Following (Wang et al., 2007), we assume the last word in a noun phrase as the head noun
GPU promotion ◦ Word to phrase: promote a phrase by virtualcount when a topic is
assigned to its head noun
◦ Phrase to word: promote 0.5 * virtualcount to the head noun and 0.25 * virtualcount to all other words when a topic is assigned to a phrase
◦ Set virtualcount=0.1 empirically, based on how much to promote phrases
Statistical Evaluation
Two commonly used evaluation statistics:
◦ Perplexity: measures the likelihood of unseen documents
◦ KL-divergence: measure the distinctiveness of topics
◦ Neither of them correlates well with human judgments
We use topic coherence (Mimno et al. 2011)
◦ It measures the degree of co-occurrence of topical words
under a topic
◦ Has been shown to correlate with human judgment quite well
◦ Generates a negative value, the higher the better
Statistical Evaluation
Topic Coherence using top 15 topical terms
Statistical Evaluation
Topic Coherence using top 30 topical terms
Human Evaluation
Done by two annotators in two stages sequentially
◦ Topic labeling (Kappa score: 0.838)
◦ Topical terms labeling by computing precision@n
(Kappa score: 0.846)
◦ We compute average p@15 and p@30 for each model
on each domain
Human Evaluation
Human evaluation on five domains
◦ Hotel, Restaurant, Watch, Tablet, MP3Player
Example Topics
Example topics by LDA(w) and LDA(p_GPU)
Future Work
Design a topic quality metrics for topics with
phrases
Systematically set the amount of promotion
based on the designed metrics
Thank You!