Date post: | 14-Apr-2017 |
Category: |
Technology |
Upload: | ayush-jain |
View: | 255 times |
Download: | 0 times |
Образец заголовка
Tutorial on Topic Modelling
by Ayush Jain
Prepared as an assignment for CS410: Text Information Systems in Spring 2016
Образец заголовка Topic Models
• Discoverhiddenthemesthatpervadethecollec2on
• Tagthedocumentsonthebasisofthesethemes
• Organize,summarizeandsearchthedocumentsonthebasisofthesethemes
Образец заголовка Takeaways from this tutorial
• What are probabilistic topic models? • What kind of things can they do? • How do we train/infer a topic model? • How do we evaluate a topic model?
Образец заголовка Tools
• Topic models are a special application of probability theory. In particular, they touch – Probabilistic graphical Models – Conjugate and non-conjugate priors – Approximate posterior inference – Exploratory data analysis
Образец заголовка The Key Steps in every Topic Model
Makeassump2ons
CollectData
Inferposterior
Evaluate
Predict
Образец заголовка Outline
• Latent Dirichlet Allocation – Application of key steps – Graphical Model encoding the assumptions – Inference Algorithms – Gibbs Sampling
• Topic Models for more complex tasks – Rating prediction
• A completely novel topic model incorporating sentiments (that we’ll develop!)
Образец заголовка Latent Dirichlet Allocation
• Already covered in course • Application of the key steps
– Make assumptions • Each topic is a distribution over words • Each document is a mixture of topics • Each word is drawn from a topic
Образец заголовка Latent Dirichlet Allocation
• Graphical Model
• Encodesassump2ons• Allowsustobreakdownthejointprobabilityintoproductofcondi2onals
Образец заголовка Latent Dirichlet Allocation
• Graphical Model
Образец заголовка Latent Dirichlet Allocation
• Application of the key steps – Make assumptions (II)
• Choose probability distributions – Choosing conjugate distributions makes life easier!
» Eg: Multinomial and Dirichlet are conjugate distributions
Образец заголовка Aside: Conjugate Distributions
• Dirichlet Distribution:
: Probability of seeing different sides of die • Multinomial Distribution:
– The number of occurrences of different sides (W) of the die is distributed in a multinomial manner
• Posterior distribution:
θ
p(W |θ ) ismul2nomial
xi:Thenumberof2messideiwasobserved
Образец заголовка Latent Dirichlet Allocation
• Application of the key steps – Make assumptions (II)
• Choose probability distributions – Choosing conjugate distributions makes life easier!
» Eg: Multinomial and Dirichlet are conjugate distributions
– Collect Data • Corpus on which you want to detect themes
Образец заголовка Latent Dirichlet Allocation
• Application of the key steps – Infer Posterior
• Probabilistic graphical models provide algorithms – Mean field variational methods – Expectation Propagation (similar to EM) – Gibbs Sampling (most popular) – Variational Inference
Образец заголовка Aside: Gibbs Sampling
– Used when samples need to be drawn from a joint distribution, but the joint distribution is difficult to approximate
– Sample X=(x1, …, xn) from joint pdf p(x1, …, xn) – Conditional distributions are relatively
strighforward – Procedure:
• Begin with some initial X(i)
• Sample xj(i+1) from
p(xj(i+1) | x1
(i+1) ,.. , xj-1(i+1) , xj+1
(i) , .., xn(i+1) )
• Repeat
Образец заголовка Latent Dirichlet Allocation
• Application of the key steps – Infer Posterior (Gibbs Sampling)
• Here, X is all parameters to be inferred – Per-word topic assignment zd,n
– Per-document topic proportions d
– Per-corpus topic-word distributions k
• Extremely high dimensional! • Solution:
– Integrate out and – Conjugate distributions make the integration
straightforward!
θβ
θ β
Образец заголовка Latent Dirichlet Allocation
• Application of the key steps – Infer Posterior (Gibbs Sampling)
• After all computation:
• nd,:k, -(d,n): The number of words in document d that
belong to topic k, except for n-th word • v: Index of the n-th word in d-th document in the
vocabulary • Linear time in the number of tokens!
P Zd,n = k | Z−(d,n),W;α,β( )∝ nd,:k,−(d,n) +αk( ) n:,v
k,−(d,n) +βv
n:,rk,−(d,n) +βr
r=1
V
∑
Образец заголовка Latent Dirichlet Allocation
• Application of the key steps – Infer Posterior (Gibbs Sampling)
• After all computation:
• Linear time in the number of tokens! • Further improvements that use the sparsity of the
problem when corpus and number of topics is large
P Zd,n = k | Z−(d,n),W;α,β( )∝ nd,:k,−(d,n) +αk( ) n:,v
k,−(d,n) +βv
n:,rk,−(d,n) +βr
r=1
V
∑
Образец заголовка Topic Models: Evaluation
• Underlying topics are subjective – Makes the evaluation difficult – Workaround: Look at application and evaluate
• Document classification • Information Retrieval • Rating Prediction
Образец заголовка Topic Models: Evaluation
• Use the trained model to predict probabilities of seeing unseen documents – Better models would give high probability
• Even better: – Predict the probability of second half of
documents using first halves as the corpus – Does not require documents to be held out
Образец заголовка Beyond LDA: Rating Prediction
• Predict ratings associated with text • Additional assumption:
• Rating is conditional on the topic assignment to different words
• Graphical Model:
Образец заголовка Beyond LDA: Rating Prediction
• Topics – Least, problem, unfortunately, supposed, worse, flat, dull – Bad, guys, watchable, not, one, movie – Both, motion, simple, perfect, fascinating, power – Cinematography, screenplay, performances, pictures, effective, sound
• Notice how the assumption affects the extracted topics – Because of the dependence of the overall rating on number of words in
different topics, topics are collections of words that appear in similarly ranked documents
– Topics express sentiment but loose their original meaning!
Образец заголовка Beyond LDA: Rating Prediction
• Latent Aspect Rating Prediction – Joint Topic and Sentiment Modelling
Genera&veModel1. Chooseaspectsandwordsfor
eachaspect Wdij
Образец заголовка Beyond LDA: Rating Prediction
• Latent Aspect Rating Prediction – Joint Topic and Sentiment Modelling
Genera&veModel1. Chooseaspectsandwordsfor
eachaspect2. Calculate aspect ra2ng based
onaspectwords sdi = β ijWdij
j=1
n
∑
Образец заголовка Beyond LDA: Rating Prediction
• Latent Aspect Rating Prediction – Joint Topic and Sentiment Modelling
Genera&veModel1. Choose aspects and words
foreachaspect2. Calculate aspect ra2ng
basedonaspectwords3. Overall ra2ng is weighted
sumofaspectra2ngs
rd ~ N αdi β ijWdij
j=1
n
∑ ,δ 2i=1
k
∑"
#$$
%
&''
Образец заголовка Beyond LDA: Rating Prediction
• Latent Aspect Rating Prediction – Joint Topic and Sentiment Modelling
Genera&veModelE-Step:Inferaspectra2ngsandaspectweightsM-Step:Update
sdαd
µ,Σ,β,δ( )
Образец заголовка Beyond LDA: Rating Prediction
• Latent Aspect Rating Prediction – Results
• Detects sentiments without supervision
Образец заголовка Beyond LDA: Rating Prediction
• Latent Aspect Rating Prediction – Results
• Requires keyword supervision – Any way to remove? (Think LDA!)
Образец заголовка Beyond LDA: Rating Prediction
• Latent Aspect Rating Prediction without Aspect Keyword Supervision
– Aspect Modelling Module from LDA included
Образец заголовка Beyond LDA: Topic Phrase Mining
• Motivation: – machine learning is a phrase and should be assigned
to one topic • Assigning machine to “Industry” and learning to “Education”
is incorrect
• Approach: – Extract high frequency phrases
• If a phrase is infrequent, so is any super-phrase • If a document does not contain a frequent phrase of length n,
it also does not contain any of length > n • Use hierarchical clustering to find frequent phrases
– Apply LDA on phrase tokens
Образец заголовка Sentiment Analysis
• Let’s build our own simple model using the key steps!
• Use case:
Образец заголовка Sentiment Analysis
• Make Assumptions – Each (topic, sentiment) pair has a vocabulary
• ‘quick delivery’ has more probability for (service, +) than for (service, -) or (food quality, +)
– Each (topic, rating) pair has a sentiment distribution • + sentiments for food quality are more likely to appear in
highly rated reviews • A 4-star rated restaurant is likely to have good food quality
even if it does not provide wireless – Each review has
• Overall rating • Topic distribution: Different users might talk about different
aspects in their reviews
Образец заголовка Sentiment Analysis
• Graphical Model Genera&veProcess1. Chooseworddistribu2onfor
all(topic,sen2ments)
Образец заголовка Sentiment Analysis
• Graphical Model Genera&veProcess1. Chooseworddistribu2onfor
all(topic,sen2ments)2. Choosesen2mentdistribu2on
forall(topic,ra2ng)
Образец заголовка Sentiment Analysis
• Graphical Model Genera&veProcess1. Chooseworddistribu2onfor
all(topic,sen2ments)2. Choosesen2mentdistribu2on
forall(topic,ra2ng)3. Foreachreview
• Choosera2ng
Образец заголовка Sentiment Analysis
• Graphical Model Genera&veProcess1. Chooseworddistribu2onfor
all(topic,sen2ments)2. Choosesen2mentdistribu2on
forall(topic,ra2ng)3. Foreachreview
• Choosera2ng• Choosetopicdistribu2on
Образец заголовка Sentiment Analysis
• Graphical Model Genera&veProcess1. Chooseworddistribu2onfor
all(topic,sen2ments)2. Choosesen2mentdistribu2on
forall(topic,ra2ng)3. Foreachreview
• Choosera2ng• Choosetopicdistribu2on• Foreachwordinreview:
• Choosetopic
Образец заголовка Sentiment Analysis
• Graphical Model Genera&veProcess1. Chooseworddistribu2onfor
all(topic,sen2ments)2. Choosesen2mentdistribu2on
forall(topic,ra2ng)3. Foreachreview
• Choosera2ng• Choosetopicdistribu2on• Foreachwordinreview:
• Choosetopic• Choosesen2ment
Образец заголовка Sentiment Analysis
• Graphical Model Genera&veProcess1. Chooseworddistribu2onfor
all(topic,sen2ments)2. Choosesen2mentdistribu2on
forall(topic,ra2ng)3. Foreachreview
• Choosera2ng• Choosetopicdistribu2on• Foreachwordinreview:
• Choosetopic• Choosesen2ment• Chooseword
Образец заголовка Sentiment Analysis
• Inference Parameterstobeinferred1. Perdocumenttopicdistribu2on2. Ra2ngdistribu2on3. Sen2mentdistribu2on4. Worddistribu2ons
UseCollapsedGibbsSampling!Integrateoutandφ π
Образец заголовка Sentiment Analysis
• Evaluation – Yelp – Sandwich: sandwich, slaw, primanti, coleslaw, cole, market, pastrami,
reuben, bro, mayo, famous, cheesesteak, rye, zucchini, swiss, sammy, peppi, burgh, messi
– Vietnamese: pho, noodl, bowl, soup, broth, sprout, vermicelli, peanut, lemongrass, leaf
– Payment options: server, check, custom, card, return, state, credit, coupon, accept, tip, treat, gift, refill
– Location: locat, park, street, drive, hill, window, south, car, downtown, number, corner, distance
– Ambience: crowd, fun, group, rock, play, loud, music, young, sing, club, ticket, meet, entertain, dance, band, song
Образец заголовка Sentiment Analysis
• Evaluation – Yelp – Rating prediction
Образец заголовка Sentiment Analysis
• Evaluation – Yelp – Opinion Summarization
• For all reviews of this restaurant – 15% words assigned to topic “Vegetarian” – 5% to “Breakfast” (Eggs) with sentiment 0.78 – 3% to “Staff Attitude” with sentiment 0.82
Образец заголовка Topic Modelling: Future Work
• Missing Links – Model selection: Which model to pick for
which applications – Incorporating linguistic structure/NLP:
• How can our knowledge of language help? – Bag of words:
• Most models are based on the unigram bag of words model
• Context is lost – words like good or nice are often associated with certain words within context, eg: ‘good standard of living’, ‘nice view from the hotel’
Образец заголовка Topic Modelling
Questions?
Образец заголовка Topic Modelling
Thank You!