Topic Modelling: Tutorial on Usage and Applications

Образец заголовка

Tutorial on Topic Modelling

by Ayush Jain

Prepared as an assignment for CS410: Text Information Systems in Spring 2016

Образец заголовка Topic Models

•  Discoverhiddenthemesthatpervadethecollec2on

•  Tagthedocumentsonthebasisofthesethemes

•  Organize,summarizeandsearchthedocumentsonthebasisofthesethemes

Образец заголовка Takeaways from this tutorial

•  What are probabilistic topic models? •  What kind of things can they do? •  How do we train/infer a topic model? •  How do we evaluate a topic model?

Образец заголовка Tools

•  Topic models are a special application of probability theory. In particular, they touch – Probabilistic graphical Models – Conjugate and non-conjugate priors – Approximate posterior inference – Exploratory data analysis

Образец заголовка The Key Steps in every Topic Model

Makeassump2ons

CollectData

Inferposterior

Evaluate

Predict

Образец заголовка Outline

•  Latent Dirichlet Allocation – Application of key steps – Graphical Model encoding the assumptions –  Inference Algorithms – Gibbs Sampling

•  Topic Models for more complex tasks – Rating prediction

•  A completely novel topic model incorporating sentiments (that we’ll develop!)

Образец заголовка Latent Dirichlet Allocation

•  Already covered in course •  Application of the key steps

– Make assumptions •  Each topic is a distribution over words •  Each document is a mixture of topics •  Each word is drawn from a topic


•  Graphical Model

•  Encodesassump2ons•  Allowsustobreakdownthejointprobabilityintoproductofcondi2onals


•  Graphical Model


•  Application of the key steps – Make assumptions (II)

•  Choose probability distributions – Choosing conjugate distributions makes life easier!

»  Eg: Multinomial and Dirichlet are conjugate distributions

Образец заголовка Aside: Conjugate Distributions

•  Dirichlet Distribution:

: Probability of seeing different sides of die •  Multinomial Distribution:

–  The number of occurrences of different sides (W) of the die is distributed in a multinomial manner

•  Posterior distribution:

θ

p(W |θ ) ismul2nomial

xi:Thenumberof2messideiwasobserved


•  Application of the key steps – Make assumptions (II)

•  Choose probability distributions – Choosing conjugate distributions makes life easier!

»  Eg: Multinomial and Dirichlet are conjugate distributions

– Collect Data •  Corpus on which you want to detect themes


•  Application of the key steps –  Infer Posterior

•  Probabilistic graphical models provide algorithms – Mean field variational methods –  Expectation Propagation (similar to EM) – Gibbs Sampling (most popular) –  Variational Inference

Образец заголовка Aside: Gibbs Sampling

– Used when samples need to be drawn from a joint distribution, but the joint distribution is difficult to approximate

– Sample X=(x1, …, xn) from joint pdf p(x1, …, xn) – Conditional distributions are relatively

strighforward – Procedure:

•  Begin with some initial X(i)

•  Sample xj(i+1) from

p(xj(i+1) | x1

(i+1) ,.. , xj-1(i+1) , xj+1

(i) , .., xn(i+1) )

•  Repeat


•  Application of the key steps –  Infer Posterior (Gibbs Sampling)

•  Here, X is all parameters to be inferred –  Per-word topic assignment zd,n

–  Per-document topic proportions d

–  Per-corpus topic-word distributions k

•  Extremely high dimensional! •  Solution:

–  Integrate out and – Conjugate distributions make the integration

straightforward!

θβ

θ β



•  After all computation:

•  nd,:k, -(d,n): The number of words in document d that

belong to topic k, except for n-th word •  v: Index of the n-th word in d-th document in the

vocabulary •  Linear time in the number of tokens!

P Zd,n = k | Z−(d,n),W;α,β( )∝ nd,:k,−(d,n) +αk( ) n:,v

k,−(d,n) +βv

n:,rk,−(d,n) +βr

r=1

V

∑



•  After all computation:

•  Linear time in the number of tokens! •  Further improvements that use the sparsity of the

problem when corpus and number of topics is large

P Zd,n = k | Z−(d,n),W;α,β( )∝ nd,:k,−(d,n) +αk( ) n:,v

k,−(d,n) +βv

n:,rk,−(d,n) +βr

r=1

V

∑

Образец заголовка Topic Models: Evaluation

•  Underlying topics are subjective – Makes the evaluation difficult – Workaround: Look at application and evaluate

•  Document classification •  Information Retrieval •  Rating Prediction

Образец заголовка Topic Models: Evaluation

•  Use the trained model to predict probabilities of seeing unseen documents – Better models would give high probability

•  Even better: – Predict the probability of second half of

documents using first halves as the corpus – Does not require documents to be held out

Образец заголовка Beyond LDA: Rating Prediction

•  Predict ratings associated with text •  Additional assumption:

•  Rating is conditional on the topic assignment to different words

•  Graphical Model:


•  Topics –  Least, problem, unfortunately, supposed, worse, flat, dull –  Bad, guys, watchable, not, one, movie –  Both, motion, simple, perfect, fascinating, power –  Cinematography, screenplay, performances, pictures, effective, sound

•  Notice how the assumption affects the extracted topics –  Because of the dependence of the overall rating on number of words in

different topics, topics are collections of words that appear in similarly ranked documents

–  Topics express sentiment but loose their original meaning!


•  Latent Aspect Rating Prediction –  Joint Topic and Sentiment Modelling

Genera&veModel1.  Chooseaspectsandwordsfor

eachaspect Wdij



Genera&veModel1.  Chooseaspectsandwordsfor

eachaspect2.  Calculate aspect ra2ng based

onaspectwords sdi = β ijWdij

j=1

n

∑



Genera&veModel1.  Choose aspects and words

foreachaspect2.  Calculate aspect ra2ng

basedonaspectwords3.  Overall ra2ng is weighted

sumofaspectra2ngs

rd ~ N αdi β ijWdij

j=1

n

∑ ,δ 2i=1

k

∑"

#$$

%

&''



Genera&veModelE-Step:Inferaspectra2ngsandaspectweightsM-Step:Update

sdαd

µ,Σ,β,δ( )


•  Latent Aspect Rating Prediction –  Results

•  Detects sentiments without supervision


•  Latent Aspect Rating Prediction –  Results

•  Requires keyword supervision – Any way to remove? (Think LDA!)


•  Latent Aspect Rating Prediction without Aspect Keyword Supervision

–  Aspect Modelling Module from LDA included

Образец заголовка Beyond LDA: Topic Phrase Mining

•  Motivation: – machine learning is a phrase and should be assigned

to one topic •  Assigning machine to “Industry” and learning to “Education”

is incorrect

•  Approach: – Extract high frequency phrases

•  If a phrase is infrequent, so is any super-phrase •  If a document does not contain a frequent phrase of length n,

it also does not contain any of length > n •  Use hierarchical clustering to find frequent phrases

– Apply LDA on phrase tokens

Образец заголовка Sentiment Analysis

•  Let’s build our own simple model using the key steps!

•  Use case:


•  Make Assumptions – Each (topic, sentiment) pair has a vocabulary

•  ‘quick delivery’ has more probability for (service, +) than for (service, -) or (food quality, +)

– Each (topic, rating) pair has a sentiment distribution •  + sentiments for food quality are more likely to appear in

highly rated reviews •  A 4-star rated restaurant is likely to have good food quality

even if it does not provide wireless – Each review has

•  Overall rating •  Topic distribution: Different users might talk about different

aspects in their reviews


•  Graphical Model Genera&veProcess1.  Chooseworddistribu2onfor

all(topic,sen2ments)



all(topic,sen2ments)2.  Choosesen2mentdistribu2on

forall(topic,ra2ng)




forall(topic,ra2ng)3.  Foreachreview

•  Choosera2ng





•  Choosera2ng•  Choosetopicdistribu2on





•  Choosera2ng•  Choosetopicdistribu2on•  Foreachwordinreview:

•  Choosetopic






•  Choosetopic•  Choosesen2ment






•  Choosetopic•  Choosesen2ment•  Chooseword


•  Inference Parameterstobeinferred1.  Perdocumenttopicdistribu2on2.  Ra2ngdistribu2on3.  Sen2mentdistribu2on4.  Worddistribu2ons

UseCollapsedGibbsSampling!Integrateoutandφ π


•  Evaluation – Yelp –  Sandwich: sandwich, slaw, primanti, coleslaw, cole, market, pastrami,

reuben, bro, mayo, famous, cheesesteak, rye, zucchini, swiss, sammy, peppi, burgh, messi

–  Vietnamese: pho, noodl, bowl, soup, broth, sprout, vermicelli, peanut, lemongrass, leaf

–  Payment options: server, check, custom, card, return, state, credit, coupon, accept, tip, treat, gift, refill

–  Location: locat, park, street, drive, hill, window, south, car, downtown, number, corner, distance

–  Ambience: crowd, fun, group, rock, play, loud, music, young, sing, club, ticket, meet, entertain, dance, band, song


•  Evaluation – Yelp – Rating prediction


•  Evaluation – Yelp – Opinion Summarization

•  For all reviews of this restaurant –  15% words assigned to topic “Vegetarian” –  5% to “Breakfast” (Eggs) with sentiment 0.78 –  3% to “Staff Attitude” with sentiment 0.82

Образец заголовка Topic Modelling: Future Work

•  Missing Links – Model selection: Which model to pick for

which applications –  Incorporating linguistic structure/NLP:

•  How can our knowledge of language help? – Bag of words:

•  Most models are based on the unigram bag of words model

•  Context is lost – words like good or nice are often associated with certain words within context, eg: ‘good standard of living’, ‘nice view from the hotel’

Образец заголовка Topic Modelling

Questions?

Образец заголовка Topic Modelling

Thank You!

Date post:	14-Apr-2017
Category:	Technology
Upload:	ayush-jain
View:	255 times
Download:	0 times

Topic Modelling: Tutorial on Usage and Applications

Technology