Date post: | 07-Jul-2018 |
Category: |
Documents |
Upload: | fkjljsdkfj |
View: | 230 times |
Download: | 0 times |
of 58
8/18/2019 Big data machine learning topic models text recognition natural language processing
1/58
CMSC 25025 / Stat 37601
Machine Learning andLarge Scale Data Analysis
Tuesday, April 21
8/18/2019 Big data machine learning topic models text recognition natural language processing
2/58
For Today
• Mixtures (redux)
• Bayesian inference (redux)
• Topic models
2
8/18/2019 Big data machine learning topic models text recognition natural language processing
3/58
Mixtures
• Key technique: Mixture models
• Mixtures have latent variables
• Flexible tool
• Simple and difficult at the same time
3
8/18/2019 Big data machine learning topic models text recognition natural language processing
4/58
Gaussian Mixture
x
p ( x
)
0.00
0.05
0.10
0.15
0.20
!4 !2 0 2 4 6
p (x ) = 25φ(x ;−1.25, 1) + 35φ(x ; 2.95, 1)
4
8/18/2019 Big data machine learning topic models text recognition natural language processing
5/58
Bumps and More Bumps (MacKay and Williams)
A mixture of k Gaussians models can have 53 k modes.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
-4 -3 -2 -1 0 1 2 3 4
-4
-3
-2
-1
0
1
2
3
4
.
.
.
.
.
.
.
.
-0.5 0 0.5 1
-0.5
0
0.5
1
5
8/18/2019 Big data machine learning topic models text recognition natural language processing
6/58
Mixtures
• Mixture of f and g :
p (x ) = ηf (x ) + (1 − η)g (x )
Simplest, most common kind of latent variable model
• Hidden variable represention : Define Z ∼ Bernoulli(η) and
p (x ) =X
z =0,1
p (x | z ) p (z )
with p (x | 0) = f (x ), p (x | 1) = g (x ), p (z ) = ηz (1 − η)(1−z ).
6
8/18/2019 Big data machine learning topic models text recognition natural language processing
7/58
Gaussian Mixture: All the Key Concepts
x
p ( x
)
0.00
0.05
0.10
0.15
0.20
!4 !2 0 2 4 6
7
8/18/2019 Big data machine learning topic models text recognition natural language processing
8/58
Bayesian Inference
The parameter θ of a model is viewed as a random variable.
Inference usually carried out as follows:
• Choose a generative model p (x | θ) for the data.
• Choose a prior distribution π(θ) that expresses beliefs about theparameter before seeing any data.
• After observing data Dn = {x 1, . . . , x n }, update beliefs andcalculate the posterior distribution p (θ | Dn ).
8
8/18/2019 Big data machine learning topic models text recognition natural language processing
9/58
Bayes’ Theorem
The posterior distribution can be written as
p (θ | x 1, . . . , x n ) = p (x 1, . . . , x n | θ)π(θ)
p (x 1, . . . , x n ) =
Ln (θ)π(θ)
c n ∝ Ln (θ)π(θ)
where Ln (θ) = Qn i =1 p (x i | θ) is the likelihood function andc n = p (x 1, . . . , x n ) =
Z p (x 1, . . . , x n | θ)π(θ)d θ =
Z Ln (θ)π(θ)d θ
is the normalizing constant, which is also called evidence .
9
8/18/2019 Big data machine learning topic models text recognition natural language processing
10/58
Example
X ∼ Bernoulli(θ) with data Dn = {x 1, . . . , x n }. Prior Beta(α,β )distribution
πα,β (θ) = Γ(α + β )
Γ(α)Γ(β )θα−1(1 − θ)β −1
Let s =
Pn i =1 x i be the number of “successes.”
Posterior distribution θ | Dn is Beta(α + s ,β + n − s ). Posterior meanis a mixture:
θ̄ = α + s
α + β + n =
n
α + β + n bθ + α + β
α + β + n θ0where bθ = s /n is the MLE and θ0 = α/(α + β ) is the prior mean.
10
8/18/2019 Big data machine learning topic models text recognition natural language processing
11/58
Example
n = 15 points sampled as X ∼ Bernoulli(θ = 0.4), with s = 7 heads.
!
D e n s i t y
0
1
2
3
0.0 0.2 0.4 0.6 0.8 1.0!
D e n s i t y
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
0.0 0.2 0.4 0.6 0.8 1.0
good prior bad prior
Prior distribution (black-dashed), likelihood function (blue-dotted),
posterior distribution (red-solid).
11
8/18/2019 Big data machine learning topic models text recognition natural language processing
12/58
Dirichlet
Multinomial model with Dirichlet prior is generalization of the
Bernoulli/Beta model.
Dirichletα(θ) =
Γ(PK j =1 α j )QK
j =1 Γ(α j )
K
Y j =1
θ
α j −1
j
where α = (α1, . . . , αK ) ∈ RK + is a non-negative vector.
12
8/18/2019 Big data machine learning topic models text recognition natural language processing
13/58
Example
!34
! 3 4
! 3 4
! 3 4
! 3 4
!32
! 3 2
! 3 2
! 3 2
! 3 2
! 3 2
!30!28
!26!24
!22
!20
!18
!45 !40!35!30
!25
! 2 5
! 2 5
! 2 5
! 2 5
!20
prior with Dirichlet(6,6,6) likelihood function with n = 20
!85
!85
!80!75
! 7 5
!70!65
! 6 5
! 6 5
! 6 5
! 6 5
! 6 5
!60 !55
!50
!45
!40
!550!500 !450!400
!350
! 3 5 0
! 3 5 0
! 3 5 0
!300
!250
posterior distribution with n = 20 posterior distribution with n = 200
13
8/18/2019 Big data machine learning topic models text recognition natural language processing
14/58
Summary
• Mixtures are latent variable models
• The mixing weight encodes a hidden variable
• Computing with mixtures uses basic probabilistic reasoning
• But can get complicated
• Topic models are flexible mixtures models for complex data likedocuments and images (next)
14
8/18/2019 Big data machine learning topic models text recognition natural language processing
15/58
Ball and Elephants
15
8/18/2019 Big data machine learning topic models text recognition natural language processing
16/58
Captioning
there is a large bird on the water a professional baseball game is played in the middle of the field
a small bird sitting on top of a lake several players at the end of a baseball game
a large white bird standing on the water on a beach a group of players playing a baseball game
a bird is on the water on a beach the baseball players are playing games at the field
a bird that is standing in the water a baseball players are playing with a game and fans
www.cs.toronto.edu/˜nitish/nips2014demo/
16
8/18/2019 Big data machine learning topic models text recognition natural language processing
17/58
Intro to Topic Modeling
Some of the following slides are from Dave Blei’s 2011 tutorial onTopic Modeling
http://www.cs.princeton.edu/˜blei/topicmodeling.html
A survey paper describing many of these ideas in more detail is here:
http://www.cs.princeton.edu/˜blei/papers/
BleiLafferty2009.pdf
See also:
http://awards.acm.org/award_winners/blei_3974465.cfm
17
8/18/2019 Big data machine learning topic models text recognition natural language processing
18/58
8/18/2019 Big data machine learning topic models text recognition natural language processing
19/58
Discover topics from a corpus
human evolution disease computer
genome evolutionary host models
dna species bacteria information
genetic organisms diseases data
genes life resistance computers
sequence origin bacterial systemgene biology new network
molecular groups strains systems
sequencing phylogenetic control model
map living infectious parallel
information diversity malaria methods
genetics group parasite networksmapping new parasites software
project two united new
sequences common tuberculosis simulations
8/18/2019 Big data machine learning topic models text recognition natural language processing
20/58
Model the evolution of topics over time
1880 1900 1920 1940 1960 1980 2000
o o o o o oo
o
o
o
o
o
o
o
oo
oo
oo
o oo o o
o o o o o o
o o
o o
o o o
o o
o
o
o
o
o
o
o
o
o o
o o o o
oo
o
oo o
o
o
o
o
o
o
o
o
o
o
o oo
o o
1880 1900 1920 1940 1960 1980 2000
o o o
o
o o
oo
o
oo o
o
o
oo o o o
oo
o oo o
o o o
oo
o
o
oo
o
oo
o
o
o
o
o
oo
oo
o oo o
o o o o o o o o o o
o o
o o
o
o
o
o
o
o o o
o oo
RELATIVITY
LASER
FORCE
NERVE
OXYGEN
NEURON
"Theoretical Physics" "Neuroscience"
8/18/2019 Big data machine learning topic models text recognition natural language processing
21/58
Model connections between topics
wild typemutant
mutations
mutantsmutation
plants
plant
gene
genes
arabidopsis
p53
cell cycle
activity
cyclin
regulation
amino acids
cdna
sequence
isolated
protein
genedisease
mutations
families
mutation
rna
dna
rna polymerase
cleavage
site
cells
cell
expressioncell lines
bone marrow
united states
women
universities
students
education
science
scientists
says
research
people
research
funding
support
nih
program
surface
tipimagesampledevice
laser
optical
light
electrons
quantum
materials
organic
polymer
polymers
molecules
volcanicdepositsmagmaeruption
volcanism
mantle
crust
upper mantle
meteorites
ratios
earthquake
earthquakes
fault
images
data
ancient
found
impactmillion years ago
africaclimate
ocean
ice
changes
climate change
cells
proteins
researchers
protein
found
patients
disease
treatment
drugs
clinical
genetic
populationpopulationsdifferences
variation
fossil record
birds
fossilsdinosaurs
fossil
sequence
sequences
genome
dnasequencing
bacteria
bacterial
host
resistance
parasitedevelopment
embryos
drosophila
genes
expression
speciesforest
forests
populations
ecosystems
synapsesltp
glutamate
synaptic
neurons
neurons
stimulus
motor
visualcortical
ozoneatmospheric
measurementsstratosphere
concentrations
sun
solar wind
earth
planets
planet
co2
carbon
carbon dioxide
methane
water
receptorreceptors
ligandligands
apoptosis
proteins
protein
binding
domain
domains
activatedtyrosine phosphorylation
activation
phosphorylation
kinase
magnetic
magnetic field
spin
superconductivity
superconducting
physicists
particles
physics
particle
experimentsurface
liquid
surfacesfluid
model reaction
reactionsmoleculemolecules
transition state
enzyme
enzymes
iron
active site
reduction
pressure
high pressure
pressures
core
inner core
brain
memorysubjects
left
task
computer
problem
information
computers
problems
starsastronomers
universe
galaxies
galaxy
virus
hivaids
infection
viruses
miceantigen
t cells
antigens
immune response
8/18/2019 Big data machine learning topic models text recognition natural language processing
22/58
8/18/2019 Big data machine learning topic models text recognition natural language processing
23/58
Annotate images
SKY WATER TREEMOUNTAIN PEOPLE
SCOTLAND WATER
FLOWER HILLS TREE
SKY WATER BUILDINGPEOPLE WATER
FISH WATER OCEAN
TREE CORAL
PEOPLE MARKET PATTERN
TEXTILE DISPLAY
BIRDS NEST TREE
BRANCH LEAVES
8/18/2019 Big data machine learning topic models text recognition natural language processing
24/58
Discover influential articles
Year
W e i g h t e d I n f l u e n c e
0.000
0.005
0.010
0.015
0.020
0.025
0.030
1880 1900 1920 1940 1960 1980 2000
Jared M. Diamond, Distributional Ecology of New Guinea Birds. Science (1973)[296 citations]
W. B. Scott, The Isthmus of Panama in Its Relation to the Animal Life of North and South America , Science (1916)[3 citations]
William K. Gregory, The New Anthropogeny: Twenty-Five Stages ofVertebrate Evolution, from Silurian Chordate to Man , Science (1933)[3 citations]
Derek E. Wildman et al., Implications of Natural Selection in Shaping 99.4% NonsynonymousDNA Identity between Humans and Chimpanzees: Enlarging Genus Homo, PNAS (2003)[178 citations]
8/18/2019 Big data machine learning topic models text recognition natural language processing
25/58
Predict links between articles
Markov chain Monte Carlo convergence diagnostics: A comparative review
Minorization conditions and convergence rates for Markov chain Monte Carlo
RTM
( ψ e
)
Rates of convergence of the Hastings and Metropolis algorithms
Possible biases induced by MCMC convergence diagnostics
Bounding convergence time of the Gibbs sampler in Bayesian image restoration
Self regenerative Markov chain Monte Carlo
Auxiliary variable methods for Markov chain Monte Carlo with applications
Rate of Convergence of the Gibbs Sampler by Gaussian Approximation
Diagnosing convergence of Markov chain Monte Carlo algorithms
Exact Bound for the Convergence of Metropolis Chains LDA + R
e gr e s si on
Self regenerative Markov chain Monte Carlo
Minorization conditions and convergence rates for Markov chain Monte Carlo
Gibbs-markov models
Auxiliary variable methods for Markov chain Monte Carlo with applications
Markov Chain Monte Carlo Model Determination for Hierarchical and Graphical Models
Mediating instrumental variables
A qualitative framework for probabilistic inference
Adaptation for Self Regenerative MCMC
8/18/2019 Big data machine learning topic models text recognition natural language processing
26/58
Characterize political decisions
dod,defense,defense and appropriation,military,subtitle
veteran,veterans,bills,care,injury
people,woman,american,nation,school
producer,eligible,crop,farm,subparagraph
coin,inspector,designee,automobile,lebanon
bills,iran,official,company,sudan
human,vietnam,united nations,call,people
drug,pediatric,product,device,medical
child,fire,attorney,internet,billssurveillance,director,court,electronic,flood
energy,bills,price,commodity,market
land,site,bills,interior,river
child,center,poison,victim,abuse
coast guard,vessel,space,administrator,requires
science,director,technology,mathematics,bills
computer,alien,bills,user,collection
head,start,child,technology,award
loss,crop,producer,agriculture,trade
bills,tax,subparagraph,loss,taxablecover,bills,bridge,transaction,following
transportation,rail,railroad,passenger,homeland security
business,administrator,bills,business concern,loan
defense,iraq,transfer,expense,chapter
medicare,medicaid,child,chip,coverage
student,loan,institution,lender,school
energy,fuel,standard,administrator,lamp
housing,mortgage,loan,family,recipient
bank,transfer,requires,holding company,industrial
county,eligible,ballot,election,jurisdictiontax credit,budget authority,energy,outlays,tax
8/18/2019 Big data machine learning topic models text recognition natural language processing
27/58
Organize and browse large corpora
8/18/2019 Big data machine learning topic models text recognition natural language processing
28/58
This tutorial
• What are topic models?
• What kinds of things can they do?• How do I compute with a topic model?
• What are some unsanswered questions in this field?
• How can I learn more?
8/18/2019 Big data machine learning topic models text recognition natural language processing
29/58
Uber Topics
Hi Prof. Lafferty,
I took your ML+LSDA course last Spring. The course was super helpful,
and I just wanted to let you know that I’m currently using Latent Dirichlet
Allocation at my current job at Uber!
We’re using LDA to discover topics in rider feedback – when riders write
comments about their driver after the trip. We’re trying to find topics such
as ’unprofessional driver’, ’driver no-show’, ’sexual harassment’, etc. LDA
has worked really well with this – so thank you for covering it in much
detail in your course.
18
8/18/2019 Big data machine learning topic models text recognition natural language processing
30/58
Bag Demo
19
8/18/2019 Big data machine learning topic models text recognition natural language processing
31/58
Introduction to Topic Modeling
8/18/2019 Big data machine learning topic models text recognition natural language processing
32/58
Probabilistic modeling
1 Data are assumed to be observed from a generative probabilistic
process that includes hidden variables.
• In text, the hidden variables are the thematic structure.
2 Infer the hidden structure using posterior inference
• What are the topics that describe this collection?
3 Situate new data into the estimated model.
• How does a new document fit into the topic structure?
8/18/2019 Big data machine learning topic models text recognition natural language processing
33/58
Latent Dirichlet allocation (LDA)
Simple intuition: Documents exhibit multiple topics.
8/18/2019 Big data machine learning topic models text recognition natural language processing
34/58
Generative model for LDA
gene 0.04
dna 0.02
genetic 0.01
.,,
life 0.02
evolve 0.01
organism 0.01
.,,
brain 0.04
neuron 0.02
nerve 0.01
...
data 0.02
number 0.02
computer 0.01
.,,
Topics Documents Topic proportions and
assignments
• Each topic is a distribution over words
• Each document is a mixture of corpus-wide topics
• Each word is drawn from one of those topics
8/18/2019 Big data machine learning topic models text recognition natural language processing
35/58
The posterior distribution
Topics Documents Topic proportions and
assignments
• In reality, we only observe the documents
• The other structure are hidden variables
8/18/2019 Big data machine learning topic models text recognition natural language processing
36/58
The posterior distribution
Topics Documents Topic proportions and
assignments
• Our goal is to infer the hidden variables
• I.e., compute their distribution conditioned on the documents
p (topics, proportions, assignments | documents)
8/18/2019 Big data machine learning topic models text recognition natural language processing
37/58
LDA as a graphical model
d d,n d,n β k η
Proportionsparameter
Per-documenttopic proportions
Per-wordtopic assignment
Observedword Topics
Topicparameter
• Encodes our assumptions about the data
• Connects to algorithms for computing with data
• See Pattern Recognition and Machine Learning (Bishop, 2006).
8/18/2019 Big data machine learning topic models text recognition natural language processing
38/58
LDA as a graphical model
d d,n d,n β k η
Proportionsparameter
Per-documenttopic proportions
Per-wordtopic assignment
Observedword Topics
Topicparameter
• Nodes are random variables; edges indicate dependence.
• Shaded nodes are observed.
• Plates indicate replicated variables.
8/18/2019 Big data machine learning topic models text recognition natural language processing
39/58
LDA as a graphical model
d d,n d,n β k η
Proportionsparameter
Per-documenttopic proportions
Per-wordtopic assignment
Observedword Topics
Topicparameter
K Yi =1
p (β i | η)D Y
d =1
p (θd |α)
N Y
n =1
p (z d ,n | θd )p (w d ,n | β 1:K , z d ,n )
!
8/18/2019 Big data machine learning topic models text recognition natural language processing
40/58
LDA
θd Z d,n W d,nN D K
β kα η
• This joint defines a posterior.
• From a collection of documents, infer
• Per-word topic assignment z d
,
n • Per-document topic proportions θd • Per-corpus topic distributions β k
• Then use posterior expectations to perform the task at hand,
e.g., information retrieval, document similarity, exploration, ...
8/18/2019 Big data machine learning topic models text recognition natural language processing
41/58
LDA
θd Z d,n W d,nN D K
β kα η
Approximate posterior inference algorithms
• Mean field variational methods (Blei et al., 2001, 2003)
• Expectation propagation (Minka and Lafferty, 2002)
• Collapsed Gibbs sampling (Griffiths and Steyvers, 2002)• Collapsed variational inference (Teh et al., 2006)
• Online variational inference (Hoffman et al., 2010)
Also see Mukherjee and Blei (2009) and Asuncion et al. (2009).
8/18/2019 Big data machine learning topic models text recognition natural language processing
42/58
Example inference
θd Z d,n W d,nN
D K
β kα η
• Data: The OCR’ed collection of Science from 1990–2000
• 17K documents• 11M words• 20K unique terms (stop words and rare words removed)
• Model: 100-topic LDA model using variational inference.
8/18/2019 Big data machine learning topic models text recognition natural language processing
43/58
Example inference
1 8 16 26 36 46 56 66 76 86 96
Topics
P r o b a b i l i t y
0 . 0
0 . 1
0 . 2
0 . 3
0 . 4
8/18/2019 Big data machine learning topic models text recognition natural language processing
44/58
8/18/2019 Big data machine learning topic models text recognition natural language processing
45/58
Example inference (II)
8/18/2019 Big data machine learning topic models text recognition natural language processing
46/58
Example inference (II)
problem model selection species
problems rate male forest
mathematical constant males ecology
number distribution females fish
new time sex ecological
mathematics number species conservation
university size female diversity
two values evolution population
first value populations natural
numbers average population ecosystems
work rates sexual populations
time data behavior endangeredmathematicians density evolutionary tropical
chaos measured genetic forests
chaotic models reproductive ecosystem
8/18/2019 Big data machine learning topic models text recognition natural language processing
47/58
Used to explore and browse document collections
8/18/2019 Big data machine learning topic models text recognition natural language processing
48/58
Aside: The Dirichlet distribution
• The Dirichlet distribution is an exponential family distribution over
the simplex, i.e., positive vectors that sum to one
p (θ | ~ α) = Γ (
Pi αi )
Qi
Γ(αi ) Yi
θαi −1i .
• It is conjugate to the multinomial. Given a multinomial
observation, the posterior distribution of θ is a Dirichlet.
• The parameter α controls the mean shape and sparsity of θ.
• The topic proportions are a K dimensional Dirichlet.
The topics are a V dimensional Dirichlet.
8/18/2019 Big data machine learning topic models text recognition natural language processing
49/58
α = 1
item
v a l u e
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
1
!
!
!
! !
!
!! !
!
6
!
!
!
!
!
!
!
!
!
!
11
! !
!
!
!
!
!
!
!
!
1 2 3 4 5 6 7 8 9 10
2
!
!
!
!
!
!
!
!
!
!
7
!
! !
!
!
! !
! ! !
12
!
!
!!
!
!
!
! !
!
1 2 3 4 5 6 7 8 9 10
3
!
!
!
!
!
!
!
! !
!
8
!
!
!
!
!
!
!
!
!
!
13
!
!
!
! ! !
!!
!
!
1 2 3 4 5 6 7 8 9 10
4
! ! !
!
! !
! !!
!
9
!
!
!
!
!
!!
!
!!
14
!!
!
!!
!
!!
!
!
1 2 3 4 5 6 7 8 9 10
5
!!
!
!!
! !
!
!
!
10
!
!! !
!
!
! ! !
!
15
! !
!
! !
! !
!
! !
1 2 3 4 5 6 7 8 9 10
8/18/2019 Big data machine learning topic models text recognition natural language processing
50/58
α = 10
item
v a l u e
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
1
!!
! ! ! !
!
!
!
!
6
!!
!
! ! !
!
!
!
!
11
!! !
!
! !
!!
!!
1 2 3 4 5 6 7 8 9 10
2
! !
!! !
!
!!
!!
7
!
!
!
! !!
! ! !
!
12
! !
!
! ! ! !!
!
!
1 2 3 4 5 6 7 8 9 10
3
!
! !
!
!
!
! !
!
!
8
!!
! ! !
!! !
!
!
13
!!
! ! !!
!!
!
!
1 2 3 4 5 6 7 8 9 10
4
! !! !
! !! !
!!
9
!
!!
!! !
! !
! !
14
!!
! !
!
!
!
!!
!
1 2 3 4 5 6 7 8 9 10
5
! !
!!
!
!! ! !
!
10
! !
!!
!
!!
!
!!
15
!
!
!
! ! !
! !
!
!
1 2 3 4 5 6 7 8 9 10
8/18/2019 Big data machine learning topic models text recognition natural language processing
51/58
α = 100
item
v a l u e
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
1
! ! ! ! ! ! !
! ! !
6
! ! ! ! ! ! ! !
! !
11
! ! ! ! ! ! ! ! ! !
1 2 3 4 5 6 7 8 9 10
2
! ! ! ! ! ! ! ! !
!
7
! ! ! ! ! ! ! ! ! !
12
!! ! !
!! ! ! ! !
1 2 3 4 5 6 7 8 9 10
3
! ! ! ! ! ! ! ! ! !
8
! ! ! ! ! ! ! ! ! !
13
!! ! ! ! !
! ! ! !
1 2 3 4 5 6 7 8 9 10
4
! ! ! ! !
!!
! ! !
9
! ! ! ! ! ! ! ! ! !
14
! ! ! ! ! ! ! ! !
!
1 2 3 4 5 6 7 8 9 10
5
! ! ! ! ! ! ! ! ! !
10
! ! ! ! ! ! ! ! ! !
15
! ! ! !
! ! ! ! ! !
1 2 3 4 5 6 7 8 9 10
8/18/2019 Big data machine learning topic models text recognition natural language processing
52/58
α = 1
item
v a l u e
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
1
!
!
!
! !
!
!! !
!
6
!
!
!
!
!
!
!
!
!
!
11
! !
!
!
!
!
!
!
!
!
1 2 3 4 5 6 7 8 9 10
2
!
!
!
!
!
!
!
!
!
!
7
!
! !
!
!
! !
! ! !
12
!
!
!!
!
!
!
! !
!
1 2 3 4 5 6 7 8 9 10
3
!
!
!
!
!
!
!
! !
!
8
!
!
!
!
!
!
!
!
!
!
13
!
!
!
! ! !
!!
!
!
1 2 3 4 5 6 7 8 9 10
4
! ! !
!
! !
! !!
!
9
!
!
!
!
!
!!
!
!!
14
!!
!
!!
!
!!
!
!
1 2 3 4 5 6 7 8 9 10
5
!!
!
!!
! !
!
!
!
10
!
!! !
!
!
! ! !
!
15
! !
!
! !
! !
!
! !
1 2 3 4 5 6 7 8 9 10
8/18/2019 Big data machine learning topic models text recognition natural language processing
53/58
α = 0.1
item
v a l u e
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
1
! ! ! ! ! !
!
! ! !
6
! ! !
!
!
!
! !
!
!
11
! !
!
! !
!
!
! ! !
1 2 3 4 5 6 7 8 9 10
2
! !! ! !
!
! !
!
!
7
! ! !
!
!
!
!
! !
!
12
!
!! ! ! ! ! !
!
!
1 2 3 4 5 6 7 8 9 10
3
!
!
!
!
! ! !
!
! !
8
!
! !
!
!
! ! ! ! !
13
!!
!
!!
! ! ! !
!
1 2 3 4 5 6 7 8 9 10
4
!
!! ! ! ! ! ! ! !
9
!
!
! !
!
!! ! ! !
14
!
! ! ! ! ! ! ! ! !
1 2 3 4 5 6 7 8 9 10
5
! ! ! !
!
!
!
!
! !
10
! ! ! ! !
!
!
!
!
!
15
!
!
!
! !
!
!
! !
!
1 2 3 4 5 6 7 8 9 10
8/18/2019 Big data machine learning topic models text recognition natural language processing
54/58
α = 0.01
item
v a l u e
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
1
!
!
! ! ! ! ! ! ! !
6
! ! !
!
! ! ! ! ! !
11
!! ! ! ! ! !
!
!
!
1 2 3 4 5 6 7 8 9 10
2
! ! ! ! !
! !
!
! !
7
! ! ! ! ! ! ! !
!
!
12
! ! !
!
! ! ! ! ! !
1 2 3 4 5 6 7 8 9 10
3
!
!
!
!
! ! ! ! ! !
8
! ! ! ! ! ! !
!
! !
13
! ! ! !
!
!
! ! ! !
1 2 3 4 5 6 7 8 9 10
4
! ! ! ! !
!
! ! ! !
9
! ! ! ! ! ! ! ! !
!
14
! ! ! ! !
!
! !
!
!
1 2 3 4 5 6 7 8 9 10
5
! ! ! ! ! ! !
!
! !
10
! ! ! ! ! !
!
! ! !
15
! ! ! ! ! ! !
!
! !
1 2 3 4 5 6 7 8 9 10
8/18/2019 Big data machine learning topic models text recognition natural language processing
55/58
α = 0.001
item
v a l u e
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
1
!
! ! ! ! ! ! ! ! !
6
! !
!
! ! ! ! ! ! !
11
! ! ! ! ! ! ! !
!
!
1 2 3 4 5 6 7 8 9 10
2
! ! ! ! ! ! ! !
!
!
7
! !
!
! ! ! ! ! ! !
12
! ! ! ! ! ! ! !
!
!
1 2 3 4 5 6 7 8 9 10
3
! ! ! ! ! !
!
! ! !
8
! ! ! !
!
! ! ! ! !
13
! ! ! ! ! ! !
!
! !
1 2 3 4 5 6 7 8 9 10
4
!
!
! ! ! ! ! ! ! !
9
! ! ! ! ! !
!
! ! !
14
! ! ! ! ! !
!
! ! !
1 2 3 4 5 6 7 8 9 10
5
! ! ! ! ! ! ! ! !
!
10
! ! !
!
! ! ! ! ! !
15
! !
!
! ! ! ! ! ! !
1 2 3 4 5 6 7 8 9 10
Wh d LDA “ k”?
8/18/2019 Big data machine learning topic models text recognition natural language processing
56/58
Why does LDA “work”?
Why does the LDA posterior put “topical” words together?
• Word probabilities are maximized by dividing the words among
the topics. (More terms means more mass to be spread around.)
• In a mixture, this is enough to find clusters of co-occurring words.
• In LDA, the Dirichlet on the topic proportions can encourage
sparsity, i.e., a document is penalized for using many topics.
• Loosely, this can be thought of as softening the strict definition of“co-occurrence” in a mixture model.
• This flexibility leads to sets of terms that more tightly co-occur.
S f LDA
8/18/2019 Big data machine learning topic models text recognition natural language processing
57/58
Summary of LDA
θd Z d,n W d,nN D K
β kα η
• LDA can
• visualize the hidden thematic structure in large corpora• generalize new data to fit into that structure
• Builds on Deerwester et al. (1990) and Hofmann (1999)
It is a mixed membership model (Erosheva, 2004).
Relates to multinomial PCA (Jakulin and Buntine, 2002)
• Was independently invented for genetics (Pritchard et al., 2000)
I l t ti f LDA
8/18/2019 Big data machine learning topic models text recognition natural language processing
58/58
Implementations of LDA
There are many available implementations of topic modeling—
LDA-C∗ A C implementation of LDA
HDP∗ A C implementation of the HDP (“infinite LDA”)
Online LDA∗
A python package for LDA on massive dataLDA in R∗ Package in R for many topic models
LingPipe Java toolkit for NLP and computational linguistics
Mallet Java toolkit for statistical NLP
TMVE∗ A python package to build browsers from topic models
∗ available at www.cs.princeton.edu/ ∼blei/