Post on 03-Jan-2016
description
transcript
1
LING 696B: Mixture model and linear dimension reduction
2
Statistical estimation Basic setup:
The world: distributions p(x; ), -- parameters “all models may be wrong, but some are useful”
Given parameter , p(x; ) tells us how to calculate the probability of x (also referred to as the “likelihood” p(x|) )
Observations: X = {x1, x2, …, xN} generated from some p(x; ). N is the number of observations
Model-fitting: based on some examples X, make guesses (learning, inference) about
3
Statistical estimation Example:
Assuming people’s height follows normal distributions (mean, var)
p(x; ) = the probability density function of normal distribution
Observation: measurements of people’s height
Goal: estimate parameters of the normal distribution
4
Maximum likelihood estimate (MLE) Likelihood function: examples xi
are independent of one another, so
Among all the possible values of , choose the so that L() is the biggest
L()
Consistency:
H !
5
H matters a lot! Example: curve fitting with
polynomials
6
Clustering Need to divide x1, x2, …, xN into
clusters, without a priori knowledge of where clusters are
An unsupervised learning problem: fitting a mixture model to x1, x2, …, xN
Example: height of male and female follow two distributions, but don’t know gender from x1, x2, …, xN
7
The K-means algorithm Start with a random assignment,
calculate the means
8
The K-means algorithm Re-assign members to the closest
cluster according to the means
9
The K-means algorithm Update the means based on the
new assignments, and iterate
10
Why does K-means work? In the beginning, the centers are poorly
chosen, so the clusters overlap a lot But if centers are moving away from each
other, then clusters tend to separate better Vice versa, if clusters are well-separated,
then the centers will stay away from each other
Intuitively, these two steps “help each other”
11
Interpreting K-means as statistical estimation Equivalent to fitting a mixture of
Gaussians with: Spherical covariance Uniform prior (weights on each Gaussian)
Problems: Ambiguous data should have gradient
membership Shape of the clusters may not be
spherical Size of the cluster should play a role
12
Multivariate Gaussian 1-D: N(, 2) N-D: N(, ), ~NX1 vector, ~NXN
matrix with (i,j) = ij ~ correlation Probability calculation:
P(x; ,) = C ||-N/2 exp{-(x-)T -1 (x-)} Intuitive meaning of -1: how to
calculate the distance from x to
transpose inverse
13
Multivariate Gaussian: log likelihood and distance Spherical covariance matrix -1
Diagonal covariance matrix -1
Full covariance matrix -1
14
Learning mixture of Gaussian:EM algorithm Expectation: putting “soft” labels
on data -- a pair (, 1-)
(0.5, 0.5)
(0.05, 0.95)(0.8, 0.2)
15
Learning mixture of Gaussian:EM algorithm Maximization: doing Maximum-
Likelihood with weighted data
Notice everyoneis wearing a hat!
16
EM v.s. K-means Same:
Iterative optimization, provably converge (see demo)
EM better captures the intuition: Ambiguous data are assigned gradient
membership Clusters can be arbitrary shaped
pancakes Size of the cluster is a parameter Allows for flexible control based on
prior knowledge (see demo)
17
EM is everywhere Our problem: the labels are important,
yet not observable – “hidden variables” This situation is common for complex
models, and Maximum likelihood --> EM Bayesian Networks Hidden Markov models Probabilistic Context Free Grammars Linear Dynamic Systems
18
Beyond Maximum likelihood?Statistical parsing Interesting remark from Mark Johnson:
Intialize a PCFG with treebank counts Train the PCFG on treebank with EM
A large a mount of NLP research try to dump the first, and improve the second
Log likelihood
Measure of success
19
What’s wrong with this? Mark Johnson’s idea:
Wrong data: human don’t just learn from strings
Wrong model: human syntax isn’t context-free
Wrong way of calculating likelihood: p(sentence | PCFG) isn’t informative
(Maybe) wrong measure of success?
20
End of excursion:Mixture of many things Any generative model can be combined
with a mixture model to deal with categorical data
Examples: Mixture of Gaussians Mixture of HMMs Mixture of Factor Analyzers Mixture of Expert networks
It all depends on what you are modeling
21
Applying to the speech domain
Speech signals have high dimensions Using front-end acoustic modeling from
speech recognition: Mel-Frequency Cepstral Coefficients (MFCC)
Speech sounds are dynamic Dynamic acoustic modeling: MFCC-delta Mixture components are Hidden Markov
Models (HMM)
22
Clustering speech with K-means Phones from TIMIT
23
Clustering speech with K-means Diphones
Words
24
What’s wrong here Longer sound sequences are more
distinguishable for people Yet doing K-means on static feature
vectors misses the change over time
Mixture components must be able to capture dynamic data
Solution: mixture of HMMs
25
Mixture of HMMs HMM HMM Mixture
Learning: EM for HMM + EM for mixture
silence burst transition
26
Mixture of HMMs Model-based clustering Front-end (MFCC+delta) Algorithm: initial guess by K-means, then EM
Gaussian mixturefor single frames
HMM mixturefor whole sequences
27
Mixture of HMM v.s. K-means
Phone clustering: 7 phones from 22 speakers
*1 – 5: cluster index
28
Mixture of HMM v.s. K-means
Diphone clustering: 6 diphones from 300+ speakers
29
Mixture of HMM v.s. K-means
Word clustering: 3 words from 300+ speakers
30
Growing the model Guess 6 at once is hard, but 2 is easy; Hill climbing strategy: starting with 2,
then 3, 4, ... Implementation: split the cluster with
the maximum gain in likelihood; Intuition: discriminate within the
biggest pile.
31
Learning categories and features with mixture model Procedure: apply mixture model
and EM algorithm, inductively find clusters
Each split is followed by a retraining step using all dataData
21
11 12 21 22
32
% classified as Cluster 1
% classified as Cluster 2
All data
1obstruent
2sonorant
IPA TIMIT
33
% classifed as Cluster 11
% classified as Cluster 12
All data
1 2
1
11fricative
12
34
% classified as Cluster 21
% classified as Cluster 22
All data
1 21
11 12
r
21back
sonorant
22
35
% classified Cluster 121
% classified as Cluster 122
All data
1 21
11 12 21 22
121oralstop
122nasalstop
36
% classified as Cluster 221
% classified as Cluster 222
All data
1 21
11 12 22
121 122
221 222
21
front low
sonorant
front high
sonorant
nasaloralstop
fricative backsonorant
37
Summary: learning features
Discovered features: distinctions between natural classes based on spectral properties
All data
1 [+sonorant][- sonorant]
[+fricative] [-fricative] [+back] [-back]
[-nasal] [+nasal] [+high] [-high]
For individual sounds, the feature values are gradient rather than binary (Ladefoged, 01)
38
Evaluation: phone classification How do the “soft” classes fit into “hard” ones?
Training set
Test set
Are “errors” really errors?
39
Level 2: Learning segments + phonotactics
Segmentation is a kind of hidden structure Iterative strategy works here too
Optimization -- the augmented model:p(words | units, phonotactics, segmentation) Units argmax p({wi} | U, P, {si})
Clustering = argmax p(segments | units) -- Level 1 Phonotactics argmax p({wi} | U, P, {si})
Estimating transitions of Markov chain Segmentation argmax p({wi} | U, P, {si})
Viterbi decoding
40
Iterative learning as coordinate-wise ascent
Each step increases likelihood score and eventually reaches a local maximum
segmentation
Unitsphonotactics
Level curves of likelihood score
Initial valuecomes fromLevel-1 learning
41
Level 3:Lexicon can be mixtures too
Re-clustering of words using the mixture-based lexical model
Initial values (mixture components, weights) bottom-up learning (Stage 2)
Iterating steps: Classify each word as the best exemplar of
the given lexical item (also infer segmentation)
Update lexical weights + units + phonotactics
42
Big question:How to choose K? Basic problem:
Nested hypothesis spaces: Hk-1 Hk Hk+1 …
As K goes up, likelihood always goes up.
Recall the polynomial curve fitting Mixture model too
(see demo)
43
Big question:How to choose K? Idea #1: don’t just look at the likelihood,
look at the combination of likelihood and something else Bayesian Information Criterion:
-2 log L() + (log N)*d Minimal Description Length:
log L() + description() Akaike Information Criterion:
-2 log L() + 2 d/N In practice, often need magical “weights”
in front of the something else
44
Big question:How to choose K? Idea #2: use one set of data for
learning, one for testing generalization
Cross-validation: run EM until the likelihood starts to hurt in the test set (see demo)
What if you have a bad test set: Jack-knife procedure Cutting data into 10 parts, and do 10
training and tests
45
Big question:How to choose K? Idea #3: treat K as “hyper” parameter,
and do Bayesian learning on K More flexible: K can grow up and down
depending on number of data Allow K to grow to infinity: Dirichlet /
Chinese restaurant process mixture Need “hyper-hyper” parameters to
control how likely K grows Computationally also intensive
46
Big question:How to choose K? There is really no elegant universal
solution One view: statistical learning looks
within Hk, but does not come up with Hk itself
How do people choose K? (also see later reading)
47
Dimension reduction Why dimension reduction? Example: estimate a continuous
probability distribution by counting histograms on samples
10 bins 20 bins 30 bins
48
Dimension reduction Now think about 2D, 3D …
How many bins do you need? Estimate density of distribution
with Parzen window:
How big (r) does the window needs to grow?
Data in the window
Window size
49
Curse of dimensionality Discrete distributions:
Phonetics experiment: M speakers X N sentences X P stresses X Q segments … …
Decision rules: (K) Nearest-neighbor How big a K is safe? How long do you have to wait until you
are really sure they are your nearest neighbors?
50
One obvious solution Assume we know something about
the distribution Translates to a parametric approach
Example: counting histograms for 10-D data needs lots of bins, but knowing it’s a pancake allows us to fit a Gaussian d10 parameters v.s. how many?
51
Linear dimension reduction Principle Components Analysis Multidimensional Scaling Factor Analysis Independent Component Analysis As we will see, we still need to
assume we know something…
52
Principle Component Analysis Many names (eigen modes, KL
transform, etc.) and relatives The key is to understand how to
make a pancake Centering, rotating and smashing
Step 1: moving the dough to the center X <-- X -
53
Principle Component Analysis Step 2: finding a direction of
projection that has the maximal “stretch”
Linear projection of X onto vector w: Projw(X) = XNXd * wdX1 (X centered)
Now measure the stretch This is sample variance = Var(X*w)
wx
54
Principle Component Analysis Step 3: formulate this as a
constrained optimization problem Objective of optimization: Var(X*w) Need constraint on w: (otherwise can
explode), only consider the direction So formally:
argmax||w||=1 Var(X*w)
55
Principle Component Analysis Some algebra (homework):
Var(x) = E[(x - E[x])2
= E[x2] - (E[x])2
Apply to matrices (homework)Var(X*w) = wTXT * X w = wTCov(X) w (why)
Cov(X) is a dXd matrix (homework) Symmetric (easy) For any y, yTCov(X) y >= 0 (tricky)
56
Principle Component Analysis Going back to the optimization
problem:= argmax||w||=1 Var(X*w)= argmax||w||=1 wTCov(X) w
The solution is an eigenvector of Cov(X)
w1
The first Principle Component!
57
More principle components We keep looking for w2 in all the
directions perpendicular to w1
Formally:argmax||w2||=1,w2w1 wTCov(X) w
This turns out to be another eigenvector corresponding to the 2nd largest eigenvalue w2
New coordinates!
58
Rotation Can keep going until we pick up all d
eigenvectors, perpendicular to each other
Putting these eigenvectors together, we have a big matrix W=(w1,w2,…,wd)
W is called an orthogonal matrix This corresponds to a rotation of the
pancake This pancake has no correlation
between dimensions